The task of building a Natural Language Processing (NLP) text summarizer in one night from scratch is a challenging but rewarding endeavor. This project aims to demonstrate the feasibility of quickly creating a basic NLP summarization tool, using various NLP techniques and algorithms. The summarizer will be designed to effectively condense large amounts of text into a brief summary, highlighting the most important information. By doing so, it can save time and improve efficiency for individuals and organizations that need to quickly process large amounts of information. The end result of this project will not only be a functional NLP summarizer, but also a deeper understanding of the processes and techniques involved in building NLP applications.

We first start by doing the following:

  1. Install Required Libraries: Start by installing the necessary libraries including NumPy, Pandas, and NLTK. You can use the following command to install them:
  2. pip install numpy pandas nltk
  3. Load Data: Load the data you want to summarize into a Pandas dataframe. You can use the pandas.read_csv function for this.
  4. Preprocessing: Clean and preprocess the text data. This involves converting all the text to lowercase, removing stop words, stemming or lemmatizing words, and removing punctuations.
  5. Tokenization: Tokenize the text data into sentences or words, using the nltk.sent_tokenize or nltk.word_tokenize function.
  6. Vectorization: Convert the tokenized sentences or words into numerical vectors using the CountVectorizer or TfidfVectorizer class from the sklearn library.
  7. Text Rank Algorithm: Implement the Text Rank Algorithm to generate a score for each sentence based on its importance and relevance. The score will be used to select the most important sentences for the summary.
  8. Select Sentences: Select the top N sentences with the highest scores to form the summary. You can use the numpy.argsort function to sort the scores and select the top N sentences.
  9. Generate Summary: Finally, concatenate the selected sentences to form the final summary.
  10. Testing: Test your summarization tool on a sample of your data and evaluate its performance using metrics like ROUGE or BLEU scores.umair-akbar-model - Building a Natural Language Processing (NLP) text summarizer in one night from scratch

Note: This is just a high-level overview, and each step requires careful implementation and tuning for optimal results.

Now that we have a high level understanding of what we want to create and do with our system, we expand on each point further:

  1. Install Required Libraries:
    • NLTK: Natural Language Toolkit library for text preprocessing and tokenization.
    • NumPy: Library for numerical operations and data manipulation.
    • Pandas: Library for data analysis and manipulation.
  2. Load Data:
    • Use the pandas.read_csv function to load the data into a Pandas dataframe.
    • Ensure the data is loaded correctly by checking the shape and head of the dataframe.
    • Store the text data in a variable for preprocessing and summarization.
  3. Preprocessing:
    • Convert all text to lowercase using the lower method.
    • Remove stop words using the nltk.corpus stopwords list.
    • Perform stemming or lemmatization to reduce words to their base form using the nltk.stem or nltk.wordnet module.
    • Remove punctuations and special characters using the string module.
  4. Tokenization:
    • Use the nltk.sent_tokenize function to tokenize the text into sentences.
    • Use the nltk.word_tokenize function to tokenize the text into words.
    • Store the tokenized sentences or words in a variable for vectorization.
  5. Vectorization:
    • Use the CountVectorizer or TfidfVectorizer class from the sklearn library to convert the tokenized sentences or words into numerical vectors.
    • Fit the vectorizer to the text data and store the transformed vectors in a variable.
  6. Text Rank Algorithm:
    • Implement the Text Rank Algorithm to calculate a score for each sentence based on its importance and relevance.
    • The algorithm uses the similarity between sentences to determine the score, considering factors such as frequency of words and co-occurrence of words.
    • Store the scores for each sentence in a variable for selection.
  7. Select Sentences:
    • Use the numpy.argsort function to sort the scores in descending order and select the top N sentences.
    • N is a user-defined parameter representing the number of sentences desired in the summary.
    • Store the selected sentences in a variable for concatenation.
  8. Generate Summary:
    • Concatenate the selected sentences to form the final summary.
    • Use the join method to join the sentences into a single string.
    • Store the final summary in a variable for testing or output.
  9. Testing:
    • Test the summarization tool on a sample of the data to evaluate its performance.
    • Use metrics like ROUGE or BLEU scores to measure the similarity between the generated summary and the original text.
    • Tune the parameters and implementation to improve the performance of the summarization tool.

Next, we will develop detailed python code to implement all of the steps that we just discussed. Since we have written them out in detail, the code shouldn’t be a problem (I hope).

Code Iteration #1:

python

import numpy as np
import pandas as pd
import string
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
# Load data into a pandas dataframe
df = pd.read_csv("sample_data.csv")
# Preprocessing
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuations
    text = text.translate(str.maketrans("", "", string.punctuation))
    # Tokenize words
    words = word_tokenize(text)
    # Remove stop words
    stop_words = set(stopwords.words("english"))
    words = [word for word in words if word not in stop_words]
    # Stemming or lemmatization
    stemmer = PorterStemmer()
    lemmatizer = WordNetLemmatizer()
    words = [stemmer.stem(word) for word in words] # or use lemmatizer.lemmatize(word)
    # Join words back into a sentence
    text = " ".join(words)
    return text
df["text"] = df["text"].apply(preprocess_text)
# Tokenization
def tokenize_sentences(text):
    return sent_tokenize(text)
df["sentences"] = df["text"].apply(tokenize_sentences)
# Vectorization
vectorizer = CountVectorizer() # or use TfidfVectorizer
vectorized_text = vectorizer.fit_transform(df["text"])
# Text Rank Algorithm
def text_rank(vectorized_text, sentences, top_n=3):
    # Calculate similarity between sentences
    similarity_matrix = vectorized_text * vectorized_text.T
    # Calculate sentence scores
    sentence_scores = np.array(similarity_matrix.mean(axis=1)).flatten()
    sentence_scores = sentence_scores / np.sum(sentence_scores)
    # Sort sentences by scores
    top_sentences = np.argsort(sentence_scores)[::-1][:top_n]
    return [sentences[i] for i in top_sentences]
df["summary"] = df.apply(lambda x: text_rank(vectorized_text, x["sentences"], top_n=3), axis=1)

Note: This is just a first draft of my code and might not work for your specific dataset or requirement. We will be carefully tuning the code and parameters for optimal performance later in the article.

Plan to optimize code will be as follows:

  1. Preprocessing: a. Converting to lowercase, b. Removing punctuations, c. Tokenizing words, d. Removing stop words, e. Stemming/Lemmatization, f. Joining words back into a sentence.
  2. Tokenization: a. Tokenizing the text into sentences.
  3. Vectorization: a. Using either CountVectorizer or TfidfVectorizer.
  4. Text Rank Algorithm: a. Calculating similarity between sentences, b. Calculating sentence scores, c. Sorting sentences by scores, d. Selecting top N sentences based on scores.
  5. Tune the code: a. Consider changing the stemmer to lemmatizer, b. Experiment with different vectorization methods to determine the best one for the task, c. Experiment with different values for top_n in text_rank to determine the optimal number of sentences in the summary, d. Consider adding additional preprocessing steps such as removing numbers, e. Consider adding additional stop words to the stop words list, f. Handle missing values in the input data.

Here’s an optimized version of the code (iteration #2):

python

import numpy as np
import pandas as pd
import string
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
# Load data into a pandas dataframe
df = pd.read_csv("sample_data.csv")
# Preprocessing
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuations
    text = text.translate(str.maketrans("", "", string.punctuation))
    # Tokenize words
    words = word_tokenize(text)
    # Remove stop words
    stop_words = set(stopwords.words("english"))
    words = [word for word in words if word not in stop_words]
    # Stemming or lemmatization
    stemmer = PorterStemmer()
    words = [stemmer.stem(word) for word in words] # Use stemmer for faster processing
    # Join words back into a sentence
    text = " ".join(words)
    return text
df["text"] = df["text"].apply(preprocess_text)
# Tokenization
def tokenize_sentences(text):
    return sent_tokenize(text)
df["sentences"] = df["text"].apply(tokenize_sentences)
# Vectorization
vectorizer = TfidfVectorizer() # Use TfidfVectorizer for better results
vectorized_text = vectorizer.fit_transform(df["text"])
# Text Rank Algorithm
def text_rank(vectorized_text, sentences, top_n=3):
    # Calculate similarity between sentences
    similarity_matrix = (vectorized_text * vectorized_text.T).toarray()
    # Calculate sentence scores
    sentence_scores = np.array(similarity_matrix.mean(axis=1)).flatten()
    sentence_scores = sentence_scores / np.sum(sentence_scores)
    # Sort sentences by scores
    top_sentences = np.argsort(sentence_scores)[::-1][:top_n]
    return [sentences[i] for i in top_sentences]
df["summary"] = df.apply(lambda x: text_rank(vectorized_text, x["sentences"], top_n=3), axis=1)
  • Changed the vectorizer from CountVectorizer to TfidfVectorizer for better results.
  • Changed the stemming method from WordNetLemmatizer to PorterStemmer for faster processing.
  • Used toarray to convert the similarity matrix to a dense array.
  • Normalized the sentence scores by dividing with the sum of scores.

The code can be optimized further as follows (iteration #3):

python

import numpy as np
import pandas as pd
import string
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
# Load data into a pandas dataframe
df = pd.read_csv("sample_data.csv")
# Preprocessing
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuations
    text = text.translate(str.maketrans("", "", string.punctuation))
    # Tokenize words
    words = word_tokenize(text)
    # Remove stop words
    stop_words = set(stopwords.words("english"))
    words = [word for word in words if word not in stop_words]
    # Stemming
    stemmer = PorterStemmer()
    words = [stemmer.stem(word) for word in words]
    # Join words back into a sentence
    text = " ".join(words)
    return text
df["text"] = df["text"].apply(preprocess_text)
# Tokenization
def tokenize_sentences(text):
    return sent_tokenize(text)
df["sentences"] = df["text"].apply(tokenize_sentences)
# Vectorization
vectorizer = TfidfVectorizer()
vectorized_text = vectorizer.fit_transform(df["text"])
# Text Rank Algorithm
def text_rank(vectorized_text, sentences, top_n=3):
    # Calculate similarity between sentences
    similarity_matrix = (vectorized_text * vectorized_text.T).toarray()
    # Calculate sentence scores
    sentence_scores = np.array(similarity_matrix.mean(axis=1)).flatten()
    sentence_scores = sentence_scores / np.sum(sentence_scores)
    # Sort sentences by scores
    top_sentences = np.argsort(sentence_scores)[::-1][:top_n]
    return [sentences[i] for i in top_sentences]
df["summary"] = df.apply(lambda x: text_rank(vectorized_text, x["sentences"], top_n=3), axis=1)

Changes made:

  • Replaced WordNetLemmatizer with PorterStemmer for faster processing
  • Vectorization changed from CountVectorizer to TfidfVectorizer for better results

As any decent programmer does, they write notation (and or “comments”) to help not only themselves be able to understand what they coded later on and in the future, but also for others to allow for consistency and clarity within the codebase. We will now be writing comments on our code to ensure that the direction and vision is clear.

python

import numpy as np

This line imports the numpy library and renames it as np for easier access. Numpy is a library used for scientific computing in Python, including support for a powerful N-dimensional array object.

python

import pandas as pd

This line imports the pandas library and renames it as pd for easier access. Pandas is a library used for data manipulation and analysis, providing data structures for efficiently storing large datasets and tools for working with them.

go

import string

This line imports the string module which provides a collection of string constants and classes, including a constant string.punctuation which contains all ASCII punctuation characters.

python

from nltk.tokenize import sent_tokenize, word_tokenize

This line imports the sent_tokenize and word_tokenize functions from the nltk.tokenize module of the nltk (Natural Language Toolkit) library. These functions are used for tokenizing text into sentences or words, respectively.

python

from nltk.corpus import stopwords

This line imports the stopwords corpus from the nltk.corpus module of the nltk library. The stopwords corpus contains a list of stop words, which are commonly occurring words that are usually removed from text data before further processing.

python

from nltk.stem import PorterStemmer

This line imports the PorterStemmer class from the nltk.stem module of the nltk library. The PorterStemmer is used to perform stemming, which is the process of reducing words to their root or base form.

python

from sklearn.feature_extraction.text import TfidfVectorizer

This line imports the TfidfVectorizer class from the sklearn.feature_extraction.text module of the scikit-learn library. The TfidfVectorizer is used for transforming text data into a numerical representation (vector) using the term frequency-inverse document frequency (TF-IDF) method.

bash

# Load data into a pandas dataframe
df = pd.read_csv("sample_data.csv")

This line uses the read_csv function from the pandas library to load a CSV file named sample_data.csv into a pandas dataframe called df. The dataframe will be used for processing and storing the text data.

python

# Preprocessing
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuations
    text = text.translate(str.maketrans("", "", string.punctuation))
    # Tokenize words
    words = word_tokenize(text)
    # Remove stop words
    stop_words = set(stopwords.words("english"))
    words = [word for word in words if word not in stop_words]
    # Stemming
    stemmer = PorterStemmer()
    words = [stemmer.stem(word) for word in words]
    #

Calculate similarity between sentences

scss

similarity_matrix = (vectorized_text * vectorized_text.T).toarray()

The code calculates the similarity between sentences by multiplying the vectorized text matrix with its transpose, then converting it to an array, storing the result in similarity_matrix.

Calculate sentence scores

scss

sentence_scores = np.array(similarity_matrix.mean(axis=1)).flatten()
sentence_scores = sentence_scores / np.sum(sentence_scores)

The code calculates sentence scores by first taking the mean of similarity_matrix along axis 1, which calculates the mean of each row, and then flattening the resulting array into a 1D array. The scores are then normalized by dividing each element by the sum of all elements.

Sort sentences by scores

scss

top_sentences = np.argsort(sentence_scores)[::-1][:top_n]

The code sorts the sentences by their scores in descending order and selects the top n sentences with the highest scores, where n is passed as an argument to the text_rank function as top_n. The indices of the top sentences are stored in top_sentences.

Return the top sentences

css

return [sentences[i] for i in top_sentences]

The code returns a list of the top sentences by indexing into sentences using the indices stored in top_sentences.

Summarization of our Summarization tool

The code is for a text summarization algorithm that uses the TextRank algorithm to summarize a given text in a csv file. It has the following steps:

Importing necessary libraries:

  • numpy as np
  • pandas as pd
  • string
  • nltk.tokenize (sent_tokenize and word_tokenize)
  • nltk.corpus (stopwords)
  • nltk.stem (PorterStemmer)
  • sklearn.feature_extraction.text (TfidfVectorizer)

Loading the data into a pandas dataframe:

  • df = pd.read_csv(“sample_data.csv”)

Preprocessing the text:

  • Convert to lowercase
  • Remove punctuations
  • Tokenize words
  • Remove stop words
  • Stemming
  • Join words back into a sentence

Tokenizing sentences:

  • df[“sentences”] = df[“text”].apply(tokenize_sentences)

Vectorizing the text:

  • vectorizer = TfidfVectorizer()
  • vectorized_text = vectorizer.fit_transform(df[“text”])

Applying the TextRank algorithm:

  • Calculate similarity between sentences
  • Calculate sentence scores
  • Sort sentences by scores
  • Return top N sentences
  • df[“summary”] = df.apply(lambda x: text_rank(vectorized_text, x[“sentences”], top_n=3), axis=1)

Step 5 in the code is the calculation of sentence scores using the similarity matrix and the mean of the matrix along the axis 1.

In this step, the similarity matrix is created by multiplying the vectorized text with its transpose and converting the result to an array. The similarity matrix contains the similarity scores between all pairs of sentences in the text.

Next, the mean of each row of the similarity matrix is calculated and stored in a 1D array called sentence_scores. This represents the average similarity score of each sentence with all the other sentences in the text.

To normalize the scores, the sentence_scores are divided by the sum of all scores, so that the scores sum up to 1. This is done to ensure that the scores are proportional to their importance in the text.

In the end, the top_sentences are selected based on the sorted sentence_scores in descending order and the top_n sentences with the highest scores are returned as the summary of the text.

Step 6 is the implementation of the TextRank algorithm, which is used to summarize the processed text data. This algorithm works by first calculating the similarity between sentences, then determining the importance of each sentence based on the similarities.

The algorithm starts by calculating the similarity between each pair of sentences by taking the dot product of their Tf-idf vectors, which are stored in the vectorized_text variable. The resulting similarity matrix shows the similarities between each sentence in the text.

Next, the sentence scores are calculated by taking the mean of each row in the similarity matrix, which represents the similarity between a sentence and all other sentences. The sentence scores are then normalized by dividing each score by the sum of all scores, so that the scores add up to 1.

Finally, the top n sentences are selected based on their sentence scores, where n is specified by the top_n argument in the text_rank function. The sentences are sorted by their scores in descending order, and the top n sentences are returned as the summary. These top sentences are stored in the “summary” column of the dataframe.

Step 4 in the code is the text preprocessing step. This step is crucial for improving the performance and accuracy of the text summarization model. The text preprocessing step is achieved by defining a function preprocess_text and applying it to the text column of the pandas dataframe df.

The preprocess_text function takes in a string of text as input and performs several operations to clean and preprocess the text:

  1. The text is converted to lowercase using the .lower() method.
  2. Punctuation marks are removed using the str.translate function with the string.punctuation constant as the argument.
  3. The text is tokenized into words using the word_tokenize function from the nltk.tokenize module.
  4. Stop words are removed using the stopwords corpus from the nltk library.
  5. The words are stemmed using the PorterStemmer class from the nltk.stem module.
  6. The words are joined back into a sentence using the join() method.

Finally, the function returns the preprocessed text as output. This preprocessed text is stored in the text column of the df dataframe.

Step 3 of the code performs text tokenization on the preprocessed text data. The function tokenize_sentences(text) is defined to break down the input text into individual sentences using the sent_tokenize method from the nltk.tokenize library.

The sent_tokenize method uses an instance of Punkt tokenizer trained on a large corpus of text data to split the input text into sentences. This is important for later analysis, where the similarity between sentences will be used to calculate sentence scores.

The function is then applied to the “text” column of the pandas dataframe “df” using the apply method, with the result stored in a new column called “sentences”.

Step 2 performs tokenization on the text data. The tokenize_sentences function is defined, which takes in a text argument, and returns a list of sentences. The text is first passed through the sent_tokenize function from the nltk.tokenize module, which splits the text into sentences. The resulting list of sentences is then assigned to a new column "sentences" in the pandas dataframe df, by applying the tokenize_sentences function to each value in the "text" column of the dataframe. The resulting dataframe now has two columns: "text" and "sentences". The "text" column consists of preprocessed text, and the "sentences" column consists of lists of sentences, where each list represents a single text item.

Step 1 of the code involves loading the data into a Pandas dataframe using the read_csv method. The read_csv method reads a CSV (Comma Separated Value) file and converts it into a Pandas dataframe. A Pandas dataframe is a two-dimensional data structure that provides an easy way to manipulate, clean and analyze data. In this step, the data is loaded from a file called “sample_data.csv”. The loaded data is stored in the df variable, which is now a Pandas dataframe.

The Text Rank Algorithm is used to extract the most important sentences from a long piece of text, summarizing it into a smaller, more concise format. It uses techniques from information retrieval, graph-based ranking, and natural language processing to achieve this. In this blog post, we will go over the steps involved in implementing the Text Rank Algorithm in python using the popular libraries: Numpy, Pandas, NLTK, and scikit-learn.

Step 1: Load Data into a Pandas Dataframe The first step is to load the data you want to summarize into a pandas dataframe. This can be done by using the read_csv() function from pandas and passing the name of your csv file as an argument. In our case, we’ll be loading the “sample_data.csv” file.

python

import pandas as pd
df = pd.read_csv("sample_data.csv")

Step 2: Preprocessing Before we can start summarizing the text, we need to preprocess it. This involves several steps such as converting the text to lowercase, removing punctuations, tokenizing the words, removing stop words, and stemming the words.

python

import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuations
    text = text.translate(str.maketrans("", "", string.punctuation))
    # Tokenize words
    words = word_tokenize(text)
    # Remove stop words
    stop_words = set(stopwords.words("english"))
    words = [word for word in words if word not in stop_words]
    # Stemming
    stemmer = PorterStemmer()
    words = [stemmer.stem(word) for word in words]
    # Join words back into a sentence
    text = " ".join(words)
    return text
df["text"] = df["text"].apply(preprocess_text)

Step 3: Tokenization In this step, we are going to split the preprocessed text into sentences. This is done using the sent_tokenize() function from the nltk library.

python

from nltk.tokenize import sent_tokenize
def tokenize_sentences(text):
    return sent_tokenize(text)
df["sentences"] = df["text"].apply(tokenize_sentences)

Step 4: Vectorization In this step, we are going to convert the preprocessed text into a numerical representation. This is done using the TfidfVectorizer from scikit-learn. Tf-idf stands for Term Frequency-Inverse Document Frequency, and it is a measure of the importance of a word in a document. The TfidfVectorizer calculates the Tf-idf for each word in the document and returns a matrix.

python

from sklearn.feature_extraction.text import TfidfVector

The code is a simple implementation of a text summarization algorithm that uses the TextRank algorithm to summarize large pieces of text into a smaller, more readable format. The algorithm makes use of several NLP techniques such as tokenization, stemming, and vectorization to analyze and represent text data.

The first step is to load a sample data set into a pandas dataframe. This is done using the pd.read_csv method, which reads in a csv file and converts it into a pandas dataframe object.

The next step is the preprocessing of text data. The preprocessing step is crucial to the performance of the text summarization algorithm, as it prepares the text data for analysis and representation. The preprocessing step includes lowercasing, removing punctuation marks, tokenizing words, removing stop words, stemming, and joining words back into a sentence.

Once the text data has been preprocessed, the next step is tokenization. Tokenization involves splitting the text into smaller pieces, called tokens. In this code, tokenization is performed on the sentences, resulting in a list of sentences for each text document.

The next step is vectorization, which involves converting the tokenized text data into numerical representations. This is done using the TfidfVectorizer class from the sklearn library, which calculates the Term Frequency-Inverse Document Frequency (TF-IDF) for each word in the text. The resulting vectorized text data can then be used to calculate similarity between different text documents.

Finally, the TextRank algorithm is applied to the vectorized text data to produce a summary of the text. The algorithm uses the similarity matrix calculated during the vectorization step to determine the relevance of each sentence in the text. Sentences are then ranked based on their relevance, and the top N sentences are selected to form the summary. In this code, N is set to 3, meaning that the top 3 sentences in the text will be selected as the summary.

In conclusion, the code provides a simple implementation of a text summarization algorithm using the TextRank algorithm. The algorithm makes use of NLP techniques such as tokenization, stemming, and vectorization to analyze and represent text data, and produces a condensed summary of the text. The code provides a good starting point for further exploration and improvement of text summarization algorithms.