How to Effectively Train a Text Summarization AI Model Using Python?

Faraz Logo

By Faraz -

Master the art of training Text Summarization AI models with Python! Expert insights, step-by-step guidance, and practical tips await. Dive in now!


how-to-effectively-train-a-text-summarization-ai-model-using-python.webp

Table of Contents

  1. Types of Text Summarization
  2. Implementing Text Summarization in Python (using the Seq2Seq model)
  3. Find A Valid Training Dataset
  4. Preprocess Data To Clean It
  5. Make Tokenizers (To Convert Words To Strings)
  6. Finalize the Model
  7. Final Words

Text summarization means to shorten a text by keeping its intended meaning the same. In today’s haphazardness, there’s a rising need for clear, concise content. This is to quickly deliver all the points to the audience so that every party gains value in less time.

Thus, AI summarizers come into play. These automated tools are built on trained ML models with vast datasets. The summarizers utilize NLP (Natural Language Processing) to read and predict suitable words for the shortened text.

However, there are ways that you can train a model with Python to make your text summarizer. This article will teach you the

(Intro explaining the definition of text summarization (in NLP) and its use cases. Also, give background for AI models and how we can train them with Python.)

1. Types of Text Summarization

There are two types of text summarizations possible.

  • Extractive

This type of summary means to extract important sentences from a sample text and display them. The resulting text is much shorter than the original one, without altering the original meaning.

types of text summarization - extractive

An Attractive summarizer doesn’t alter the sentences in any way. It just copies them from the source text to make a summary that justifies the context.

  • Abstractive

An abstractive summary is where the summarizer doesn’t copy the same sentences from a sample text. Instead, it learns whatever is in an input text and writes new sentences to complete the summary.

types of text summarization - abstractive

For today’s post, we will show the procedure of training an AI model to create abstractive summaries. The reason to pick abstractive over extractive summaries is that they’re more intuitive to human learning.

So, you can utilize the summarised text to digest a lengthy, intricate text for your convenience. This can be helpful to take valid inspiration from published content or to learn manual summarization techniques.

2. Implementing Text Summarization in Python (using the Seq2Seq model)

To effectively train a text summarization AI model in Python, we’ll use the Seq2Seq model. The S2S model consists of two main components: Encoder, and Decoder. We recommend reading a complete LSTM tutorial to understand these concepts in detail.

Moving on, to train our AI model, let’s do the following steps.

Import the Right Libraries

There are many libraries out there for Python that can achieve our task. However, we’ll limit ourselves to the following ones.

  • BeautifulSoup
  • Nltk
  • Numpy
  • Pandas
  • Re
  • Keras

Let us briefly explain what each does, so that you have an idea of what we’re doing here.

BeautifulSoup is an HTML parser that makes it easy to extract content from web pages. It is helpful to remove HTML tags and extract pure text.

Nltk stands for Natural Language Toolkit. It is an API used for Natural Language Processing. The Nltk library converts text to numbers, easily recognizable for the machine.

Numpy is used to carry out mathematical operations with large, multidimensional arrays or matrices. This makes them suitable for iterative tasks, like the one we have.

Panda is another popular Python library that creates two-dimensional heterogeneous objects and helps in mathematical calculations.

Re is short for Regex. In short, it is a ‘match function’ that catches and detects a given string to match from an ocean of words and phrases.

Keras is an open-source library developed by Google. It is a high-level deep-learning API that enables creators to make AI models. Later, in 2017, Keras merged with TensorFlow.

Finally, import the libraries in a suitable Python environment by applying the following code.

import numpy as np  
import pandas as pd
import re           
from bs4 import BeautifulSoup
from keras. preprocessing.text import Tokenizer
from keras. preprocessing.sequence import pad_sequences
from nltk.corpus import stopwords   
from TensorFlow. keras.layers import Input, LSTM, Embedding, Dense, Concatenate, TimeDistributed, Bidirectional
from tensorflow. keras.models import Model
from TensorFlow. keras.callbacks import EarlyStopping
import warnings
pd.set_option("display.max_colwidth", 200)
warnings.filter warnings("ignore")

3. Find a Valid Training Dataset

After creating a solid base for the libraries to use, let’s move on to find the right dataset. The one we’ll use for our demonstration is Amazon’s Fine Food reviews. The data includes around 560,000 reviews on the Amazon website from October 1999 to October 2012.

The reason to use this training dataset is that product reviews generally have all types of vocabulary and writing styles. It’s a good representation of the machine of how humans write in real life.

Thus, the data will work wonders for teaching the AI model on text summarization. The abstract summaries will resemble those done by humans. The text summarizations will be coherent, cohesive, and effective.

However, if you want to train your AI model to create extractive summaries, then use the TextRank algorithm. The code works on a two-dimensional array and defines probabilities of correlation of each index in the calculation.

Nevertheless, we’ll stick with the review dataset to train abstractive summaries. For ease of computation, we’ll limit the sample size to 100,000 by following the code below.

data=pd.read_csv("Reviews.csv", nrows=100000)
data.drop_duplicates(subset=['Text'],inplace=True)
data.dropna(axis=0,inplace=True)

Additionally, you should eliminate redundancy from the data by removing repeating and NULL or NaN values. Then, you’ll be ready to use your dataset for training your AI model on summaries.

4. Preprocess Data To Clean It

Clearing out NUL and repeating words was the initial step toward cleaning the data. However, it requires much more than that. Below, we have listed some more steps you need to follow to make your training dataset pristine.

  • Mapping contracted words
  • Converting everything to lowercase
  • Removing anything in parenthesis
  • Removing stop words and short words
  • Eliminating special characters and punctuation

To do these steps, we need to follow some coding techniques. First, we’ll declare an array that has some contraction definitions in it like,

contraction_mapping = {“can’t”: “cannot”… and so on.}

We’ll remove short words and stop words from our text. This is because stop words like, ‘and’, ‘there’, ‘is’, and ‘the’ don’t add any value to the content of the summary. A similar argument can be presented for short words like abbreviations that aren’t meaningful to the overall understanding of an input text.

stop_words = set(stopwords.words('english'))
def text_cleaner(text):
	newString = text.lower()
	newString = BeautifulSoup(newString, "lxml").text
	newString = re.sub(r'\([^)]*\)', '', newString)
	newString = re.sub('"','', newString)
	newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])	
	newString = re.sub(r"'s\b","",newString)
	newString = re.sub("[^a-zA-Z]", " ", newString)
	tokens = [w for w in newString.split() if not w in stop_words]
	long_words=[]
	for i in tokens:
    	if len(i)>=3:                  #removing short word
        	long_words. Append(i)  
	return (" ".join(long_words)).strip()
cleaned_text = []
for t in data['Text']:
	cleaned_text.append(text_cleaner(t))

In the above code, the variable stop_words recalls and stores all the relevant stop words (as mentioned above) from the nltk library.

Then, a text_cleaner function performs all the steps we mentioned above to convert the text to lowercase and remove special characters, parenthesis, punctuation, etc.

5. Make Tokenizers (To Convert Words To Strings)

Now, it’s time to tokenize the words so that the machine converts them to phrases and sentences for the summary.

Before that, we need to convert our input text data into two parts: training set, and holdout set. This is to ensure that our end product is functioning well enough.

max_len_text=80
max_len_summary=10
from sklearn.model_selection import train_test_split
x_tr,x_val,y_tr,y_val=train_test_split(data['cleaned_text'],data['cleaned_summary'],test_size=0.1,random_state=0,shuffle=True)

Notice that we set the maximum length of text and summary as 80 and 10, respectively. This was done as reviews were at an average length of 80.

You can change this value according to the training dataset you have chosen for your AI model. In our case, we figured that these constraints would work the best.

Now, let’s define a text tokenizer for our training dataset.

#prepare a tokenizer for reviews on training data
x_tokenizer = Tokenizer()
x_tokenizer.fit_on_texts(list(x_tr))
 
#convert text sequences into integer sequences
x_tr	=   x_tokenizer.texts_to_sequences(x_tr)
x_val   =   x_tokenizer.texts_to_sequences(x_val)
 
#padding zero up to maximum length
x_tr	=   pad_sequences(x_tr,  maxlen=max_len_text, padding='post')
x_val   =   pad_sequences(x_val, maxlen=max_len_text, padding='post')
 
x_voc_size   =  len(x_tokenizer.word_index) +1

Next, prepare a summary tokenizer.

#preparing a tokenizer for a summary of training data
y_tokenizer = Tokenizer()
y_tokenizer.fit_on_texts(list(y_tr))
 
#convert summary sequences into integer sequences
y_tr	=   y_tokenizer.texts_to_sequences(y_tr)
y_val   =   y_tokenizer.texts_to_sequences(y_val)
 
#padding zero up to maximum length
y_tr	=   pad_sequences(y_tr, maxlen=max_len_summary, padding='post')
y_val   =   pad_sequences(y_val, maxlen=max_len_summary, padding='post')
 
y_voc_size  =   len(y_tokenizer.word_index) +1

Make sure to define the start and end words correctly. Then, move on to the next and final step.

6. Finalise the Model

We are at the step at which every single element will stitch together to create a functional AI model. To do so, we need to explore some concepts on LSTM, hidden state, cell-state, etc.

LSTM stands for long short-term memory. It is a type of recurrent neural network (RNN) that the AI model will use to predict the right words for our summary sentences.

from keras import backend as K
K.clear_session()
latent_dim = 500
 
# Encoder
encoder_inputs = Input(shape=(max_len_text,))
enc_emb = Embedding(x_voc_size, latent_dim,trainable=True)(encoder_inputs)
 
#LSTM 1
encoder_lstm1 = LSTM(latent_dim,return_sequences=True,return_state=True)
encoder_output1, state_h1, state_c1 = encoder_lstm1(enc_emb)
 
#LSTM 2
encoder_lstm2 = LSTM(latent_dim,return_sequences=True,return_state=True)
encoder_output2, state_h2, state_c2 = encoder_lstm2(encoder_output1)
 
#LSTM 3
encoder_lstm3=LSTM(latent_dim, return_state=True, return_sequences=True)
encoder_outputs, state_h, state_c= encoder_lstm3(encoder_output2)
 
# Set up the decoder.
decoder_inputs = Input(shape=(None,))
dec_emb_layer = Embedding(y_voc_size, latent_dim,trainable=True)
dec_emb = dec_emb_layer(decoder_inputs)
 
#LSTM using encoder_states as initial state
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs,decoder_fwd_state, decoder_back_state = decoder_lstm(dec_emb,initial_state=[state_h, state_c])
 
# Concat attention output and decoder LSTM output
decoder_concat_input = Concatenate(axis=-1, name='concat_layer')([decoder_outputs, attn_out])
 
#Dense layer
decoder_dense = TimeDistributed(Dense(y_voc_size, activation='softmax'))
decoder_outputs = decoder_dense(decoder_concat_input)
 
# Define the model
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.summary()

In the above code, we’ve built a 3-layer LSTM that we’ll use for the encoder as an input to the AI model.

Now, let’s set up inference for the encoder and decoder of the text summarizer.

# encoder inference
encoder_model = Model(inputs=encoder_inputs,outputs=[encoder_outputs, state_h, state_c])
 
# decoder inference
# Below tensors will hold the states of the previous time step
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_hidden_state_input = Input(shape=(max_len_text,latent_dim))
 
# Get the embeddings of the decoder sequence
dec_emb2= dec_emb_layer(decoder_inputs)
 
# To predict the next word in the sequence, set the initial states to the states from the previous time step
decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=[decoder_state_input_h, decoder_state_input_c])
 
#attention inference
attn_out_inf, attn_states_inf = attn_layer([decoder_hidden_state_input, decoder_outputs2])
decoder_inf_concat = Concatenate(axis=-1, name='concat')([decoder_outputs2, attn_out_inf])
 
# A dense softmax layer to generate prob dist. over the target vocabulary
decoder_outputs2 = decoder_dense(decoder_inf_concat)
 
# Final decoder model
decoder_model = Model(
[decoder_inputs] + [decoder_hidden_state_input,decoder_state_input_h, decoder_state_input_c],
[decoder_outputs2] + [state_h2, state_c2])

Afterward, let’s define a function that shows the implementation of the inference.

def decode_sequence(input_seq):
	# Encode the input as state vectors.
	e_out, e_h, e_c = encoder_model.predict(input_seq)
 
	# Generate an empty target sequence of length 1.
	target_seq = np.zeros((1,1))
 
	# Chose the 'start' word as the first word of the target sequence
	target_seq[0, 0] = target_word_index['start']
 
	stop_condition = False
	decoded_sentence = ''
	while not stop_condition:
    	output_tokens, h, c = decoder_model.predict([target_seq] + [e_out, e_h, e_c])
 
    	# Sample a token
    	sampled_token_index = np.argmax(output_tokens[0, -1, :])
    	sampled_token = reverse_target_word_index[sampled_token_index]
 
   	 if(sampled_token!='end'):
        	decoded_sentence += ' '+sampled_token
 
        	# Exit condition: either hit max length or find a stop word.
       	 if (sampled_token == 'end' or len(decoded_sentence.split()) >= (max_len_summary-1)):
            	stop_condition = True
            	
            	# Update the target sequence (of length 1).
    	target_seq = np.zeros((1,1))
    	target_seq[0, 0] = sampled_token_index
 
    	# Update internal states
    	e_h, e_c = h, c
 
	return decoded_sentence

Finally, we’ll define a function that converts the integer sequences to a word sequence for our abstractive summary.

def seq2summary(input_seq):
	newString=''
	for i in input_seq:
  	if((i!=0 and i!=target_word_index['start']) and i!=target_word_index['end']):
        newString=newString+reverse_target_word_index[i]+' '
	return newString
def seq2text(input_seq):
	newString=''
	for i in input_seq:
 	 if(i!=0):
        newString=newString+reverse_source_word_index[i]+' '
	return newString
 
for i in range(len(x_val)):
  print("Review:",seq2text(x_val[i]))
  print("Original summary:",seq2summary(y_val[i]))
  print("Predicted summary:",decode_sequence(x_val[i].reshape(1,max_len_text)))
  print("\n")

Below are some of the outputs that this AI model spits out.

original summaries
predicted abstract summaries
Images showing the original and predicted abstract summaries of the product reviews. Source: (Analyticsvidhya.com)

The results are quite similar to those produced by an AI summarizer available online.

final-result

We must note that the result produced above by the commercial-level summarizer is top-notch. It even included punctuations and expressions in its results, which our model doesn’t. However, there is always room for improvement in your model-training techniques.

7. Final Words

In this post, we learned the steps that you need to follow to effectively train your AI model for text summarization.

We learned that we need to create a 3-layer LSTM layer to successfully build the capability of predicting the right words in a summary. The codes will run only if the valid libraries are installed in the system.

There are still chances of an error in the whole process, so you’ll need to be patient and avoid bugs in the code. Debugging is a game of patience and mastering it will make your AI models the best!

That’s it for the post! We hope you enjoyed reading our content!

That’s a wrap!

I hope you enjoyed this article

Did you like it? Let me know in the comments below 🔥 and you can support me by buying me a coffee.

And don’t forget to sign up to our email newsletter so you can get useful content like this sent right to your inbox!

Thanks!
Faraz 😊

End of the article

Subscribe to my Newsletter

Get the latest posts delivered right to your inbox


Latest Post