Text Summarization is a process of generating a compact and meaningful synopsis from a huge volume of text. Sources for such text include news articles, blogs, social media posts, all kinds of documentation, and many more. If you are new to NLP and want to read more about text summarization, this article will help you understand the basic and advanced concepts. The purpose of this article is to demonstrate the implementation of an extractive text summarizer using state-of-the-art contextual embeddings.
If this in-depth educational content is useful for you, subscribe to our AI research mailing list to be alerted when we release new material.
Approach
The approach to building the summarizer can be divided into the following steps:
- Convert the article/passage to a list of sentences using nltk’s sentence tokenizer.
- For each sentence, extract contextual embedding using Sentence Transformer.
- Apply Kmeans clustering on the embeddings. The idea is to cluster the sentences that are contextually similar to each other and pick one sentence from each cluster that is closest to the mean(centroid).
- For each sentence embedding, calculate the distance from the centroid. Sometimes, the centroids are the actual sentence embeddings and in that case, the distance would be zero.
- For each cluster, select the embedding (sentence) with the lowest distance from the centroid and return the summary based on the order in which the sentences appeared in the original text.
Sentence Transformer
Sentence Transformer is a python package that enables you to represent your sentences and paragraphs as dense vectors. The package is compatible with state-of-the-art models like BERT, RoBERTa, XLM-RoBERTa, etc. Choosing the correct model is very important as models perform better on tasks they were designed to address. For our use case, we will be using STS (Semantic Textual Similarity) based models. To read more about Sentence Transformer, check out the following links;
- Source code: UKPLab / sentence-transformers
- List of pre-trained models: Pretrained Models
- List of STS models: SentenceTransformer Pretrained Models
Implementation
Now that we have an understanding of the use case and approach, let’s dive into the implementation of the summarizer. We will be using the following python packages:
- Pandas
- Numpy
- Sentence Transformer
- NLTK’s KMeanClusterer
So let’s import the above-mentioned packages:
import nltk import pandas as pd from sentence_transformers import SentenceTransformer from nltk.cluster import KMeansClusterer import numpy as np
The next step is to initialize the SentenceTransformer with the appropriate model. As mentioned above, I am going to use STS based model i.e. stsb-roberta-base. Feel free to experiment with other models.
model = SentenceTransformer('stsb-roberta-base')
We are going to perform a summarization of the latest new article describing the new Aston Martin Formula 1 car.
article='''
This week's launch of Aston Martin’s new Formula 1 car was one of the most hyped events of the pre-season so far, as fans were intrigued by how the new-look AMR21 would be painted. Unlike the car launches that came before it, Aston Martin left very little to the imagination, releasing detailed shots of the entire car. The first thing to note is that the team spent both of its development tokens on redesigning the chassis, in order that it could unlock aerodynamic performance from the central portion of the car. This is, in part, a legacy of the team’s approach for 2020, having assimilated the overall design package of the previous year’s championship winning Mercedes including a more conventional position for the side-impact protection spars (SIPS). The low-slung arrangement, as introduced by Ferrari in 2017, is now considered critical from an aerodynamic perspective, with the sidepod inlet positioned much like a periscope. This is typically above the fairing that surrounds the SIPS, which is used to inhibit the turbulence created by the front tyre and therefore also aids the transit of cool air that’s supplied to the radiators within the sidepods. This image of the car depicts how the bargeboards are used to filter the turbulence created by the front tyre and convert it into something more usable. Meanwhile, the airflow fed from the front of the car, including the cape, is forced around the underside of the sidepod whilst the fairing around the SIPS shields the airflow entering the sidepod inlet. This should result in a much cleaner flow arriving at the radiators, with the air having not been worked too hard by numerous surfaces en route. The inlet itself is extremely narrow with the team recovering some of that with the sculpting on the sides of the chassis. The narrowness of the inlet also draws your attention to the substantial fin that grows out of the sidepod’s shoulder and helps to divert airflow down over the revamped sidepod packaging behind. This is an area where the team has clearly focused its resources, knowing that getting this right will reap aerodynamic rewards for other areas of the car. The sidepod design draws inspiration from the new bodywork that the team installed in Mugello last season (below) but falls short of having the full ramp to floor transition, instead favouring the dipped midriff like we’ve seen adopted elsewhere. The rear portion of the sidepods and the engine cover have extremely tight packaging, with the AMR21 akin to the W12 with the bodywork almost shrink wrapped to the componentry inside. And, much like the W12, it also features a bodywork blister around the inlet plenum, a feature of the power unit which is believed to be bigger this season as a result of some of the performance and durability updates introduced by HPP. The AMR12 also features a very small rear cooling outlet that not only shows how efficient they expect the Mercedes-AMG F1 M12 E Performance power unit to be, but also how much they have focused on producing a car that recovers the downforce lost by the introduction of the new regulations. The extremely tight packaging creates a sizable undercut beneath the cooling outlet too, which buys back some of the floor that has been lost to the new regulations and drives home the performance of the coke bottle region. This is aided further by the token-free adoption of the Mercedes gearbox carrier and rear suspension from last season, an arrangement that Mercedes was particularly proud of because of the aerodynamic gains that it facilitates. The new arrangement sees the suspension elements lifted clear of the diffuser ceiling, which has become more prominent as the teams push the boundaries of the regulations, while the rear leg of the lower wishbone being positioned so far back also results in the ability to extract more performance from the diffuser. Aston Martin is the first team to unmask all the aerodynamic tricks it will use to make up the difference on the edge of the diagonal floor cut-out. The first of these tricks shares a similarity to the design shown by AlphaTauri, with a trio of outwardly directed fins installed just behind the point where the floor starts to taper in. The airflow structures emitted from these fins will undoubtedly interact with the AlphaTauri-esque floor scroll and floor notch just ahead of them and help to mitigate some of the losses that have been created due to fully enclosed holes being outlawed and the reduced floor width ahead of the rear tyre. It’s here where we find a solution akin to the one that Ferrari tested at the end of 2020 too, as a series of fins form an arc. This should help influence the airflow ahead of the rear tyre and reduce the impact that tyre squirt has on the diffuser. Interestingly, it has also added two offset floor strakes inboard of this where teams normally only opt for one strake, with Mercedes in the pre-hybrid era being an advocate of such designs. A new solution appears on the rear wing too, as the thickness of the upper front corner of the endplate has been altered to allow for another upwash strike. Teams had already started to look for ways to redesign this region last year, with the removal of the louvres in 2019 resulting in an increase in drag. The upwash strike is positioned in order that it can affect the tip vortex that’s generated by the top flap and endplate juncture and will undoubtedly be a design aspect that the rest of the field will take note of. While Aston Martin did show us a lot of its new car, it did keep one element secret for now – the rear brake ducts (not pictured, above). It does seem like a strange omission given it has shown us so much around the rest of the car but we must remember that this is one aspect of the 2021 cars that’s affected by the new regulations. Perhaps the team feels it has found a small pocket of performance in that regard and doesn’t want to unnecessarily hand its rivals a chance to see it ahead of testing.
'''
Now, we convert the above article to a list of sentences. We will use nltk’s sent_tokenize() method.
sentences=nltk.sent_tokenize(article) # strip leading and trailing spaces sentences = [sentence.strip() for sentence in sentences]
Output:
["\nThis week's launch of Aston Martin’s new Formula 1 car was one of the most hyped events of the pre-season so far, as fans were intrigued by how the new-look AMR21 would be painted.",
'Unlike the car launches that came before it, Aston Martin left very little to the imagination, releasing detailed shots of the entire car.',
'The first thing to note is that the team spent both of its development tokens on redesigning the chassis, in order that it could unlock aerodynamic performance from the central portion of the car.',
'This is, in part, a legacy of the team’s approach for 2020, having assimilated the overall design package of the previous year’s championship winning Mercedes including a more conventional position for the side-impact protection spars (SIPS).',
'The low-slung arrangement, as introduced by Ferrari in 2017, is now considered critical from an aerodynamic perspective, with the sidepod inlet positioned much like a periscope.',
'This is typically above the fairing that surrounds the SIPS, which is used to inhibit the turbulence created by the front tyre and therefore also aids the transit of cool air that’s supplied to the radiators within the sidepods.',
'This image of the car depicts how the bargeboards are used to filter the turbulence created by the front tyre and convert it into something more usable.',
'Meanwhile, the airflow fed from the front of the car, including the cape, is forced around the underside of the sidepod whilst the fairing around the SIPS shields the airflow entering the sidepod inlet.',
'This should result in a much cleaner flow arriving at the radiators, with the air having not been worked too hard by numerous surfaces en route.',
'The inlet itself is extremely narrow with the team recovering some of that with the sculpting on the sides of the chassis.',
'The narrowness of the inlet also draws your attention to the substantial fin that grows out of the sidepod’s shoulder and helps to divert airflow down over the revamped sidepod packaging behind.',
'This is an area where the team has clearly focused its resources, knowing that getting this right will reap aerodynamic rewards for other areas of the car.',
'The sidepod design draws inspiration from the new bodywork that the team installed in Mugello last season (below) but falls short of having the full ramp to floor transition, instead favouring the dipped midriff like we’ve seen adopted elsewhere.',
'The rear portion of the sidepods and the engine cover have extremely tight packaging, with the AMR21 akin to the W12 with the bodywork almost shrink wrapped to the componentry inside.',
'And, much like the W12, it also features a bodywork blister around the inlet plenum, a feature of the power unit which is believed to be bigger this season as a result of some of the performance and durability updates introduced by HPP.',
'The AMR12 also features a very small rear cooling outlet that not only shows how efficient they expect the Mercedes-AMG F1 M12 E Performance power unit to be, but also how much they have focused on producing a car that recovers the downforce lost by the introduction of the new regulations.',
'The extremely tight packaging creates a sizable undercut beneath the cooling outlet too, which buys back some of the floor that has been lost to the new regulations and drives home the performance of the coke bottle region.',
'This is aided further by the token-free adoption of the Mercedes gearbox carrier and rear suspension from last season, an arrangement that Mercedes was particularly proud of because of the aerodynamic gains that it facilitates.',
'The new arrangement sees the suspension elements lifted clear of the diffuser ceiling, which has become more prominent as the teams push the boundaries of the regulations, while the rear leg of the lower wishbone being positioned so far back also results in the ability to extract more performance from the diffuser.',
'Aston Martin is the first team to unmask all the aerodynamic tricks it will use to make up the difference on the edge of the diagonal floor cut-out.',
'The first of these tricks shares a similarity to the design shown by AlphaTauri, with a trio of outwardly directed fins installed just behind the point where the floor starts to taper in.',
'The airflow structures emitted from these fins will undoubtedly interact with the AlphaTauri-esque floor scroll and floor notch just ahead of them and help to mitigate some of the losses that have been created due to fully enclosed holes being outlawed and the reduced floor width ahead of the rear tyre.',
'It’s here where we find a solution akin to the one that Ferrari tested at the end of 2020 too, as a series of fins form an arc.',
'This should help influence the airflow ahead of the rear tyre and reduce the impact that tyre squirt has on the diffuser.',
'Interestingly, it has also added two offset floor strakes inboard of this where teams normally only opt for one strake, with Mercedes in the pre-hybrid era being an advocate of such designs.',
'A new solution appears on the rear wing too, as the thickness of the upper front corner of the endplate has been altered to allow for another upwash strike.',
'Teams had already started to look for ways to redesign this region last year, with the removal of the louvres in 2019 resulting in an increase in drag.',
'The upwash strike is positioned in order that it can affect the tip vortex that’s generated by the top flap and endplate juncture and will undoubtedly be a design aspect that the rest of the field will take note of.',
'While Aston Martin did show us a lot of its new car, it did keep one element secret for now – the rear brake ducts (not pictured, above).',
'It does seem like a strange omission given it has shown us so much around the rest of the car but we must remember that this is one aspect of the 2021 cars that’s affected by the new regulations.',
'Perhaps the team feels it has found a small pocket of performance in that regard and doesn’t want to unnecessarily hand its rivals a chance to see it ahead of testing.']
Note: As we are using the transformer model, it’s recommended not to remove stopwords, punctuation, etc. as they help to capture more context compared to preprocessed text.
For applying different transformations of the data efficiently, we will use Pandas DataFrame. Let’s convert the above list to a pandas data frame;
data = pd.DataFrame(sentences) data.columns=['sentence']
Output:
The next step is to represent the sentences as dense vectors. We will create a small UDF that returns a vector given a sentence. We have already created an instance of the Sentence Transformer above.
def get_sentence_embeddings(sentence): embedding = model.encode([sentence]) return embedding[0]
Create a new column ‘embeddings’ using the above UDF.
data['embeddings']=data['sentence'].apply(get_sentence_embeddings)
Output:
Now that we have the text embeddings, let’s cluster them using NLTK’s KMeansClusterer.
NUM_CLUSTERS=10 iterations=25 X = np.array(data['embeddings'].tolist()) kclusterer = KMeansClusterer( NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance, repeats=iterations,avoid_empty_clusters=True) assigned_clusters = kclusterer.cluster(X, assign_clusters=True)
Note: The intuition for the NUM_CLUSTERS parameter is the number of sentences the end-user expects in the summary.
As you can observe, in NLTK’s KMeanClsuterer, we can use cosine_distance as a measure to determine the distance/similarity between 2 vectors.
Output:
[9,0,2,9,4,4,2,4,8,5,6,0,1,5,3,5,5,5,8,5,6,6,9,8,9,2,9,4,5,6,7]
Finally, we compute the distance between the sentence vector and the centroid (also called mean) vector. To achieve this, we need to assign the centroid to each row based on the cluster number.
data['cluster']=pd.Series(assigned_clusters, index=data.index) data['centroid']=data['cluster'].apply(lambda x: kclusterer.means()[x])
Output:
To compute the distance, we will use scipy’s distance_matrix function.
from scipy.spatial import distance_matrix def distance_from_centroid(row): #type of emb and centroid is different, hence using tolist below return distance_matrix([row['embeddings']], [row['centroid'].tolist()])[0][0] data['distance_from_centroid'] = data.apply(distance_from_centroid, axis=1)
Output:
The final step is to generate a summary. To do this, we will use the following steps:
- Group sentences based on the cluster column.
- Sort the group in ascending order based on the distance_from_centroid column and select the first row (sentence having least distance from the mean)
- Sort the sentences based on their sequence in the original text.
The above-mentioned steps can be implemented using one line of code:
summary=' '.join(data.sort_values('distance_from_centroid',ascending = True).groupby('cluster').head(1).sort_index()['sentence'].tolist())
Extracted Summary:
The first thing to note is that the team spent both of its development tokens on redesigning the chassis, in order that it could unlock aerodynamic performance from the central portion of the car. This is, in part, a legacy of the team’s approach for 2020, having assimilated the overall design package of the previous year’s championship winning Mercedes including a more conventional position for the side-impact protection spars (SIPS). This is typically above the fairing that surrounds the SIPS, which is used to inhibit the turbulence created by the front tyre and therefore also aids the transit of cool air that’s supplied to the radiators within the sidepods. This is an area where the team has clearly focused its resources, knowing that getting this right will reap aerodynamic rewards for other areas of the car. The sidepod design draws inspiration from the new bodywork that the team installed in Mugello last season (below) but falls short of having the full ramp to floor transition, instead favouring the dipped midriff like we’ve seen adopted elsewhere. And, much like the W12, it also features a bodywork blister around the inlet plenum, a feature of the power unit which is believed to be bigger this season as a result of some of the performance and durability updates introduced by HPP. The AMR12 also features a very small rear cooling outlet that not only shows how efficient they expect the Mercedes-AMG F1 M12 E Performance power unit to be, but also how much they have focused on producing a car that recovers the downforce lost by the introduction of the new regulations. The airflow structures emitted from these fins will undoubtedly interact with the AlphaTauri-esque floor scroll and floor notch just ahead of them and help to mitigate some of the losses that have been created due to fully enclosed holes being outlawed and the reduced floor width ahead of the rear tyre. This should help influence the airflow ahead of the rear tyre and reduce the impact that tyre squirt has on the diffuser. Perhaps the team feels it has found a small pocket of performance in that regard and doesn’t want to unnecessarily hand its rivals a chance to see it ahead of testing.
As you can see, the approach we followed does a decent job of generating a summary that describes the car. If you read carefully, the sentences are pretty much inline provided that we picked the top sentence from each cluster. For e.g. the initial part of the summary talks about SIPS (Side Impact Protection Spars) i.e. sentences 2, 3, and 4 while the following sentence talks about the bodywork i.e. sentences 5 and 6, and so on. So, we successfully describe each aspect of the car using a couple of sentences by implementing extractive text summarization.
The code explained here is also available on my Github. Feel free to fork it and play with it.
Thank you for reading! Like and leave a comment if you found it interesting and useful.
References:
Sentence Embedding Based Semantic Clustering Approach for Discussion Thread Summarization
nltk.cluster.kmeans – NLTK 3.5 documentation
This article was originally published on Medium and re-published to TOPBOTS with permission from the author.
Enjoy this article? Sign up for more AI updates.
We’ll let you know when we release more technical education.
Leave a Reply
You must be logged in to post a comment.