Summarization has become a very helpful way of tackling the issue of data overburden. In my earlier story, I shared how you can create your personal text summarizer using the extractive method — if you have tried that, you may have noticed that, because no new sentences were generated from the original content, at times you may have difficulties understanding the generated extractive summary.
In this story, I will share how I use Google’s T5 (Text-to-Text Transfer Transformer) model to create a human-like summarizer with just a few lines of code!
As a bonus, I will also share my text summarizer pipelines where I combine both extractive and abstractive methods to generate meaningful summaries for PDF documents of any length…
If this in-depth educational content is useful for you, subscribe to our AI research mailing list to be alerted when we release new material.
Text Summarization Techniques
There are two techniques to summarize a long content:
i. Extractive summary — extracts important sentences from long content.
ii. Abstractive summary — creates a summary by generating new sentences from original content.
Abstractive summary is a comparatively more difficult technique as it involves deep learning, but thanks to Google’s pre-trained models that are made available to public, creating a meaningful abstractive summary is no longer a daunting machine learning task!
T5 Text Summarizer
You can build a simple yet incredibly powerful abstractive text summarizer using Google’s T5 pre-trained model. I will use HuggingFace’s state-of-the-art Transformers framework and PyTorch to build a summarizer.
Install packages
Please ensure you have both Python packages installed.
pip install torch
pip install transformers
Load model and tokenizer
Load T5’s pre-trained model and its tokenizer.
import torch from transformers import AutoTokenizer, AutoModelForSeq2SeqLM model = AutoModelForSeq2SeqLM.from_pretrained('t5-base') tokenizer = AutoTokenizer.from_pretrained('t5-base')
There are a total of five T5 models to choose from: t5-small, t5-base, t5-large, t-3B & t5–11B. They each have different parameters. I will choose the ‘t5-base’ model, which has a total of 220 million parameters. Feel free to try different T5 models.
Input text
Let’s load a CNN news article about ‘Netflix needs a Next Big Thing’ — simply because this is rather interesting business news — and see how well our summarizer performs.
text = """New York (CNN Business)Netflix is synonymous with streaming, but its competitors have a distinct advantage that threatens the streaming leader's position at the top.
Disney has Disney+, but it also has theme parks, plush Baby Yoda dolls, blockbuster Marvel movies and ESPN. Comcast (CMCSA), Amazon (AMZN), ViacomCBS (VIACA), CNN's parent company WarnerMedia and Apple (AAPL) all have their own streaming services, too, but they also have other forms of revenue.
As for Netflix (NFLX), its revenue driver is based entirely on building its subscriber base. It's worked out well for the company — so far. But it's starting to look like the king of streaming will soon need something other than new subscribers to keep growing.
The streaming service reported Tuesday it now has 208 million subscribers globally, after adding 4 million subscribers in the first quarter of 2021. But that number missed expectations and the forecasts for its next quarter were also pretty weak.
That was a big whiff for Netflix — a company coming off a massive year of growth thanks in large part to the pandemic driving people indoors — and Wall Street's reaction has not been great.
The company's stock dropped as much as 8% on Wednesday, leading some to wonder what the future of the streamer looks like if competition continues to gain strength, people start heading outdoors and if, most importantly, its growth slows.
"If you hit a wall with [subscriptions] then you pretty much don't have a super growth strategy anymore in your most developed markets," Michael Nathanson, a media analyst and founding partner at MoffettNathanson, told CNN Business. "What can they do to take even more revenue out of the market, above and beyond streaming revenues?"
Or put another way, the company's lackluster user growth last quarter is a signal that it wouldn't hurt if Netflix — a company that's lived and died with its subscriber numbers — started thinking about other ways to make money.
An ad-supported Netflix? Not so fast
There are ways for Netflix to make money other than raising prices or adding subscribers. The most obvious: selling advertising.
Netflix could have 30-second commercials on their programming or get sponsors for their biggest series and films. TV has worked that way forever, why not Netflix?
That's probably not going to happen, given that CEO Reed Hastings has been vocal about the unlikelihood of an ad-supported Netflix service. His reasoning: It doesn't make business sense.
"It's a judgment call... It's a belief we can build a better business, a more valuable business [without advertising]," Hastings told Variety in September. "You know, advertising looks easy until you get in it. Then you realize you have to rip that revenue away from other places because the total ad market isn't growing, and in fact right now it's shrinking. It's hand-to-hand combat to get people to spend less on, you know, ABC and to spend more on Netflix.
"Hastings added that "there's much more growth in the consumer market than there is in advertising, which is pretty flat.
"He's also expressed doubts about Netflix getting into live sports or news, which could boost the service's allure to subscribers, so that's likely out, too, at least for now.
So if Netflix is looking for other forms of near-term revenue to help support its hefty content budget ($17 billion in 2021 alone) then what can it do? There is one place that could be a revenue driver for Netflix, but if you're borrowing your mother's account you won't like it.
Netflix could crack down on password sharing — a move that the company has been considering lately.
"Basically you're going to clean up some subscribers that are free riders," Nathanson said. "That's going to help them get to a higher level of penetration, definitely, but not in long-term.
Lackluster growth is still growth
Missing projections is never good, but it's hardly the end of the world for Netflix. The company remains the market leader and most competitors are still far from taking the company on. And while Netflix's first-quarter subscriber growth wasn't great, and its forecasts for the next quarter alarmed investors, it was just one quarter.
Netflix has had subscriber misses before and it's still the most dominant name in all of streaming, and even lackluster growth is still growth. It's not as if people are canceling Netflix in droves.
Asked about Netflix's "second act" during the company's post-earnings call on Tuesday, Hastings again placed the company's focus on pleasing subscribers.
"We do want to expand. We used to do that thing shipping DVDs, and luckily we didn't get stuck with that. We didn't define that as the main thing. We define entertainment as the main thing," Hastings said.
He added that he doesn't think Netflix will have a second act in the way Amazon has had with Amazon shopping and Amazon Web Services. Rather, Netflix will continue to improve and grow on what it already does best.
"I'll bet we end with one hopefully gigantic, hopefully defensible profit pool, and continue to improve the service for our members," he said. "I wouldn't look for any large secondary pool of profits. There will be a bunch of supporting pools, like consumer products, that can be both profitable and can support the title brands.""""
Tokenize Text
T5 can be used to perform other tasks, such as text generation, translation, etc.; adding the T5 specific prefix “summarize: ” will tell the model to perform the summarizing task.
tokens_input = tokenizer.encode("summarize: " + text, return_tensors='pt', max_length=tokenizer.model_max_length, truncation=True)
Here we will tokenize our text to the model’s maximum acceptable token input length. If the tokenized input exceeds the model’s maximum token length, it will be truncated.
Generate Summary
Let’s generate a summary by passing in the encoded tokens and then decode the generated summary back to text.
summary_ids = model.generate(tokens_input, min_length=80, max_length=150, length_penalty=15, num_beams=2) summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
The model takes encoded tokens and the following input arguments:
- min_length: minimum length of sentences.
- max_length: maximum length of sentences.
- length_penalty: value > 1 forces the model to generate a longer summary, value < 1 forces the model to generate a shorter summary.
- num_beams: value 2 allows the model to explore tokens that generate more promising predictions.
Note: Keeping the minimum and maximum sentence lengths between 80 and 150 and a length penalty of 15 will allow the model to generate a reasonable summary of 60 to 90 words. We will use the default values for the rest of input arguments (not shown above).
Output Summary
Netflix (NFLX) reported Tuesday it now has 208 million subscribers globally. that number missed expectations and the forecasts for its next quarter were also pretty weak. the streaming service's stock dropped as much as 8% on Wednesday, leading some to wonder what the future of the streamer looks like. if competition continues to gain strength, people start heading outdoors and if, most importantly, its growth slows, it wouldn't hurt if Netflix started thinking about other ways to make money - like selling ads.
Wow! It looks like a pretty decent summary.
…but if you read the full text and read the summary again, you will notice that the latter part of the full text did not get summarized — this is because the tokenized input gets truncated after it exceeds the maximum model token input length of 512.
If you are worried about missing out some important details in the latter text, you can use a simple trick to solve the issue: perform extractive summarization to the original text first, followed by abstractive summarization.
BERT Extractive Summary
Before we proceed, make sure you have pip installed BERT extractive summarizer on your terminal.
pip install bert-extractive-summarizer
BERT stands for Bidirectional Encoder Representations from Transformers. It’s a Transformer-based machine learning technique for Natural Language Processing (NLP) developed by Google. It uses a powerful flat architecture with inter-sentence transform layers to get the best result in summarization.
from summarizer import Summarizer bert_model = Summarizer() ext_summary = bert_model(text, ratio=0.5)
Below is the extractive summary generated by BERT. I purposely set it to produce a summary that is 50% in length of the original text by setting the summary ratio to 0.5. Feel free to use a different ratio to adjust your long document to the appropriate length.
New York (CNN Business)Netflix is synonymous with streaming, but its competitors have a distinct advantage that threatens the streaming leader's position at the top. Disney has Disney+, but it also has theme parks, plush Baby Yoda dolls, blockbuster Marvel movies and ESPN. It's worked out well for the company - so far. But that number missed expectations and the forecasts for its next quarter were also pretty weak. Or put another way, the company's lackluster user growth last quarter is a signal that it wouldn't hurt if Netflix - a company that's lived and died with its subscriber numbers - started thinking about other ways to make money. Not so fast
There are ways for Netflix to make money other than raising prices or adding subscribers. His reasoning: It doesn't make business sense. "It's a judgment call... It's a belief we can build a better business, a more valuable business [without advertising]," Hastings told Variety in September. " You know, advertising looks easy until you get in it. Then you realize you have to rip that revenue away from other places because the total ad market isn't growing, and in fact right now it's shrinking. It's hand-to-hand combat to get people to spend less on, you know, ABC and to spend more on Netflix." So if Netflix is looking for other forms of near-term revenue to help support its hefty content budget ($17 billion in 2021 alone) then what can it do? Netflix could crack down on password sharing - a move that the company has been considering lately. "Basically you're going to clean up some subscribers that are free riders," Nathanson said. " That's going to help them get to a higher level of penetration, definitely, but not in long-term." The company remains the market leader and most competitors are still far from taking the company on. We used to do that thing shipping DVDs, and luckily we didn't get stuck with that. We define entertainment as the main thing," Hastings said. He added that he doesn't think Netflix will have a second act in the way Amazon has had with Amazon shopping and Amazon Web Services. Rather, Netflix will continue to improve and grow on what it already does best. I wouldn't look for any large secondary pool of profits.
Let’s now feed the extractive summary through our T5 model.
The Extractive-Abstractive Summary
Netflix's lackluster user growth is a signal that it wouldn't hurt if it started thinking about other ways to make money. the company remains the market leader and most competitors are still far from taking the company on. the company could crack down on password sharing - a move that the company has been considering lately. "it's a judgment call... it's a belief we can build a better business, a more valuable business," hastings said.
Wow… the generated summary now covers the entire context of the original text.
For your convenience, I have summarized the codes below.
You can also click here to go to my GitHub to get the Jupyter Notebooks for T5 text summarizer and text summarizer pipelines preparation, and the pipeline scripts that you can run on your terminal to summarize multiple PDF documents.
# make sure you have pip installed torch and transformers packages import torch from transformers import AutoTokenizer, AutoModelForSeq2SeqLM model = AutoModelForSeq2SeqLM.from_pretrained('t5-base') tokenizer = AutoTokenizer.from_pretrained('t5-base') text = """ New York (CNN Business)Netflix is synonymous with streaming, but its competitors have a distinct advantage that threatens the streaming leader's position at the top. Disney has Disney+, but it also has theme parks, plush Baby Yoda dolls, blockbuster Marvel movies and ESPN. Comcast (CMCSA), Amazon (AMZN), ViacomCBS (VIACA), CNN's parent company WarnerMedia and Apple (AAPL) all have their own streaming services, too, but they also have other forms of revenue. As for Netflix (NFLX), its revenue driver is based entirely on building its subscriber base. It's worked out well for the company - so far. But it's starting to look like the king of streaming will soon need something other than new subscribers to keep growing. The streaming service reported Tuesday it now has 208 million subscribers globally, after adding 4 million subscribers in the first quarter of 2021. But that number missed expectations and the forecasts for its next quarter were also pretty weak. That was a big whiff for Netflix - a company coming off a massive year of growth thanks in large part to the pandemic driving people indoors - and Wall Street's reaction has not been great. The company's stock dropped as much as 8% on Wednesday, leading some to wonder what the future of the streamer looks like if competition continues to gain strength, people start heading outdoors and if, most importantly, its growth slows. "If you hit a wall with [subscriptions] then you pretty much don't have a super growth strategy anymore in your most developed markets," Michael Nathanson, a media analyst and founding partner at MoffettNathanson, told CNN Business. "What can they do to take even more revenue out of the market, above and beyond streaming revenues?" Or put another way, the company's lackluster user growth last quarter is a signal that it wouldn't hurt if Netflix - a company that's lived and died with its subscriber numbers - started thinking about other ways to make money. An ad-supported Netflix? Not so fast There are ways for Netflix to make money other than raising prices or adding subscribers. The most obvious: selling advertising. Netflix could have 30-second commercials on their programming or get sponsors for their biggest series and films. TV has worked that way forever, why not Netflix? That's probably not going to happen, given that CEO Reed Hastings has been vocal about the unlikelihood of an ad-supported Netflix service. His reasoning: It doesn't make business sense. "It's a judgment call... It's a belief we can build a better business, a more valuable business [without advertising]," Hastings told Variety in September. "You know, advertising looks easy until you get in it. Then you realize you have to rip that revenue away from other places because the total ad market isn't growing, and in fact right now it's shrinking. It's hand-to-hand combat to get people to spend less on, you know, ABC and to spend more on Netflix." Hastings added that "there's much more growth in the consumer market than there is in advertising, which is pretty flat." He's also expressed doubts about Netflix getting into live sports or news, which could boost the service's allure to subscribers, so that's likely out, too, at least for now. So if Netflix is looking for other forms of near-term revenue to help support its hefty content budget ($17 billion in 2021 alone) then what can it do? There is one place that could be a revenue driver for Netflix, but if you're borrowing your mother's account you won't like it. Netflix could crack down on password sharing - a move that the company has been considering lately. "Basically you're going to clean up some subscribers that are free riders," Nathanson said. "That's going to help them get to a higher level of penetration, definitely, but not in long-term." Lackluster growth is still growth Missing projections is never good, but it's hardly the end of the world for Netflix. The company remains the market leader and most competitors are still far from taking the company on. And while Netflix's first-quarter subscriber growth wasn't great, and its forecasts for the next quarter alarmed investors, it was just one quarter. Netflix has had subscriber misses before and it's still the most dominant name in all of streaming, and even lackluster growth is still growth. It's not as if people are canceling Netflix in droves. Asked about Netflix's "second act" during the company's post-earnings call on Tuesday, Hastings again placed the company's focus on pleasing subscribers. "We do want to expand. We used to do that thing shipping DVDs, and luckily we didn't get stuck with that. We didn't define that as the main thing. We define entertainment as the main thing," Hastings said. He added that he doesn't think Netflix will have a second act in the way Amazon has had with Amazon shopping and Amazon Web Services. Rather, Netflix will continue to improve and grow on what it already does best. "I'll bet we end with one hopefully gigantic, hopefully defensible profit pool, and continue to improve the service for our members," he said. "I wouldn't look for any large secondary pool of profits. There will be a bunch of supporting pools, like consumer products, that can be both profitable and can support the title brands. """ def abs_sum(text, model, tokenizer): tokens_input = tokenizer.encode("summarize: "+text, return_tensors='pt', max_length=tokenizer.model_max_length, truncation=True) summary_ids = model.generate(tokens_input, min_length=80, max_length=150, length_penalty=15, num_beam=2) summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True) return summary summary = abs_sum(text, model, tokenizer) # Generated summary # ------------------- # Netflix (NFLX) reported Tuesday it now has 208 million subscribers globally. # that number missed expectations and the forecasts for its next quarter were # also pretty weak. the streaming service's stock dropped as much as 8% on # Wednesday, leading some to wonder what the future of the streamer looks like. # if competition continues to gain strength, people start heading outdoors and # if, most importantly, its growth slows, it wouldn't hurt if Netflix started # thinking about other ways to make money - like selling ads. # ======================================================================== # Use BERT summarizer to reduce the text by 50% followed by T5 summarizer # make sure you have pip installed bert-extractive-summarizer from summarizer import Summarizer bert_model = Summarizer() ext_summary = bert_model(text, ratio=0.5) summary_2 = abs_sum(ext_summary, model, tokenizer) # Generated summary # ------------------ # Netflix's lackluster user growth is a signal that it wouldn't hurt # if it started thinking about other ways to make money. the company remains # the market leader and most competitors are still far from taking the company # on. the company could crack down on password sharing - a move that the # company has been considering lately. "it's a judgment call... it's a belief # we can build a better business, a more valuable business," hastings said.
BONUS….T5 Text Summarizer Pipelines
I have built a text summarizer pipeline that can extract text from PDF documents, summarize the text and store both the original text and the summary into an SQLite database, and output the summary to a text file.
To summarize a long PDF document, you can first apply extractive summarization to shorten the text before you feed it through the T5 model to generate a human-like summary.
Note: key in ‘1.0′ if you only want to summarize the text with the T5 model.
Note: key in a ratio below ‘1.0’ (e.g. ‘0.5’) if you wish to shorten the text with BERT extractive summarization before running it through T5 summarization. It takes longer to generate a summary this way because each text is run through two different summarizers.
Conclusion… and future work
There you go — you only need 7 lines of codes (including importing libraries and modules) to get Google’s T5 pre-trained model to summarize content for you.
To make the summary of long content more meaningful, you can apply extractive summarization to shorten the text first, followed by abstractive summarization.
T5 pre-trained models support Transfer-Learning: that means we can train the models further with our custom datasets.
For future work, it will be interesting to see how the models perform after they have been custom-trained to summarize specific contents, e.g. medical journals and engineering journals.
This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.
Enjoy this article? Sign up for more AI updates.
We’ll let you know when we release more technical education.
Leave a Reply
You must be logged in to post a comment.