In this article, we will learn about …
… the difference between extractive and abstractive text summarization.
… what the ROUGE score is.
… why and where it fails.
Text Summarization
We refer to text summarization as the process of training an Artificial Intelligence (AI) model to produce a smaller chunk of text out of a bigger chunk of text. Where “smaller chunk” could be anything that a human would summarize — headline, key facts, results, the essence of the bigger chunk, etc.
How do we know if the summary that the machine wrote is good or bad?
… and frankly what does good or bad even mean.
Before we answer this, let’s talk just a bit more about summarization.
If this in-depth educational content is useful for you, subscribe to our AI research mailing list to be alerted when we release new material.
Extractive vs Abstractive
There are two types of text summarization that a human, and nowadays a machine, can do [1].
- Extractive: Words and phrases are directly extracted from the text.
- Abstractive: Words and phrases are generated semantically consistent, ensuring the key information of the original text is maintained.
“What about an example“ you are thinking? I know, I sometimes read minds. 😉
As you can see the latter — abstractive summarization, more closely resembles the way a human writes a summary. Due to that complexity, its feasibility truly relies on advances in AI and Deep Learning [2].
Evaluation using ROUGE
In practice one of the most common metrics used to measure the performance of a summarization model is called the ROUGE score (Recall-Oriented Understudy for Gisting Evaluation) [3].
The algorithm to compute ROUGE score considers consecutive tokens a.k.a. n-grams. The n-grams from one text (e.g. the human-written summary) are compared to the n-grams of the other text (e.g. the machine-written summary). A large overlap of n-grams results in a high ROUGE score and a low overlap — in a low ROUGE score. There are many variations of ROUGE such as one-grams, bi-grams, longest common sub-sequence, etc. And also ROUGE Precision, Recall and F-score [4].
ROUGE is a proxy metric for abstractive summarization.
When the aim is abstractive summarization ROUGE is best used only as an initial indicator of how much the machine-written summary overlaps with the human written summary, because it does not take into account the semantic meaning and the factual accuracy of the summaries.
For text summarization we want to look at ROUGE longest common subsequence (ROUGE L) as this will give us the longest overlap.
Let’s see an example!
Below in Figure 1, we can see in the top left corner first the human-written summary, then the machine-written summary.
The two have a fairly good overlap of words, therefore we can see that the ROUGE score is very high (79%). And also the machine-generated summary is semantically accurate. So this is great! — ROUGE gave us a good indication.
In the next scenario (Figure 2) we again have the human-written and the machine-written summaries in the top left corner. However, this time when we carefully read the machine-written summary we can see that actually the machine got it wrong — the fox is the one jumping over the dog and not the other way around!
We see that here ROUGE failed to give us a good indication because it still shows a high ROUGE score (77%) for the machine-written summary, but this summary is actually factually incorrect.
And here comes our last scenario (Figure 3) in which the machine-generated summary is factually correct, but the ROUGE score did not give us a good indication (55%) because it told us that the summary is mediocre.
If you are interested in abstractive summarisation, then you cannot expect a lot of the words and phrases in the human-written summary to overlap with the machine-written summary.
To reproduce these results you can use this rouge library.
So, to ROUGE or not to ROUGE?
When dealing with abstractive summarization, we should use the ROUGE metric to get a sense of the overlap. For semantic and factual accuracy I highly advise you to ALWAYS consult with subject matter experts in the domain of the human-written summaries!
Open Questions
- How do we measure abstractiveness/extractiveness? — stay tuned for my next short blog post. 😉
- Can you recommend any tools that indicate semantic and factual accuracy? — This is still a challenge for the NLP community [5, 6].
- What about the summarization models? — You can find a nice overview of different summarization models and how they rank against others in NLP Progress [7].
- … ?
Resources
[1] Abigail See et al., “Taming Recurrent Neural Networks for Better Summarization”, 2017. http://www.abigailsee.com/2017/04/16/taming-rnns-for-better-summarization.html
[2] Dima Suleiman et al., “Deep Learning Based Abstractive Text Summarization: Approaches, Datasets, Evaluation Measures, and Challenges”, Mathematical Problems in Engineering, vol. 2020, Article ID 9365340, 29 pages (2020). https://doi.org/10.1155/2020/9365340
[3] Chin-Yew Li, “ROUGE: A Package for Automatic Evaluation of Summaries”(2004). https://www.aclweb.org/anthology/W04-1013.pdf
[4] Kavita Ganesan, “An intro to ROUGE, and how to use it to evaluate summaries” (2017). https://www.freecodecamp.org/news/what-is-rouge-and-how-it-works-for-evaluation-of-summaries-e059fb8ac840/
[5] Yuhui Zhang et al., “A Close Examination of Factual Correctness Evaluation in Abstractive Summarization” ( 2020). https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1204/reports/custom/report53.pdf
[6] Kryscinski, W. et al., “Evaluating the Factual Consistency of Abstractive Text Summarization” ( 2020). https://arxiv.org/abs/1910.12840
[7] Sebastian Ruder, “Summarization”, NLP Progress. http://nlpprogress.com/english/summarization.html
This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.
Enjoy this article? Sign up for more AI updates.
We’ll let you know when we release more technical education.
Leave a Reply
You must be logged in to post a comment.