AI Summarization: Extractive and Abstractive Techniques
Learn how to create an AI-powered summarization tool using Hugging Face and OpenAI, combining extractive and abstractive methods for concise, accurate results.
Join the DZone community and get the full member experience.
Join For FreeThe proliferation of digital content has made it more difficult to comprehend and interpret lengthy texts, such as reports, research papers, and news items. An answer is offered by AI-powered summarizing tools, which make it simpler to extract essential information from lengthy texts. Shortening the text is only one aspect of summarization; another is maintaining the original material's context, tone, and intent.
This tutorial introduces two complementary techniques — extractive summarization and abstractive summarization — and shows you how to combine them for robust results. You'll use pre-trained models from Hugging Face for extractive tasks and OpenAI's GPT for abstractive rewriting, resulting in summaries that are both concise and contextually accurate.
What Is Summarization?
Summarization in natural language processing (NLP) is the process of condensing text into a shorter version while retaining its core meaning.
Types of Summarizations
1. Extractive Summarization
Selects key sentences or phrases directly from the original text.
Example:
- Original: "The smartphone has great battery life, but the camera quality is subpar."
- Extractive summary: "Great battery life but subpar camera quality."
2. Abstractive Summarization
Generates a completely new version of the text, paraphrasing or rewriting it in a concise form.
Example:
- Original: "The smartphone has great battery life, but the camera quality is subpar."
- Abstractive summary: "The phone excels in battery performance but lacks a good camera."
Why Combine Both Methods?
- Extractive summarization ensures that important details are preserved verbatim.
- Abstractive summarization enhances readability and contextual relevance, making the summary suitable for diverse audiences.
Together, they form a powerful approach that blends precision with creativity.
Building the AI Summarization Tool
Prerequisites
Before you start, ensure you have the following:
- Programming knowledge: Basic understanding of Python.
- Software tools: Python 3.7 or later, along with the following libraries:
pip install openai transformers nltk pandas
- API access
- OpenAI API Key. Sign up for free here.
- Hugging Face Token (optional). Create an account.
Setting Up the Hugging Face Token
- Go to the Hugging Face website.
- If you already have an account, click on Sign In. If not, click Sign Up to create a new account.
- Navigate to the dropdown menu and select Settings.
- In the Settings menu, look for the Access Tokens tab on the left side of the page and click on it.
- Under the Access Tokens section, you will see a button to create a new token. Click on New Token.
- Give your token a descriptive name (e.g., "Food Inventory App Token").
- Choose Read as the token scope for basic access to models and datasets. If you need write access for uploading models or data, choose Write.
- Click Generate Token. The token will be displayed on the screen.
- After generating the token, copy it. Make sure to save it in a secure place, as you will need it for authentication when making API calls.
Keeping Hugging Face API Token Safe
Your Hugging Face API token should be kept safely in an environment variable instead of being written in your script as code. To do this:
- Create an .env file in the root of your project: (
touch .env
). - Inside this file, add your Hugging Face token (
HF_TOKEN=your_hugging_face_token_here
). - Load this environment variable securely in your Flask app using Python’s
os
module.
import os
huggingface_token = os.getenv('HF_TOKEN')
Step 1: Extracting Key Sentences
What Is Extractive Summarization?
This method identifies and selects sentences that are most representative of the text's main ideas. We'll use Hugging Face's BART
model, which is pre-trained on summarization tasks.
from transformers import pipeline
# Load a pre-trained extractive summarization model
extractor = pipeline("summarization", model="facebook/bart-large-cnn")
# Example long text (replace this with your actual content)
text = """
Artificial Intelligence is transforming industries by automating tasks, improving efficiency,
and enabling new capabilities. In healthcare, AI-powered tools assist in diagnosis and personalized treatment,
while in finance, they help detect fraud and manage risk. Despite its advantages, AI raises ethical and privacy concerns.
"""
# Extractive summarization
extracted_summary = extractor(text, max_length=100, min_length=50, do_sample=False)
print("Extracted Summary:", extracted_summary[0]['summary_text'])
What Happens Here?
- Input. A long-form document or paragraph.
- Output. A condensed version by extracting the most important sentences.
Step 2: Rewriting With Abstractive Summarization
What Is Abstractive Summarization?
This method rewrites the extracted sentences into a more concise and human-like summary. It is particularly useful for creating audience-specific outputs (e.g., simplified for laymen or detailed for professionals).
import openai
# Set your OpenAI API key
openai.api_key = "YOUR_OPENAI_API_KEY"
def abstractive_summary(text, audience="layman"):
prompt = (
f"Rewrite the following text into a concise summary suitable for a {audience} audience:\n\n"
f"{text}\n\n"
"Summary:"
)
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}],
max_tokens=150,
temperature=0.7,
)
return response['choices'][0]['message']['content'].strip()
# Abstractive summarization
abstractive_text = abstractive_summary(extracted_summary[0]['summary_text'], audience="layman")
print("Abstractive Summary:", abstractive_text)
Example Outputs
- Extracted summary: "AI is transforming industries by improving efficiency and enabling new capabilities. In healthcare, it assists in diagnosis, while in finance, it helps detect fraud."
- Abstractive summary: "AI enhances efficiency across industries, aiding healthcare diagnoses and financial fraud detection."
Customizing for Different Audiences
Adjust the summaries based on the audience by modifying the GPT prompt.
Examples
Layman Audience Prompt
"Rewrite this text into simple terms that anyone can understand."
Technical Audience Prompt
"Rewrite this text to emphasize technical details for industry professionals."
Full Pipeline
Combine all steps into one function:
def summarize_document(text, audience="layman"):
# Preprocess text
extracted_summary = extractor(text, max_length=100, min_length=50, do_sample=False)
key_sentences = extracted_summary[0]['summary_text']
# Generate audience-specific abstractive summary
final_summary = abstractive_summary(key_sentences, audience=audience)
return final_summary
# Example usage
text = "Your lengthy document goes here."
summary = summarize_document(text, audience="layman")
print("Final Summary:", summary)
Real-World Applications
- News aggregation. Quickly summarize breaking news for readers.
- Academic research. Provide concise summaries of complex studies.
- Business reports. Simplify lengthy reports for decision-makers.
By following the steps outlined in this article, you will be able to create summarizing tools that are driven by artificial intelligence and are capable of catering to a wide range of audiences and use cases. As artificial intelligence continues to grow, you will have the ability to continue to improve this tool so that it can handle content that is more complicated and deliver summaries in several different languages.
You can ensure that your summaries are accurate, short, and contextually relevant by combining extractive and abstractive methods. This makes them a vital tool for content-heavy domains. Since the possibilities are endless, you should immediately begin the construction process.
Happy coding!
Opinions expressed by DZone contributors are their own.
Comments