DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • The Power of LLMs in Java: Leveraging Quarkus and LangChain4j
  • How To Perform Sentiment Analysis and Classification on Text (In Java)
  • How to Detect Hate Speech Text in Java
  • Introducing Graph Concepts in Java With Eclipse JNoSQL, Part 3: Understanding Janus

Trending

  • How Large Tech Companies Architect Resilient Systems for Millions of Users
  • Navigating Double and Triple Extortion Tactics
  • Designing for Sustainability: The Rise of Green Software
  • Mastering Advanced Traffic Management in Multi-Cloud Kubernetes: Scaling With Multiple Istio Ingress Gateways
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Simple Text Summarizer Using Extractive Method

Simple Text Summarizer Using Extractive Method

In this article, we will build a text summarizer with an extracted method that is super easy to build and very reliable when it comes to results.

By 
Pushkara Sharma user avatar
Pushkara Sharma
·
Jul. 01, 20 · Tutorial
Likes (4)
Comment
Save
Tweet
Share
7.1K Views

Join the DZone community and get the full member experience.

Join For Free

Have you seen applications like inshorts that converts the articles or news into 60 words summary. Yes, that’s what we are going to build today. In this article, we will build a text summarizer with extracted method that is super easy to build and very reliable when it comes to results. Don't worry, I will also explain what this extracted method is?

You may found many articles about text summarizers but what makes this article unique is the short and beginners friendly high-level description of the code snippets.

So, Text summarization can be done in two ways:

  • Extractive method — Selecting the n numbers of most important sentences from the article that most probably convey the message of the article. This approach is very easy to implement and beginners friendly. That's the main reason of choosing this method for this tutorial.
  • Abstractive method — This method uses concepts of deep learning like encoder-decoder architecture, LSTM(Long Short Term Memory) networks that are very difficult for beginners to understand. This method generates the whole new summary of the article and contains sentences that are not even present in the original article. This method may lead to the creation of sentences that don't have any meaning at all.

Now, as we are clear about why we have chosen the extracted method, Lets directly jump to the coding section.

Prerequisites

I assume that you are familiar with python and already have installed the python 3 in your systems. I have used jupyter notebook for this tutorial. You can use the IDE of your like.

Installing Required Libraries

For this project, you need to have the following packages installed in your python. If they are not installed, you can simply usepip install PackageName . The scraping part is optional, you can also skip that and use any local text file for which you want a summary.

  • bs4 — for BeautifulSoup that is used for parsing Html page.
  • lxml — it is the package that is used to process Html and XML with python.
  • nltk — for performing natural language processing tasks.
  • urllib — for requesting a webpage.

Let's Start Coding

First, we have to import all the libraries that we will use. bs4 and urllib will be used for scraping of the article. re(regular expression) is used for removing unwanted text from the article. The 4th line is used to install the nltk(natural language toolkit) package that is the most important package for this tutorial. After that, we have downloaded some of the data that is required for the text processing like punkt (used for sentence tokenizing) and stopwords(words like is,the,of that does not contribute).

Java
 




xxxxxxxxxx
1
12


 
1
import bs4
2
import urllib.request as url
3
import re
4
#!pip3 install nltk
5
import nltk
6
nltk.download('punkt')
7
nltk.download('stopwords')
8
from nltk import sent_tokenize
9
from nltk.corpus import stopwords
10
from nltk import word_tokenize
11
stop_word = stopwords.words('english')
12
import string



Here, I have simply taken the URL of the article from the user itself.

Java
 




xxxxxxxxxx
1


 
1
url_name = input("Enter url of the text you want to summerize:")



In this snippet of code, we have requested the page source with urllib and then parse that page with BeautifulSoup to find the paragraph tags and added the text to the articlevariable.

Java
 




xxxxxxxxxx
1


 
1
web = url.urlopen(url_name)
2
page = bs4.BeautifulSoup(web,'html.parser')
3
elements = page.find_all('p')
4
article = ''
5
for i in elements:
6
    article+= (i.text)
7
article



Now, we remove all the special characters from that string variable articlethat contains the whole article that is to be summarized. For this, we have simply used inbuilt replacefunction and also used a regular expression (re) to remove numbers.

Java
 




xxxxxxxxxx
1


 
1
processed = article.replace(r'^\s+|\s+?$','')
2
processed = processed.replace('\n',' ')
3
processed = processed.replace("\\",'')
4
processed = processed.replace(",",'')
5
processed = processed.replace('"','')
6
processed = re.sub(r'\[[0-9]*\]','',processed)
7
processed



Here, we have simply used the sent_tokenizefunction of nltk to make the list that contains sentences of the article at each index.

Java
 




xxxxxxxxxx
1


 
1
sentences = sent_tokenize(processed)



After that, we convert the characters of article to lowercase. Then we loop through every word of the article and check if it is not stopword or any punctuation(we have already removed the punctuations but we still use this just in case). And if the word is none of them we just added that word into the dictionary and then further count the frequency of that word.

In the screenshot, you can see the dictionary containing every word with its count in the article(higher the frequency of the word, more important it is). Now you know why we have removed stopwords like of the for otherwise, they will come on top.

Java
 




xxxxxxxxxx
1


 
1
frequency = {}
2
processed1 = processed.lower()
3
for word in word_tokenize(processed1):
4
    if word not in stop_word and word not in string.punctuation:
5
        if word not in frequency.keys():
6
            frequency[word]=1
7
        else:
8
            frequency[word]+=1
9
frequency


A dictionary containing every word of article with its frequency


Here, we have calculated the importance of every word in the dictionary by simply dividing the frequency of every word with the maximum frequency among them. In the screenshot, you can clearly see that importance of word languagecomes on top as it has the max frequency that is 22.

Java
 




xxxxxxxxxx
1


 
1
max_fre = max(frequency.values())
2
for word in frequency.keys():
3
    frequency[word]=(frequency[word]/max_fre)
4
frequency


Importance of every word of the article


After doing that, now we have to calculate the importance of every sentence of the article. And for doing this, we iterate through every sentence of the article, then for every word in the sentence added the individual score or importance of the word to give the final score of that particular sentence.

In the screenshot, you can clearly see that every sentence now has some score that represents how important that sentence is.

Java
 




xxxxxxxxxx
1
10


 
1
sentence_score = {}
2
for sent in sentences:
3
    for word in word_tokenize(sent):
4
        if word in frequency.keys():
5
            if len(sent.split(' '))<30:
6
                if sent not in sentence_score.keys():
7
                    sentence_score[sent] = frequency[word]
8
                else:
9
                    sentence_score[sent]+=frequency[word]
10
sentence_score


Score or importance of every sentence


In the end, We have used heapq to find the 4 sentences with the highest scores. You can choose any number of sentences you want. Then simply joined the list of selected sentences to form a single string of summary.

The final output summary for the Natural Language Processing article can be seen in the screenshot attached.

Java
 




xxxxxxxxxx
1


 
1
import heapq
2
summary = heapq.nlargest(4,sentence_score,key = sentence_score.get)
3
summary = ' '.join(summary)
4
final = "SUMMARY:- \n  " +summarytextfinal = 'TEXT:-    '+processed
5
textfinal = textfinal.encode('ascii','ignore')
6
textfinal = str(textfinal) 
7
final


Final extracted summary


0
Advanced issue foun▲


 YES! WE DID IT

The extracted summary may be not up to the mark but it is capable enough of conveying the main idea of the given article. Also, it is more reliable as it only outputs the selected number of sentences from the article itself rather than generating the output of its own.

I will also try to make the tutorial for the abstractive method, but that will be a great challenge for me to explain.

Thank you for your time, and I hope you like this tutorial.

Java (programming language) NLP

Opinions expressed by DZone contributors are their own.

Related

  • The Power of LLMs in Java: Leveraging Quarkus and LangChain4j
  • How To Perform Sentiment Analysis and Classification on Text (In Java)
  • How to Detect Hate Speech Text in Java
  • Introducing Graph Concepts in Java With Eclipse JNoSQL, Part 3: Understanding Janus

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!