DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

  1. DZone
  2. Refcards
  3. Introduction to Text Mining
refcard cover
Refcard #331

Introduction to Text Mining

Thanks to text mining, you can extract information from written text. This is something we do, naturally, every day, in conversations or when we read. Like driving a car, once we learn how to do it, we take it for granted. This Refcard will introduce text mining, as well as key methods and techniques for success.

Download Refcard
Free PDF for Easy Reference
refcard cover

Written By

author avatar Christopher Lamb
Principal Member of Technical Staff, Sandia National Laboratories
Table of Contents
► What Is Text Mining? ► What Problem Does Text Mining Solve? ► How Can It Differentiate Products and Services? ► Getting Started ► Development Environment ► Key Methods and Techniques ► Real-World Applications ► Conclusion
Section 1

What Is Text Mining?

Text mining is an ambiguous term for extracting useful information from otherwise unstructured text. There are two particular terms we need to pay close attention to when defining “text mining”— extracting useful information and unstructured text. 

Useful information, in this context, could be anything from basic facts expressed by the text to advanced sentiment analysis indicating the state of mind of the author at the time the text was created. 

Unstructured text means that the information is not stored in a structured format like XML or a database table. The text is still structured in some way, usually dictated by the language in which it’s written and the custom of the medium. 


This is a preview of the Introduction to Text Mining Refcard. To read the entire Refcard, please download the PDF from the link above.

Section 2

What Problem Does Text Mining Solve?

With text mining, you can extract information from written text. This is something we do, naturally, every day, in conversations or when we read. Like driving a car, once we learn how to do it, we take it for granted. 

Like driving a car, it has been resistant to automation, and has only recently become more tractable and automatable with the recent explosion of computation power, additional algorithm development, and machine learning techniques. New advances in this area promise to revolutionize customer service, business intelligence, and a myriad of other fields. 


This is a preview of the Introduction to Text Mining Refcard. To read the entire Refcard, please download the PDF from the link above.

Section 3

How Can It Differentiate Products and Services?

Effective text mining opens up new application areas while improving the quality of existing ones. Customer service systems with integrated text analysis and effective voice-to-text capabilities can build analytic pipelines supporting real-time sentiment analysis, allowing representatives to engage with customers knowing their emotional state prior to saying a word. 

Internet analysis tools can plumb the web for information, visiting competitors, and extracting information regarding current capabilities to feed business intelligence analysis systems. Overall, text mining tools can make business systems more robust, insightful, and powerful. 


This is a preview of the Introduction to Text Mining Refcard. To read the entire Refcard, please download the PDF from the link above.

Section 4

Getting Started

Text mining is a complex area moving quickly into machine learning and artificial intelligence. This Refcard will walk you through the text-specific topics you need to know to start to move into more complex areas like semantic analysis and meaning extraction by showing you the underlying techniques specific to unstructured text analysis. 

A typical text analysis workflow mirrors the outline in Figure 1. Here, we have two groups of inter-related activities – traditional text analysis and AI-enabled text analysis. With tools like Tensorflow and Pytorch, AI software development is more straightforward than ever, but it’s still far from simple. 


This is a preview of the Introduction to Text Mining Refcard. To read the entire Refcard, please download the PDF from the link above.

Section 5

Development Environment

In this Refcard, we’re going to focus on using Python 3 and the Natural Language Toolkit, for the most part. R is another popular platform for text processing, but I prefer using Python because of its extensive collection of libraries. I suggest using Anaconda for this kind of work as well, as it will allow you to create custom isolated Python environments that you can use for a variety of things. 

Setting up your environment is well documented on the Anaconda site. Follow those instructions, and then create a new environment — you can name it whatever you’d like. I’m going to name mine  text_analysis. 

I use a variety of other tools too, like iPython and Jupyter. I suggest you install those too (conda install python jupyter should do the trick). We’ll use Jupyter notebooks to track our examples, and I’ll make copies available for you to work through as well on GitHub. 


This is a preview of the Introduction to Text Mining Refcard. To read the entire Refcard, please download the PDF from the link above.

Section 6

Key Methods and Techniques

Word Frequency

Word frequency measures a given text and provides insight into the topics discussed and key concepts. 

Java
 




x


 
1
from nltk.probability import FreqDist 
2
distribution = FreqDist(paradise_lost) 
3
print(distribution.most_common(50)) 
4
distribution.plot(50, cumulative=False) 



Here, we’re extracting the tokens from the text and graphing them. This allows you to see the most common tokens. 

We can also examine the characteristics of words, like so: 

Java
 




xxxxxxxxxx
1


 
1
long_words = [w for w in paradise_lost if len(w) > 10] 
2
distribution = FreqDist(long_words) 
3
print(distribution.most_common(50)) 



The group of words in paradise_lost can be treated as a Python list, and then used as an argument to the various distribution tools, like FreqDist or ConditionalFreqDist. 


This is a preview of the Introduction to Text Mining Refcard. To read the entire Refcard, please download the PDF from the link above.

Section 7

Real-World Applications

Healthcare

Healthcare, especially telehealth, is a field ripe for disruption via text analysis. Anything that can be spoken can be transcribed into text and then analyzed. Furthermore, any correspondence between a physician and patient can be examined as well. This information can be used in any number of domains. 

Fraud is always an issue, whether it be in a hospital or private practice. By evaluating and extracting information from medical records, doctor’s notes, and correspondence, systems can help local instances of fraudulent prescriptions, for example, or more egregious fraud. 

Extracting information from written records can be fed into expert systems or neural networks as well. This kind of system could help healthcare providers diagnose rare disorders in patients that otherwise may not be diagnosed properly. 


This is a preview of the Introduction to Text Mining Refcard. To read the entire Refcard, please download the PDF from the link above.

Section 8

Conclusion

Text mining has more and more potential to impact day-to-day operations for enterprises now more than ever. The kinds of techniques we’ve covered in this Refcard are as applicable to insurance as they are to social media analysis. 

The ability to extract structured, meaningful information from otherwise opaque unstructured text opens new capabilities for customer interaction, operational refinement, or crime prevention. The ability to review customer service interactions with call-center personnel provides new ways to increase consumer satisfaction and identify new sales relationships that had been hidden in the mess of unstructured data that organizations collect every day. 


This is a preview of the Introduction to Text Mining Refcard. To read the entire Refcard, please download the PDF from the link above.

Like This Refcard? Read More From DZone

related article thumbnail

DZone Article

Text Mining 101: What it Is and How it Works
related article thumbnail

DZone Article

Calculating TF-IDF With Apache Spark
related article thumbnail

DZone Article

A Complete Guide to Modern AI Developer Tools
related article thumbnail

DZone Article

The Human Side of Logs: What Unstructured Data Is Trying to Tell You
related refcard thumbnail

Free DZone Refcard

Open-Source Data Management Practices and Patterns
related refcard thumbnail

Free DZone Refcard

Real-Time Data Architecture Patterns
related refcard thumbnail

Free DZone Refcard

Getting Started With Real-Time Analytics
related refcard thumbnail

Free DZone Refcard

Getting Started With Apache Iceberg

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: