Introduction to Text Mining

Section 1

What Is Text Mining?

Text mining is an ambiguous term for extracting useful information from otherwise unstructured text. There are two particular terms we need to pay close attention to when defining “text mining”— extracting useful information and unstructured text.

Useful information, in this context, could be anything from basic facts expressed by the text to advanced sentiment analysis indicating the state of mind of the author at the time the text was created.

Unstructured text means that the information is not stored in a structured format like XML or a database table. The text is still structured in some way, usually dictated by the language in which it’s written and the custom of the medium.

This is a preview of the Introduction to Text Mining Refcard. To read the entire Refcard, please download the PDF from the link above.

Section 2

What Problem Does Text Mining Solve?

With text mining, you can extract information from written text. This is something we do, naturally, every day, in conversations or when we read. Like driving a car, once we learn how to do it, we take it for granted.

Like driving a car, it has been resistant to automation, and has only recently become more tractable and automatable with the recent explosion of computation power, additional algorithm development, and machine learning techniques. New advances in this area promise to revolutionize customer service, business intelligence, and a myriad of other fields.

This is a preview of the Introduction to Text Mining Refcard. To read the entire Refcard, please download the PDF from the link above.

Section 3

How Can It Differentiate Products and Services?

Effective text mining opens up new application areas while improving the quality of existing ones. Customer service systems with integrated text analysis and effective voice-to-text capabilities can build analytic pipelines supporting real-time sentiment analysis, allowing representatives to engage with customers knowing their emotional state prior to saying a word.

Internet analysis tools can plumb the web for information, visiting competitors, and extracting information regarding current capabilities to feed business intelligence analysis systems. Overall, text mining tools can make business systems more robust, insightful, and powerful.

This is a preview of the Introduction to Text Mining Refcard. To read the entire Refcard, please download the PDF from the link above.

Section 4

Getting Started

Text mining is a complex area moving quickly into machine learning and artificial intelligence. This Refcard will walk you through the text-specific topics you need to know to start to move into more complex areas like semantic analysis and meaning extraction by showing you the underlying techniques specific to unstructured text analysis.

A typical text analysis workflow mirrors the outline in Figure 1. Here, we have two groups of inter-related activities – traditional text analysis and AI-enabled text analysis. With tools like Tensorflow and Pytorch, AI software development is more straightforward than ever, but it’s still far from simple.

This is a preview of the Introduction to Text Mining Refcard. To read the entire Refcard, please download the PDF from the link above.

Section 5

Development Environment

In this Refcard, we’re going to focus on using Python 3 and the Natural Language Toolkit, for the most part. R is another popular platform for text processing, but I prefer using Python because of its extensive collection of libraries. I suggest using Anaconda for this kind of work as well, as it will allow you to create custom isolated Python environments that you can use for a variety of things.

Setting up your environment is well documented on the Anaconda site. Follow those instructions, and then create a new environment — you can name it whatever you’d like. I’m going to name mine text_analysis.

I use a variety of other tools too, like iPython and Jupyter. I suggest you install those too (conda install python jupyter should do the trick). We’ll use Jupyter notebooks to track our examples, and I’ll make copies available for you to work through as well on GitHub.

This is a preview of the Introduction to Text Mining Refcard. To read the entire Refcard, please download the PDF from the link above.

Section 6

Key Methods and Techniques

Word Frequency

Word frequency measures a given text and provides insight into the topics discussed and key concepts.

    Java
   
          x
         
from nltk.probability import FreqDist 
distribution = FreqDist(paradise_lost) 
print(distribution.most_common(50)) 
distribution.plot(50, cumulative=False)

Here, we’re extracting the tokens from the text and graphing them. This allows you to see the most common tokens.

We can also examine the characteristics of words, like so:

    Java
   
xxxxxxxxxx

long_words = [w for w in paradise_lost if len(w) > 10] 
distribution = FreqDist(long_words) 
print(distribution.most_common(50))

The group of words in paradise_lost can be treated as a Python list, and then used as an argument to the various distribution tools, like FreqDist or ConditionalFreqDist.

This is a preview of the Introduction to Text Mining Refcard. To read the entire Refcard, please download the PDF from the link above.

Section 7

Real-World Applications

Healthcare

Healthcare, especially telehealth, is a field ripe for disruption via text analysis. Anything that can be spoken can be transcribed into text and then analyzed. Furthermore, any correspondence between a physician and patient can be examined as well. This information can be used in any number of domains.

Fraud is always an issue, whether it be in a hospital or private practice. By evaluating and extracting information from medical records, doctor’s notes, and correspondence, systems can help local instances of fraudulent prescriptions, for example, or more egregious fraud.

Extracting information from written records can be fed into expert systems or neural networks as well. This kind of system could help healthcare providers diagnose rare disorders in patients that otherwise may not be diagnosed properly.

This is a preview of the Introduction to Text Mining Refcard. To read the entire Refcard, please download the PDF from the link above.

Section 8

Conclusion

Text mining has more and more potential to impact day-to-day operations for enterprises now more than ever. The kinds of techniques we’ve covered in this Refcard are as applicable to insurance as they are to social media analysis.

The ability to extract structured, meaningful information from otherwise opaque unstructured text opens new capabilities for customer interaction, operational refinement, or crime prevention. The ability to review customer service interactions with call-center personnel provides new ways to increase consumer satisfaction and identify new sales relationships that had been hidden in the mess of unstructured data that organizations collect every day.

This is a preview of the Introduction to Text Mining Refcard. To read the entire Refcard, please download the PDF from the link above.