Reasons to Replace Dictionary Based Text Mining with Machine Learning Techniques
Reasons to Replace Dictionary Based Text Mining with Machine Learning Techniques
In this post, we will highlight the issues with the dictionary-based approaches and how Machine Learning can replace these approaches with greater accuracy.
Join the DZone community and get the full member experience.Join For Free
Bias comes in a variety of forms, all of them potentially damaging to the efficacy of your ML algorithm. Read how Alegion's Chief Data Scientist discusses the source of most headlines about AI failures here.
More than 80% of data in most organizations is about how the customers are engaging with the product. Monitoring this relationship using text mining is important when it comes to designing major strategies in any enterprise. The large user-generated content requires the use of automated techniques for text mining and analyzing since crowdsourced mining and analysis are often replete with errors, expensive, and do not scale.
Machine learning approaches have been gaining momentum with researchers due to its adaptability and accuracy for automated text mining. However, most of the organizations are still relying on the pre-tagged lexicons dictionary approaches to do most of the text mining.
In this post, we will highlight the issues with the dictionary-based approaches and how Machine Learning can replace these approaches with greater accuracy and adaptability when datasets change.
Opinion Mining— How Enterprises Can Use AI to Understand Consumer Behavior
People share their opinions and sentiments on a variety of subjects like products, news, institutions, etc. every day. When consumers face the trade-off in purchase decisions, they refer to user reviews and discussions posted by other consumers before making their purchase decisions. People have a tendency to express their opinion on various entities. As a result, opinion mining has gained importance. Opinion mining not only helps in allowing enterprises to get more and relevant information about different products and services on a mouse click, but it also helps in arriving at a more informed decision.
For example, in the sentence, "The battery life of this mobile is very bad and does not even last 4 hours," the opinion is on the "life of the battery" of the mobile object (target), and the opinion is negative. Many day-to-day life applications requires this and a deeper level of analysis is required to make decisions like what components and/or features of the product to be marketed extensively or improved upon in the next upgrade.
Opinion mining is a challenge of the Natural Language Processing (NLP), text analytics, and computational linguistics. Here, we discuss the state-of-art of the works, which are focused on open-web user-generated content such as reviews, comments, web interactions on platforms such as microblogging websites, forums, and social networks for opinion mining.
Keyword Search (Bag-Of-Words Approach) — The Traditional Approach Towards Opinion Mining
In the BoW model, a sentence or a document is considered a "Bag" containing words. It will take into account the words and their frequency of occurrence in the sentence or the document disregarding semantic relationship in the sentences. The marketer makes lists of positive and negative sentiment words (seeds) and sees which predominate in a given document (and mark as "no opinion" if there are few words of either type). The algorithm grows this set by searching in an online dictionary for their synonyms and antonyms.
As an example, a conventional approach for filtering all Price related messages from a bunch of user reviews about a product is to do a keyword search on Price and other closely related words like (pricing, charge, $, paid).
This method, however, has shortcomings, which make it ineffective to carry out large-scale sophisticated text mining tasks.
The Problems With This Approach
The human limitation with manual ontology — It is almost impossible to think of all the relevant keywords and their variants that represent a particular concept. Building and maintaining a manual ontology has a significant impact on the level of accuracy.
Lack of domain expertise — When dictionaries are created in one substantive area and then applied to other problems, serious errors can occur. Many words that have a negative connotation in other contexts, like "higher crude prices," may have a positive connotation in the context of the crude company. Also, such approaches fail of phrases like "fix the broken economy" or double negatives like "taste was not bad," which frequently occur in every day's conversations.
It's time for a new approach.
Machine Learning enables users to deploy AI to unstructured enterprise content. It is one of the most prominent techniques gaining the interest of researchers due to its adaptability and accuracy. It comprises of four stages: Data collection, Pre-processing, Training data, and Testing and Validating the results. In the training data, a collection of tagged data is provided. A model is created based on the training dataset, which is employed over the new/unseen text for classification purpose. Gather enough opinions — and analyze them correctly — and you've got an accurate gauge of the feelings of the silent majority. This relates not only to how people feel, but the drivers underlying why they feel the way they do.
Pattern Discovery — How Text Classification Comes Close to the Human-Like Classification of Text
In a classification scenario, we run a pattern discovery algorithm over a small set of labeled training data to compute text patterns that are highly correlated with the occurrence of a specific label (i.e. if the pattern occurs, then — with high probability — so does the label). The classifier identifies relationships between the words and stores them for analyzing unseen future documents. Consider the task of classifying user-feedback emails sent to a large company into emails expressing positive and negative sentiment. In this context, a frequent text pattern that has a high correlation to the negative label might be "I will switch to XYZCorp," where XYZCorp is the name of a competitor. Once the classifier has learned this, it will be able to classify other new documents into labels, just like a human would.
"What Is Driving the Sentiment?" — Text Classification at Work
By understanding what is driving the sentiment, opinion data can be used to expose critical areas of strength and weakness. This data allows executives to make the targeted, strategic overhauls needed to reinvigorate profitability or reclaim slipping market share.
Within the public sector, this same data can be used to build strategies and campaigns that resonate with the electorate and react to voters' changing needs. By isolating the specific, topic-level drivers of positive and negative sentiment, opinion mining allows for the development of an incredibly deep level of social insight — a window into how people think and feel.
By analyzing conversations for both sentiment and the topics driving that sentiment, a retail bank might discover, that of customers' criticisms, queue length, and waiting times feature uppermost.
A fast-food chain might be interested to know that relative to their closest competitor, many consider their portion size too small, though their friendly customer service is a plus.
You can either use one of our off-the-shelf text classification solutions like Sentiment Analysis and Emotion Analysis or build your own classifier using Custom Classifier API. All the APIs are available in Excel plugins and Google Sheets add-on to do text mining from the comforts of your spreadsheets.
For Enterprises, text classification models can be licensed for on-premise or private cloud deployment to ensure low latency and compliance with privacy laws.
You can also explore more of our text classification solutions here.
You can read here about applications and use cases of text classification.
Published at DZone with permission of Shashank Gupta , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.