DZone
AI Zone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
  • Refcardz
  • Trend Reports
  • Webinars
  • Zones
  • |
    • Agile
    • AI
    • Big Data
    • Cloud
    • Database
    • DevOps
    • Integration
    • IoT
    • Java
    • Microservices
    • Open Source
    • Performance
    • Security
    • Web Dev
DZone > AI Zone > Machine Learning Helps Humans Perform Text Analysis

Machine Learning Helps Humans Perform Text Analysis

Read this article in order to learn more about Machine learning and Exaptive, ML and data landscape exploration, and ML and PubMed Explorer.

Joshua Southerland user avatar by
Joshua Southerland
·
Jun. 08, 18 · AI Zone · Analysis
Like (1)
Save
Tweet
6.79K Views

Join the DZone community and get the full member experience.

Join For Free

The rise of Big Data created the need for data applications to be able to consume data residing in disparate databases of wildly differing schema. The traditional approach to performing analytics on this sort of data has been to warehouse it; to move all the data into one place under a common schema so it can be analyzed.

This approach is no longer feasible with the volume of data being produced, the variety of data requiring specific optimized schemas, and the velocity of the creation of new data. A much more promising approach has been based on semantic link data, which models data as a graph (a network of nodes and edges) instead of as a series of relational tables.

To augment that approach, we've found that we can use machine learning to improve the semantic data models as the dataset evolves. Our specific use-case is text data in millions of documents. We've found that machine learning facilitates the storage and exploration of data that would otherwise be too vast to support valuable insights.

Machine Learning and Exaptive

Machine Learning (ML) allows for a model to improve over time given new training data, without requiring more human effort. For example, a common text-classification benchmark task is to train a model on messages for multiple discussion board threads and then later use it to predict what the topic of discussion was (space, computers, religion, etc). Besides being able to classify new texts, ML approaches can also attempt to identify the authors or find similar documents. The ability to identify similar documents can lead to a recommender system for new content that a user might find interesting.

ML-based models are commonly desired to be black-box in the sense that a user desires to be able to put data in and get answers out, without having to know the details of how this is achieved. However, there is usually a desire to understand the resulting model and why a recommendation is given. This desire aligns nicely in the case of understanding a collection of texts, such as search results, where the user may want a summary of a 100-page list of 1000 ranked results. In this use case, we build a data application featuring a landscape visualization, which conveys the documents' similarities as well as the relationships to key terms which were identified when learning the model.

ML and Data Landscape Exploration

In one application of data landscape technology, we processed over 100 million documents that had been machine read from scanned documents via optical character recognition (OCR). Some of the documents were hundreds of years old. We recorded counts for roughly 200,000 words and then estimated the importance of those words to the documents as a feature engineering step. This measure is known as term frequency-inverse document frequency (Tf-idf). Singular value decomposition (SVD) was then used to find high-level concepts, which are each defined by many words.

At that point in the process, documents are described by high-level concepts that align with areas of medicine, economics, religion, politics, etc. The concepts that are learned are data-dependent: if only medical documents are used then the model's resources will be used to find more finely-detailed categories. We then clustered the documents in that topic space to find which documents are similar. The silhouette coefficient measure allowed us to automatically select a good number of clusters. Next, we projected the documents down to a two-dimensional scatterplot using a combination of SVD and multi-dimensional scaling (MDS). Based on the density of the documents, we fit a contour map, which looks like a topological map. Color varies across the contour map according to the cluster assignment for documents in that area. Finally, we solve for landmarks which correspond to the x,y location of the key driver terms for each cluster.

ML and PubMed® Explorer

Exaptive's PubMed Explorer provides a visual interface for searching PubMed's extensive collection of papers and visualizing search results. One of the visualizations provided is a term landscape. The term landscape is similar to the key term landmarks from the previously described data landscape. The positions are found in a more direct method by projecting Tf-idf values directly to 2D. For a collection of search results, the user may then view a two-dimensional landscape where related terms are grouped together spatially. Depending on how this project is performed, it is easy to obtain either the documents locations or the term locations. This allows us to provide the PubMed user with options to create the same visualization for articles or journals instead of topics. As with the previously described visualizations, the documents are categorized using clustering, which provides for distinction with term/cluster color.

Artificial Intelligence vs Intelligence Augmentation

Many people associate ML with Artificial Intelligence (AI). At Exaptive, we use it to support Intelligence Augmentation (IA). The difference is that instead of using machine learning to eliminate the need for humans in a process, the technology supports the intelligence of the human researcher. In this way, we use machine reading and ML to help researchers accomplish more than what would otherwise be humanly possible.

Machine learning Big data

Published at DZone with permission of Joshua Southerland, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • How to Build Spark Lineage for Data Lakes
  • Understanding Cursors in Apache Pulsar
  • Here Is Why You Need a Message Broker
  • API Security Weekly: Issue 173

Comments

AI Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • MVB Program
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends:

DZone.com is powered by 

AnswerHub logo