DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Text Analytics: Finding Insights in Scientific Publications

Text Analytics: Finding Insights in Scientific Publications

It's estimated that around two million new scientific publications appear every year. Researchers need help from AI to deal with all this new content

Ricardo Balduino user avatar by
Ricardo Balduino
·
Oct. 26, 17 · Opinion
Like (2)
Save
Tweet
Share
3.76K Views

Join the DZone community and get the full member experience.

Join For Free

Can you read fast? Really, really fast?

As you start reading this blog, consider that a person reads an average page of text in about two minutes, so it could take you about ten minutes to read this whole post, more or less. Now imagine reading 10 to 20 pages of a scientific paper. Next imagine reading hundreds, thousands, or even millions of such papers. Not an easy — if even feasible — task for a single person or even for a group of avid readers. And even if a group of people can read many scientific publications in a reasonable time, how would they then combine their acquired knowledge and establish correlations between articles and terms of interest, find common patterns related to a specific subject, and so on?

This is one of the major challenges now facing health professionals and researchers. It's estimated that around two million new scientific publications appear every year (with an exponential growth in the past decade), which brings the total to well over 50 million publications documented since 1665, according to an article from ResearchGate.

Obviously, the advancement of science — and society — relies on researchers sharing the knowledge and results of their arduous work via scientific publications. However, when it comes to consuming and mining that vast amount of information, there's clearly room for improvement.

Data science can help, especially with techniques and tooling for text analytics and natural language processing (NLP). Text analytics provides the ability to process a large collection of unstructured data (in this case, text from scientific publications) to output data that can be further analyzed to discover new insights. Related to text analytics, NLP provides machines with the ability to understand aspects of human languages, such as the relationships between words, grouping words into phrases, and much more.

Here's Someone Who Can Read Fast! (Not a Real Person, Though...)

IBM Watson Explorer Content Analytics is a powerful tool created to help with this kind of analysis. It collects and analyzes structured and unstructured content from documents, databases, websites and many other types of data repositories.

Note: Typical spreadsheets with rows and columns are examples of structured content; unstructured content includes things like text from articles and emails.

Watson Explorer crawls, parses, and analyzes content to create a searchable index that allows researchers to perform text analytics across all data, and to query the index to quickly find and retrieve relevant documents from a ranked list of results. Watson Explorer also offers a rich content mining user interface that allows users to explore data interactively. That uncovers different facets, relationships, and anomalies between the various facets.

Let's look at a case where Watson Explorer can help us analyze and derive new insights from scientific publications. We start by importing into Watson Explorer a large number of medical journals downloaded from a public source such as the US National Library of Medicine, also known as PubMed. Data gathering and some preparation are required to retrieve the journal abstracts or the entire text of the journals before importing into Watson Explorer, which accepts several formats of data input. For our example, we have a CSV (comma separated values) file with one publication title and abstract per row, ready to import into Watson Explorer.

For this example, let's look at publications related to infectious disease. Figure 1 shows that we took a collection of almost 170K medical publications (spanning several decades) and imported them into Watson Explorer. Watson Explorer's text analytics engine parses and organizes the input text, breaking it down into parts of speech of natural language, such as nouns, verbs, adjectives, and so on. The tool also gives the count for each word parsed. By selecting one of the words, we can see it highlighted in the medical paper abstracts to the right. We can also query terms of interest in a query bar on the top and see those terms highlighted in the text.

Image title

Figure 1: Visualizing parts of speech of natural language in Watson Explorer.

Understanding Natural Language and Domain-Specific Terms

Besides the default natural language parsing available with Watson Explorer, we can also add taxonomies and word dictionaries to help us parse and analyze the text to focus on a specific domain or domains. In the example, we use the 2017 MeSH (Medical Subject Headings) taxonomy (or hierarchical organization of medical terms) created, maintained, and provided by the US National Library of Medicine.

Figure 2 shows the MeSH taxonomy represented in the Watson Explorer tree browser as Facets. There are several levels of concepts ranging from more generic (for example, Diseases, Anatomy, and so on) to more specific concepts (for example, the actual disease names, the anatomical names for body parts, and so on). Watson Explorer uses the concept levels when traversing the content. For example, by selecting the Disease concept at Level 0, we see not the "Disease" word itself selected in the abstracts, but each and every disease (by name) found in the text.

Image title

Figure 2: MeSH taxonomy represented as concept levels in Watson Explorer.

We also make use of the DrugBank database, a resource for detailed drug data, such as chemical, pharmacological, and pharmaceutical data. Figure 3 shows this database visually represented as a word dictionary in Watson Explorer via nodes (or Facets) in the tree browser, where we can see and traverse the various drug attributes such as drug name, protein details, and so on. We can also make queries against the corpus and find results that pertain to the drug information available in the dictionary, also shown in Figure 3.

Image title

Figure 3: DrugBank database added as word dictionary in Watson Explorer.

Using Powerful Queries and Visualizations

Now, with the default parts of speech in natural language, the MeSH taxonomy, and the DrugBank database available in Watson Explorer, we can make queries and navigate the content. Let's use the default views and search capabilities in Watson Explorer. Note that the intent here is not to provide an exhaustive list of capabilities, but an example of how we can use the tool to query data and to visualize concepts and relationships.

In Watson Explorer's Dashboard seen in Figure 4, we have several views created and populated as we make selections in the Facets tree. For example, we can select General Nouns (from Parts of Speech) vs. Gene Name (from the DrugBank dictionary). From this selection, the top nouns appear through several months of publications. One particularly interesting chart appears that could help us with further investigation: the Facet Pairs chart shows a correlation between nouns and gene names, and "Infection" has a high correlation to the "S Gene", represented by the brown circle in the diagram. Clicking that circle prompts Watson Explorer to update the query bar, meaning we don't necessarily have to type the query text to make powerful queries. Watson Explorer then highlights the words "infection" and "HBsAg" in the abstracts. The subset of abstracts that Watson Explorer returns and satisfies that query is now 219 out of the almost 170,000 that we started with, as also seen in Figure 4.

Image title

Figure 4: Drilling down using the Facet Pairs chart in Watson Explorer.

Now, let's fine-tune the query using the concepts from the MeSH taxonomy. On the Concepts tab in Watson Explorer seen in Figure 5, we can visualize the concepts that are part of MeSH. As we move the concepts slider from more general to more specific concepts, the tool shows the number of concepts at each level, and the circles representing the concepts vary in size and color depending on the correlation of those concepts with the results returned by current query - for example, dark orange and red have higher correlation. We can move the Relevancy slider up and down to include or exclude concepts from the view. Let's say we are interested in seeing how DNA viruses relate to the original query (S gene vs. Infection). If we select that circle and add it to the query, Watson Explorer returns a smaller subset of abstracts (136 down from 219), and adds DNA viruses such as hepatitis B virus, cytomegalovirus, and other related viruses to the highlighted words to the right, as also seen in Figure 5.

Image title

Figure 5: Concepts and their relevancy in Watson Explorer.

Notice in Figure 5 that the abstracts' publication dates start in 1985. We can refine the query by looking at the Trends tab in Watson Explorer and selecting specific dates to focus on.

We see in Figure 6 the Trends tab showing various words in the context of our current search, trending throughout the months and years, with the bars indicating the prevalence of such words in given dates. After analyzing the trends, we see that the word "antigen" has a yellow bar at the August 2014 mark, indicating a stronger correlation of that word with the other terms in our query. By clicking on that bar, Watson Explorer updates the resulting abstracts to 2 (from 136) and adds the word "antigen" to the highlighted words in the text to the right, as also seen in Figure 6.

Image title

Figure 6: Viewing trending topics in Watson Explorer

From the result of this last query, the right side of Figure 6 now shows two medical publications. For example, by expanding the first publication on the list, we can read the abstract: it states that studies performed in Sub-Saharan Africa in 2014 concluded that expectant mothers who were co-infected with HIV and Hepatitis B virus and went through treatment to avoid transmission of HIV to the unborn baby (technique known as antenatal antiretroviral), along with early identification of infants requiring vaccination after birth, experienced a much lower risk of Hepatitis B transmission to the baby. Antenatal antiretroviral is, in fact, a practice recommended by the World Health Organization (WHO) according to this bulletin.

Conclusion and Acknowledgements

The example shows how Watson Explorer goes beyond a typical text query provided by popular search engines. You can use the tool to ingest a large corpus of medical publications, perform the parsing and indexing of the content, based on natural language constructs, taxonomies, and dictionaries, and provide a user interface to help us fine-tune queries and find medical publications of interest.

Database Analytics Concept (generic programming) Data science Insight (email client) NLP

Published at DZone with permission of Ricardo Balduino, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • GitOps: Flux vs Argo CD
  • Continuous Development: Building the Thing Right, to Build the Right Thing
  • How To Use Terraform to Provision an AWS EC2 Instance
  • Deploying Java Serverless Functions as AWS Lambda

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: