Over a million developers have joined DZone.

Text Analytics: Finding Insights in Scientific Publications

DZone's Guide to

Text Analytics: Finding Insights in Scientific Publications

It's estimated that around two million new scientific publications appear every year. Researchers need help from AI to deal with all this new content

· AI Zone ·
Free Resource

Insight for I&O leaders on deploying AIOps platforms to enhance performance monitoring today. Read the Guide.

Can you read fast? Really, really fast?

As you start reading this blog, consider that a person reads an average page of text in about two minutes, so it could take you about ten minutes to read this whole post, more or less. Now imagine reading 10 to 20 pages of a scientific paper. Next imagine reading hundreds, thousands, or even millions of such papers. Not an easy — if even feasible — task for a single person or even for a group of avid readers. And even if a group of people can read many scientific publications in a reasonable time, how would they then combine their acquired knowledge and establish correlations between articles and terms of interest, find common patterns related to a specific subject, and so on?

This is one of the major challenges now facing health professionals and researchers. It's estimated that around two million new scientific publications appear every year (with an exponential growth in the past decade), which brings the total to well over 50 million publications documented since 1665, according to an article from ResearchGate.

Obviously, the advancement of science — and society — relies on researchers sharing the knowledge and results of their arduous work via scientific publications. However, when it comes to consuming and mining that vast amount of information, there's clearly room for improvement.

Data science can help, especially with techniques and tooling for text analytics and natural language processing (NLP). Text analytics provides the ability to process a large collection of unstructured data (in this case, text from scientific publications) to output data that can be further analyzed to discover new insights. Related to text analytics, NLP provides machines with the ability to understand aspects of human languages, such as the relationships between words, grouping words into phrases, and much more.

Here's Someone Who Can Read Fast! (Not a Real Person, Though...)

IBM Watson Explorer Content Analytics is a powerful tool created to help with this kind of analysis. It collects and analyzes structured and unstructured content from documents, databases, websites and many other types of data repositories.

Note: Typical spreadsheets with rows and columns are examples of structured content; unstructured content includes things like text from articles and emails.

Watson Explorer crawls, parses, and analyzes content to create a searchable index that allows researchers to perform text analytics across all data, and to query the index to quickly find and retrieve relevant documents from a ranked list of results. Watson Explorer also offers a rich content mining user interface that allows users to explore data interactively. That uncovers different facets, relationships, and anomalies between the various facets.

Let's look at a case where Watson Explorer can help us analyze and derive new insights from scientific publications. We start by importing into Watson Explorer a large number of medical journals downloaded from a public source such as the US National Library of Medicine, also known as PubMed. Data gathering and some preparation are required to retrieve the journal abstracts or the entire text of the journals before importing into Watson Explorer, which accepts several formats of data input. For our example, we have a CSV (comma separated values) file with one publication title and abstract per row, ready to import into Watson Explorer.

For this example, let's look at publications related to infectious disease. Figure 1 shows that we took a collection of almost 170K medical publications (spanning several decades) and imported them into Watson Explorer. Watson Explorer's text analytics engine parses and organizes the input text, breaking it down into parts of speech of natural language, such as nouns, verbs, adjectives, and so on. The tool also gives the count for each word parsed. By selecting one of the words, we can see it highlighted in the medical paper abstracts to the right. We can also query terms of interest in a query bar on the top and see those terms highlighted in the text.

Image title

Figure 1: Visualizing parts of speech of natural language in Watson Explorer.

Understanding Natural Language and Domain-Specific Terms

Besides the default natural language parsing available with Watson Explorer, we can also add taxonomies and word dictionaries to help us parse and analyze the text to focus on a specific domain or domains. In the example, we use the 2017 MeSH (Medical Subject Headings) taxonomy (or hierarchical organization of medical terms) created, maintained, and provided by the US National Library of Medicine.

Figure 2 shows the MeSH taxonomy represented in the Watson Explorer tree browser as Facets. There are several levels of concepts ranging from more generic (for example, Diseases, Anatomy, and so on) to more specific concepts (for example, the actual disease names, the anatomical names for body parts, and so on). Watson Explorer uses the concept levels when traversing the content. For example, by selecting the Disease concept at Level 0, we see not the "Disease" word itself selected in the abstracts, but each and every disease (by name) found in the text.

Image title

Figure 2: MeSH taxonomy represented as concept levels in Watson Explorer.

We also make use of the DrugBank database, a resource for detailed drug data, such as chemical, pharmacological, and pharmaceutical data. Figure 3 shows this database visually represented as a word dictionary in Watson Explorer via nodes (or Facets) in the tree browser, where we can see and traverse the various drug attributes such as drug name, protein details, and so on. We can also make queries against the corpus and find results that pertain to the drug information available in the dictionary, also shown in Figure 3.

Image title

Figure 3: DrugBank database added as word dictionary in Watson Explorer.

Using Powerful Queries and Visualizations

Now, with the default parts of speech in natural language, the MeSH taxonomy, and the DrugBank database available in Watson Explorer, we can make queries and navigate the content. Let's use the default views and search capabilities in Watson Explorer. Note that the intent here is not to provide an exhaustive list of capabilities, but an example of how we can use the tool to query data and to visualize concepts and relationships.

In Watson Explorer's Dashboard seen in Figure 4, we have several views created and populated as we make selections in the Facets tree. For example, we can select General Nouns (from Parts of Speech) vs. Gene Name (from the DrugBank dictionary). From this selection, the top nouns appear through several months of publications. One particularly interesting chart appears that could help us with further investigation: the Facet Pairs chart shows a correlation between nouns and gene names, and "Infection" has a high correlation to the "S Gene", represented by the brown circle in the diagram. Clicking that circle prompts Watson Explorer to update the query bar, meaning we don't necessarily have to type the query text to make powerful queries. Watson Explorer then highlights the words "infection" and "HBsAg" in the abstracts. The subset of abstracts that Watson Explorer returns and satisfies that query is now 219 out of the almost 170,000 that we started with, as also seen in Figure 4.

Image title

Figure 4: Drilling down using the Facet Pairs chart in Watson Explorer.

Now, let's fine-tune the query using the concepts from the MeSH taxonomy. On the Concepts tab in Watson Explorer seen in Figure 5, we can visualize the concepts that are part of MeSH. As we move the concepts slider from more general to more specific concepts, the tool shows the number of concepts at each level, and the circles representing the concepts vary in size and color depending on the correlation of those concepts with the results returned by current query - for example, dark orange and red have higher correlation. We can move the Relevancy slider up and down to include or exclude concepts from the view. Let's say we are interested in seeing how DNA viruses relate to the original query (S gene vs. Infection). If we select that circle and add it to the query, Watson Explorer returns a smaller subset of abstracts (136 down from 219), and adds DNA viruses such as hepatitis B virus, cytomegalovirus, and other related viruses to the highlighted words to the right, as also seen in Figure 5.

Image title

Figure 5: Concepts and their relevancy in Watson Explorer.

Notice in Figure 5 that the abstracts' publication dates start in 1985. We can refine the query by looking at the Trends tab in Watson Explorer and selecting specific dates to focus on.

We see in Figure 6 the Trends tab showing various words in the context of our current search, trending throughout the months and years, with the bars indicating the prevalence of such words in given dates. After analyzing the trends, we see that the word "antigen" has a yellow bar at the August 2014 mark, indicating a stronger correlation of that word with the other terms in our query. By clicking on that bar, Watson Explorer updates the resulting abstracts to 2 (from 136) and adds the word "antigen" to the highlighted words in the text to the right, as also seen in Figure 6.

Image title

Figure 6: Viewing trending topics in Watson Explorer

From the result of this last query, the right side of Figure 6 now shows two medical publications. For example, by expanding the first publication on the list, we can read the abstract: it states that studies performed in Sub-Saharan Africa in 2014 concluded that expectant mothers who were co-infected with HIV and Hepatitis B virus and went through treatment to avoid transmission of HIV to the unborn baby (technique known as antenatal antiretroviral), along with early identification of infants requiring vaccination after birth, experienced a much lower risk of Hepatitis B transmission to the baby. Antenatal antiretroviral is, in fact, a practice recommended by the World Health Organization (WHO) according to this bulletin.

Conclusion and Acknowledgements

The example shows how Watson Explorer goes beyond a typical text query provided by popular search engines. You can use the tool to ingest a large corpus of medical publications, perform the parsing and indexing of the content, based on natural language constructs, taxonomies, and dictionaries, and provide a user interface to help us fine-tune queries and find medical publications of interest.

TrueSight is an AIOps platform, powered by machine learning and analytics, that elevates IT operations to address multi-cloud complexity and the speed of digital transformation.

ai ,text analytics ,tutorial ,machine learning

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}