Machine Learning, Graphs, and the Fake News Epidemic (Part 2)

DZone 's Guide to

Machine Learning, Graphs, and the Fake News Epidemic (Part 2)

In this continuing series about the problem that is fake news, take a closer look at building a graph to help detect fake news that will serve as the model to eventually feed some useful algorithms.

· AI Zone ·
Free Resource

In Part 1, we discussed why designing a fully automated fake news detector is currently infeasible and introduced a semi-automated, graph-based solution that would use machine learning to work alongside human fact checkers to scalably flag and quarantine fake news.

This post will provide an overview of such a solution, explain how to build the news graph, and explain how to use it to leverage the relationships that exist in the new sphere.

To do this, we need to recognize potentially related articles and then identify and compare their stances towards each of the individual claims made by the target article. This may sound like a long leap from the FNC’s single-stance detection problem, but with the introduction of graph technology and storing the right data, it is well within reach.

The Power of Pattern Matching

In the case of our user story, let's define similar articles as any articles that mention an arbitrary number z of entities or topics in common.

By storing the entities and topics mentioned by articles into Neo4j as nodes with connections to their respective articles, we can use Cypher, Neo4j’s graph query language, to intuitively search through millions of articles in real time and return all articles similar to a specified article. Where a single SQL SELECT statement can involve multiple JOINs and WHERE clauses, Cypher can return the same result intuitively and free of painful JOINs.

The following Cypher query uses pattern matching to return all articles which mention at least two of the same topics or entities as a specified article, a1, with the title The Fake News Epidemic:

MATCH p = (a1:Article {title: 'The Fake News Epidemic'})
WITH count(p) AS commonality, a2.article_id
WHERE commonality >=2

The MATCH statement searches the graph for paths from a1 to other articles, denoted a2, through a mutually mentioned topic or entity, n. Watch the video at the top of this post for a sense of what this pattern might look like in our graph.

This result is then passed to the WITH statement, which counts the number of matching paths from each node, a2, and denotes it as the commonality. The final lines return the article_id of all nodes with a commonality of at least two.

Similarly, by extracting important claims from each article and adding them to our graph, we can use Cypher to return all claims made by the article. From there, our problem is again reduced to the simple task of comparing each claim from the article against the body of each article in our first result set.

An implementation of the stance-detection model — as well as the graph algorithms used for topic, entity and claim extraction — will be discussed in later blog posts. For now, we will think of them as black box operations and continue on to an overview of how the news graph is assembled.

The News Graph

To fully utilize our news graph, we need to structure it to focus on important relationships in our dataset. After adding more nodes to store the authors and sources of the articles in our graph, as well as some useful properties for each of our nodes and relationships, we arrive at the following graph schema.

A fake news detection graph data modelThese additional author and source nodes will allow us to extend our measurement for controversiality to those nodes, as well.

By traversing out one level from authors and sources on the WROTE and PUBLISHED relationships, respectively, we can average the controversiality of the articles that they are connected to gauge their own overall controversiality. We can also use graph clustering methods on these nodes to identify communities which tend to consistently agree with one another.

While this is only one potential implementation of a fake news detection graph — with room for modification and improvement — its advantages over a relational model are clear.

Building the News Graph

To get from a disjunct set of articles into this tightly woven graph, however, requires some additional processing through our “black box” algorithms. Following blog posts will discuss these algorithms in detail and even include some sample code and results. But for now, a general understanding of their purpose in our system will suffice.

Notice the diagram below, which models the way data flows between our graph and various algorithms in order to construct a database matching the schema we specified earlier. Also note that the dotted lines indicate directed data flow, rather than graph edges, and each of the colored diamonds indicates an algorithm used to assemble a part of our graph, not nodes in our database.

Data flow diagram for a fake news detection using Neo4jThe next post in the series will show how we can use Cypher to load data into Neo4j and preprocess it to create inputs to our various algorithms.

algorithm ,fake news ,graph database ,machine learning ,pattern matching ,tutorial

Published at DZone with permission of Nir Avrahamov , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}