Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Building Better Knowledge

DZone's Guide to

Building Better Knowledge

Diffbot is a knowledge graph that connects the web’s knowledge within a structured database that can be queried so you can find answers easier.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Image by thinboyfatter on Flickr (CC BY 2.0)

Imagine you have a school project to complete. You need to write about a particular topic that you know nothing about. So you head to the library and sit by the row of books that cover the subject. You can search through the row, read the books and find out what you need. You’ll probably have to start with a basic book, build some knowledge of the key aspects of the topic, then you can look up those elements in the index of more advanced books, flipping back and forth, until you have a command of the subject.

Just as you are about to start, in wanders a University professor with years of knowledge of the subject stored in her head. She has read all the books on that particular shelf in the library, in fact, she’s written some of them. She sits down at your table and says, “I'm an expert on topic X. Ask me anything.”

By asking questions, you can take advantage, not only of the facts she has at her fingertips, but also the connections between them that she has built over the years. You can quickly get to the heart of the knowledge using your language skills to question her. If she hadn’t arrived, you would have been building those for yourself from scratch using the lookup mechanisms within the books: chapter headings and indexes.

OK, it’s a slightly clunky analogy because you have to trust that this professor has an up-to-date, comprehensive knowledge that is free from bias. And, as human capacity is limited, while she may be expert on this project, she will not be able to help you write a paper on planetary physics, or on the geography of Scotland and another on medieval architecture, which are also due next week (you are at a school for high achievers!). You need to hunt down some different people to question for those or hit the books again.

If you’re jumping up and down saying “Use the Internet!”, consider how it works. Is it more like a library or like the professor? On the web, the information is stored in sets of pages across a range of sites. It is similar to the library because you have to find the most relevant pages to determine the most important facts and uncover the salient terminology to drop into additional web searches to find yet more information. You are working hard to extract and connect the key facts, even if you don’t have to leave the house and take a bus, because your search engine serves up the pages you need to read, not the answers you want to know.

Search Engines

Which brings us to the capabilities of a search engine. On the surface, it appears to understand a question sufficiently to answer it with a spread of information about the key topic. Ask Google about the actor Gary Oldman, and you’ll get something like the screen capture below, where the results returned include a range of different resources (images, video, text) plus an information box with some key facts from Wikipedia and links to related information such as his films and the place of his birth. You’ll also see a list of people and questions that may be relevant, based on searches that other people have made.

The information box I’ve shown above comes from a product known as the “Google Knowledge Graph.” There is no official information about exactly how the product works, although this paper is a useful resource. We know that it draws upon public sources such as Wikipedia, and also amasses data on what people search for on the web, with some level of human input in the form of curation. For that reason alone, Google’s Knowledge Graph is arguably somewhat limited, not least when you consider that Wikipedia data is limited to noteworthy people, companies, and locations, rather than everyone documented online. The product is driven as a service to Google’s advertisers and provides content for use by the smart speakers used by Google Home.

As an illustration, if you ask “Is Gary Oldman married?” you probably won’t get a direct answer, but, instead, you’ll be presented with a set of links to follow and read, rather like the row of library books from our analogy. This will certainly be the case for any questions beyond those basics presented in the information box to the side. I’ve noticed recently that Google is now able to provide some direct answers. For example, if you search it for “Is Gary Oldman an only child?” you’ll get back an information box about his sister (Laila Morse). However, this information comes from whether Google has a direct answer to your question in the form of a string of text that has been found somewhere on the web, such as “Laila Morse is Gary Oldman’s sister.”

If it’s not already written down, Google cannot infer it from other information in the way humans would (e.g. by finding out how many children his mother had given birth to).

For this kind of inference, we need to turn back to our fictional professor or find a technology that can automatically determine answers from connections built across the wealth of information stored online.

Introducing Diffbot

Back in August of this year, Diffbot, a Silicon Valley startup released what they’ve called the Diffbot Knowledge Graph (DKG) to provide “knowledge-as-a-service to power intelligent applications.” They are using a combination of machine learning, computer vision, and natural language processing to scrape the contents of the entire web into a knowledge graph. Note that this isn’t the same as the Google product the “Knowledge Graph” but refers to the more general concept (if you are unsure of what a knowledge graph is, I’ve written previously on the subject in an article called WTF is a Knowledge Graph?; to summarise, a knowledge graph is a single, structured source of data — stored as a graph with semantic (self-descriptive) properties and support for an inference).

Diffbot uses a search engine called Gigablast to crawl and store the entire web, collecting documents in various formats, such as HTML web pages and PDF attachments. Diffbot then uses computer vision to understand the structure of those documents, breaking them down into structural elements such as headers, blocks of text, tables, and so on. After working out the structure, the content is parsed using a combination of natural language processing and machine learning, pulling out facts, figures, and relationships between them with better than human accuracy, and building a corpus of knowledge that is added to the DKG.

Diffbot is effectively capturing the knowledge of the web within a database, and connecting it so that it can be used to provide the answers to complex questions posed as queries. This is — once you are used to the syntax needed to ask the questions — far more effective a resource than sets of links to pages of text: the output of a search engine.

At the time of release, the DKG contained more than 1 trillion facts and 10 billion entities, which was nearly 500 times larger than the Google Knowledge Graph product, and is growing by over 100 million facts a month. The DKG is fully autonomous and built solely using artificial intelligence, rather than relying on a level of manual curation. The value of this approach is that the knowledge graph can be constantly rebuilt, from scratch, keeping the DKG data fresh and accurate, since sources that are inconsistent or found to be plain inaccurate can simply be excluded and others added. In the battle against ‘fake news,’ this is a useful weapon.

Storing knowledge within a graph makes it rapidly available: it’s possible to build products that use the connected data. Knowledge graphs are the closest a computer can come to a contextual understanding of how our world works by relating concepts and items to one another. If you are building an AI assistant who can understand complex queries, it’s going to need to understand complex relationships.

Example

Let’s close with an example use case. Ask your AI assistant (Siri, Alexa, Google Assistant) the following: “How many companies are in New York?”

If you receive an answer, it’s one that has been scraped from a webpage somewhere and is based on a string of text found online and stored as a fact. If you ask are more obscure question “How many companies are in York, England?” you may not get a direct answer, but one that lists a set of links to places you can potentially find the information (although it’s not guaranteed). Here’s a screen from Siri’s response when I asked recently:

Turning back to New York and asking a multifaceted question:

“How many companies in New York are hiring people with JavaScript skills?”

Again, the assistant can’t answer that because it’s not recorded as scrapable text on a web page, and they don’t have the capability to work out the answer based on their other knowledge.

However, if the assistant was drawing upon the contents of a knowledge graph, giving it an understanding ‘things’ and the connection between those things, it would be able to use its ontology to work out:

Companies have: 
Locations
Employees
Employees have:
Employers
Skills
Skills have:
People who have them

This grants the capacity to understand the question and use contextual knowledge of the subject to find an answer. This is what Diffbot’s Knowledge Graph is being developed to do.

Mike Tung, founder and CEO of Diffbot says,“What we’ve built is the first knowledge graph that organizations can use to access the full breadth of information contained on the web. Unlocking that data and giving organizations instant access to those deep connections completely changes knowledge-based work as we know it.” You can see a recent video of Mike Tung at O'Reilly Media's Strata Data conference in New York City where he described the future of automating business processing with large-scale knowledge graphs.

Find Out More About Diffbot!

Closing Comments

I’d like to note that I haven’t been paid by Diffbot to write this article and am not affiliated with them in any way. Knowledge graphs are an abiding interest and something I’ve written about previously when I worked at GRAKN.AI. This technology looks like a great way to work with knowledge in a new way, and I hope to write more about Diffbot soon by taking it for a spin and writing a tutorial about it.

Let me know what you think in the comments, and if you have tried it yourself, how did you get on?

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
knowledge graphs ,big data ,natural language processing ,web scraping

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}