DZone
Big Data Zone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
  • Refcardz
  • Trend Reports
  • Webinars
  • Zones
  • |
    • Agile
    • AI
    • Big Data
    • Cloud
    • Database
    • DevOps
    • Integration
    • IoT
    • Java
    • Microservices
    • Open Source
    • Performance
    • Security
    • Web Dev
DZone > Big Data Zone > Transmuting Documents Into Graphs

Transmuting Documents Into Graphs

Learn how to scan documents using and NLP library and then turn that data into interesting visualizations using big data techniques.

Max De Marzi user avatar by
Max De Marzi
·
Jun. 01, 18 · Big Data Zone · Tutorial
Like (5)
Save
Tweet
5.56K Views

Join the DZone community and get the full member experience.

Join For Free

Alchemy is a philosophical and proto-scientific tradition practiced throughout Europe, Africa, and Asia. Its aim is to purify, mature, and perfect certain objects. In popular culture, we often see the case of shadowy figures trying to turn lead into gold to make themselves immensely rich or to ruin the world economy. In our case, we will not be transmuting lead into gold, but documents into graphs which is just as good. In the past, we had used "Alchemy API" but they were purchased by IBM and retired. You can get similar functionality with IBM Watson, but let's do something else instead. Let's add Entity Extraction right into Neo4j.

The concept is to take a document, be it a text file, word file, pdf, powerpoint, excel, etc., and have Tika detect and extract metadata and text, then run that text through a set of NLP models from OpenNLP to find interesting Entities. Let's go ahead and build a stored procedure.

If I think I may have too much in the stored procedure itself, sometimes I just make a callable and stream the results. That's all we are doing here. There is one big spoiler in here already in that we are going to ingest documents in more than just English.

@Procedure(name = "com.maxdemarzi.en.ingest", mode = Mode.WRITE)
@Description("CALL com.maxdemarzi.en.ingest")
public Stream<GraphResult> IngestEnglishDocument(@Name("file") String file) throws Exception {
    IngestDocumentCallable callable = new IngestDocumentCallable(file, "English", db, log);
    return Stream.of(callable.call());
}

Our procedure is going to return a GraphResult so we will need a place to hold our nodes and relationships as we find or create them:

@Override
public GraphResult call()  {
    List<Node> nodes = new ArrayList<>();
    List<Relationship> relationships = new ArrayList<>();

We don't know what the filetype of the document is going to be so we will use an AutoDetectParser to deal with it per these instructions.

BodyContentHandler handler = new BodyContentHandler();
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();

Next, we will parse and capture the text of the document:

String text = "";
try (InputStream stream = new FileInputStream(new File(file))) {
    parser.parse(stream, handler, metadata);
    text = handler.toString();
} catch (Exception e) {
    log.error(e.getMessage());
}

With our text in hand, we can use an OpenNLPNERecogniser to find interesting entities and store them in a map. I'm choosing between two languages here but, of course, you could add more.

Map<String, Set<String>> recognized;
switch (language) {
    case "English":
        recognized = ENGLISH_NER.recognise(text);
        break;
    case "Spanish":
        recognized = SPANISH_NER.recognise(text);
        break;
    default:
        recognized = new HashMap<>();
}

So wait, what are ENGLISH_NER and SPANISH_NER anyway? They are OpenNLPNERecogniser objects that take a map as input which directs them to the locations of the language files of the pre-trained models. You can find more pre-trained language models on the web or you can train your own.

private static final Map<String, String> ENGLISH = new HashMap<String, String>() {{
    put(PERSON, "models/en-ner-person.bin");
    put(LOCATION, "models/en-ner-location.bin");
    put(ORGANIZATION, "models/en-ner-organization.bin");

    put(TIME, "models/en-ner-time.bin");
    put(DATE, "models/en-ner-date.bin");
    put(PERCENT,"models/en-ner-percentage.bin");
    put(MONEY,"models/en-ner-money.bin");
}};

static final OpenNLPNERecogniser ENGLISH_NER = new OpenNLPNERecogniser(ENGLISH);

With that out of the way, let's get back to our procedure. We first go ahead and create a new document node with our text, file name, and language. Then we add it to our list of nodes to return in our result set.

try(Transaction tx = db.beginTx() ) {
    Node document = db.createNode(Labels.Document);
    document.setProperty("text", text);
    document.setProperty("file", file);
    document.setProperty("language", language);
    nodes.add(document);

Then for every type of entity our language model recognized, we check to see if the entity already exists, or create it and then add a relationship from the document to this entity. We add our entity and relationships to our result sets as we get them and finally we call success to make sure our transaction gets committed.

for (Map.Entry<String, Set<String>> entry : recognized.entrySet()) {
    Label label = Schema.LABELS.get(entry.getKey());
    for (String value : entry.getValue()) {
        Node entity = db.findNode(label, "id", value);
        if (entity == null) {
            entity = db.createNode(label);
            entity.setProperty("id", value);
        }
        nodes.add(entity);
        Relationship has = document.createRelationshipTo(entity, RelationshipTypes.HAS);
        relationships.add(has);
    }
}
tx.success();

Let's compile our procedure, add it to the plugins folder of Neo4j, restar neo4j, and try a few documents. But first let's go ahead and create some indexes for the "id" property for each of our types of entities:

CALL com.maxdemarzi.schema.generate;

Now we can try one by calling:

CALL com.maxdemarzi.en.ingest('data/en_sample.txt');

Cool, it works! How about a PDF file instead of a text file:

CALL com.maxdemarzi.en.ingest('data/en_sample.pdf');

Nice. How about a Spanish PDF:

CALL com.maxdemarzi.es.ingest('data/es_sample.pdf');

Sweet! It looks like we could use a little disambiguation and some cleanup, but this is enough to get us started. The source code as always is on GitHub. Be sure to take a look at the Tika documentation to learn more about what it can do.

Finally, if you want to see a plugin with more NLP functionality check out the work our partners at GraphAware have done with their Neo4j NLP plugin.

Document

Published at DZone with permission of Max De Marzi, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Applying Domain-Driven Design Principles to Microservice Architectures
  • Getting Started With RSocket Kotlin
  • Data Statistics and Analysis With Java and Python
  • Vaadin Apps as Native Executables Using Quarkus Native

Comments

Big Data Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • MVB Program
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends:

DZone.com is powered by 

AnswerHub logo