DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
The Latest "Software Integration: The Intersection of APIs, Microservices, and Cloud-Based Systems" Trend Report
Get the report
  1. DZone
  2. Coding
  3. Java
  4. Using Lucene and Cascalog for Fast Text Processing at Scale

Using Lucene and Cascalog for Fast Text Processing at Scale

Mitch Pronschinske user avatar by
Mitch Pronschinske
·
Nov. 09, 11 · Interview
Like (0)
Save
Tweet
Share
9.24K Views

Join the DZone community and get the full member experience.

Join For Free

This post explains text processing and analytics techniques used at the startup Yieldbot.  Their technology uses open source tools including Cascalog, Lucene, Hadoop, and Clojure's Java Interop.  The following post was authored by Soren Macbeth, a Data Scientist at Yieldbot.

Here at Yieldbot we do a lot of text processing of analytics data. In order to accomplish this in a reasonable amount of time, we use Cascalog, a data processing and querying library for Hadoop; written in Clojure. Since Cascalog is Clojure, you can develop and test queries right inside of the Clojure REPL. This allows you to iteratively develop processing workflows with extreme speed. Because Cascalog queries are just Clojure code, you can access everything Clojure has to offer, without having to implement any domain specific APIs or interfaces for custom processing functions. When combined with Clojure's awesome Java Interop, you can do quite complex things very simply and succinctly.

Many great Java libraries already exist for text processing, e.g., Lucene, OpenNLP, LingPipe, Stanford NLP. Using Cascalog allows you take advantage of these existing libraries with very little effort, leading to much shorter development cycles.

By way of example, I will show how easy it is to combine Lucene and Cascalog to do some (simple) text processing. You can find the entire code used in the examples over on Github.  

Our goal is to tokenize a string of text. This is almost always the first step in doing any sort of text processing, so it's a good place to start. For our purposes we'll define a token broadly as a basic unit of language that we'd like to analyze; typically a token is a word. There are many different methods for doing tokenization. Lucene contains many different tokenization routines which I won't cover in any detail here, but you can read the docs ot learn more. We'll be using Lucene's Standard Analyzer, which is a good basic tokenizer. It will lowercase all inputs, remove a basic list of stop words, and is pretty smart about handling punctuation and the like.

First, let's mock up our Cascalog query. Our inputs are going to be 1-tuples of a string that we would like to break into tokens.

(defn tokenize-strings [in-path out-path]
  (let [src (hfs-textline in-path)]
    (?<- (hfs-textline out-path :sinkmode :replace)
         [!line ?token]
         (src !line)
         (tokenize-string !line :> ?token)
         (:distinct false))))

I won't waste a ton of time explaining Cascalog's syntax, since the wiki and docs are already very good at that. What we're doing here is reading in a text file that contains the strings we'd like to tokenize, one string per line. Each one of these string will be passed into the tokenize-string function, which will emit 1 or more 1-tuples; one for each token generated.

Next let's write our tokenize-string function. We'll use a handy feature of Cascalog here called a stateful operation. If looks like this:

(defmapcatop tokenize-string {:stateful true}
  ([] (load-analyzer StandardAnalyzer/STOP_WORDS_SET))
  ([analyzer text]
     (emit-tokens (tokenize-text analyzer text)))
  ([analyzer] nil))

The 0-arity version gets called once per task, at the beginning. We'll use this to instantiate our Lucene analyzer that will be doing our tokenization. The 1+n-arity passes the result of the 0-arity function as it first parameter, plus any other parameters we define. This is where the actual work will happen. The final 1-arity function is used for clean up.

Next, we'll create the rest of the utility functions we need to load the Lucene analyzer, get the tokens and emit them back out.

(defn tokenizer-seq
  "Build a lazy-seq out of a tokenizer with TermAttribute"
  [^TokenStream tokenizer ^TermAttribute term-att]
  (lazy-seq
    (when (.incrementToken tokenizer)
      (cons (.term term-att) (tokenizer-seq tokenizer term-att)))))

(defn load-analyzer [^java.util.Set stopwords]
  (StandardAnalyzer. Version/LUCENE_CURRENT stopwords))

(defn tokenize-text
  "Apply a lucene tokenizer to cleaned text content as a lazy-seq"
  [^StandardAnalyzer analyzer page-text]
  (let [reader (java.io.StringReader. page-text)
        tokenizer (.tokenStream analyzer nil reader)
        term-att (.addAttribute tokenizer TermAttribute)]
    (tokenizer-seq tokenizer term-att)))

(defn emit-tokens [tokens-seq]
  "Compute n-grams of a seq of tokens"
  (partition 1 1 tokens-seq))

We make heavy use of Clojure's awesome Java Interop here to make use of Lucene's Java API to do the heavy lifting. While this example is very simple, you can take this framework and drop in any number of the different Lucene analyzers available to do much more advanced work with little change to the Cascalog code.

By leaning on Lucene, we get battle hardened, speedy processing without having to write a ton of glue code thanks to Clojure. Since Cascalog code is Clojure code, we don't have to spend a ton of time switching back and forth between different build and testing environments and a production deploy is just a `lein uberjar` away.


Source: http://blog.yieldbot.com/using-lucene-and-cascalog-for-fast-text-proce

Lucene Processing

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • 5 Steps for Getting Started in Deep Learning
  • Decode User Requirements to Design Well-Architected Applications
  • What “The Rings of Power” Taught Me About a Career in Tech
  • Unlock the Full Potential of Git

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: