DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Resource List: Machine Learning, ODPi, Deduping With Scala, OCR and More...

Resource List: Machine Learning, ODPi, Deduping With Scala, OCR and More...

A brief article with great resources for Hadoop, Spark, and Java gurus.

Tim Spann user avatar by
Tim Spann
CORE ·
May. 25, 16 · Opinion
Like (5)
Save
Tweet
Share
4.74K Views

Join the DZone community and get the full member experience.

Join For Free

ODPi for Hadoop Standards: The ODPi + ASF to consolidate Hadoop and all the versions.   Too many custom distributions with various versions of the 20 or so tools that make up Apache Big Data.   To be able to move between HDP, CDH, IBM, Pivotal and MapR seemless would be awesome.  For now HDP, Pivotal and IBM are part of the ODPi.

Structured Data: Connecting Modern Relational Database and Hadoop is always an architectural challenge that requires decisions, EnterpriseDB (Postgresql) has an interesting article on that.   It let’s you read HDFS/Hive tables from EDB with SQL.  (Github)

Semistructured Data: Using Apache NIFI with Tesseract for OCR:   HP and Google have been fine-tuning Tesseract for awhile to handle OCR.   Using dataflow technology from the NSA, you can automate OCR tasks on Mac.   Pretty Cool.  On my machine, I needed to install a few things first:

  • Tesseract-OCR FAQ

  • Searching Through PDFs with Tesseract with Apache SOLR

Atlas + Ranger for Tag Based Policies in Hadoop: Using these new but polished Apache projects for managing everyting around security policies in the Hadoop ecosystem.   Add to that a cool example with Apache SOLR.

Anyone who hasn’t tried Pig yet, might want to check out this cool tutorial: Using PIG for NY Exchange Data. Pig will work on Tez and Spark, so it’s a tool Data Analysts should embrace.

It’s hard to think of Modern Big Data Applications without thinking of Scala.   A number of interesting resources have come out after Scala Days NYC:

  • Data in Motion: Streaming Static Data Efficiently:   This is a new talk on Scala and Akka for Data In Motion.
  • Real-Time Analytics and Algorithms
  • Using Finagle 101 (Scala)
  • Scala Microservices
  • Genetic Algorithms in Scala

Java 8 is still in the race for developing Modern Data applications with a number of projects around Spring and CloudFoundry, including Spring Cloud Stream, which lets you connect microservices with Kafka or RabbitMQ that you can run on Apache YARN. Also, see this article.

For those of you lucky enough to have a Community Account at DataBricks cloud, you can check out the new features of Spark 2.0 on display in that platform before release. 

An interesting topic for me is Fuzzy Matching, I’ve seen a few interesting videos and GitHub pages on that:

  • String Metric Github
  • Global Names Matcher
  • Real-time Fuzzy Matching with Spark and ElasticSearch

Am I the only person trying to remove duplicates from data? CSV Data? People?    

  • CVS Dedupe
  • Spark DeDuper 
  • Dedup Variable Person
  • Probable People
  • Dedupe Examples

I have also been looking for some good resources on NLP (Natural Language Processing). There’s some interesting text problems I am looking at. 

  • Machine Learning with NLP on Craig’s List Data using Sparkling Water (H20 + Spark) 
  • Word2Vec with DeepLearning4J
  • Word2Vec with Spark
  • Classifiying Documents using Naive Bayes on Spark MLlib 
  • Simple NLP Search in Scala
  • ScalaNLP
  • Simple NLP Search DataSet Creator
  • NLP on Amazon Reviews
  • Date / Time NLP Parser – intelligent parsing of dates, times and temporal concepts
  • GloVe: Global Vectors for Word Representation – Draft for GloVe on Spark (Github)
  • Spark Word2Vec Example 
  • OpenNLP
  • Stanford CoreNLP (Java Library)
  • Analyzing Text using Stanford CoreNLP 
  • Sentiment Analysis with Stanford CoreNLP
  • TweetNLP
  • ArkTweetNLP
  • Spark Wrapper of CoreNLP
  • Machine learning Scala (programming language) Big data

    Published at DZone with permission of Tim Spann, DZone MVB. See the original article here.

    Opinions expressed by DZone contributors are their own.

    Popular on DZone

    • 5 Software Developer Competencies: How To Recognize a Good Programmer
    • How To Choose the Right Streaming Database
    • 11 Observability Tools You Should Know
    • Solving the Kubernetes Security Puzzle

    Comments

    Partner Resources

    X

    ABOUT US

    • About DZone
    • Send feedback
    • Careers
    • Sitemap

    ADVERTISE

    • Advertise with DZone

    CONTRIBUTE ON DZONE

    • Article Submission Guidelines
    • Become a Contributor
    • Visit the Writers' Zone

    LEGAL

    • Terms of Service
    • Privacy Policy

    CONTACT US

    • 600 Park Offices Drive
    • Suite 300
    • Durham, NC 27709
    • support@dzone.com
    • +1 (919) 678-0300

    Let's be friends: