Over a million developers have joined DZone.

This Week in Hadoop and More: Spark, TensorFlow, and JSoup

DZone's Guide to

This Week in Hadoop and More: Spark, TensorFlow, and JSoup

A recap of news from all over the world of big data including Hive, Spark, Flink, and NiFi.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Back from vacation and ready to rock.

Image title

A few interesting presentations from around the world of big data.

Hortonworks has a nice webinar (recorded) on IoT intelligence using geographically distributed sensors. There's another great talk on using Apache NiFi with its smaller cousin Apache MiniFi, as well as a great talk on enabling Kerberos Security with Apache HBase. Additionally, there's a great talk on using Apache Zeppelin and Spark for Enterprise Data Science. For Deep Learning fans, there's another great tutorial on using TensorFlow.

Big Data Spain 2016 has released a lot of excellent presentation content:

Cool Tools

  • Record Query (GitHub) allows you to read JSON, AVRO, and other semi-structured formats. Written in RUST, this is very interesting and useful command line tool.

  • Trapezium (GitHub) from Verizon is a Spark/Scala/Akka framework for building batch, streaming and API services to deploy Machine Learning Models.

  • Here is a cool article on how to use Apache NiFi to Convert Rows to Columns in Text Files.

  • This article I wrote on streaming data from a relational database to Hadoop as HBase/Phoenix and Hive/ORC tables and files.

What's Going on Today

I am building a robotic miniature car with a Raspberry Pi 3 B+, sensors, camera, and WiFi. It will use Python to send MQTT messages to the cloud which I will pull off with NiFi and land in Hadoop. That data will be graphed and track the car as it chases my robotic vacuum. I think this is in the future what robots will watch for entertainment. It's like NASCAR for robots.

This Week's Bit of Java 8

Extracting a link using JSoup:

public List<PrintableLink> extract(String url, String type) {
   List<PrintableLink> linksReturned = new ArrayList<>();

   try {
      Document doc = Jsoup.connect(url).get();
      Elements links = doc.select("a[href]");
      PrintableLink pLink = null;

      for (Element link : links) {
      if (null != type) {
        if (null != link.attr("abs:href") && link.attr("abs:href").endsWith(type)) {
        pLink = new PrintableLink();
        pLink.setDescr(trim(link.text(), 100));
        pLink = null;
  } else {
    pLink = new PrintableLink();
    pLink.setLink(trim(link.text(), 100));
    pLink = null;
  } catch (Exception x) {
return linksReturned;

Too bad Spark SQL wasn't still called Shark. 

Image title

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

big data ,hadoop ,spark

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}