I had the honor of being part of the amazing Hadoop Summit in April this year. WOW what a blast! For starters we had it in Dublin Ireland, and 1,400 of our closest friends joined us for a few days of fun, conversation and learning. The venue was amazing and there was no one in Dublin that didn’t know we were in town!
My Top 5 Highlights:
The opening act was amazing with Irish Dancing, drums and an illuminated light show. Sorry the blog just can’t do it justice, you had to be there to experience the energy and enthusiasm but you can check out the photos and videos on our Facebook page – don’t forget to tag yourself if you happen to appear in the album.
The opening keynotes were really succinct and powerful. I really enjoyed the discussion on Connected Data Platforms and the need to marry the data-at-rest and data-in-motion worlds.Naturally, a highlight of the event was the 10 Years of Hadoop Party at the Guinness Storehouse.
I am not a beer connoisseur, but believe me Guinness tastes better in Ireland!
In the memorable section was cycling in cold rainy Dublin at 5:30am (the day after the Party). Memorable yes.. And now it is over I think I will do it again!
Meeting and learning from the attendees. Sure we went to many sessions together, but the real learning happened as we networked, shared our pain points, our frustrations and our victories. This was the true power of community.
I had the opportunity to interact with many attendees and the two things thinks I heard without fail was:
- Hadoop is real — real customers, real deployments, real results
- Summit has the best technical sessions than any other conference
I would like to take some credit for this, but honestly, sessions that are chosen by the community for the community win over those chosen any other way. Our open community approach to session selection really works. (Just like Open Source really.. The power of community)
For those not able to attend the event, all sessions and slides are now available online for FREE! Yes every session every day… just for you!
You don’t have time to watch them all, or want to know which ones to start with. Using the Power of Community and foot traffic analysis here are the top 5 most attended sessions:
As more and more Spark projects are moving into production, getting the most out of Spark in production environment is becoming more critical. In a production deployment there are many concerns:
- What is the best way to get performance out of Spark: configuring the right task parallelism, choosing best number, size & core per executors requires deeper understanding of Spark
- How to secure a Spark deployment? What is the right way to integrate Spark with Kerberos authentication & Authorization.
- How to keep up with ongoing Spark releases? Now that you have Spark jobs in production, how do you upgrade your Spark cluster to get new Spark features, while minimizing downtime and meeting your own SLAs.
- How do I run Spark on YARN in the best way? How do I share resources across all YARN workloads efficiently?
The Hadoop platform has rapidly evolved into a diverse ecosystem of projects, able to process data at unprecedented scale and speed. Apache NiFi aims to help Hadoop reach its full potential by providing the highest quality data flows to and from from the platform, as well as between components of the ecosystem. It supports highly configurable directed graphs of data routing, transformation, and system mediation logic, including a web-based user interface, guaranteed delivery, data provenance, and easy extensibility. NiFi provides out-of-the-box integration with several projects in the Hadoop ecosystem, including HDFS, Spark Streaming, Storm, Kafka, and more. This talk will dive into the inner workings of these integration points, plans for future enhancements, and discuss how these technologies can be used with NiFi to build end-to-end systems. Finally, this talk will dive into a real-world use-case and discuss how NiFi fits into the architecture, with a focus on highlighting the similarities and differences between NiFi and other projects.
A lot has changed and a lot has stayed the same with Ingest and Stream Processing over the years. But today there are many options than even for Ingest and Stream Processing that one may wonder why one solution versus the other. The problem is that in this space, one size does not fit all, and that makes it all the more confusing. This talk aims at giving the audience a direction to choose when it comes to Ingest and Stream Processing. We aim to help the audience understand the solutions available to them, and to make the best choice based on their use-case. In this talk, we will be go over current and emerging technologies in the marketplace. These include Kafka’s CopyCat, Kafka + Flume’s “Flafka”, Spark Streaming and Storm + Trident. We will evaluate each of them and understand how they are useful in solving problems related to large scale data processing, joining and combining streams. We will also look at the various ways of achieving “at least once” and “exactly once” processing. We will discuss how each of these can be scaled and how we can make sure that data is processed in a timely fashion.
Apache Hadoop YARN is a modern resource-management platform that handles resource scheduling, isolation and multi-tenancy for a variety of data processing engines that can co-exist and share a single data-center in a cost-effective manner. Docker is an application container engine that enables developers and sysadmins to build, deploy and run containerized applications. In this talk, we’ll discuss container runtimes in YARN, with a focus on using the DockerContainerRuntime to run various docker applications under YARN. Support for container runtimes (including the docker container runtime) was recently added to the Linux Container Executor (YARN-3611 and its sub-tasks). We’ll walk through various aspects of running docker containers under YARN – resource isolation, some security aspects (for example container capabilities, privileged containers, user namespaces) and other work in progress features like image localization and support for different networking modes.
This talk is about the past, present, and future of Hadoop at LinkedIn. It is a story that begins back in 2008 with a group of renegade engineers cobbling together their first cluster from a collection of mismatched Solaris boxes. I’ll describe early big successes that generated confidence, and the growing pains caused by new use cases and new users, and how we responded to explosive increases in capacity requirements. I will also share the nuggets of wisdom we learned along the way: how to support a demanding user population, how to limit and prevent the growth of technical debt, and the factors we take into account when making bets on new technologies like Tez, Spark, and Presto.
Speaker: Carl Steinbach, LinkedIn: Carl Steinbach is a Senior Staff Software Engineer at LinkedIn where he leads the Hadoop Platform Team. He is also a member of LinkedIn’s Technology Leadership Group and its Open Source Committee. Before joining LinkedIn Carl was an early employee at Cloudera. He is an ASF member and former PMC Chair of the Apache Hive Project.
Honorable mentions to those garnering the most views after the event go to:
- Overview of Apache Flink the 4G of Big Data Analytics Frameworks VIDEO | SLIDES
- Telematics with Hadoop and NiFi VIDEO | SLIDES
- Rocking the World of Big Data at Centrica VIDEO | SLIDES
- TensorFlow Large Scale Deep Learning For Intelligent Computer Systems VIDEO | SLIDES
- Detecting Persistent Threats Using Sequence Statistics VIDEO | SLIDES