How Big Data Tools and Technology Have Changed — In The Last Year
Four themes: 1) Spark supplanting MapReduce and Hadoop; 2) machine learning and clustered computing coming to the forefront; 3) the cloud enabling larger datasets at ever lower prices; and, 4) the new tools that make the analysis of large, disparate dataset even faster.
Join the DZone community and get the full member experience.Join For Free
To gather insights for DZone's Big Data Research Guide, scheduled for release in August, 2016, we spoke to 15 executives who have created big data solutions for their clients.
Here's who we talked to:
Uri Maoz, Head of U.S. Sales and Marketing, Anodot | Dave McCrory, CTO, Basho | Carl Tsukahara, CMO, Birst | Bob Vaillancourt, Vice President, CFB Strategies | Mikko Jarva, CTO Intelligent Data, Comptel | Sham Mustafa, Co-Founder and CEO, Correlation One | Andrew Brust, Senior Director Marketing Strategy, Datameer | Tarun Thakur, CEO/Co-Founder, Datos IO | Guy Yehiav, CEO, Profitect | Hjalmar Gislason, Vice President of Data, Qlik | Guy Levy-Yurista, Head of Product, Sisense | Girish Pancha, CEO, StreamSets | Ciaran Dynes, Vice Presidents of Products, Talend | Kim Hanmark, Director, Professional Services, TARGIT | Dennis Duckworth, Director of Product Marketing, VoltDB.
We asked these executives, "How have big data tools and technologies changed in the past year?"
Here's what they told us:
- Massive movement. 1) Look at what Spark has done to Hadoop – it’s sucked the energy out of the newer frameworks. 2) How Google cloud is bringing new applications from machine learning and an AI perspective. It’s about the intelligence level, not about the data infrastructure. 3) The big data ecosystem is evolving with new tools being added on top of Hadoop. Keep in mind the application developers who are becoming more aware of big data.
- We have a big name data scientist on staff that knows multiple languages and platforms. We see a lot of companies using Apache Spark as their big data platform because it analyzes and hand-offs datasets more quickly. We are also able to see what three dozen clients are doing to see what’s trending and we’re able to create tests that get information from employees about what knowledge they already have and what knowledge they need to learn. Companies don’t know what they don’t know and they don’t know the value of the data they already have. The knowledge gap is huge which is why we are using testing to create standards and metrics. We provide value to clients by providing feedback on the positions they are looking to fill and the specific requirements and skillsets for those positions. Managers need to be trained in how to leverage the value of the information collected.
- 1) Open Source – NoSQL drove a lot with Apache projects and clients cobbling together multiple Open Source packages to solve a single problem with the SMACK stack which you can do with our single solution. In addition, code glue and developers are needed to stitch all of the different packages together. We’ve lost the knowledge of how to create a solution from scratch. 2) The cloud has provided prominent deployment options since it’s not in the IT business – Operations Expense versus Capital Expense. Healthcare and financial services tend to trail due to privacy regulations. Those industries are leading the way with hybrid cloud solutions.
- There are a lot of tools and platforms; however, you don’t know which one’s right for your business today, let alone tomorrow. Our smart execution solutions help you identify the solution that’s best for your jobs based on cost-based optimization.
- Big data and Hadoop used by data scientists. Large scale complicated machine learning work – clustered computing. Last year more developers made big data stores an interactive resource for the business. We need to learn how to make Hadoop accessible to other people than just data scientists – from 10 to 20 users to 100s and 1000s of users. This need is driven by economics and the volume of data projected to increase 47-times over the next 20 years. Greater ability to push down ETL in Hadoop to the cluster. Better economics made more real in the organization. More tools help make the environment ready for corporate scale.
- Accelerating innovation with Open Source Apache Heron and Beam. New applications continue to emerge focusing on streaming data. We use Kafka for this and the cloud with AWS and Azure. Innovation with record services and MapR systems and Horton Works’ data flow.
- 1) Increase in the number of tools enables scale out from just a few members. There’s a focus on Spark rather than Map Reduce. People are unsure what to invest in – what will be the go to solution. 2) In memory databases with Apache and NoSQL. Anything that flows to a disk is slower so we want to keep data in memory so we can look it up more quickly. 3) Machine learning is scaling up. This requires math and computational skills that people get with self education. We use the term “smart” in marketing more. The scale of the amount of data we are getting requires machine learning. There is an influx of smaller vendors doing interesting things while improving the UX. Happier times with nicer interfaces.
- 1) More companies are looking at big data because less technology is needed to access it. Platforms were expensive but they’re now in the cloud with Dell, AWS, and other supporting business users. 2) Capture, facilitate, and use data on demand and analyze. A differentiator will be how much information you can catalyze on the different big data tools.
- The pace of technology is changing quickly. Closer integration between the different technologies. Microsoft just integrated Spark to their big data application. They obviously have a commitment to Apache Spark and Azure. There’s strong maneuvering among the big cloud companies to provide dominant B.I. solutions. The cloud game and the big data game go hand-in-hand. Ultimately you need to convert data into insights. While it’s early, Microsoft is offering a low-end freemium B.I. tool. Some of the other players are creating an integrated ecosystem.
- IoT is overly used and loosely defined. Use big data for machine learning algorithms. We have a client that puts raw materials into reactors full of sensors. They received a lot of data but were unable to understand the root causes of the changes in their output. Getting data isn’t where the value is. You need to get to the root cause and make prescriptive changes where they are needed to be made. As you generate more data with more accuracy, you need the tools at the end to show the actions you need to take.
- The technical progress has not been revolutionary. The biggest changes have been around the tools we work with, the volume of data, the sources of data, and the types of data. There’s also been a significant change on the business front with companies investing in big data systems and populating their databases over the last six to eight years. Those companies are now looking for insights, value, and ROI. Big data has become “real” in the last two to three years.
- Tools like Tableau and Qlikview were more static. We want to provide granularity on top of big data so customers could get inside machine learning. We enable companies to look at the lower levels of the data from the bottom up and the top down. The tools are scalable and automated.
- We moved from MapReduce and Batch to Streaming real-time data ingestion using Kafka and Spark. There’s an interest in streaming style technologies, different levels of processing, analytics and post processing. There’s more data, the cost of collection is going down, and the tools are maturing.
- They’re changing all the time. We’ve moved from MapReduce to Spark and are exploring new tools as they come on the market. The tool we choose depends on the application. Tools become obsolete as well so we keep a lookout for new tools all of the time.
- Big data has been the hot keyword for the last couple of years and with more competition it has become cheaper and more efficient in the marketplace. The pricing models to host big data in the cloud have significantly dropped over the last 5 years and that is a direct result of competition. The competition drives innovation in hosting data in the cloud and reflecting that data visually.
What are the most significant changes you've observed in big data tools and technology in the past year?
Opinions expressed by DZone contributors are their own.