How Has Big Data Changed in the Past Year?
How Has Big Data Changed in the Past Year?
Real-time data streaming has dramatically changed the way enterprises work with and analyze Big Data.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
To gather insights on the state of Big Data today, we spoke with 22 executives from 20 companies who are working in Big Data themselves or providing big data solutions to clients. Here’s who we talked to:
- Nitin Tyagi, Vice President Enterprise Solutions, Cambridge Technology Enterprises
- Ryan Lippert, Senior Marketing Manager and Sean Anderson, Senior Product Marketing Manager, Cloudera
- Sanjay Jagad, Senior Manager, Product Marketing, Coho Data
- Amy Williams, COO, Data Conversion Laboratory (DCL)
- Andrew Brust, Senior Director Market Strategy and Intelligence, Datameer
- Eric Haller, Executive Vice President, Experian DataLabs
- Julie Lockner, Global Product Marketing, Data Platforms, Intersystems
- Jim Frey, V.P. Strategic Alliances, Kentik
- Eric Mizell, Vice President Global Engineering, Kinetica
- Rob Consoli, Chief Revenue Officer, Liaison
- Dale Kim, Senior Director of Industrial Solutions, MapR
- Chris Cheney, CTO, MPP Global
- Amit Satoor, Senior Director, Product and Solution Marketing, SAP
- Guy Levy-Yurista, Head of Product, Sisense
- Jon Bock, Vice President of Product and Marketing, Snowflake Computing
- Bob Brodie, CTO, SUMOHeavy
- Kim Hanmark, Director of Professional Services EMEA, TARGIT
- Dennis Duckworth, Director of Product Marketing, VoltDB
- Alex Gorelik, Founder and CEO and Todd Goldman, CMO, Waterline Data
- Oliver Robinson, Director and Co-Founder, World Programming
We asked, "How has Big Data changed in the past year?" Here's what they told us:
- Use of streams as part of a Big Data strategy. Streams as a means of creating a master source of data. Employ data to differentiate customers with different formats for different targets.
- There’s an uptick in real-time streaming of data. Clients want to react based on the data coming through and what they’ve learned in the past. Produce actionable insights. Time envelopes are shrinking which is driving machine learning. Fraud detection and maintenance – what to banks do with fraudulent credit card transactions? Models need to be highly accurate to respond to real-time data. There’s no luxury to do analysis in batch. Use Spark for machine learning.
- Aware of just collecting what you use. Refine how it’s collected. Consider data governance, security, and compliance. Look at the kinds of data to collect. Tokenization of data replaces it with a surrogate value and the keys are stored in a separate, secure location.
- Larger volumes of datasets. Different uses of the data to run the business. Creating meta data and sending to distributors. More now on the technology side with full automation due to the size of the projects.
- It’s no longer enough to show that you can ingest large amounts of data - you have to be able prove that your assumptions about what you are doing with that data are valid. In other words, the integrity and accuracy of the data science around the analytics becomes just as important to prove than just the fact that you can gather a lot of data.
- Big data technology is now available for humans anywhere. You can use Alexa and natural language processing to run queries on your data obtaining insights from oceans of data.
- 1) People look at continuous data streams – data that’s constantly being ingested. 2) Evaluation and use of cloud-based solutions which have reached maturity.
- 1) Move to the cloud with its elasticity and lower cost. 2) Hadoop is being treated as part of the data ecosystem provisioning to the data lake when we need more integrated and standardized data.
- Different data, different volume. Now big data is synonymous with “data at rest.” Companies are trying to get in front of fast streaming data, breaking down siloes, getting queries in less than one minute. NVidia is doing real-time image recognition.
- It’s a dynamic space. 1) Evolution of tools that are enterprise quality. 2) Storage infrastructure of HDFS given the limitation of traditional architectures. 3) Big data acknowledging HDFS as a protocol rather than a file system. Allow customer to access data and compute. Decouple and make infrastructure simpler and easier. Use tools on the compute side. Industry accepts HDFS as a protocol. Efficiency in the stack. Companies can focus on actual applications and how to harness.
- Data cleansing and preparation is the same problem today that it was in 1999, there’s just a lot more data and more technologies available to solve the problem. It’s all about algorithms right now. We’ve accepted the ability to do individual analysis but algorithms are the secret sauce. Data sources haven’t changed. It’s about how to put the algorithm into action.
- Technology disappears as fast as it appears. Data lakes have settled into acceptance. Hadoop is hard to set up and use. More effort needs to be put into ease of deployment. HDFS is here to stay. MapReduce is fading out and only being used for more specialist projects. We can now mix SAS, MapReduce, R and Pig – you just need to know the different technologies. We’re not seeing as much Julia as I thought we would.
- Twenty-four months ago Hadoop was hot and now it’s table stakes. Infrastructure is a snore. Machine learning is becoming a dinner table topic though most people only see 5% of the iceberg. We need to become more nuanced in our language around machine learning. Transferring a terabyte of data is easier. AWS has trucks with 100 petabytes of data. It’s easier to pack down data. Data storage space has expanded. GPU clusters are more interesting – a 10X lift over Spark.
- Will see evolution from HDFS and MapReduce because they are too restrictive. Expand to Apache Spark to meet more complex data science projects. Start with active archive and move to multi-tenant architecture. Identify the specific tools and user interface that will meet your needs. Users are leveraging Spark looking at machine learning and streams of data. Invest in newer workloads with the ability to run Hadoop on a lot of different platforms. Continue to educate customers on everything big data can do. Expand to business intelligence and machine learning. As there are more components in the ecosystem, it’s harder to curate to the bleeding edge in a fast, easy, and secure way.
- Hadoop and other big data technologies fit and finish, governance, operational manageability. Attributes of software and stack are needed for discipline and enterprise readiness.
- Move from building data warehouses and lakes to teams, technologies and practices which will ultimately lead to NLP and machine learning.
- Less expensive and easier to get social media data without additional services because sites are built with APIs to enable you to download the data and run queries against the test files.
- We are moving from “Hadoop as cheap storage” to processing and transformation stage. Customers are now focused on data processing pipelines where multiple transformation and analysis engines can be used in stages on raw data. There is separation of processing from storage. On the business side, focus is more on getting business value from specific use cases. Rather than building the next greatest Hadoop stack using latest components, focus is on what is needed to meet the requirements of specific use case. As companies realize the value of big data, operational factors are becoming priority. Security, data governance, availability are important part of deployment discussions.
- Big data has changed in three distinct areas: 1. Companies are more aware of big data as an opportunity to improve business performance. 2. The technologies to store and process big data are becoming more readily available. For example, Microsoft has embraced the favorite tools amongst data scientists – embedding it into their more “classical” products like MS SQL Server. That minimizes the learning curve for existing developers when they can use it in platforms they are already familiar with. 3. Lastly, in-memory based technologies are bringing development cycles down due to speed of which you can load and process data. You simply get the results faster.
- Accessing the engineering toolsets is getting much easier. The big cloud providers such as Microsoft Azure and Amazon web Services (AWS) are rolling out services that make build and roll out at scale much easier and cost effective. It’s still not as easy as it will be, but it is getting much better as the months roll by.
What are the most significant changes to big data from your perspective?
Opinions expressed by DZone contributors are their own.