Solutions Used to Collect and Analyze Data
Hadoop, Spark, and Tableau were among the most frequently mentioned technical solutions for collecting and analyzing data.
Join the DZone community and get the full member experience.Join For Free
To gather insights on the state of Big Data today, we spoke with 22 executives from 20 companies who are working in Big Data themselves or providing Big Data solutions to clients.
Here's what they told us when we asked, "What are the technical solutions you use to collect and analyze data?"
- There is a wide range of tools on top of Hadoop. Machine Learning simplifies the process with products like H2O, Mahout, Spark, and Arcadia Data.
- Data warehouse with more in memory. There's good maturity with different platforms for different data and different reports. Hadoop is just one of a myriad of analytical tools. Mission critical data is available in real-time to target customers, retention, prevent churn. It should be personalize for finer granularity. Do more with less time.
- The hierarchy is evolving with the integration layer where you have access to the data — the data layer where you apply the workflow and governance and the visibility layer with solutions like Tableau and Spitfire. Under the integration layer are databases with Hadoop, Cassandra, and Kafka. Phenomenal tools are coming online. Relational databases cannot handle disparate data sources.
- We use OCR and our own conversion models.
- We designed our own Big Data architecture due to the combination of design constraints and objectives of running a true multi-tenant SaaS offering. Along the way, we used several well-known technologies, such as Go, Docker, ZFS, and Node.JS. We also use open-source tools for monitoring our own operations environments, such as ELK, nProbe, and Grafana.
- There are no silver bullets. Everything is available depending on what business problem you are trying to solve. We see customers using RedShift and Hadoop skipping the data warehouse. You can connect to live streaming data sources skipping the data warehouse.
- There’s huge diversity. A lot of innovative ETL, BI, and analytics tools enable you to build analytics into applications. They're embedded inside the application and constantly updated. There's been change in how people move data with more mature tools to fit scheduled data pipelines.
- Tableau, Qlik, and custom cataloging. Automation is required to get through all the data being accumulated today.
- Data lake for storage using open-source solutions like Spark, Kafka, Nitti, and Storm, and Hadoop for long-term data storage.
- Multiple tools. Run containers directly on storage nodes and access with HDFS. Have customer video processing using containers for video encoding.
- Investing in a Hadoop cluster is not the same as what you do with it. SAS and SPSS used to be the tools. Now, R and Python are so well built-out and adopted, it threatens the SAS and SPSS ecosystem. Execute R-based algorithms. Tech solutions are the next generation using open source R and Python libraries. Risk averse companies still invest in proprietary data stores. They want to see their data run in proven tech stacks. More customers are developing internal skills around mobile application development with less investment in Tableau and BI tools. Users want real-time data on a mobile device.
- Structured data areas. Take semi-structured data into a structured environment. Find the right tool for the job and the best environment for the data scientists. Maximizing the effectiveness of the data scientists is your best bang for the buck because they’re expensive and hard to find. Think about how to use HDFS and Spawn without exposing the data to business analysts. Clever GUIs can work for a while but ultimately you’re back to data programming. Focus on users versus technology.
- SAS used to have a monopoly on analytics tools. That has changed the last 12 months. The largest requests now are around R and H2O and some companies are walking away from SAS. The way of open source is bringing a world of hurt their way. Python is in the lab of 70% of data scientists. It’s easy to learn and it’s what they’re teaching in school. Mainframe is out. Hybrid cloud is in.
- Our clients are all over the place — Kafka, Spark streams, standard NoSQL analytics on top with IoT for predictive maintenance, real-time analytics, real-time ingestion, and processing. Focus on processing, storing, and ingestion.
- Hadoop, previous generation analytics technology, data warehouse, BI, OLAP. We partner with Microsoft and Tableau.
- Python and R as a base for data analytics. SAS as a common third-party solution.
- AWS EMR data into queryable formats. Personalization, Monetate, Dynamic Yield, and Optimizely.
- Our own platform, software, and tools.
- We use various database platforms to collect data. On top of this, we use our in-memory product to load data for further analysis into our visual analytical platform.
- We collect data into a range of data stores, which mostly include collecting many JSON documents and storing them in Blob storage or S3 buckets. JSON-based datasets are very cheap to store and flexible with the data they can store. The storage of information in this way is almost infinitely scalable and it means we can go back to old data and reprocess it again and again for varying objectives. The thing about Big Data is to simply collect the data and not worry too much about the formatting (although you shouldn’t completely ignore it). We have explored a variety of ways to process the data on Hadoop-based server clusters, and we have settled on using the data pipeline process capability in Microsoft Azure with Hive queries running on top of HD Insights. We deposit the analysis results in Azure Data Warehouse and SQL server. In the past, we had to build this engineering ourselves, but now it is easier and cheaper to consume the services on-demand from a cloud-based service.
What are the technical solutions you use to collect and analyze data?
By the way, here's who we talked to!
- Nitin Tyagi, Vice President Enterprise Solutions, Cambridge Technology Enterprises
- Ryan Lippert, Senior Marketing Manager and Sean Anderson, Senior Product Marketing Manager, Cloudera
- Sanjay Jagad, Senior Manager, Product Marketing, Coho Data.
- Amy Williams, COO, Data Conversion Laboratory (DCL).
- Andrew Brust, Senior Director Market Strategy and Intelligence, Datameer.
- Eric Haller, Executive Vice President, Experian DataLabs.
- Julie Lockner, Global Product Marketing, Data Platforms, Intersystems.
- Jim Frey, V.P. Strategic Alliances, Kentik.
- Eric Mizell, Vice President Global Engineering, Kinetica.
- Rob Consoli, Chief Revenue Officer, Liaison.
- Dale Kim, Senior Director of Industrial Solutions, MapR.
- Chris Cheney, CTO, MPP Global.
- Amit Satoor, Senior Director, Product and Solution Marketing, SAP.
- Guy Levy-Yurista, Head of Product, Sisense.
- Jon Bock, Vice President of Product and Marketing, Snowflake Computing.
- Bob Brodie, CTO, SUMOHeavy.
- Kim Hanmark, Director of Professional Services EMEA, TARGIT.
- Dennis Duckworth, Director of Product Marketing, VoltDB.
- Alex Gorelik, Founder and CEO and Todd Goldman, CMO, Waterline Data.
- Oliver Robinson, Director and Co-Founder, World Programming.
Opinions expressed by DZone contributors are their own.