Technical Solutions Used For Big Data
The top five: 1) Open Source; 2) Apache Spark; 3) Hadoop; 4) Kafka; and, 5) Python.
Join the DZone community and get the full member experience.Join For Free
Here's who we talked to:
Uri Maoz, Head of U.S. Sales and Marketing, Anodot | Dave McCrory, CTO, Basho | Carl Tsukahara, CMO, Birst | Bob Vaillancourt, Vice President, CFB Strategies | Mikko Jarva, CTO Intelligent Data, Comptel | Sham Mustafa, Co-Founder and CEO, Correlation One | Andrew Brust, Senior Director Marketing Strategy, Datameer | Tarun Thakur, CEO/Co-Founder, Datos IO | Guy Yehiav, CEO, Profitect | Hjalmar Gislason, Vice President of Data, Qlik | Guy Levy-Yurista, Head of Product, Sisense | Girish Pancha, CEO, StreamSets | Ciaran Dynes, Vice Presidents of Products, Talend | Kim Hanmark, Director, Professional Services, TARGIT | Dennis Duckworth, Director of Product Marketing, VoltDB.
We asked these executives, "What are the technical solutions you use to work on big data projects?"
Here's what they told us:
- Scale out clustered data to protect the software using ZooKeeper, MapReduce, RAFT, a ton of Open Source, Rabbit MQ for messaging, C++, and Python.
- Participating in big data solutions customers tend to use Open Source, Java, Eclipse, Puppet, and Chef. We provide open interfaces for fast ingestion and import using Kafka. We have Open Source ODBC interfaces with Teradata and Vertica to optimize big data analysis.
- Our product is encapsulated as JSON.
- Cloud-based SaaS. We use Open Source delivered as an aggregated platform. Clients can use the data tier to ingest data from many sources. Combine, process, and sit in the cloud that’s an analytically ready OLAP. Data can be stored in a memory or columnar database via software automation. Take from the IT organization so they don’t have to write script when loading data. Expose to users with a semantic layer with business evaluation definition that’s transparent and doesn’t need to know how the data got there just trust that it’s accurate and through. A single distributed version of the truth delivered to users. Series of things provided: data discovery, dashboards, predictive analytics, and visualization via a distributed multi-tenant service.
- Ingesting into big data stores. Building an adaptable pipeline that will deal with the evolution of sources and destinations. Ability to compute KPIs as the data is flowing to ensure availability and fidelity. Focus on the containerized architecture framework to decouple data from the source and big data store infrastructure.
- Open Source and Hadoop with NoSQL on top. There’s no gap between generating in code or hand coding Apache. We use ship Spark as our core-to-core processing engine. We try to stay vendor neutral with regards to architecture. Clients want Open Source, open APIs, and open technologies.
- Enable companies to store and share data. Launched data discovery tools. Go with big data source. Discovery platforms to connect data and perform analytics reports in the same form for sharing information. The continuous finding and sharing of information. The amount of data to analyze is growing hence the growth of Python and Spark. We’ve added in-memory technology to our stack to we can provide clients access to market information faster. We’ve reduced the time to deliver analytics projects.
- We have our own proprietary technology build on the Microsoft Stack. Process tons of data like three years of point-of-sale data for a large retailer.
- We have a large ecosystem with proprietary technology. We focus on an associative engine for easy blending of data and fast analysis. We give clients the ability to change analysis on the fly. Allow people to query data on the fly. We provide a thin visualization layer and control the analysis. More complex analysis on subsets of the data.
- The algorithms we use are out IP. The architecture we use is Hadoop and Elasticsearch. We get 3.3 billion samples everyday and run each sample by 150 million algorithms. We do this in a scalable way and have the ability to provide a good UX.
- Riak KV is an open source product for which we have an enterprise edition for cluster replication. Also use Spark and have Spark Connector reading data in and out of Riak.Mesos project framework written in Go and then Erlang. Kafka for integration. We’re a goodOpen Source citizen partner. We try to use what customers are using and asking for. We enable customers to do what they need to do.
- More and more built on top of Open Source. We used to use SQL databases. We’re now usingSpark, PostgreSQL, and relational databases from Oracle.
- Data Storage:
- Salesforce Enterprise Edition – Good for up to 500 custom fields and additional related tables of a similar size. Major upside is the ease of use and flexibility of the backend that regular users can access. Our core product works with this out of the box.
- Heroku Connect – Easily integrates with your Salesforce instance and can be scaled to manage additional instances as well as integrate with other BI tools.
- Amazon RDS – Cost-effective in hosting large data volumes and can scale as needed. Typically, we work with our state and national voter files in here creating an API to Salesforce as needed.
- Data Visualization/ Reporting:
- Salesforce Reports and Dashboards – Basic reports and charts to visualize key metrics and generate automated reports to personnel at set times or alerts we create
- Tableau – More robust charts and visualization than Salesforce native
- Geopointe/Spatial Key – because who doesn’t want to see things in a map?
What technical solutions do you use for your big data projects?
Opinions expressed by DZone contributors are their own.