[DZone Research] Devs and Big Data
[DZone Research] Devs and Big Data
What is the business problem you are trying to solve? How will what you are doing add value to the business and improve the customer experience? And many more.
Join the DZone community and get the full member experience.Join For Free
Cloudera Data Flow, the answer to all your real-time streaming data problems. Manage your data from edge to enterprise with a no-code approach to developing sophisticated streaming applications easily. Learn more today.
To understand the current and future state of big data, we spoke to 31 IT executives from 28 organizations. We asked them, "What do developers need to keep in mind working with big data?" Here's what they told us.
- The scale of big data is very different on volume, variety, users, and use. That changes from traditional to building something for the user. You need to enable self-service for users. Given the amount of experimentation required, developers can’t prebuild solutions, they need to give business users access to explore. Focus on flexibility and self-service for users. How much will my users be able to do on their own? Everyone wants to use data. At Google, 80% of people access the data catalog every day. Developers need to solve the hard technical problems of performance, scalability, and security and make the data available to business users.
- Data/intelligence is the new marketing operating system in that we are developing programs and applications on top of an information/intelligence foundation layer. The same core principles that apply to successful software development apply to big data. Concepts such as scalability, reliability, extensibility are critical to efficient and effective data-driven software or program development. Any development initiative must be scalable enough to ingest the volume and velocity of data created in today’s world. The system must be of the highest reliability and integrity to ensure uptime, accuracy, and real-time access to data/intelligence. And, the system must be designed for extensibility utilizing APIs for UI/app development, real-time feeds/subscriptions to data, and the capability to integrate with any partner/external system.
- Data at rest. Data on the wire. When data is back in the data center, ensure you have sufficient protection and servers. Data at rest protection and security needs to be more focused. Developers need to be trained in how to protect data. Data on the wire protections are getting good.
- Security and scale, how to derive insights for the entire enterprise, people other than you. We need to think about how we help analysts communicate with groups that aren’t like them. The next step is to think about how the answer is presented to people to benefit them.
- Analysts need to make bigger changes than developers. Move away from the relational DB. Data analysts should learn Python and others beyond SQL. NoSQL is winning and relational is going to be transaction processing. Developers will find themselves in a better position. Adapt to microservices-based coding and platforms. Different use cases of big data require different tools and microservices allow you to use the right tool without impacting the entire organization. With Kubernetes (K8s) even more importance with DevOps – no one in infrastructure focuses on K8s and containers.
- Pick the right tools for the job. Be careful not to start too low-level. Instead of a compiler look at a platform to get higher quality outcomes. ML is a learn by example paradigm. Use references. Do what others have done. Talk to trusted references, LinkedIn, user groups, experts.
- Don’t log stupid shit. Analysts/scientists have to be an engineer to prepare the data. BI is about answering questions about things you know. Data science is about asking questions about the questions. Data science frustrates engineers because they want things that weren’t thought about when the code was created. Start working together earlier. Interface around making data accessible and discoverable, and get good APIs around data, not just functions.
- 1) Understand there’s a lot of different constituents in the data world. The people part. People are more technically proficient than 10 to 15 years ago. Business unit managers are much more “techy.” It's not a black hole. 2) AI is super important. Streaming, containers — it's all mushed together for 2019. If focused on transactional, AI or containers will come running at them and turn into a streaming application. Analytics at the edge. Not just capturing at the edge but running analysis at the edge. Be ready and start to educate. It won’t all happen in the cloud, data center, or notebook.
- Big data is data — it should not be considered special or different. The data challenges being faced by developers, the 3 or 7 or 14 Vs of big data, however many you count, are the result of changing business conditions and the need for all of this new, fast, geographically dispersed data and computing that is being made possible by the massive cloud platform players and through build-outs of private cloud infrastructure. There are many tools that are available for developers, some of them are less than optimal for many use cases. We have seen often within our customers and prospects, there is an approved list of tools, data management products, that can be used and that are recommended for use. If a developer wants to use something else, they either have to go “rogue” and hope their unsanctioned choice works out to try to drive certification and adoption of that something else by their company, or they have to fight through the official process for getting a new tool evaluated, tested, and certified. Many developers are unwilling to fight those battles, so they are left using inferior tools and platforms, needing to write additional complex code to make up for shortcomings in the tools and platforms themselves. We see this, for example, with Cassandra. Many large enterprises have adopted Cassandra as one of their few main data management platforms, despite the challenges of using Cassandra, which are particularly strong for operational use cases.
- Developers should continue to learn new and emerging languages to gain insight from data, such as Scala, Python, and R. However, the “lingua franca” of data is SQL. Time has proven that even newer data management approaches in the past decade, such as HDFS, eventually rely on SQL for analysis. An example is how the definition of NoSQL has expanded to “Not Only SQL” and the emergence of SQL on Hadoop engines. So, developers should continue sharpening their skills on the latest languages and tools, but when it comes to scoring and perfecting their data models, there is no substitution for SQL-based data analytical platforms.
Here’s who we spoke to:
- Cheryl Martin, V.P. Research Chief Data Scientist, Alegion
- Adam Smith, COO, Automated Insights
- Amy O’Connor, Chief Data and Information Officer, Cloudera
- Colin Britton, Chief Strategy Officer, Devo
- OJ Ngo, CTO and Co-founder, DH2i
- Alan Weintraub, Office of the CTO, DocAuthority
- Kelly Stirman, CMO and V.P. of Strategy, Dremio
- Dennis Duckworth, Director of Product Marketing, Fauna
- Nikita Ivanov, founder and CTO, GridGain Systems
- Tom Zawacki, Chief Digital Officer, Infogroup
- Ramesh Menon, Vice President, Product, Infoworks
- Ben Slater, Chief Product Officer, Instaclustr
- Jeff Fried, Director of Product Management, InterSystems
- Bob Hollander, Senior Vice President, Services & Business Development, InterVision
- Ilya Pupko, Chief Architect, Jitterbit
- Rosaria Silipo, Principal Data Scientist and Tobias Koetter, Big Data Manager and Head of Berlin Office, KNIME
- Bill Peterson, V.P. Industry Solutions, MapR
- Jeff Healey, Vertica Product Marketing, Micro Focus
- Derek Smith, CTO and Co-founder and Katie Horvath, CEO, Naveego
- Michael LaFleur, Global Head of Solution Architecture, Provenir
- Stephen Blum, CTO, PubNub
- Scott Parker, Director of Product Marketing, Sinequa
- Clarke Patterson, Head of Product Marketing, StreamSets
- Bob Eve, Senior Director, TIBCO
- Yu Xu, Founder and CEO, and Todd Blaschka, CTO, TigerGraph
- Bala Venkatrao, V.P. of Product, Unravel
- Madhup Mishra, VP of Product Marketing, VoltDB
- Alex Gorelik, Founder and CTO, Waterline Data
Opinions expressed by DZone contributors are their own.