Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

[DZone Research] Devs and Big Data

DZone's Guide to

[DZone Research] Devs and Big Data

What is the business problem you are trying to solve? How will what you are doing add value to the business and improve the customer experience? And many more.

· Big Data Zone ·
Free Resource

How to Simplify Apache Kafka. Get eBook.

To understand the current and future state of big data, we spoke to 31 IT executives from 28 organizations. We asked them, "What do developers need to keep in mind working with big data?" Here's what they told us.

Business Problems

  • Developers want to build cool solutions that matter. Need to focus on what matters and partner with the business to solve problems. 
  • Be aware that modern big data technology is use case specific. Need to choose the right solution for your use case. 
  • Think about the business context of what you are doing. Understand regulations and constraints around the data. Keep in mind when building. Think about the record, compliance, and security. Bridge to other software and build capability into big data applications. Connect to the business to leverage the information and understand the constraints and governance. 
  • Developers and solution architects can get myopic. Need to have a bigger picture. Useful in development of apps and solutions know what doing is part of a larger context to draw from and product creating impacts a bigger picture. 
  • 1) Always be thinking about where the data comes from. 2) Do I have the right connections to people in the business who are stewards of the data? What is the business outcome I am working on and do I have the business partner to see value? You cannot work in a vacuum and be successful. 
  • If you’re in retail, your focus should be on “How do I help provide a better retail experience?” If you’re in oil and gas, you need to ask, “How do I efficiently get oil out of the ground?” Developers need to focus on how they can provide value to their specific business in response to their particular industry rather than spending all their time trying to build horizontal functionality that they can get from the market. There is a great temptation to build something yourself because it might be fun or interesting. Developers have to keep in mind that building your first data pipeline end-to-end from scratch is fun the first time. But when you have to build 10 or 100 or 1000 data pipelines, it is no longer quite so fun. Maintaining the original code you wrote becomes a drag on your ability to create new data pipelines. So the more of the basic process you can automate, the more time you will have to focus on capabilities that are actually specific to your business.

Scale

  • There are two sides to consider: 1) A lot of people are afraid of big data. Just accept its nothing new, it’s just more records. 2) It’s a lot more records. Have a clear definition of what’s meant by big data. Regardless of the definition, it’s nothing new, all of the issues you need to cover (security, data access, pain points). Don’t get too cozy, be careful with it. Everything is exacerbated – everything is 100X larger. 
  • The scale of big data is very different on volume, variety, users, and use. That changes from traditional to building something for the user. You need to enable self-service for users. Given the amount of experimentation required, developers can’t prebuild solutions, they need to give business users access to explore. Focus on flexibility and self-service for users. How much will my users be able to do on their own? Everyone wants to use data. At Google, 80% of people access the data catalog every day. Developers need to solve the hard technical problems of performance, scalability, and security and make the data available to business users. 
  • Data/intelligence is the new marketing operating system in that we are developing programs and applications on top of an information/intelligence foundation layer. The same core principles that apply to successful software development apply to big data. Concepts such as scalability, reliability, extensibility are critical to efficient and effective data-driven software or program development. Any development initiative must be scalable enough to ingest the volume and velocity of data created in today’s world. The system must be of the highest reliability and integrity to ensure uptime, accuracy, and real-time access to data/intelligence. And, the system must be designed for extensibility utilizing APIs for UI/app development, real-time feeds/subscriptions to data, and the capability to integrate with any partner/external system.

Security

  • Data at rest. Data on the wire. When data is back in the data center, ensure you have sufficient protection and servers. Data at rest protection and security needs to be more focused. Developers need to be trained in how to protect data. Data on the wire protections are getting good.
  • Security and scale, how to derive insights for the entire enterprise, people other than you. We need to think about how we help analysts communicate with groups that aren’t like them. The next step is to think about how the answer is presented to people to benefit them.

Many Others

  • Analysts need to make bigger changes than developers. Move away from the relational DB. Data analysts should learn Python and others beyond SQL. NoSQL is winning and relational is going to be transaction processing. Developers will find themselves in a better position. Adapt to microservices-based coding and platforms. Different use cases of big data require different tools and microservices allow you to use the right tool without impacting the entire organization. With Kubernetes (K8s) even more importance with DevOps – no one in infrastructure focuses on K8s and containers.
  • Pick the right tools for the job. Be careful not to start too low-level. Instead of a compiler look at a platform to get higher quality outcomes. ML is a learn by example paradigm. Use references. Do what others have done. Talk to trusted references, LinkedIn, user groups, experts.
  • When evaluating a project look at the data sets. Try connecting data to see how much can connect and how long it takes. Look at the connection and data loading speed. Be open to new solutions like graph databases.
  • Be practical. In Silicon Valley, developers are leading the way for new and innovative technology use – Kafka, Spark will lead to your next job. Get your arms around and adopt new technology. Non-Silicon Valley organizations want to do that but can’t ramp that quickly. Find a compromise to leverage technologies (Hadoop, Kafka, Spark) but be OK leveraging tools/platforms that help you to use the new technologies.
  • The cost of running big data analysis can be high. Using a thoughtful and efficient approach to solving large-scale analysis is necessary to avoid a big compute bill.
  • Do not rush to build a predictive model! An efficient and appropriate data transformation and data cleaning process is usually the key to meaningful results, even predictions. Invest your time wisely and dedicate enough processing energy to the data preparation step.
  • Don’t log stupid shit. Analysts/scientists have to be an engineer to prepare the data. BI is about answering questions about things you know. Data science is about asking questions about the questions. Data science frustrates engineers because they want things that weren’t thought about when the code was created. Start working together earlier. Interface around making data accessible and discoverable, and get good APIs around data, not just functions.
  • 1) Understand there’s a lot of different constituents in the data world. The people part. People are more technically proficient than 10 to 15 years ago. Business unit managers are much more “techy.” It's not a black hole. 2) AI is super important. Streaming, containers — it's all mushed together for 2019. If focused on transactional, AI or containers will come running at them and turn into a streaming application. Analytics at the edge. Not just capturing at the edge but running analysis at the edge. Be ready and start to educate. It won’t all happen in the cloud, data center, or notebook.
  • Master some of the core formats and standards – Apache Parquet is the most popular way to store data for analysis. With the consolation of Hortonworks and Cloudera, we see Parquet as the winner of the arms race. Become familiar with Apache Arrow since it is supported in more than a dozen programming languages. It's the standard way applications organize and process data in memory. More efficient code and memory application.
  • The data engineering skillset is going to be critically important and sought after in the future. Understand how data can be effectively stored, addressed, moved around, and brought to the analytics. Also breaking into the data science world. Understand the bugs that cause data errors. Be able to look at the data to see where the errors are creeping up. Debugging isn’t just the procedure code it’s how the data got treated and what happened to it on the journey.
  • Big data is not a magic bullet. There are a lot of challenges how it runs on a distributed system. Use the right technology, know how to tune, and optimize. Spark is a powerful compute framework, but it gets complicated on a distributed system. Understand how your application connects to the distributed system.
  • As a knee-jerk reaction to proprietary big data technologies, a lot of developers are resorting to using open source projects and building a DIY technology stack without understanding all the intricacies that come with it. They end up using Lego-building blocks of technologies like streaming, windowing, NoSQL, IMDB etc. 1) Keep in mind that DIY technologies are hard to create but harder to maintain over the years. The stack needs to be replaced once the lead developer leaves the company. 2) While the library of open source technologies is rich, layering various technologies often leads to big latency issues resulting in poor customer experience and sometimes loss of business. 3) All these technologies ultimately drive up the hardware costs to run the application. Keeping it simple can be cheaper, more performant, and easier to manage over the years.
  • Big data is data — it should not be considered special or different. The data challenges being faced by developers, the 3 or 7 or 14 Vs of big data, however many you count, are the result of changing business conditions and the need for all of this new, fast, geographically dispersed data and computing that is being made possible by the massive cloud platform players and through build-outs of private cloud infrastructure. There are many tools that are available for developers, some of them are less than optimal for many use cases. We have seen often within our customers and prospects, there is an approved list of tools, data management products, that can be used and that are recommended for use. If a developer wants to use something else, they either have to go “rogue” and hope their unsanctioned choice works out to try to drive certification and adoption of that something else by their company, or they have to fight through the official process for getting a new tool evaluated, tested, and certified. Many developers are unwilling to fight those battles, so they are left using inferior tools and platforms, needing to write additional complex code to make up for shortcomings in the tools and platforms themselves. We see this, for example, with Cassandra. Many large enterprises have adopted Cassandra as one of their few main data management platforms, despite the challenges of using Cassandra, which are particularly strong for operational use cases.
  • Developers need to keep in mind that storing big data is not enough. Real-time transaction processing and decision making based on the data are often the long-term goal and should be kept in mind when selecting technologies.
  • Developers should continue to learn new and emerging languages to gain insight from data, such as Scala, Python, and R. However, the “lingua franca” of data is SQL. Time has proven that even newer data management approaches in the past decade, such as HDFS, eventually rely on SQL for analysis. An example is how the definition of NoSQL has expanded to “Not Only SQL” and the emergence of SQL on Hadoop engines. So, developers should continue sharpening their skills on the latest languages and tools, but when it comes to scoring and perfecting their data models, there is no substitution for SQL-based data analytical platforms.
  • Ultimately, it’s all about the data. If you try to lock in the data model or build inflexible applications, you will be in trouble down the road. If you require pristine, perfectly clean data, your application won’t work in the real world at all.

Here’s who we spoke to:

12 Best Practices for Modern Data Ingestion. Download White Paper.

Topics:
big data ,dzone research ,data engineering ,data science

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}