Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Skills Developers Need for Big Data

DZone's Guide to

Skills Developers Need for Big Data

To work with big data, developers need to understand the business problem they are working on, along with the deployment architectures and data.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

To gather insights on the state of big data in 2018, we talked to 22 executives from 21 companies who are helping clients manage and optimize their data to drive business value. We asked them, "What skills do developers need to have to work on big data projects?" Here's what they told us.

Know the Business Problem

  • Look at your work from a data-centric perspective. What data do you have, what do you want to know, and how can you fill in the gaps to solve the problem or answer the question? 
  • Developers need a variety of skills to work on big data projects, including the following three that are crucial for success: They must have a clear understanding of the range of business objectives within a company and how those align with the capabilities of various technologies. Similarly, in the context of an application, developers need to have an understanding of the business value of the datasets they’re working with. Finally, developers need the ability to build and manage an application as a member of a self-contained team that’s part of a larger organization. 
  • Understand the use case and figure out what’s the best solution stack to achieve your goals. Have a cynical view of tools and toolsets. Evaluate free tools and share your findings with other developers and compare notes. 
  • Develop talent in the core basics. Build on math and standards. Understand data structures, frameworks, and models. Understand the business application — how the information will be used in business. There are a number of tools available that are intuitive to reduce the initial difficulty. As I indicated above, the perfect combination of skills involves statistical/math knowledge, experience with data modeling, some programming experience, and a certain business domain acumen. While it’s fairly rare to find individuals with a perfect combination of skills (a true data scientist), certain toolsets and systems can alleviate the need for serious programming experience, help with the data modeling part, and even reduce the reliance on a deep understanding of the mathematical models behind the predictions.

Deployment Architectures

  • The future is AI/ML. Think microservices. Scale in the cloud and tie into cognitive and AI/ML tools. This requires a different, bigger picture mentality. 
  • Be cognizant of the cloud, microservices, geographic distribution, and security. 
  • Understand the architecture of popular open source systems. Keep up with the trends — what’s changing. 
  • System architecture, software engineering, machine learning, advanced analytics.

Data

  • While developers rule right now, platforms will converge in order to scale so you better understand Kafka. You don’t have to know all of the coding since there will be tools that remove the connectivity challenges. Stay in touch with what is happening with the data you are collecting and ensure it is being used ethically and securely. 
  • Leverage data fabric to simplify processes. Use data as a general resource with containers and microservices. Transformational process simplifies and allows them to pursue business moments. Intelligence to make more tailored and reactive processes. See quality issues and the root cause of issues. Makes their job easier so they’re able to contribute more. 
  • Integrating resources for building applications and recommendation engines. Complement the software stack, ML libraries, and compute resources. Think about how people will be using data going forward. Structure data so that it’s easy to use. Make that part of the problem trivial. 
  • Embrace non-relational data models like documents and semi-structured. Quite often you will need to denormalize data for the sake of analysis. 
  • Understand the basic data vocabulary of structure, dimension, and variables. Understand what kind of analysis can be done with a given variable. Know the gotchas for data — minimum quality standards? What are the tests you can run to determine data integrity?
  • How to work with data at scale. Concurrency of multi-users. Application developers pick up languages quickly. Understand how the data ecosystem works. 
  • Developers need to use programming languages, probability and statistics, applied math, and algorithms for rising trend of machine learning. They also need to understand the context of data, how it will be consumed by the end user, how it will be reused. They need to think distributed computing and architecture to properly separate data management into distinct zones, to keep the big data architecture organized, agile, and secure. DevOps principles should be applied, too. By being involved throughout the software delivery process, data experts can help the rest of the team understand the types of data challenges their software will face in production. The result of big data and DevOps teams working together will be apps whose real-world behavior matches as closely as possible to its behavior in development and testing environments. 
  • Data engineering and data science are the big divisions. Basic knowledge of data science might suffice but deep knowledge of the different data technologies is necessary. Despite NoSQL’s popularity, SQL is still the standard for querying data. Developers need to be aware of the different deployment options — cloud native, containers, and the popular deployment options. But, my personal view is that developers need to know the underlying concepts of databases, without which, the ton of technologies in the space can seem daunting to learn. A good understanding of database and system concepts such as consistency guarantees, transactional boundaries, system architecture, guarantees, and responsibilities etc will help developers understand the landscape, categorize the technologies, and identify technologies that they should be looking into.

Other

  • Understand the big data world is decentralized and distributed by nature. Understand the pitfalls of high availability, latency, debugging. Understand the concepts of in-memory with Spark and data locality with Hadoop. Understand open source options for AI/ML with Apache Spark. You are not restricted to big frameworks. Look into more simplified frameworks.
  • The proliferation of technology. Skills required are very different for Hadoop, MapReduce, and Python. Most developers have a certain interest. Decide where you want to focus and go deeper in that area. Building apps in JavaScript, Node, Java or mobile using iOS. In analytics, there’s a lot of SQL and it’s not going away. C and C++ are good for performance. Java and Python probably have the most database support. Pick a language that’s broad. If you want to be involved in ML, learn Python.
  • Developers should not need to be aware of any specialized development languages and should be able to focus on identifying the core business logic needed to deliver big data projects. A systematic and AI-driven data platform can then translate the business logic into the underlying processing, enabling developers to be future-proofed for any changes to processing technology frameworks.

Here’s who we spoke to:

  • Emma McGrattan, S.V.P. of Engineering, Actian
  • Neena Pemmaraju, VP, Products, Alluxio Inc.
  • Tibi Popp, Co-founder and CTO, Archive360
  • Laura Pressman, Marketing Manager, Automated Insights
  • Sébastien Vugier, SVP, Ecosystem Engagement & Vertical Solutions, Axway
  • Kostas Tzoumas, Co-founder and CEO, Data Artisans
  • Shehan Akmeemana, CTO, Data Dynamics
  • Peter Smails, V.P. of Marketing and Business Development, Datos IO
  • Tomer Shiran, Founder and CEO and Kelly Stirman, CMO, Dremio
  • Ali Hodroj, Vice President Products and Strategy, GigaSpaces
  • Flavio Villanustre, CISO and V.P. of Technology, HPCC Systems
  • Fangjin Yang, Co-founder and CEO, Imply
  • Murthy Mathiprakasam, Director of Product Marketing, Informatica
  • Iran Hutchinson, Product Manager & Big Data Analytics Software/Systems Architect, InterSystems
  • Dipti Borkar, V.P. of Products, Kinetica
  • Adnan Mahmud, Founder and CEO, LiveStories
  • Jack Norris, S.V.P. Data and Applications, MapR
  • Derek Smith, Co-founder and CEO, Naveego
  • Ken Tsai, Global V.P., Head of Cloud Platform and Data Management, SAP
  • Clarke Patterson, Head of Product Marketing, StreamSets
  • Seeta Somagani, Solutions Architect, VoltDB
  • Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

    Topics:
    big data ,data analytics ,architecture ,deployment

    Opinions expressed by DZone contributors are their own.

    {{ parent.title || parent.header.title}}

    {{ parent.tldr }}

    {{ parent.urlSource.name }}