Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Devs and Data, Part 2: Ingesting Data at High Velocity

DZone's Guide to

Devs and Data, Part 2: Ingesting Data at High Velocity

We take a look at how developer and data pros ingest data and how they work with one of the Vs of big data, velocity.

· Big Data Zone ·
Free Resource

Cloudera Data Flow, the answer to all your real-time streaming data problems. Manage your data from edge to enterprise with a no-code approach to developing sophisticated streaming applications easily. Learn more today.

This article is part of the Key Research Findings from the new DZone Guide to Big Data: Volume, Variety, and Velocity. 

Introduction

Welcome back! In Part 1, we covered how the software industry is becoming much more data-driven and how the field of big data is growing. In this post, we examine how technologists perform data ingestions when dealing with high-velocity data.

As a quick reminder of our methodologies, for this year’s big data survey, we received 459 responses with a 78% completion rating. Based on this response rate, we have calculated the margin of error for this survey to be 5%.

Data Velocity

Types of Data to Ingest

When we asked respondents what data types give them the most issues regarding data velocity, two types saw noticeable increases over last year: relational (flat tables) and event data. In 2018, 33% of respondents reported relational data as an issue with regards to velocity; this year, that rose to 38%. For event data, we saw the percentage of respondents reporting this data type as an issue go from 23% to 30%. Interestingly, relational data types seem to be a far bigger issue for users of R than for Python developers. Among those who use Python for data science, only 8% reported relational data types to be an issue when it came to data velocity. 30% of R users, however, told us they’ve had problems with relational data.

Data velocity

We also asked respondents which data sources gave them trouble when dealing with high-velocity data. Two of the issues reported fell drastically from our 2018 survey. The number of survey-takers reporting server logs as an issue fell by 10%, and those reporting user-generated data fell from 39% in 2018 to 20% in this year’s survey. Despite these positive trends, respondents who said files (i.e. documents, media, etc.) give them trouble rose from 26% last year to 36%.

Tools of the Data Trade

The tools and frameworks that data professionals and developers use to deal with data ingestions processes also witnessed interesting fluctuations over the past year. To perform data ingestion, 66% of survey-takers reported using Apache Kafka, up from 61% last year. While Kafka has been the most popular data ingestion framework for a while now, its popularity only continues to climb. For streaming data processing, Spark Streaming came out on top, with 49% of respondents telling us they use this framework (a 14% increase over last year). For performing data serialization processes, however, respondents were split between two popular choices: 36% told us they work with Avro (up from 18% in 2018) and 30% reported using Parquet (also up from 18% in 2018).

That's all for this look into data ingestions and high-velocity data. Tune back in on Monday when we'll look into what our respondents had to say about data management and volume. 

This article is part of the Key Research Findings from the new DZone Guide to Big Data: Volume, Variety, and Velocity. 

 Cloudera Enterprise Data Hub. One platform, many applications. Start today.

Topics:
big data ,data velocity ,relational data ,event data ,data ingestion

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}