DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • What Nobody Tells You About Multimodal Data Pipelines for AI Training
  • AI in Software Engineering: 3 Critical Mistakes to Avoid (and What to Do Instead)
  • How AI Is Transforming Software Engineering and How Developers Can Take Advantage
  • LLMs in Data Engineering: How Generative AI is Changing ETL and Analytics

Trending

  • The Agentic Agile Office: Streamlining Enterprise Agile With Autonomous AI Agents
  • Zero-Downtime Deployments for Java Apps on Kubernetes
  • Rethinking Java CRUDs With Event Sourcing and CQRS Patterns
  • Run Gemma 4 on Your Laptop: A Hands-On Guide to Google's Latest Open Multimodal LLM
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Generative AI and the Future of Data Engineering

Generative AI and the Future of Data Engineering

Generative AI is revolutionizing the world. Proper data engineering and observability are critical for its success. Here's what it means for data engineering.

By 
Lior Gavish user avatar
Lior Gavish
·
Jul. 12, 23 · Opinion
Likes (1)
Comment
Save
Tweet
Share
3.1K Views

Join the DZone community and get the full member experience.

Join For Free

Maybe you’ve noticed the world has dumped the internet, mobile, social, cloud, and even crypto in favor of an obsession with generative AI.

But is there more to generative AI than a fancy demo on Twitter? And how will it impact data? 

Let’s assess.

How Generative AI Will Disrupt Data

With the advent of generative AI, large language models became much more useful to the vast majority of humans. 

Need a drawing of a dinosaur riding a unicycle for your three-year-old’s birthday party? Done. How about a draft of an email to employees about your company’s new work-from-home policy? Easy as pie. 

It’s inevitable that generative AI will disrupt data, too. After speaking with hundreds of data leaders across companies from Fortune 500s to startups, we came up with a few predictions:

Access to Data Will Become Much Easier – And More Ubiquitous

Chat-like interfaces will allow users to ask questions about data in natural language. People that are not proficient in SQL and business intelligence will no longer need to ask an analyst or analytics engineer to create a dashboard for them. Simultaneously, those who are proficient will be able to answer their own questions and build data products quicker and more efficiently. 

This will not displace SQL and business intelligence (or data professionals), for that matter, but it will lower the bar for data access and open it up to more stakeholders across more use cases. As a result, data will become more ubiquitous and more useful to organizations, with the opportunity to drive greater impact.

Simultaneously, Data Engineers Will Become More Productive

In the long term, bots may eat us (just kidding - mostly), but for the foreseeable future, generative AI won’t be able to replace data engineers; just make their lives easier - and that’s great. Check out what GitHub Copilot does if you need more evidence. 

While generative AI will relieve data professionals of some of their more ad hoc work, it will also give data people AI-assisted tools to more easily build, maintain, and optimize data pipelines. Generative AI models are already great at creating SQL/Python code, debugging it, and optimizing it, and they will only get better.

These enhancements may be baked into current staples of your data stack or entirely new solutions being engineered by a soon-to-be-launched seed-stage startup. Either way, the outcome will be more data pipelines and more data products to be consumed by end users. 

Still, like any change, these advancements won’t be without their hurdles. Greater data access and greater productivity increase both the criticality of data and its complexity, making data harder to govern and trust. 

I don’t predict that bots shaped like Looker dashboards and Tableau reports will run amok. Still, I do foresee a world in which pipelines turn into figurative Frankenstein Monsters, and business users rely on data with little insight into where the data came or guidance around what to use. Data governance and reliability will become much more important in this brave new world. 

Software engineering teams have long been practicing DevOps and automating their tooling to improve developer workflows, increase productivity, and build more useful products - all while ensuring the reliability of complex systems. 

Similarly, we are going to have to step up our game in the data space and become more operationally disciplined than ever before. Data observability will play a similar role for data teams to manage the reliability of data - and data products - at scale and will become more critical and powerful.

Building, Tuning, and Leveraging LLMs 

Last month, Datadog announced that they are integrating with ChatGPT to better manage the performance and reliability of OpenAI APIs by tracking usage patterns, cost, and performance. 

Monitoring the OpenAI API is massive, but what happens when data teams start using LLMs as part of their data processing pipelines? What happens when teams use their own datasets to fine-tune LLMs or even create them from scratch? Needless to say, broken pipelines and faulty data will severely impact the quality and reliability of the end product.

During Snowflake’s Q1 2023 earnings call, Frank Slootman, CEO of Snowflake, argued that “generative AI is powered by data. That’s how models train and become progressively more interesting and relevant... You cannot just indiscriminately let these [LLMs] loose on data that people don’t understand in terms of its quality and its definition and its lineage.”

We’ve already seen the implications of unreliable model training before the advent of LLMs. Just last year, Equifax, the global credit giant, shared that an ML model trained on bad data caused them to send lenders incorrect credit scores for millions of consumers. And not long before that, Unity Technologies reported a revenue loss of $110M due to bad ads data fueling its targeting algorithms. 

According to Slootman (and likely execs at Equifax and Unity now, too), having AI simply isn’t enough to succeed with it - you need to manage its reliability, too. Not just that, but teams need an automated, scalable, end-to-end, and comprehensive approach to managing the detection, resolution, and, ultimately, prevention of bad models powered by bad data. 

Data observability will play a key role in bringing LLMs to production and making them reliable enough for companies and individuals to adopt in production use cases.

Data observability gives teams critical insights into the health of their data at each stage in the pipeline, automatically monitoring data and letting you know when systems break. Data observability also surfaces rich context with field-level lineage, logs, correlations, and other insights that enables rapid triage, incident resolution, and effective communication with stakeholders impacted by data reliability issues - all critical for both trustworthy analytics and AI products.

AI Data processing Engineering Software engineering Data (computing) Pipeline (software)

Published at DZone with permission of Lior Gavish. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • What Nobody Tells You About Multimodal Data Pipelines for AI Training
  • AI in Software Engineering: 3 Critical Mistakes to Avoid (and What to Do Instead)
  • How AI Is Transforming Software Engineering and How Developers Can Take Advantage
  • LLMs in Data Engineering: How Generative AI is Changing ETL and Analytics

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook