DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • A Data Warehouse Alone Won’t Cut It for Modern Analytics
  • How to Generate Customer Success Analytics in Snowflake
  • Fixing Common Oracle Database Problems
  • SAP HANA Triggers: Enhancing Database Logic and Automation

Trending

  • Beyond ChatGPT, AI Reasoning 2.0: Engineering AI Models With Human-Like Reasoning
  • Issue and Present Verifiable Credentials With Spring Boot and Android
  • The 4 R’s of Pipeline Reliability: Designing Data Systems That Last
  • Cookies Revisited: A Networking Solution for Third-Party Cookies
  1. DZone
  2. Data Engineering
  3. Data
  4. Databricks: An Understanding Inside the WH

Databricks: An Understanding Inside the WH

This article presents an understanding of Databricks with helpful links and an explanation of the author's knowledge of the topic.

By 
Barath Ravichander user avatar
Barath Ravichander
·
Dec. 21, 23 · Opinion
Likes (1)
Comment
Save
Tweet
Share
3.6K Views

Join the DZone community and get the full member experience.

Join For Free

Below is a summarized write-up of Databricks and my understanding of Databricks. There are many different types of data warehouses in the market, but here, we are just going to focus on Databricks alone. 

Databricks is a similar concept to a data catalog using a hive meta store. Your data resides in s3 and not in any storage database that resides inside an HDD or an SSD. 

After the data is in s3, the process is similar to the data catalog in Glue — with the help of crawlers how the data is read and ready for users. 

The source data can be in any format, but it is internally stored in parquet format. 

Data will be in your s3, and on top of it, there is a Unity Catalog, which does fine-grain governance on top of your data before it is ingested. 

Data Loading

One interesting feature that I liked in Databricks is the Autoloader. These are common practices that happen in any OLTP or OLAP databases, but a small change is that the file can be in any format after understanding the data structure loads into a parquet format. Let's say you have a CSV with four rows and four columns, the data will be loaded into databricks into a parquet format soon it identifies a file in the specified location. 

There are many other ways to load data, like any DBT tools, and we can also use Glue (if you are using AWS) — you can read more on Glue with Delta Lakes. 

Another way to load, similar to any data warehouse, is COPY INTO. Using a SQL query, you can just give the path and copy it into the delta table using the table name. 

You can also play around using SQL, Python, R, and Scala. 

SQL
 
COPY INTO my_table
FROM '/path/to/files'
FILEFORMAT = <format>
FORMAT_OPTIONS ('mergeSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');


Reference SQL Source: COPY INTO

The data is stored in Delta Lakes in three tiers: Bronze, Silver, and Gold.

Bronze: This is used for RAW data ingestion and any historical data. 

Silver: Cleansed data or filtered or any augmented data. 

Gold: Used for business aggregations.  

Instance Types

There are two types of ways you can spin up Databricks, either through serverless or though on-demand instances. These on-demand instances are photon-type instances with Graviton or other instance types. 

It's easy to spin up one on the Databricks page: Databricks on AWS.

You can also calculate your instance pricing on the Pricing Calculator page.

IDE Setups

IDEs are important from a developer's perspective on how your teams can collab, run, and commit your code into your relevant code repositories. 

There are a couple of options where you can use the notebooks, which are collaborative with the development teams, SQL Editor, or you have a lot of extensions for any IDE. The one I liked was with VS Code Plugin.

Conclusion

Databricks has an ecosystem of a data warehouse that reads data directly from s3, where there is no need to have a storage layer. The combination of Ingestion, Data/AI Platform, and Data Warehousing is Databricks Lakehouse.

Above is my understanding of how Databricks work with my initial knowledge. Will keep adding more details to these blogs. Please share your experience with Databricks.

Links are given at regular checkpoints.

Data warehouse Database Data (computing)

Opinions expressed by DZone contributors are their own.

Related

  • A Data Warehouse Alone Won’t Cut It for Modern Analytics
  • How to Generate Customer Success Analytics in Snowflake
  • Fixing Common Oracle Database Problems
  • SAP HANA Triggers: Enhancing Database Logic and Automation

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!