DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • A Data Warehouse Alone Won’t Cut It for Modern Analytics
  • How to Generate Customer Success Analytics in Snowflake
  • Production Database Migration or Modernization: A Comprehensive Planning Guide [Part 2]
  • The Data Warehouse Concurrency Playbook: Surviving the "Super Bowl" Moment

Trending

  • The Update Problem REST Doesn't Solve
  • Why SAP S/4HANA Landscape Design Impacts Cloud TCO More Than Compute Costs
  • Chat with Your Oracle Database: SQLcl MCP + GitHub Copilot
  • 11 Agentic Testing Tools to Know in 2026
  1. DZone
  2. Data Engineering
  3. Data
  4. Databricks: An Understanding Inside the WH

Databricks: An Understanding Inside the WH

This article presents an understanding of Databricks with helpful links and an explanation of the author's knowledge of the topic.

By 
Barath Ravichander user avatar
Barath Ravichander
·
Dec. 21, 23 · Opinion
Likes (1)
Comment
Save
Tweet
Share
4.1K Views

Join the DZone community and get the full member experience.

Join For Free

Below is a summarized write-up of Databricks and my understanding of Databricks. There are many different types of data warehouses in the market, but here, we are just going to focus on Databricks alone. 

Databricks is a similar concept to a data catalog using a hive meta store. Your data resides in s3 and not in any storage database that resides inside an HDD or an SSD. 

After the data is in s3, the process is similar to the data catalog in Glue — with the help of crawlers how the data is read and ready for users. 

The source data can be in any format, but it is internally stored in parquet format. 

Data will be in your s3, and on top of it, there is a Unity Catalog, which does fine-grain governance on top of your data before it is ingested. 

Data Loading

One interesting feature that I liked in Databricks is the Autoloader. These are common practices that happen in any OLTP or OLAP databases, but a small change is that the file can be in any format after understanding the data structure loads into a parquet format. Let's say you have a CSV with four rows and four columns, the data will be loaded into databricks into a parquet format soon it identifies a file in the specified location. 

There are many other ways to load data, like any DBT tools, and we can also use Glue (if you are using AWS) — you can read more on Glue with Delta Lakes. 

Another way to load, similar to any data warehouse, is COPY INTO. Using a SQL query, you can just give the path and copy it into the delta table using the table name. 

You can also play around using SQL, Python, R, and Scala. 

SQL
 
COPY INTO my_table
FROM '/path/to/files'
FILEFORMAT = <format>
FORMAT_OPTIONS ('mergeSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');


Reference SQL Source: COPY INTO

The data is stored in Delta Lakes in three tiers: Bronze, Silver, and Gold.

Bronze: This is used for RAW data ingestion and any historical data. 

Silver: Cleansed data or filtered or any augmented data. 

Gold: Used for business aggregations.  

Instance Types

There are two types of ways you can spin up Databricks, either through serverless or though on-demand instances. These on-demand instances are photon-type instances with Graviton or other instance types. 

It's easy to spin up one on the Databricks page: Databricks on AWS.

You can also calculate your instance pricing on the Pricing Calculator page.

IDE Setups

IDEs are important from a developer's perspective on how your teams can collab, run, and commit your code into your relevant code repositories. 

There are a couple of options where you can use the notebooks, which are collaborative with the development teams, SQL Editor, or you have a lot of extensions for any IDE. The one I liked was with VS Code Plugin.

Conclusion

Databricks has an ecosystem of a data warehouse that reads data directly from s3, where there is no need to have a storage layer. The combination of Ingestion, Data/AI Platform, and Data Warehousing is Databricks Lakehouse.

Above is my understanding of how Databricks work with my initial knowledge. Will keep adding more details to these blogs. Please share your experience with Databricks.

Links are given at regular checkpoints.

Data warehouse Database Data (computing)

Opinions expressed by DZone contributors are their own.

Related

  • A Data Warehouse Alone Won’t Cut It for Modern Analytics
  • How to Generate Customer Success Analytics in Snowflake
  • Production Database Migration or Modernization: A Comprehensive Planning Guide [Part 2]
  • The Data Warehouse Concurrency Playbook: Surviving the "Super Bowl" Moment

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook