DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Integrating Apache Doris and Hudi for Data Querying and Migration
  • When Doris Meets Iceberg: A Data Engineer's Redemption
  • The Future of Data Lakehouses: Apache Iceberg Explained
  • Artificial Intelligence (AI) Revolutionizes the Oil Industry, Boosting Production and Efficiency

Trending

  • Rethinking Recruitment: A Journey Through Hiring Practices
  • From Zero to Production: Best Practices for Scaling LLMs in the Enterprise
  • AI’s Role in Everyday Development
  • Performance Optimization Techniques for Snowflake on AWS
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Lakehouse: Manus? MCP? Let’s Talk About Lakehouse and AI

Lakehouse: Manus? MCP? Let’s Talk About Lakehouse and AI

This is the second article in the “Lakehouse: What’s the Big Deal?” series, where I will periodically discuss Lakehouse. Your comments and discussions are welcome.

By 
Mingyu Chen user avatar
Mingyu Chen
·
Mar. 26, 25 · Analysis
Likes (2)
Comment
Save
Tweet
Share
2.3K Views

Join the DZone community and get the full member experience.

Join For Free

Since OpenAI launched ChatGPT in late 2022, AI has become an unavoidable topic in every field. Many companies have even transformed into AI companies overnight. The data analytics domain is no exception — Databricks, Snowflake, and Elasticsearch have all redefined themselves as AI data platforms or AI-ready data analytics and search products. Setting aside the “hype”, in today’s article, we’ll explore what relationship actually exists between Lakehouse and AI.

Before diving into this topic, let’s start with a simple example demonstrating the connection between data and AI.

First Experience With Apache Doris MCP Server

What Is MCP?

Model Context Protocol (MCP) is a standard for how applications provide context to large language models (LLMs), proposed by Anthropic in late 2024. In MCP’s official introduction, this protocol is compared to a USB-C interface for AI applications, making it convenient for AI models to connect with different data sources and tools.

MCP adopts a client-server architecture with the following core concepts:

  • MCP client. Client programs that access MCP servers, such as Cursor, Claude Desktop, IDEs, etc.
  • MCP server. A lightweight program that provides specific functional interfaces through the standardized MCP protocol.
  • Local data sources and remote services. Resources that can be accessed by MCP servers, such as local files, web links, remote services, etc.

The following diagram shows the relationship between these concepts:
The MCP architecture

Overall, MCP provides a scalable interface between large models and data sources through a standardized communication protocol, allowing users to easily implement different AI agents to collaborate. The recently viral Manus AI is essentially unleashing the power of large models through the packaging and productization of AI Agents.

As of February 2025, more than 10 tools, including Claude Desktop, have integrated MCP, with community contributions exceeding 1,000 MCP servers covering multiple domains. Gartner predicts that by 2026, 30% of enterprise AI projects will adopt standardized protocols similar to MCP.

Apache Doris + MCP

Apache Doris is an open-source analytical database product. Apache Doris MCP Server is an MCP server that can access Doris data services. Through Apache Doris MCP Server, we can allow LLM to directly access and explore data stored in Doris. Let me demonstrate this briefly.

Let’s assume we have already installed the latest version of Cursor IDE locally and have an accessible Doris cluster. Additionally, we need a uv runtime environment locally.

The complete code can be found at: https://github.com/morningman/mcp-doris

1. Adding Doris MCP Server

Open the Cursor Settings page, click on the “MCP” tab, and click “Add new MCP server.”
Add new MCP server

In the opened dialog, fill in the following information:

  • Name: The name of the MCP server, which can be filled in arbitrarily.
  • Type: Select “command.”
  • Command: Fill in the following content:
Shell
 
env DORIS_HOST=<doris-host> DORIS_PORT=<query-port> DORIS_USER=<doris-user> DORIS_PASSWORD=<doris-pwd> uv run --with mcp-doris --python 3.13 mcp-doris


Then, click “Add” to add an Apache Doris MCP Server. A small green dot indicates that the MCP server is running normally.

Add an Apache Doris MCP serverMCP Server is running normally

2. Interacting With Doris MCP Server

The currently implemented Doris MCP Server is relatively simple, only implementing the ability to list databases, tables, and execute query SQL. Now we can open Cursor’s “Chat” page and switch to “Agent” mode. 

Note that if you’re using an older version of Cursor, you’ll need to open the “Composer” page to interact with the MCP server.

Next, we can interact with the Doris database using natural language.


In the video demonstration, I asked Cursor to list the databases in Doris and analyze the contents of the TPCH database, and finally, I used natural language to ask a question about the TPCH dataset:

In TPCH database, find which supplier should be selected to place an order for a given part in a given region.

Cursor, through multiple interactions with the MCP server, ultimately provided and executed an SQL example:

SQL
 
SELECT 
    s.s_suppkey, 
    s.s_name,
    s.s_address,
    s.s_phone,
    p.p_partkey,
    p.p_name,
    ps.ps_supplycost,
    ps.ps_availqty,
    n.n_name AS nation,
    r.r_name AS region
FROM 
    tpch.supplier s
    JOIN tpch.nation n ON s.s_nationkey = n.n_nationkey
    JOIN tpch.region r ON n.n_regionkey = r.r_regionkey
    JOIN tpch.partsupp ps ON s.s_suppkey = ps.ps_suppkey
    JOIN tpch.part p ON ps.ps_partkey = p.p_partkey
WHERE 
    r.r_name = 'REGION_NAME' AND  -- Replace with desired region
    p.p_partkey = PART_KEY        -- Replace with desired part
ORDER BY 
    ps.ps_supplycost ASC,         -- Primary: lowest cost
    ps.ps_availqty DESC           -- Secondary: highest availability
LIMIT 1;


As we can see, through the MCP server, we quickly implemented capabilities similar to ChatBI. Additionally, we can add more MCP servers with different capabilities, and Cursor can simultaneously use these servers’ capabilities to complete complex combined tasks.

We can further enrich the capabilities of the Doris MCP Server by providing interfaces for querying system information to help us diagnose the operation issues of the Doris cluster. We can also allow the MCP server to read database, table, and field comments to better understand the meaning of data in the database and provide more accurate answers.

Lakehouse: What Is the Big Deal?

Returning to our main topic, let’s discuss the relationship between Lakehouse and AI. And why Lakehouse can become the data foundation in the AI era.

Actually, the MCP server example above doesn’t showcase any particular advantages of Lakehouse or analytical databases in this scenario. It seems we could achieve similar capabilities by replacing Doris with MySQL or other systems with data access capabilities. So what exactly are the advantages of Data Lake and data analytics engines like Doris in the AI domain?

Lakehouse in the AI Era

First is the Data Lake. In my previous article, I mentioned two core points: data sharing and diverse workloads and collaboration. It allows different compute engines to work collaboratively on shared data and integrates seamlessly with various tools and platforms. This means that during AI development, data scientists and engineers can conveniently use familiar tools to process data without complex data migration and format conversion, ensuring data consistency and real-time access, driving efficient AI workflow operation.

On the other hand, support for diverse data format storage is also an important factor that enables Data Lake to serve AI scenarios. Beyond tabular data, storage of unstructured data and vector data allows us to confidently meet the data requirements of complex scenarios like Manus, which requires multiple capabilities and data collaboration. 

Of course, the widely applied open lake formats such as Iceberg, Paimon, etc., essentially still belong to the structured and semi-structured tabular format category, which is just a part of the entire data lake storage capability. I will discuss how a unified catalog service manages all data on the lake in future series articles.

Next is the data analytics engine. For analytics engines like Apache Doris, on the surface, it seems that from the data era to the AI era, the main functionality they provide hasn’t changed significantly: data analysis through SQL language. However, in the AI era, application scenarios place higher demands on the capabilities of analytics engines:

Higher Performance

From the MCP server demonstration earlier, we can see that when we pose questions to LLM using natural language, the model breaks down the problem into multiple sub-problems to process. With DeepSeek’s popularization of deep thinking capabilities, a complex problem implies a longer chain of thought and more sub-problems. 

If readers carefully observe the model’s processing of the problem in the example, they can see that it repeatedly interacts with the MCP Server (or Doris) to eventually arrive at the solution through exploration. Behind a model solving a problem are numerous data interaction actions, even explosive request generation. 

Therefore, the response capability of the backend data source directly affects the user experience. Analytics engines like Doris, with their ultra-high performance and concurrent processing capabilities, make these interactive AI scenarios possible with an acceptable user experience.

Richer SQL Expression Capabilities

SQL language provides reliable problem expression capabilities based on relational algebra. From the development trajectory of SQL to NoSQL and back to SQL in the big data field, we can see that SQL remains one of the most important capabilities in the data analytics domain. 

On the basis of two-dimensional relational data, through UDFs, we can extend more capabilities, such as the ability to analyze unstructured data or the ability to call LLM to analyze data. The remote UDF provided by Apache Doris further decouples “server-side computing capabilities” and “client-side computing capabilities,” enabling more flexible data analysis functions.

Open Data API or Open Data Format?

In mainstream open data lakehouse architectures, data is typically presented in its openness in the form of Open Data Format. Representative projects include Iceberg, Hudi, Paimon, Delta Lake, etc. These projects usually start from a storage perspective, designing a unified format standard for upstream systems to interface with. 

On the other hand, compute engines or data warehouse systems represented by BigQuery and Redshift expose their stored data externally through Open Data API.

The following table lists the differences between these two “Open” forms:

Open Data API vs Open Data Format

As we can see, in terms of openness, Open Data Format has the edge, while in terms of functional completeness, Open Data API performs better.

Therefore, as a data analytics engine in the AI era, it needs to support both open standards simultaneously to meet the requirements of different scenarios. Apache Doris, as a pioneer in this aspect, not only supports access to mainstream open data formats such as Iceberg, but also provides its own open data API based on Apache Arrow Flight, enabling seamless integration of statistical analysis and model training.

To Be Continued

This article used MCP server + Doris as a starting point to explore how Lakehouse supports AI applications in terms of unity, openness, functionality, and high performance in the AI era. Essentially, AI is born from the results of data analysis and understanding, while AGI produces results by acting on data. Returning to first principles, powerful data analysis capabilities remain the prerequisite for everything.

In future articles, I will continue to discuss more features of Lakehouse, as well as the positioning of real-time data warehouses and real-time query engines like Doris in this architecture. Feel free to leave comments.

AI Data lake Open data Apache

Opinions expressed by DZone contributors are their own.

Related

  • Integrating Apache Doris and Hudi for Data Querying and Migration
  • When Doris Meets Iceberg: A Data Engineer's Redemption
  • The Future of Data Lakehouses: Apache Iceberg Explained
  • Artificial Intelligence (AI) Revolutionizes the Oil Industry, Boosting Production and Efficiency

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!