DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • GraphQL vs REST — Which Is Better?
  • Beyond Django and Flask: How FastAPI Became Python's Fastest-Growing Framework for Production APIs
  • Designing Secure APIs: A Developer’s Guide to Authentication, Rate Limiting, and Data Validation
  • Analysis of the Data Processing Framework of Pandas and Snowpark Pandas API

Trending

  • Identity in Action
  • 7 Technology Waves I’ve Seen in 30 Years of Software — Will AI Be the Next Real Transformation?
  • DZone's Article Submission Guidelines
  • Persistent Memory for AI Agents Using LangChain's Deep Agents
  1. DZone
  2. Coding
  3. Frameworks
  4. Building a Reusable Framework to Standardize API Ingestion in an On-Prem Lakehouse

Building a Reusable Framework to Standardize API Ingestion in an On-Prem Lakehouse

A reusable ingestion framework standardizes connector behavior, reducing maintenance and making onboarding easier to scale.

By 
Kuladeep Sandra user avatar
Kuladeep Sandra
·
Ashwin Ramesh Kumar user avatar
Ashwin Ramesh Kumar
·
May. 21, 26 · Analysis
Likes (0)
Comment
Save
Tweet
Share
1.6K Views

Join the DZone community and get the full member experience.

Join For Free

In many enterprise lakehouse environments, the biggest ingestion challenge is not data volume; it is inconsistency.

As platforms grow, data starts arriving from many different systems through REST APIs, SOAP services, SFTP drops, database extracts, queues, and other interfaces. In many teams, these integrations are built one by one to solve immediate business needs. Over time, that creates a fragmented connector landscape where every source behaves a little differently.

One connector may implement retries one way. Another may use a different authentication pattern. A third may handle pagination, validation, and failures entirely differently. The result is a platform that works, but becomes harder to operate, extend, and support as the number of sources increases.

That is the problem this framework was designed to solve.

The Core Problem

Custom connectors are easy to justify in the short term. Each one is tailored to its source and can be delivered quickly. But as the number of integrations grows, so does the operational burden.

Failures become harder to troubleshoot because each connector has its own behavior. Onboarding new engineers takes longer because they must learn multiple implementation styles. Adding new sources becomes slower because common concerns, such as retry handling, pagination, authentication integration, validation, and dead-letter routing, are repeatedly rebuilt.

At that point, the issue is no longer ingestion itself. It is the lack of a consistent ingestion model.

The Design Objective

The goal was to create a reusable ingestion framework that made new source onboarding simple, operational behavior consistent, and maintenance easier over time.

The framework needed to do four things well:

  • Standardize common operational concerns
  • Minimize source-specific code
  • Remain easy to debug
  • Support a wide range of batch and near-batch integrations without becoming overly abstract

The solution was a connector framework built around a base class, declarative configuration, and clearly defined lifecycle hooks.

In this model, engineers only need to implement two source-specific methods:

  • fetch() to call the source and retrieve raw data
  • parse() to convert that raw payload into a DataFrame with a defined schema

Everything else is managed by the framework.

The Connector Pattern

The base connector owns the cross-cutting concerns that should not be reimplemented for every source.

That includes:

  • Session creation
  • Authentication resolution
  • Pagination orchestration
  • Retry handling
  • Rate-limit enforcement
  • Validation
  • Writing valid records
  • Routing invalid records to dead-letter storage

A simplified version of the pattern looks like this:

Python
 
from abc import ABC, abstractmethod
from pyspark.sql import DataFrame, SparkSession
import requests
import yaml

class BaseApiConnector(ABC):
    def __init__(self, config_path: str, spark: SparkSession):
        with open(config_path) as f:
            self.config = yaml.safe_load(f)
        self.spark = spark
        self.session = self._build_session()

    def _build_session(self) -> requests.Session:
        session = requests.Session()
        session.headers.update(self.config.get("headers", {}))
        auth = self._resolve_auth()
        if auth:
            session.auth = auth
        return session

    @abstractmethod
    def fetch(self, params: dict) -> dict:
        """Call the source and return raw response data."""
        ...

    @abstractmethod
    def parse(self, raw_data: dict) -> DataFrame:
        """Convert raw response into a DataFrame."""
        ...

    def run(self):
        all_data = self._paginated_fetch()
        df = self.parse(all_data)
        validated, dead = self._validate(df)
        self._write(validated)
        self._write_dead_letters(dead)


The key idea is the separation of responsibility.

The framework owns common ingestion behavior. The connector implementation only owns source-specific logic.

That keeps the design simple without forcing every connector to solve the same operational problems repeatedly.

Why Declarative Configuration Matters

A reusable framework only works if source behavior can be defined without constantly changing code.

Each source is therefore described through configuration. A typical configuration includes:

  • Source metadata
  • Connection settings
  • Authentication reference
  • Pagination strategy
  • Schema expectations
  • Retry overrides
  • Rate-limit settings

For example:

YAML
 
source:
  name: customer-api
  type: rest_api
  schedule: "every few hours"

connection:
  base_url: https://example-api.company.com
  auth:
    type: oauth2_client_credentials
    secret_reference: customer-api-credentials
  timeout_seconds: 30

pagination:
  type: cursor
  cursor_field: "nextPageToken"
  page_size: 1000

schema:
  required_columns: [id, name, status, created_at]
  output_path: /bronze/domain/entity
  format: parquet

retry:
  max_attempts: 3
  backoff_strategy: exponential

rate_limit:
  requests_per_second: 5


This approach has two major advantages.

First, it reduces the amount of code required to onboard a new source. Second, it makes source behavior more transparent. Engineers can understand how a connector behaves by reading its configuration rather than tracing through custom implementations.

Sensitive values should not be stored directly in configuration. Instead, configuration should reference a centralized secret management mechanism and resolve credentials securely at runtime.

Standardizing the Right Things

Not every part of ingestion should be configurable, and not every part should be customized.

The framework works best when it standardizes the concerns that are common across most sources.

Pagination

Most APIs use a limited number of pagination styles, usually cursor-based, offset-based, or token-based pagination. Because those patterns are common, pagination belongs in the framework rather than in each connector.

Retry Handling

Retry behavior should also be standardized. Transient failures such as throttling and temporary service errors usually deserve automatic retries. Permanent client-side failures should typically fail fast. Centralizing this logic reduces inconsistency and improves predictability.

Rate Limiting

Request pacing is another concern that should not be reimplemented per connector. Framework-level rate limiting helps protect upstream systems and reduces the likelihood of unnecessary throttling.

Validation and Dead-Letter Routing

Data quality handling is often inconsistent in connector-heavy platforms. Standard validation and dead-letter handling make ingestion outcomes easier to monitor and troubleshoot.

Onboarding a New Source

Once the framework is in place, adding a new source becomes much simpler.

A typical connector implementation may look like this:

Python
 
class CustomerAccountsConnector(BaseApiConnector):
    def fetch(self, params: dict) -> dict:
        endpoint = f"{self.config['connection']['base_url']}/accounts"
        response = self.session.get(endpoint, params=params)
        response.raise_for_status()
        return response.json()

    def parse(self, raw_data: dict) -> DataFrame:
        records = raw_data.get("records", [])
        return self.spark.createDataFrame(records)


That is often all that is needed for a standard API integration.

The connector focuses only on extracting and parsing the source response. The framework handles the operational lifecycle around it.

This is where the real value starts to show. The benefit is not just fewer lines of code. It is that every new source behaves in a familiar way.

What Improves in Practice

The biggest gain from a reusable ingestion framework is predictability.

When all connectors follow the same execution model:

  • Support becomes easier because failure patterns are more consistent
  • Onboarding improves because engineers learn one framework instead of many connector styles
  • Maintenance effort drops because shared concerns are fixed once in the framework
  • Source onboarding becomes faster because teams are not rebuilding the same plumbing repeatedly

The framework also creates a cleaner boundary between ingestion and transformation.

Its job is to land validated raw or near-raw data reliably. Transformations belong downstream, where they can evolve independently without complicating ingestion logic.

That separation makes both layers easier to manage.

What This Framework Is Not For

One of the most important design decisions in framework development is deciding what not to support.

This pattern is a strong fit for batch and near-batch ingestion, especially for API- and file-oriented integrations. It is not the right solution for every workload.

For example, it is usually not the best fit for:

  • Complex transformations tightly coupled with extraction
  • Very high-throughput streaming workloads
  • Use cases better served by dedicated streaming or CDC platforms

Those are not shortcomings. They are intentional boundaries.

A framework becomes more effective when its purpose is clear.

Final Thoughts

A good ingestion framework is not just about code reuse. It is about operational consistency.

If every new source requires its own retry model, its own pagination implementation, and its own failure-handling logic, the platform will become harder to support with every additional connector. Standardizing those behaviors through a reusable framework creates a more scalable operating model.

The most valuable outcome is not technical elegance. It is reducing variability.

When source onboarding becomes more repeatable, support becomes more predictable, and connector behavior becomes easier to understand, the entire platform becomes easier to scale.

That is what a reusable ingestion framework should really deliver.

API Framework rate limit

Opinions expressed by DZone contributors are their own.

Related

  • GraphQL vs REST — Which Is Better?
  • Beyond Django and Flask: How FastAPI Became Python's Fastest-Growing Framework for Production APIs
  • Designing Secure APIs: A Developer’s Guide to Authentication, Rate Limiting, and Data Validation
  • Analysis of the Data Processing Framework of Pandas and Snowpark Pandas API

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook