Building a Reusable Framework to Standardize API Ingestion in an On-Prem Lakehouse
A reusable ingestion framework standardizes connector behavior, reducing maintenance and making onboarding easier to scale.
Join the DZone community and get the full member experience.
Join For FreeIn many enterprise lakehouse environments, the biggest ingestion challenge is not data volume; it is inconsistency.
As platforms grow, data starts arriving from many different systems through REST APIs, SOAP services, SFTP drops, database extracts, queues, and other interfaces. In many teams, these integrations are built one by one to solve immediate business needs. Over time, that creates a fragmented connector landscape where every source behaves a little differently.
One connector may implement retries one way. Another may use a different authentication pattern. A third may handle pagination, validation, and failures entirely differently. The result is a platform that works, but becomes harder to operate, extend, and support as the number of sources increases.
That is the problem this framework was designed to solve.
The Core Problem
Custom connectors are easy to justify in the short term. Each one is tailored to its source and can be delivered quickly. But as the number of integrations grows, so does the operational burden.
Failures become harder to troubleshoot because each connector has its own behavior. Onboarding new engineers takes longer because they must learn multiple implementation styles. Adding new sources becomes slower because common concerns, such as retry handling, pagination, authentication integration, validation, and dead-letter routing, are repeatedly rebuilt.
At that point, the issue is no longer ingestion itself. It is the lack of a consistent ingestion model.
The Design Objective
The goal was to create a reusable ingestion framework that made new source onboarding simple, operational behavior consistent, and maintenance easier over time.
The framework needed to do four things well:
- Standardize common operational concerns
- Minimize source-specific code
- Remain easy to debug
- Support a wide range of batch and near-batch integrations without becoming overly abstract
The solution was a connector framework built around a base class, declarative configuration, and clearly defined lifecycle hooks.
In this model, engineers only need to implement two source-specific methods:
fetch()to call the source and retrieve raw dataparse()to convert that raw payload into a DataFrame with a defined schema
Everything else is managed by the framework.
The Connector Pattern
The base connector owns the cross-cutting concerns that should not be reimplemented for every source.
That includes:
- Session creation
- Authentication resolution
- Pagination orchestration
- Retry handling
- Rate-limit enforcement
- Validation
- Writing valid records
- Routing invalid records to dead-letter storage
A simplified version of the pattern looks like this:
from abc import ABC, abstractmethod
from pyspark.sql import DataFrame, SparkSession
import requests
import yaml
class BaseApiConnector(ABC):
def __init__(self, config_path: str, spark: SparkSession):
with open(config_path) as f:
self.config = yaml.safe_load(f)
self.spark = spark
self.session = self._build_session()
def _build_session(self) -> requests.Session:
session = requests.Session()
session.headers.update(self.config.get("headers", {}))
auth = self._resolve_auth()
if auth:
session.auth = auth
return session
@abstractmethod
def fetch(self, params: dict) -> dict:
"""Call the source and return raw response data."""
...
@abstractmethod
def parse(self, raw_data: dict) -> DataFrame:
"""Convert raw response into a DataFrame."""
...
def run(self):
all_data = self._paginated_fetch()
df = self.parse(all_data)
validated, dead = self._validate(df)
self._write(validated)
self._write_dead_letters(dead)
The key idea is the separation of responsibility.
The framework owns common ingestion behavior. The connector implementation only owns source-specific logic.
That keeps the design simple without forcing every connector to solve the same operational problems repeatedly.
Why Declarative Configuration Matters
A reusable framework only works if source behavior can be defined without constantly changing code.
Each source is therefore described through configuration. A typical configuration includes:
- Source metadata
- Connection settings
- Authentication reference
- Pagination strategy
- Schema expectations
- Retry overrides
- Rate-limit settings
For example:
source:
name: customer-api
type: rest_api
schedule: "every few hours"
connection:
base_url: https://example-api.company.com
auth:
type: oauth2_client_credentials
secret_reference: customer-api-credentials
timeout_seconds: 30
pagination:
type: cursor
cursor_field: "nextPageToken"
page_size: 1000
schema:
required_columns: [id, name, status, created_at]
output_path: /bronze/domain/entity
format: parquet
retry:
max_attempts: 3
backoff_strategy: exponential
rate_limit:
requests_per_second: 5
This approach has two major advantages.
First, it reduces the amount of code required to onboard a new source. Second, it makes source behavior more transparent. Engineers can understand how a connector behaves by reading its configuration rather than tracing through custom implementations.
Sensitive values should not be stored directly in configuration. Instead, configuration should reference a centralized secret management mechanism and resolve credentials securely at runtime.
Standardizing the Right Things
Not every part of ingestion should be configurable, and not every part should be customized.
The framework works best when it standardizes the concerns that are common across most sources.
Pagination
Most APIs use a limited number of pagination styles, usually cursor-based, offset-based, or token-based pagination. Because those patterns are common, pagination belongs in the framework rather than in each connector.
Retry Handling
Retry behavior should also be standardized. Transient failures such as throttling and temporary service errors usually deserve automatic retries. Permanent client-side failures should typically fail fast. Centralizing this logic reduces inconsistency and improves predictability.
Rate Limiting
Request pacing is another concern that should not be reimplemented per connector. Framework-level rate limiting helps protect upstream systems and reduces the likelihood of unnecessary throttling.
Validation and Dead-Letter Routing
Data quality handling is often inconsistent in connector-heavy platforms. Standard validation and dead-letter handling make ingestion outcomes easier to monitor and troubleshoot.
Onboarding a New Source
Once the framework is in place, adding a new source becomes much simpler.
A typical connector implementation may look like this:
class CustomerAccountsConnector(BaseApiConnector):
def fetch(self, params: dict) -> dict:
endpoint = f"{self.config['connection']['base_url']}/accounts"
response = self.session.get(endpoint, params=params)
response.raise_for_status()
return response.json()
def parse(self, raw_data: dict) -> DataFrame:
records = raw_data.get("records", [])
return self.spark.createDataFrame(records)
That is often all that is needed for a standard API integration.
The connector focuses only on extracting and parsing the source response. The framework handles the operational lifecycle around it.
This is where the real value starts to show. The benefit is not just fewer lines of code. It is that every new source behaves in a familiar way.
What Improves in Practice
The biggest gain from a reusable ingestion framework is predictability.
When all connectors follow the same execution model:
- Support becomes easier because failure patterns are more consistent
- Onboarding improves because engineers learn one framework instead of many connector styles
- Maintenance effort drops because shared concerns are fixed once in the framework
- Source onboarding becomes faster because teams are not rebuilding the same plumbing repeatedly
The framework also creates a cleaner boundary between ingestion and transformation.
Its job is to land validated raw or near-raw data reliably. Transformations belong downstream, where they can evolve independently without complicating ingestion logic.
That separation makes both layers easier to manage.
What This Framework Is Not For
One of the most important design decisions in framework development is deciding what not to support.
This pattern is a strong fit for batch and near-batch ingestion, especially for API- and file-oriented integrations. It is not the right solution for every workload.
For example, it is usually not the best fit for:
- Complex transformations tightly coupled with extraction
- Very high-throughput streaming workloads
- Use cases better served by dedicated streaming or CDC platforms
Those are not shortcomings. They are intentional boundaries.
A framework becomes more effective when its purpose is clear.
Final Thoughts
A good ingestion framework is not just about code reuse. It is about operational consistency.
If every new source requires its own retry model, its own pagination implementation, and its own failure-handling logic, the platform will become harder to support with every additional connector. Standardizing those behaviors through a reusable framework creates a more scalable operating model.
The most valuable outcome is not technical elegance. It is reducing variability.
When source onboarding becomes more repeatable, support becomes more predictable, and connector behavior becomes easier to understand, the entire platform becomes easier to scale.
That is what a reusable ingestion framework should really deliver.
Opinions expressed by DZone contributors are their own.
Comments