DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Testing AI-Infused Apps: A Dual-Layer Framework for AI Quality Assurance
  • The Missing `bandit` for AI Agents: How I Built a Static Analyzer for Prompt Injection
  • MuleSoft IDP: Enhancing Efficiency and Accuracy in Data Extraction
  • From AI Chaos to Control: Building Enterprise-Grade LLM Gateways With MuleSoft Anypoint

Trending

  • Stop Debugging Glue Jobs Manually: Building an Agentic Observability Layer for Data Pipelines
  • Build a GitHub Slack Bot With AWS Bedrock and MCP, Part 1
  • From 24 Hours to 2 Hours: How We Fixed a Broken BI System With Apache Airflow
  • When One MVP Is Really Four Systems: A Better Way to Plan Multi-Role Apps
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Production LLM Data Extraction Pipeline With LaunchDarkly and Vercel AI Gateway

Production LLM Data Extraction Pipeline With LaunchDarkly and Vercel AI Gateway

Build a pipeline that extracts structured fields from raw transcripts (sentiment scores, urgency signals, buying intent) and feeds them straight into your ML models.

By 
Scarlett Attensil user avatar
Scarlett Attensil
·
Mar. 16, 26 · Tutorial
Likes (2)
Comment
Save
Tweet
Share
3.2K Views

Join the DZone community and get the full member experience.

Join For Free

Every conversation your organization has contains signals your ML models need. Customer calls reveal buying intent. Support tickets expose product friction. Interview transcripts capture technical depth. The problem is that those signals are buried in thousands of words of unstructured text.

Tools like Gong, Chorus, and conversation intelligence platforms are excellent for their designed purpose, but when you need to extract specific features for your ML models — with a schema you control completely — you need something different.

This tutorial walks through building a data extraction pipeline that turns messy transcripts into structured JSON, using AI configuration management to control models, prompts, and schemas without redeploying code. The architecture separates what you extract from how you extract it, so you can iterate on schemas in minutes instead of sprints.

The Core Problem

Here's what you typically have:

Plain Text
 
"Yeah, so we've been looking at different solutions. The other vendor's
pricing was reasonable but their timeline was concerning. We need this
rolled out before Q3..."


Here's what your models actually need:

JSON
 
{
  "alternatives_mentioned": true,
  "pricing_sentiment": 0.3,
  "timeline_mentioned": true,
  "urgency_score": 0.8,
  "decision_timeframe": "Q3"
}


The gap between these two representations is where most teams get stuck. They hardcode extraction schemas into application code, which means every new field requires a PR, a review, a deploy, and a prayer that nothing breaks.

Architecture Overview

The pipeline separates concerns across three layers:

┌─────────────┐      ┌──────────────────┐      ┌─────────────┐
│  Your Text  │ ---> │  AI Gateway      │ ---> │     LLM     │
└─────────────┘      │  (unified API)   │      │(GPT/Claude/ │
                     └──────────────────┘      │  Gemini)    │
                              ↑                └─────────────┘
                              │                       │
                     ┌──────────────────┐             ↓
                     │  AI Config       │      ┌─────────────┐
                     │  (model, prompts,│      │  Structured │
                     │  tool schemas)   │      │    JSON     │
                     └──────────────────┘      └─────────────┘


The AI model automatically selects the most appropriate extraction schema based on the transcript content. You define multiple tool schemas — each tailored to a different document type — and the model picks the right one at runtime.

What lives in configuration (not code):

  • Model selection (swap between providers without redeploying)
  • System and user prompts
  • Extraction tool schemas with 40–60+ fields each
  • Temperature and other inference parameters
  • Targeting rules for different use cases (e.g., GDPR-compliant schemas for EU customers)

What stays in your code:

  • Reading input files
  • Passing transcript text to the API
  • Writing output CSV/JSON

This separation matters because discovery is iterative. When you realize that customer_question_count predicts engagement better than talk time, or that one model handles technical jargon better than another, you can update extraction logic immediately without touching application code.

Designing the Extraction Tools

Rather than a single monolithic schema, the pipeline uses six specialized extraction tools. The LLM receives all six as function definitions and selects the most appropriate one based on the transcript's content.

  • Prospecting tool (43 fields): First contact, cold outreach, gatekeeper conversations. Fields include gatekeeper_encountered, callback_scheduled, interest_level.
  • Discovery tool (48 fields): Qualification calls, needs assessment. Fields include budget_confirmed, authority_level, qualification_score.
  • Demo tool (58 fields): Product demonstrations, feature walkthroughs. Fields include customer_wow_moments, demo_effectiveness_score, trial_requested.
  • Proposal tool (53 fields): Pricing discussions, contract negotiations. Fields include close_probability, discount_requested, blockers_to_close.
  • Technical tool (63 fields): Architecture reviews, technical deep-dives. Fields include technical_fit_score, technical_risk_score, scalability_concerns.
  • Customer success tool (53 fields): QBRs, renewal discussions, expansion opportunities. Fields include account_health_score, renewal_likelihood, churn_risk.

Each tool shares a core set of ~40 fields (sentiment scores, engagement metrics, text statistics) with tool-specific fields layered on top. This design lets the model make intelligent routing decisions while keeping field definitions consistent.

Core Implementation

The implementation uses LaunchDarkly's AI SDK for configuration management and Vercel's AI Gateway for unified LLM access. Here's the simplified extraction logic:

JSON
 
import * as ld from "@launchdarkly/node-server-sdk";
import { initAi } from "@launchdarkly/server-sdk-ai";
import { VercelProvider } from "@launchdarkly/server-sdk-ai-vercel";

export class ExtractionClient {
  private ldClient: ld.LDClient;
  private aiClient: any;

  async initialize(): Promise<void> {
    this.ldClient = ld.init(process.env.LAUNCHDARKLY_SDK_KEY!);
    await this.ldClient.waitForInitialization();
    this.aiClient = initAi(this.ldClient);
  }

  async extract(configKey: string, context: ld.LDContext,
                transcript: string): Promise<any> {
    // Fetch AI config: model, prompts, and all 6 tool schemas
    const aiConfig = await this.aiClient.completionConfig(
      configKey, context, { enabled: false }
    );

    const tools = aiConfig.model?.parameters?.tools || [];
    const jsonSchema = tools[0].parameters;

    // Create unified gateway client
    const { createOpenAI } = await import("@ai-sdk/openai");
    const gateway = createOpenAI({
      baseURL: "https://ai-gateway.vercel.sh/v1",
      apiKey: process.env.VERCEL_OIDC_TOKEN
             || process.env.AI_GATEWAY_API_KEY,
    });

    // Model name comes from config, not code
    const model = gateway.chat(
      `${aiConfig.provider.name}/${aiConfig.model.name}`
    );

    // LLM selects the appropriate tool based on content
    const provider = new VercelProvider(model,
                                        aiConfig.model.parameters);
    const response = await provider.invokeStructuredModel(
      [
        { role: "system",
          content: aiConfig.config.messages[0].content },
        { role: "user",
          content: `Transcript:\n\n${transcript}` }
      ],
      jsonSchema
    );

    return response.data;
  }
}


The key detail: aiConfig.provider.name and aiConfig.model.name come from the configuration layer, not from hardcoded values. Swapping from GPT-5.2 to Claude Opus 4.5 is a UI change, not a code change.

Schema Definition and Customization

Schemas are defined in a JSON file that maps to tool function definitions:

JSON
 
{
  "variation_a_prospecting": {
    "function": {
      "name": "extract_prospecting_features",
      "description": "Extract features from prospecting calls",
      "parameters": {
        "type": "object",
        "properties": {
          "gatekeeper_encountered": {
            "type": "boolean",
            "description": "Whether a gatekeeper was encountered"
          },
          "interest_level": {
            "type": "number",
            "description": "Prospect interest level (0.0-1.0)"
          },
          "competitor_switching_intent": {
            "type": "boolean",
            "description": "Intent to switch from a competitor"
          }
        }
      }
    }
  }
}


A bootstrap script reads this file and creates the AI configuration. After the initial setup, you can edit schemas either by modifying the JSON and re-running the bootstrap or by editing directly in the LaunchDarkly UI for immediate iteration.

Running the Pipeline

Shell
 
# Clone the example repository
git clone https://github.com/launchdarkly-labs/scarlett-feature-extraction.git
cd scarlett-feature-extraction

# Install dependencies
npm install @launchdarkly/node-server-sdk @launchdarkly/server-sdk-ai \
  @launchdarkly/server-sdk-ai-vercel ai @ai-sdk/openai

# Configure environment
cp .env.example .env
# Add LAUNCHDARKLY_SDK_KEY, LD_API_KEY, LD_PROJECT_KEY

# Bootstrap the AI config (one-time)
python bootstrap/create_unified_config.py

# Start extracting
npm run dev
# Open http://localhost:3000, upload transcripts, download CSV


Beyond Sales Calls

The same architecture applies to any unstructured data extraction problem. The pipeline code never changes — just the configuration.

  • Support ticket analysis  – Extract urgency scores, issue categories, product areas, and customer effort scores. Route urgent tickets to detailed schemas and low-priority ones to streamlined extraction. Downstream, predict escalation likelihood and estimate resolution time.
  • Interview transcript processing – Extract technical competency signals, communication clarity, and culture-fit indicators. Different roles use different schemas via targeting rules. Use extracted features to predict candidate success probability and reduce hiring bias through standardized signals.
  • Earnings call transcripts – Extract forward-looking statements, financial metrics, and competitive positioning. Capture management sentiment and guidance changes for models that predict stock price movements or detect financial health indicators.
  • Legal document analysis – Extract contract terms, risk clauses, obligations, and deadlines. Route NDAs, MSAs, and employment contracts to specialized schemas. Build models that assess contract risk scores and flag compliance issues.

For any use case involving sensitive data, add a PII detection step before extraction. Scan for emails, phone numbers, SSNs, and names, then either redact or skip extraction based on compliance requirements. Geographic targeting lets you route EU transcripts to privacy-safe schemas automatically.

When To Use This Approach

This pipeline is designed for teams processing 100 to 10,000 documents monthly, where schemas need frequent iteration and different document types require different treatment. It's particularly effective when you're bootstrapping training data for ML models.

Skip this approach if you're processing millions of documents (traditional NLP is more cost-effective at that scale), your schema is fixed and proven, you need sub-second latency, or your documents follow strict templates that don't need LLM interpretation.

What's Next

The extracted features from this pipeline become inputs for predictive models. The challenge with real-world data is sparse outcomes: most deals don't close, most candidates aren't hired, most tickets don't escalate. In a follow-up piece, I'll demonstrate a zero-inflated regression approach that handles this sparsity effectively.

Start with your messiest transcripts. That's where you'll learn what features really matter.

The complete example is available on GitHub.

Further Reading

  • CI/CD for AI Configs – Automate config deployments with version control
  • Multi-agent systems with LangGraph – Build complex AI workflows
AI Data extraction large language model

Published at DZone with permission of Scarlett Attensil. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Testing AI-Infused Apps: A Dual-Layer Framework for AI Quality Assurance
  • The Missing `bandit` for AI Agents: How I Built a Static Analyzer for Prompt Injection
  • MuleSoft IDP: Enhancing Efficiency and Accuracy in Data Extraction
  • From AI Chaos to Control: Building Enterprise-Grade LLM Gateways With MuleSoft Anypoint

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook