Escaping the "Excel Trap": Building an AI-Assisted ETL Pipeline Without a Data Team
Escape Excel silos. Use GitHub Copilot to generate Python pipelines that transform static spreadsheets into dynamic dashboards without manual coding.
Join the DZone community and get the full member experience.
Join For FreeBusiness data often lives in hundreds of disconnected Excel files, making it invisible to decision-makers. Here is a pattern for Citizen Data Engineering using Python, GitHub Copilot, and Qlik Sense to unify data silos without writing a single line of manual code.
In the enterprise world, the most common database isn't Oracle or PostgreSQL — it’s Excel.
We see it everywhere: Governance teams track technology standards in spreadsheets, Sales teams track leads in .xlsx files, and HR tracks utilization in shared folders. In a recent case study at Fujitsu, a technical standardization team found themselves managing 400+ separate Excel files (TSI Checklists) containing critical data on software adoption.
The data was valuable, but the format was useless for strategic analysis. To answer a simple question like "Which department is using high-risk software?" someone had to manually open dozens of files.
Traditionally, solving this requires a Data Engineering team to build an ETL (Extract, Transform, Load) pipeline. But resources are scarce.
This article explores a "Field-Driven" data pattern where a non-programmer used GitHub Copilot as a "Virtual Data Engineer," generating Python scripts to clean, merge, and load this data into Qlik Sense for visualization.
The Problem: The "Excel Silo" Architecture
The starting state of most legacy business processes looks like this:
- Fragmentation: Data is split across hundreds of files (e.g., one file per month or per department).
- Inconsistency: Humans enter data differently. One cell says "US," another "U.S.A.," and another "United States."
- Fragility: Excel formulas (like VLOOKUP) break easily when rows are moved.
The goal is to move from static files to dynamic intelligence.
The Solution: AI-Assisted "Citizen" ETL
The proposed architecture bypasses the need for a dedicated engineering team by leveraging an LLM (GitHub Copilot) to bridge the gap between business logic and Python syntax.
The Workflow:
- Ingest: Python (pandas) reads 400+ Excel files from a directory.
- Clean: Scripts normalize column names, handle missing values (NaN), and fix typos.
- Transform: Convert wide tables (human-readable) to long tables (machine-readable).
- Visualize: Qlik Sense loads the clean dataset for interactive dashboarding.
1. The "Prompt-First" Development Cycle
In the Fujitsu case study, the developer was a domain expert, not a Python coder. Instead of writing code, they wrote Intent.
The development loop shifted from Write Code -> Run -> Fail to Prompt -> Review -> Run.
The Prompt Strategy:
"I have a folder of Excel files. I need a Python script to read all of them, merge them into a single DataFrame, and add a column for the filename so I know the source."
The Copilot-Generated Pattern:
import pandas as pd
import glob
import os
def merge_excel_files(folder_path):
# Create a list to hold all dataframes
all_data = []
# Get all excel files in the directory
files = glob.glob(os.path.join(folder_path, "*.xlsx"))
for filename in files:
try:
df = pd.read_excel(filename)
# Feature Engineering: Track the source file
df['source_file'] = os.path.basename(filename)
all_data.append(df)
except Exception as e:
print(f"Error reading {filename}: {e}")
# Concatenate all data into one DataFrame
merged_df = pd.concat(all_data, ignore_index=True)
return merged_df
# Execution
path = r'C:\Data\TSI_Checklists'
final_df = merge_excel_files(path)
print(f"Merged {len(final_df)} rows successfully.")
2. Auto-Correction and Data Cleansing
One of the biggest hurdles in Excel data is "dirty data." A human might leave a cell blank or type a date as text.
In this pattern, the "Citizen Developer" used Copilot to implement robust error handling. When the script failed due to a missing column, the developer didn't check StackOverflow — they simply pasted the error into the IDE chat.
Prompt:
"I am getting a KeyError: 'Product Name' because some files have 'ProductName' without a space. Fix the script to standardize column names."
The Resulting "Self-Healing" Logic:
def standardize_columns(df):
# Map varying column names to a standard schema
column_mapping = {
'ProductName': 'Product Name',
'Prod_Name': 'Product Name',
'dept_id': 'Department ID'
}
df.rename(columns=column_mapping, inplace=True)
# Drop rows where critical data is missing
df.dropna(subset=['Product Name'], inplace=True)
return df
This iterative process allowed the team to build advanced anomaly detection, listing files that failed processing so they could be fixed at the source.
Visualizing the Pipeline
The architecture transforms unstructured inputs into a structured analytical model.
The Visualization Layer: Qlik Sense
Once the data was unified into a single clean dataset, it was loaded into Qlik Sense. Because the data was now machine readable (long format), the BI tool could perform aggregations that were impossible in Excel.
Key capabilities included:
- Cross-department filtering: Selecting high-risk products and instantly seeing which departments use them.
- Trend analysis: Visualizing adoption over time — impossible when data is trapped in monthly files.
- Drill-down: Clicking a bar chart to see the specific Excel file and row ID that contributed to the number.
The ROI: Metrics That Matter
The results of this "Citizen Data Engineering" experiment were significant. By removing the dependency on a central IT delivery team, the business achieved:
- Development Velocity: The author produced 350 steps of production-ready Python code in roughly 40 hours.
- Productivity: This equates to 8.8 steps per hour — comparable to professional developers (benchmarked at ~14.6 steps/hour for similar tasks).
- Quality: Data quality improved significantly because the Python script enforced validation rules that manual entry could not.
Conclusion
The barrier to entry for building robust data pipelines has collapsed. You no longer need to know the syntax of pandas.melt() or how to handle Python exceptions from memory.
By combining GitHub Copilot (for syntax and logic generation) with Qlik Sense (for visualization), domain experts can escape the "Excel Trap." They can build their own ETL pipelines, turning hundreds of static files into a living, breathing decision-support system.
The future of data utilization isn't just about better tools for data engineers; it's about giving engineering powers to the people who understand the data best.
Opinions expressed by DZone contributors are their own.
Comments