DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Architecting Petabyte-Scale Hyperspectral Pipelines on AWS
  • Model Context Protocol Vs Agent2Agent: Practical Integration with Enterprise Data
  • AWS Airflow vs Step Functions: The Data Engineering Orchestration Dilemma
  • Enterprise-Grade Document Intelligence: Cloud Big Data AI With YOLOv9 and Spark on AWS

Trending

  • How Rule Engines Transform Business Agility and Code Simplicity
  • The Prompt Isn't Hiding Inside the Image
  • Comparing Top Gen AI Frameworks for Java in 2026
  • AI in SRE: What's Actually Coming in 2026
  1. DZone
  2. Data Engineering
  3. Big Data
  4. AWS Glue Crawlers: Common Pitfalls, Schema Challenges, and Best Practices

AWS Glue Crawlers: Common Pitfalls, Schema Challenges, and Best Practices

Learn key challenges and best practices for using AWS Glue crawlers, from handling CSV schema issues to schema evolution, partitions, and ETL jobs.

By 
Saradha Nagarajan user avatar
Saradha Nagarajan
·
Sohag Maitra user avatar
Sohag Maitra
·
Sep. 25, 25 · Analysis
Likes (0)
Comment
Save
Tweet
Share
3.9K Views

Join the DZone community and get the full member experience.

Join For Free

AWS Glue is a powerful serverless data integration that simplifies data discovery, preparation, and transformation. However, as with any tool, real-world application reveals quirks and corner cases that are not clearly identified in documentation. 

In this article, let's talk about some key challenges observed from my hands-on experience while building data pipelines using Glue crawlers when dealing with CSV files, schema evolution, partitioning, and crawler update settings.

1. The CSV Conundrum: Schema Inconsistencies Across Files

One of the most common and frustrating issues when working with AWS Glue crawlers is handling multiple files that don't share the same schema. While Glue is designed to infer schema automatically, it doesn't always handle structural inconsistencies well with CSVs.

In practice, when a crawler scans an S3 path containing multiple CSV files with differing column structures, it often creates multiple tables in the AWS Glue data catalog. These tables are typically named after the subfolder paths. 

Even more confusing is that the correct table may still be created, but it's not guaranteed to contain the relevant data. In some cases, the crawler might skip files that don't match the inferred schema from the first file it processes. This leads to partial or missing data in Athena queries, making debugging time-consuming.

Why does this happen?

CSV is a flexible but loosely structured format. Unlike Parquet, this lacks embedded schema metadata. Glue relies on sampling a subset of files to infer the schema, and if the sample files differ significantly, then Glue assumes they belong to different data sets.

Best Practices

  • Normalize the CSV files before uploading to S3 by ensuring all the files under the same folder share the same structure as well as data types.
  • Convert to parquet wherever possible, as this contains the schema metadata, which ensures consistency while creating Glue data catalog tables.

This highlights the issue of data hygiene before ingestion.

2. How the S3 Layer and Crawler Settings Influence Crawler Behavior

While it might seem intuitive to drop files into a bucket and let Glue handle the rest, the reality is more nuanced, especially when dealing with partitioned data or multiple folders. Organize the S3 folders by schema, as mixing files with different structures under the same path is a recipe for disaster.

Best Practices

  • For partitioned data, use a consistent folder naming convention (e.g., s3://bucket/data/year=2025/month=08/) and ensure all files within the partitions follow the same schema.
  • Set crawler configurations based on the use case:
    • For first-time loads, use “Crawl all folders.”
    • If “Crawl all folders” is enabled, ensure schema consistency; otherwise, multiple tables may automatically be created once triggered.
    • For ongoing ingestion, use “Crawl only new subfolders” with partition update settings enabled.
    • Use S3 event triggers to run crawlers on a schedule. This can be achieved using EventBridge, Lambda, Step Functions, or SNS/SQS.

Example Flow

S3 → EventBridge → Glue Crawler

  1. Enable event notifications on your S3 bucket for ObjectCreated events.
  2. Send these events to EventBridge.
  3. Create a rule in EventBridge that matches the S3 event and triggers the Glue crawler.

Avoid Schema Drift Across Folders

By aligning your folder structure and crawler settings, or by using Glue ETL jobs to normalize the schema, you can significantly reduce the unpredictability of Glue crawlers and ensure the data catalog remains accurate and queryable.

3. Partitioning and Incremental Crawling: Keeping Your Catalog in Sync

AWS Glue supports partitioned tables, but keeping those partitions up to date requires careful crawler configuration and operational discipline. When new partitioned data is added to your S3 source, Glue crawlers don't automatically update the data catalog unless they are explicitly configured to do so. Even if the new data follows the same schema and folder structure, it may not appear in Athena unless the crawler is rerun with right settings.

Key settings to enable:

  • Crawl only new subfolders, which helps to focus on incremental changes, reducing scan time and cost.
  • Ignore the change and don't update the schema: It prevents schema updates unless manually done with consecutive updates.
  • Update all new and existing partitions with metadata from the table. This helps keep the metadata fresh and up to date.

A rerun of crawlers is required when the following situation arises.

  • After adding new partitioned data.
  • After schema changes to update the structure.
  • After modifying table properties.

Best practice would be to automate the crawler rerun after the source data is updated with new data on a periodical basis using AWS step functions or S3 event notification by using Eventbridge.

4. Schema Evolution: Update Without Breaking

In dynamic data environments, schema changes are inevitable. New columns may be added, data types may evolve, or partitioning strategies may shift. AWS Glue provides several ways to handle schema evolution, but choosing the right approach is key to avoiding data loss and unnecessary table duplication.

Three Ways to Handle Schema Changes

  1. Update the schema using an ETL job
    • This is the most controlled and reliable method.
    • Use an AWS Glue job to read the new data and write it back with the updated schema.
    • Enable the enableUpdateCatalog option in the Glue job script to update the Data Catalog during the job run.
    • This approach ensures that schema changes are intentional and version-controlled.
  2. Use crawlers (with caution)
    • Configure the crawler to update an existing table.
    • Set the crawler to target the existing Data Catalog table.
    • Enable schema updates in the crawler settings.
    • This works well for minor changes but may still create new tables if the schema deviation is significant.
  3. Manually update the table schema
    • Navigate to the table in the AWS Glue Console.
    • Edit the schema manually and save.
    • This is quick but error-prone and not scalable for frequent changes.

A best practice would be to enforce schema validation upstream and document schema versions and changes to maintain data lineage and auditability.

5. When Crawlers Fail Silently

One of the more frustrating experiences with AWS Glue crawlers is when they appear to run successfully, but the expected data doesn't show up in Athena. There are no errors, no warnings, and yet the new data or partitions are simply missing.

The situation often arises when:

  • The crawler is set to ignore the schema changes or not update existing partitions.
  • The new data is not picked up due to subtle schema mismatches or folder name inconsistencies.

How to fix it:

  • Rerun the crawler with below updated settings
  • Enable "Crawl only new subfolders".
  • Check "Update all new and existing partitions with metadata from the table"
  • Verify folder structure and schema consistency
  • Check crawler logs

6. Ditch the Crawler for More Control

While AWS Glue crawlers offer a convenient way to automate schema discovery and data catalog updates, they are not always the best for the job. It's often better to manage tables directly through Glue ETL jobs.

The reason to consider skipping crawlers is the unpredictable behavior of creating multiple tables, skipping partitions, or silently failing when schema inconsistencies arise.

Another reason would be due to limited control over it, and also the additional operational overhead for the data pipeline.

Glue ETL jobs give you full control over how the data is read, transformed, and written, along with how the Data Catalog is updated.

Benefits of Using Glue ETL Jobs Instead of Crawlers

  • Deterministic schema management explicitly using DynamicFrame or DataFrame transformations.
  • Use bookmarks to process only the new data incrementally.
  • Partition control
  • Catalog updates on our own terms.
  • Better error handling.

When to Use the Glue Job Approach

  • When the schema of the data evolves frequently.
  • When you need to enforce strict schema validation
  • When you want to avoid the risk of Glue creating multiple unexpected tables.
  • More reliability and auditability

Conclusion

AWS Glue crawlers are a powerful tool for automating schema discovery and maintaining the Data Catalog, but with their own quirks. As we have explored, real-world usage often reveals edge cases that aren't immediately obvious, especially when dealing with CSV files, evolving schemas, partitioned data, and mixed folder structures.

The key takeaway is that Glue crawlers work best when the data is clean, consistent, and well-structured. But in dynamic environments where schema changes are frequent and data arrives in unpredictable formats, relying solely on crawlers can lead to confusion, silent failures, and fragmented tables.

By understanding how crawlers behave, when to bypass them in favor of Glue ETL jobs, you can build more robust, predictable, and scalable pipelines. Whether it's managing a data lake or feeding downstream AI/ML systems, the reliability of the metadata layer is critical.

Ultimately, the goal isn't to avoid Glue crawlers altogether, but to use them wisely with the right configurations, folder structures, and fallback strategies in place. With these lessons, one will be better equipped to navigate the nuances of AWS Glue and build a data platform that's flexible and resilient.

AWS CSV Big data

Opinions expressed by DZone contributors are their own.

Related

  • Architecting Petabyte-Scale Hyperspectral Pipelines on AWS
  • Model Context Protocol Vs Agent2Agent: Practical Integration with Enterprise Data
  • AWS Airflow vs Step Functions: The Data Engineering Orchestration Dilemma
  • Enterprise-Grade Document Intelligence: Cloud Big Data AI With YOLOv9 and Spark on AWS

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook