Stop Poisoning Your Models: How I Built a CV Dataset Quality Toolkit I Can Reuse Forever

CV data issues keep recurring. I built cv-quality — a toolkit to audit datasets, catch annotation errors, find mislabeled samples, and streamline labeling.

Sai Teja Erukude

May. 22, 26 · Tutorial

Likes (0)

Comment

Save

4.4K Views

Most people focus heavily on model improvements while treating data quality as a secondary concern.

They spend hours tuning hyperparameters, testing new architectures, and following the latest research, only to see performance stall at the same frustrating accuracy ceiling. More training rarely fixes it. More augmentation often does not either. Even swapping one strong architecture for another may not change much.

The real issue is often in the data. Duplicate bounding boxes, incorrect labels, boxes too small to provide meaningful signal, and heavily imbalanced class distributions can quietly limit model performance long before architecture becomes the bottleneck.

In many machine learning projects, the model is not the first thing holding results back. The data is.

The Solve-Once, Apply-Several-Times Problem

I work across a wide range of domains — astronomical imagery, corn and other food-related datasets, medical imagery, and more. These domains look nothing alike, but the data quality problems are shockingly similar: mislabeled examples, class imbalance, annotation inconsistencies, and the endless challenge of knowing which unlabeled samples to annotate next.

I kept running into the same pattern. For each project, I would end up writing a new set of scripts: one notebook to inspect class distributions, another to catch annotation outliers, another to surface suspicious labels. It was the opposite of DRY — I was DST: Doing the Same Thing.

That is what pushed me toward a reusable approach. I strongly believe in the SOAST principle — Solve Once, Apply Several Times. If the same problem keeps appearing across projects, it should be turned into a proper solution, not rebuilt from scratch every time.

So I built one: cv-quality

What Is cv-quality?

cv-quality is a Python toolkit I built specifically for computer vision dataset quality workflows. It handles four of the most painful, recurring problems I run into:

Dataset statistics & class imbalance analysis
Annotation quality checks (out-of-bounds boxes, duplicates, tiny annotations)
Label quality scoring & mislabel detection using Confident Learning and kNN
Active learning loop orchestration — knowing which samples to annotate next

It supports COCO JSON and ImageNet-style datasets natively, and because the core modules work on numpy arrays, I can plug in Pascal VOC, YOLO, Roboflow exports, or anything else with minimal glue code.

Let me walk you through how I actually use it.

Installation

    PowerShell
   
   # Core — no ML framework required
pip install cv-quality

# With PyTorch backend (for active learning)
pip install "cv-quality[torch]"

# With TensorFlow backend
pip install "cv-quality[tensorflow]"

# Everything
pip install "cv-quality[all,dev]"

# Import it as:
import cvquality

Step 1: Understanding What's Actually in My Dataset

Before I touch any model, I now always run a dataset audit. The DatasetStats module gives me class counts, bounding box distributions, Gini coefficient for imbalance, Shannon entropy, a co-occurrence matrix, and long-tail category analysis — all in one shot.

    Python
   
   from cvquality.io import COCODataset
from cvquality.stats import DatasetStats

ds = COCODataset("annotations/instances_train2017.json")
stats = DatasetStats(ds)
print(stats.summary())

    JSON
   
 

   {
  'num_images': 118287,
  'num_categories': 80,
  'class_imbalance': {'gini': 0.42, 'entropy': 5.1},
  ...
}
  

Which Categories Are Underrepresented?

    Python
   
   print(stats.tail_categories(percentile=10))

    JSON
   
   ['toaster', 'hair drier', 'parking meter', ...]

That Gini coefficient alone tells me a story. If it's creeping above 0.4, I know I have an imbalance problem that'll bite me downstream. Instead of discovering this after training, I now catch it before I write a single line of model code.

This was the first time I looked at COCO's own training set and thought — huh, no wonder my detector struggled with toasters.

Step 2: Annotation Integrity Checks

Annotators are human. Annotation tools have bugs. Exports can corrupt coordinates. I've personally seen bounding boxes that extend outside the image frame, near-duplicate boxes overlapping a single object, and boxes with area less than a square pixel.

AnnotationChecker finds all of these:

    Python
   
   from cvquality.quality import AnnotationChecker

checker = AnnotationChecker(ds, min_bbox_area=4.0, max_overlap_iou=0.85)
summary = checker.summary()
print(f"Total issues: {summary['total_issues']}")

    JSON
   
   {'total_issues': 312, 'by_type': {'out_of_bounds': 5, 'near_duplicate': 307}, ...}

307 near-duplicate annotations in a dataset I thought was clean. That's the kind of thing that silently inflates your training loss and confuses your model during NMS. Now this runs at the start of every new project. Non-negotiable.

Step 3: Label Quality Scoring with Confident Learning

This is where things get really interesting. Annotation errors — images assigned the wrong class label — are notoriously hard to find manually. You can stare at a dataset for hours and miss them.

I use Confident Learning, a statistical technique that compares your model's out-of-fold predicted probabilities against the given labels to estimate which labels are likely wrong.

    Python
   
   from cvquality.quality import LabelQualityScorer
import numpy as np

# pred_probs: (N, K) out-of-fold predictions from your trained model
lq = LabelQualityScorer(pred_probs, labels)
issues = lq.ranked_issues(top_k=50)   # worst labels first
print(lq.summary())

    JSON
   
   {'estimated_error_rate': 0.032, 'flagged_count': 47, ...}

A 3.2% estimated label error rate. That's 47 images the model is actively learning the wrong thing from. Doesn't sound like much until you realize those labels can disproportionately hurt rare classes — exactly the ones you're already struggling with.

I review the top-ranked issues manually. About 80% of the time, the flags are legitimate. The few false positives are edge cases worth knowing about anyway.

Step 4: Mislabel Detection via kNN

Sometimes I don't have out-of-fold predictions yet — especially at the start of a project when I haven't trained anything. For those situations, I use the kNN-based mislabel detector, which works purely on embeddings.

The idea: if a sample's embedding is surrounded by neighbors from a different class, something is probably off.

    Python
   
   from cvquality.quality import MislabelDetector

# embeddings: (N, D) from a pretrained backbone (e.g., ResNet features)
md = MislabelDetector(embeddings, labels, n_neighbors=15)
candidates = md.rank_
candidates(top_k=100)

    JSON
   
   [{'index': 2341, 'given_label': 3, 'suggested_label': 7, 'quality_score': 0.12}, ...]

I've had cases where the suggested_label was obviously correct — a sample labeled as "car" that was clearly a "truck" to any human eye but had slipped through the annotation process. The quality score gave me a ranked list to work through efficiently rather than eyeballing thousands of images.

Step 5: Active Learning — Spending My Annotation Budget Wisely

Active learning is one of those topics that looks intimidating in papers but is surprisingly practical once you have the scaffolding. The insight is simple: not all unlabeled data is equally valuable to label. You want to label the samples your model is most uncertain about — or the ones that are most different from what it's already seen.

cv-quality includes three families of active learning strategies:

Uncertainty: entropy, margin, least-confidence, BALD
Diversity: CoreSet, cluster-margin, MinMax
Error-Localization: gradient norm, spatial entropy

And it wraps them in a loop orchestrator that manages the train-query-label-retrain cycle:

    Python
   
 

   from cvquality.active_learning import ActiveLearningLoop, UncertaintyStrategy
from cvquality.active_learning.backends import PyTorchBackend
from cvquality.active_learning.loop import LoopConfig
import torchvision.models as M

model = M.resnet18(weights=M.ResNet18_Weights.DEFAULT)
backend = PyTorchBackend(model, device="cuda")
strategy = UncertaintyStrategy("entropy")
loop = ActiveLearningLoop(
    backend, strategy, images, labels,
    config=LoopConfig(budget_per_round=200, max_rounds=5),
)
history = loop.run()
print(loop.summary())
  

In practice, I've found that annotating 200 strategically chosen samples per round outperforms annotating 1000 random samples. This matters a lot when annotation is expensive — medical imagery, satellite data, anything requiring domain experts.

The COCO Full-Pipeline Recipe

For COCO-format datasets, I can run the entire pipeline — stats, annotation checks, label quality, and reporting — with a single recipe:

    Python
   
 

   from cvquality.recipes import COCORecipe

recipe = COCORecipe(
    "annotations/instances_train2017.json",
    image_dir="/data/coco/train2017",
    report_dir="./reports",
    dataset_name="COCO-2017-train",
)
result = recipe.run()

# Writes reports/instances_train2017_report.json + .html
  

I get a full HTML report I can share with teammates or clients. No more "trust me, the data is clean" — now I have a document that proves it (or reveals exactly what we need to fix).

The CLI for Quick Audits

When I just want a fast sanity check without writing any Python:

    PowerShell
   
   # Dataset statistics
cvquality stats annotations/instances_val2017.json

# Annotation checks
cvquality check annotations/instances_val2017.json --min-bbox-area 4 --max-iou 0.85

# Full HTML + JSON report
cvquality report annotations/instances_val2017.json --output-dir ./reports --name "COCO-val"

# ImageNet-style folder
cvquality imagenet /data/imagenet/val --output-dir ./reports

I've added cvquality check to my data ingestion pipelines as a gate. If it finds more than a threshold of issues, the pipeline raises an alert before any training job even starts.

Format Agnosticism: It Works with Everything

One thing I was careful about when designing this: COCO and ImageNet are common, but not universal. Pascal VOC, YOLO txt format, Roboflow exports, custom CSVs — these are all real formats in real projects.

The stats, quality, and active learning modules work on numpy arrays. That means:

    Python
   
 

   # Your own loader — Pascal VOC, YOLO, CSV, anything
embeddings  = my_loader.get_embeddings()   # (N, D)
labels      = my_loader.get_labels()       # (N,)
pred_probs  = my_model.predict(images)     # (N, K)

from cvquality.quality import LabelQualityScorer, MislabelDetector
from cvquality.active_learning.strategies import UncertaintyStrategy

lq       = LabelQualityScorer(pred_probs, labels)
md       = MislabelDetector(embeddings, labels)
strategy = UncertaintyStrategy("entropy")
indices  = strategy.query(pred_probs, budget=100)
  

Load your data however you want. Pass arrays. Done.

The SOAST Payoff

Since releasing cv-quality, I've run it on six different projects. Each time, it took me about 15 minutes to audit a dataset that used to take days of ad-hoc scripting. More importantly, every single audit found something — mislabels, annotation artifacts, imbalance I hadn't noticed.

That's the SOAST payoff. Build the tool properly once. Apply it everywhere. Let the tool find what human eyes miss.

What's Next?

I'm planning to extend cv-quality with:

Segmentation mask checks — polygon/RLE integrity for COCO segmentation tasks
Built-in Pascal VOC and YOLO readers — so you don't need to write converters
HuggingFace Datasets integration — for teams using the HF ecosystem
Drift detection — flagging when a new batch of data looks statistically different from your training distribution

If you work in computer vision and data quality has bitten you before — which, if you've been in this field more than six months, it has — give cv-quality a try.

    PowerShell
   
   pip install cv-quality

PyPI: https://pypi.org/project/cv-quality/

GitHub: https://github.com/SaiTeja-Erukude/cv-quality

The real model improvement secret isn't a better architecture. It's better data.

Learned something new? Tap that like button and pass it on!

Annotation Data quality HTML JSON K-nearest neighbors algorithm Machine learning Active learning (machine learning) Data (computing) Label Python (language)

Opinions expressed by DZone contributors are their own.

Related

Trending