Code Quality Had 5 Pillars. AI Broke 3 and Created 2 We Can’t Measure

AI-generated code broke three of the five classical non-functional quality pillars — readability, maintainability, and security — while creating two new dimensions

Abgar Simonean

May. 12, 26 · Analysis

Likes (3)

Comment

Save

2.9K Views

If you've been writing production software for more than a few years, you grew up with a gut sense of what "good code" meant beyond "it works." You could look at a pull request and feel whether the code was clean, or if the logic was going to be a nightmare to debug in six months.

We formalized that gut sense into five things: readability, maintainability, security hygiene, documentation, and structural simplicity. We built tools to measure them, argued about them in code reviews, and underneath all of it sat an assumption so obvious that nobody bothered to say it out loud — a human wrote this, and that human can explain why every line is there.

That assumption doesn't hold anymore. About a third of new code in production repositories is now AI-assisted, and the fraction is growing fast. I've spent the last several months digging into the empirical research that's piled up since 2024, and the picture it paints isn't the one most of us expected. AI-generated code effectively broke the way we think about code quality itself.

Three of those five pillars have cracked. And two entirely new quality dimensions have shown up that we have no idea how to measure yet.

What We Used to Agree On

Before I get into what changed, it's worth being explicit about what the old model actually was, because most of us internalized it without ever writing it down.

Readability was the idea that another developer could open your file and follow the logic without playing detective. Good names, clear control flow, predictable formatting. If you read the code, you could reconstruct the reasoning. Checkstyle, ESLint, and a thousand IDE plugins enforced it.

Maintainability was about what happened six months later when someone else needed to change your code. Cyclomatic complexity, the Maintainability Index, Halstead metrics — all of them tried to quantify how expensive a piece of code would be to modify. SonarQube and Code Climate tracked this across entire repositories, and teams used it to spot trouble early.

Security hygiene meant avoiding the known mistakes — SQL injection, hardcoded secrets, insecure deserialization. SAST tools like Checkmarx, Fortify, and CodeQL matched your code against libraries of known-bad patterns mapped to CWE and OWASP. If a developer was trained to avoid these patterns, they mostly did.

Documentation ranged from Javadoc coverage to well-named methods that made comments unnecessary. The metric was blunt (comment density, doc coverage percentage), but it worked well enough. The developer who wrote the code understood the domain context and captured at least some of it.

Structural simplicity was DRY, single responsibility, short methods, and low coupling. Code smell detectors and duplication analyzers enforced it. Simpler code was cheaper to maintain, and that was reason enough.

None of this was perfect, but it formed a coherent system. Every piece of it assumed a human author with domain knowledge, architectural awareness, and the ability to defend their choices in a review.

What the Data Shows

I want to be precise here, because most of the conversation around AI code quality has been vibes and anecdotes. The research that's come out in the last year gives us actual numbers, and some of them are genuinely surprising.

Readability Isn't What It Looks Like Anymore

Here's a finding I found interesting: A study presented at MSR 2026 ("Do AI Agents Really Improve Code Readability?") analyzed 403 commits from real-world repositories where AI agents specifically tried to improve readability. The researchers measured what happened to the code before and after each commit using standard metrics.

In 56.1% of those readability-focused commits, the Maintainability Index went down. Cyclomatic complexity went up in 42.7% of them. The agents mostly went after logic complexity and documentation — not trivial stuff like formatting — and yet the structural result was worse.

I've started calling this cosmetic readability. The code looks great on the surface, well-formatted, the names are consistent, there are comments in the right places, and digging deeper, the control flow is more tangled than what a human would have written for the same task. The PR passes visual inspection because it looks professional.

Maintainability Is Accumulating in the Background

A large-scale study across more than 500,000 code samples in Python and Java found that AI-generated code carries 1.7x more issues than human-written code overall, with maintainability errors running 1.64x higher and logic errors 1.75x more frequent.

A separate controlled comparison — 642 AI solutions vs. 107 expert-written ones — found 34% higher cyclomatic complexity and 2.1 times more code duplication in the AI output. The AI code also ran 15–40% slower and ate 25% more memory.

None of this shows up in the test suite. All looks well, but 66% of developers report spending their time fixing "almost right" AI code — stuff that works but needs human correction for design, edge cases, or integration issues. Industry surveys project 75% of tech leaders will face moderate-to-severe technical debt by 2026 from AI-driven development velocity.

The old maintainability metrics still work in the mechanical sense — cyclomatic complexity doesn't care who wrote the code. Still, when a third of your codebase carries 1.64x the maintenance burden per unit, and the volume keeps climbing, the aggregate debt outruns any review process you can realistically staff.

Security Isn't Improving, and the Reason Is Structural

Now let's talk about security, the scariest metric of all... Veracode tested 80 coding tasks across more than 100 LLMs and found that 45% of the generated code introduced vulnerabilities from the OWASP Top 10. Their Spring 2026 follow-up — after GPT-5.1, GPT-5.2, Gemini 3, all the newest models — showed security performance flat. Statistically indistinguishable from models released two years earlier.

In my humble opinion, this isn't a bug that will get patched in the next release, looks like it is a training problem and will stay in most LLMs for a long time.

LLMs learn to generate code that passes tests, compiles, and satisfies the prompt. That's the feedback signal they're optimized on because secure code and insecure code are functionally equivalent. A parameterized SQL query and a string-concatenated one both return the same rows. The model has no training signal (yet) that distinguishes them because during evaluation, both "work."

That's why the per-vulnerability numbers are so uneven. SQL injection, which follows a recognizable syntactic pattern the model has seen flagged thousands of times, gets handled reasonably well. But cross-site scripting — where you need to trace user input through multiple function calls, transformations, and template renderings — only lands securely about 12–13% of the time. Log injection fails at an 88% rate.

The model is good at local pattern matching and bad at global context reasoning. The conclusion from Veracode's data is that bigger models don't perform meaningfully better than smaller ones on security.

Java developers should pay particular attention here: Java had a security failure rate above 70%, likely because its long history as a server-side language loaded the training data with pre-modern-security-awareness examples that the models faithfully reproduce.

What Didn't Break

Documentation has actually improved in one narrow sense. AI-generated code tends to include more comments and docstrings than the average human developer bothers to write. Models learned from well-documented open-source libraries, and they reproduce that pattern.

Small catch, though: AI comments describe what the code does — which you can already see by reading it. They almost never capture why it exists, what edge case it guards against, or what architectural decision led to this approach.

Structural simplicity holds up at the function level. Individual AI-generated functions tend to follow conventions from their training data. At the project level, however, duplication runs 2.1x higher because the model solves similar problems independently across modules instead of extracting shared abstractions.

Two New Quality Dimensions Nobody Is Measuring

Intent Fidelity

Does the generated code match what the developer actually meant?

This sounds like functional correctness, but it's different. Code can pass every test and still implement the wrong business logic, miss an edge case the developer would have caught, or satisfy the literal prompt while missing the implicit requirements that any experienced colleague would have understood from context.

LLMs don't know your authorization model. They don't know your tenant boundaries or your data classification rules. They don't know that your staging environment has different security configurations than production. They generate code that satisfies the visible specification and skips everything invisible — the input validation specific to your context, the access checks for your RBAC model, the subtle safety differences between environments.

No existing tool measures this. The question that actually matters: "Is this what the developer meant, given everything they know about the system that never made it into the prompt?"

There's an emerging category of "intent-aware" verification tools that try to track the original prompt, repo history, and documentation to verify alignment between what was asked and what was generated. But these are early-stage, and the metric itself — how do you score intent fidelity? — isn't defined yet.

Architectural Coherence Across AI Contributions

Individual AI-generated PRs can be perfectly fine in isolation. Each one passes tests, follows conventions, looks clean, but zooming out across hundreds of these PRs over weeks and months, the codebase starts to drift.

This isn't a new problem — architectural drift predates AI by decades, but when GitHub's Octoverse report shows 82 million monthly code pushes with 41% being AI-assisted, the drift is happening at a pace that human review can't realistically police.

Tools like CodeScene, which combine code metrics with git history to map knowledge distribution and predict defect risk, come the closest to catching this, but they're lagging indicators. They tell you the architecture has drifted after it already happened, not before the PR gets merged.

Where This Is Going

The industry is starting to converge on something that looks like a layered quality stack, though nobody has given it a clean name yet.

Layer 1 is the old guard: deterministic static analysis. SonarQube, CodeQL, Semgrep. Rule-based, predictable, low false-positive rates. SonarQube's 6,500+ rules across 35 languages aren't glamorous, but they catch known patterns more reliably than any AI reviewer. This layer isn't going anywhere, and it shouldn't.

Layer 2 is AI-native code review — tools like CodeRabbit, Qodo, and GitHub Copilot Code Review that use LLMs to catch logic errors, suggest improvements, and flag issues that rule-based tools miss. They complement Layer 1; they don't replace it. AI catches the nuanced stuff and static analysis catches the known stuff - we need both.

Layer 3 is the newest and most interesting: reasoning-based security scanning. Anthropic's Claude Code Security is the clearest example. Instead of matching against a rule set, it uses advanced models to read codebases the way a human security researcher would — tracing data flow, reading git commit history to find incomplete patches, building hypotheses about how components interact. Anthropic pointed Opus 4.6 at real open-source projects and found over 500 high-severity vulnerabilities that had survived expert review and years of fuzzing.

Then there's the deeper question: can we actually fix the training objective problem, or is it a hard ceiling?

Two things suggest the industry is at least taking it seriously. Anthropic put $1.5 million into the Python Software Foundation, targeting supply chain security, automated PyPI malware analysis, and building datasets of known malicious packages — investing in safer training data at the source. Veracode's Spring 2026 report was: until models are trained on data that prioritizes secure code examples, don't expect security to improve. Speed to working code drives adoption right now, and security has never (sadly) been what sells an AI coding assistant.

What You Should Actually Do

If you're a senior engineer or team lead shipping AI-assisted code today, the practical picture is this: the old metrics still matter, but they're not enough anymore.

Keep tracking cyclomatic complexity trends, static analysis density, vulnerability counts by CWE, and duplication rates. These are your early warning system for the debt that AI-generated code is silently piling up.

Start tracking code ownership distribution — how many people on your team can confidently modify the AI-generated modules? Track PR scope — are AI-generated changes touching more files per PR than human-authored ones? And watch for architectural consistency: are naming conventions, abstraction patterns, and module boundaries holding across AI contributions?

Push for intent verification tooling as it matures — some way to validate that generated code matches what the developer actually needed, not just what they typed, and cumulative architectural drift scores that operate at the system level, not just function by function.

The Bottom Line

We built the five-pillar quality model over twenty years, and it served us well. It assumed human authorship, human intent, and a volume of code that humans could realistically review; these have changed.

Readability is cosmetically inverted — it looks better but measures worse, maintainability is a slow-motion debt crisis accelerated by volume, and security is stuck because the training objective doesn't reward it. Two genuinely new dimensions — intent fidelity and architectural coherence — have appeared with no mature metrics and no tooling.

Model scaling won't fix this - the data is clear on that point. What will help is layered defense: deterministic analysis at the base, AI-assisted review in the middle, reasoning-based scanning at the top, and concentrated human judgment where it matters most — on the architectural and intent questions that no tool can answer yet.

The code is getting faster, so our idea of what makes it good needs to catch up.

AI Cyclomatic Complexity

Opinions expressed by DZone contributors are their own.

Related

Trending