Critical Thinking In The Age Of AI-Generated Code

Learn how AI-generated code impacts maintainability, productivity, and quality. Explore code reviewing, testing strategies, and best practices for long-term success.

Stelios Manioudakis, PhD

CORE ·

Apr. 14, 25 · Analysis

Likes (8)

Comment

Save

23.6K Views

One of the first rules I learned about writing code is that you have to understand it line by line. Verify and validate instead of assuming and wishful thinking. The use of AI code assistants nowadays tends to make software developers forget this important rule.

One of the biggest misconceptions about AI code assistants is that they simply “speed up” development. While they do increase the volume of code written, this doesn’t necessarily translate into better software. The more important question is: Does AI-generated code improve long-term maintainability?

Code assistants took on an unprecedented role in software development in 2024. According to Stack Overflow’s 2024 Developer Survey, 63% of Professional Developers reported using AI in their workflow, with another 14% planning to adopt it soon. Developers overwhelmingly cited “increased productivity” as the main benefit — but does that productivity translate to better software?

As research published in February 2025 suggests, the answer is more complicated. While AI-assisted coding increases the total lines of code written, the key to long-term velocity isn’t just writing more code — it’s writing better, maintainable code.

The industry has long emphasized principles like DRY (Don’t Repeat Yourself) and modular design, which reduce duplication and improve system integrity. However, data from 211 million changed lines of code between 2020 and 2024 shows signs of eroding code quality. More specifically, key findings from the report include the following:

AI-assisted coding increased the volume of code written but at the cost of code quality and maintainability.
Copy-pasted code surged — 2024 was the first year where duplicated lines exceeded refactored (“moved”) lines. Developers refactored less — refactored code dropped by 39.9% YoY, while duplicated blocks grew 10x in two years.
Defect rates increased, as confirmed by Google’s 2024 DORA report, which found a 7.2% drop in delivery stability for every 25% AI adoption increase.
Short-term gains, long-term costs — teams focused more on shipping features rather than maintaining modular, reusable code, leading to higher churn and tech debt.

This article gives a sample of critical thinking activities that we can perform today if we want our AI-generated code to have certain qualities. I will focus on two critical thinking activities: code reviewing and testing.

Code Reviewing

Besides understanding our code, code reviewing AI-generated code is an invaluable skill nowadays. Tools like GitHub's Copilot and DeepCode can code-review better than a junior software developer. Depending on the complexity of the codebase, they can save us time in code reviewing and pinpoint cases that we may have missed, but, after all, they are not flawless. We still need to verify that the AI assistant's code review did not provide any false positives or false negatives. We need to verify that the code review did not miss anything important and that the AI assistant got the context correctly. The hybrid approach seems to be the most effective one: let AI handle the grunt work and rely on developers for the critical analysis.

One way or another, practical, hands-on code reviewing is still essential for verification. For as long as this necessity for human verification holds true, I believe it's a good reason why large language models (LLMs) will not render software professionals obsolete. Here are a few benefits from our code reviews for AI-generated code.

1. Guard Against Deception

LLM-generated code may present a polished facade with well-chosen identifiers, seemingly helpful comments, explicit type declarations, and a coherent structure. This can create a misleading impression of reliability. This "looks correct but behaves incorrectly" problem can be alleviated by human reviewers. Our reviews can check for logical correctness, edge cases, and unintended consequences. As an example, a human eye is needed to check that AI-generated code meets our actual requirements.

2. Guard Against Duplication and Decay

Code reviews can uncover duplicated code and enforce code reuse. Refactoring for maintainability stands as a key pillar in the code review process. It's an indispensable tool in the reviewer's arsenal, ensuring long-term code health. As copy-pasted code has surged, according to research reports, this part of code reviews can not only be essential but absolutely necessary.

3. Securing the Stack

Code reviews can catch security vulnerabilities by identifying insecure coding practices that AI-generated or human-written code might introduce. Common vulnerabilities include SQL injection (unsanitized database queries), hardcoded credentials (exposing secrets in code), insecure authentication (weak password handling or missing authorization checks), cross-site scripting (XSS) (improper input sanitization in web apps), and insecure API exposure (leaking sensitive data in API responses).

Automated SAST tools (e.g., Semgrep, Checkmarx) are indispensable, but human oversight remains essential. A strong security-focused code review prevents exploitable vulnerabilities from reaching production and strengthens overall system resilience. The OWASP top 10 for LLM applications may be worth checking.

4. A Learning Opportunity

After all, code reviewing AI-generated code is an excellent opportunity to educate ourselves while improving our code-reviewing skills. Keep in mind that, to date, AI-generated code optimizes for patterns in its training data. This may not be aligned with coding first principles.

AI-generated code may follow templated solutions rather than custom designs. It may include unnecessary defensive code or overly generic implementations. We need to check that it has chosen the most appropriate solution for each code block generated. Another common problem is that LLMs may hallucinate.

This means, for example, that they may reference libraries, functions, or helper methods that don't exist. We must avoid treating AI-generated code as inherently superior or inferior. Who knows what AI-generated code will be like tomorrow? But until tomorrow, code-reviewing AI-generated code today can help us be in touch with the current state of AI-generated code whilst improving our code-reviewing skills.

Testing For AI-Generated Code

Any type of testing at any testing level can be appropriate for AI-generated code. This article, however, will focus on the testing activities that can address the worrying findings mentioned in the introduction.

Copy-pasted code can be problematic for many reasons. Here are some of them:

Contextual errors. Code that works perfectly in one part of the application might fail in another due to differences in context, such as varying data inputs or environmental factors.
Hidden dependencies. Copied code might rely on variables, functions, or external resources that are present in the original location but missing in the new location, causing runtime errors.
Maintenance nightmares. When changes are needed, developers have to hunt down and modify every instance of the copied code. Small changes to the copied code, to try and make it fit the new location, can easily introduce logic errors. This increases the risk of errors and makes maintenance difficult.
Unpredictability. A bug is fixed or a new feature is added in one instance of the copied code. The same changes, however, are not applied to other instances of the copied code. This leads to inconsistencies and unpredictable behavior.
Multiplying problems. If the copied code has a security vulnerability or a performance issue, then we are potentially multiplying such problems.

Static analysis tools (e.g., SonarQube, ESLint) can detect duplicate code, security vulnerabilities, and unused functions. Mutation testing and fuzzy testing can help in the case of superficial tests that pass most of the time but fail under certain conditions.

Exploratory testing will uncover unexpected behaviors, usability issues, and logic gaps. AI-generated code may introduce inefficiencies, causing slowdowns and resource overuse that can be identified by performance testing. When the number of bugs increases to the point that our delivery stability is jeopardised, regression testing is necessary. Security testing can identify potential security vulnerabilities.

Testing to Catch Bugs Early

A stitch in time saves nine. This is true for copy-pasted code, too. Having conducted code reviews focusing on potential AI-introduced vulnerabilities and duplication, we can also perform the following testing activities:

Static analysis. It detects duplicated code, unused functions, unreachable branches, and basic security flaws. We can also integrate it in continuous integration to fail builds when thresholds (e.g., >5% duplication) are exceeded.
Security static testing. We can target, for example, hardcoded credentials, insecure API use and insufficient input sanitization. This can be especially vital when LLMs tend to reuse insecure patterns from training data.
Unit testing enforcement. It's generally a good idea to require tests to be submitted with any AI-generated pull request. We could also use a test coverage gate (e.g., 90%+) to catch coverage gaps introduced by the generated code.

Testing to Get Feedback

A good test coverage is only one part of the story. We have to give our best efforts to verify that we've tested edge cases and other end-user workflows. We need to find a way to guide ourselves on what to test next and how. Valuable feedback can be obtained from the test results of activities like:

Mutation testing. We can mutate logic (e.g., flip conditionals) to verify that unit tests truly validate behavior. This can expose AI-generated tests that are superficial or copy-pasted.
Fuzzying. This testing activity injects malformed, unexpected, or random inputs. It is vital for catching brittle input handling or missing validation (common in LLM code).
Exploratory testing. The good old manual testing from an experienced QA engineer. Guided sessions can be used based on experience, intuition, and feedback from other stakeholders. Misaligned assumptions in AI code, incomplete edge case handling, unexpected UX flows or state transitions, and usability issues are a sample of the areas to be covered.

Testing To Find More Bugs

Once we've found where to focus, we can dive deeper to find more bugs. This is where we can address increasing defect rates. The DORA report indicates that the increased use of AI in code generation correlates with a decrease in delivery stability. This means more bugs are likely to slip into production. Regression testing directly counters this by acting as a safety net after any code change, especially those introduced by AI.

While AI might aim to improve or add functionality, there's a risk that its output could unintentionally disrupt existing parts of the system. Regression testing helps us verify that these new AI-driven changes haven't broken anything that was working before.

A well-designed regression test suite includes tests that cover all the critical functionalities of the application. Also, it specifically targets areas that are likely to be affected by the AI's modifications. For example, if AI has generated a new user authentication module, regression tests would not only verify that the new module works correctly but also re-test the existing user profile management and session handling. Apart from regression testing, other testing activities could include:

Performance testing. This is a testing activity that usually starts early in the development phase. AI-generated code often takes the “obvious” path — not the optimal one. For example, it might sort data multiple times, make unnecessary API calls, or use memory-heavy operations. If slow database queries or inefficient loops are baked into the design, they’re much harder to refactor later. Especially in APIs, microservices, and front-end-heavy apps, small slowdowns can add up, hurting UX and scalability. Catching performance inefficiencies early saves engineering hours later, keeps systems scalable, and ensures that the team isn’t slowed down. So, start performance testing early and measure continuously.
Security testing. This is an activity that simulates real-world attacks. It doesn’t just read code; it runs the application and tests how it behaves under attack conditions. Tools like OWASP ZAP and Burp Suite intercept and modify traffic between client and server. They can fuzz inputs to test for vulnerabilities, they can crawl and map the app’s attack surface. They can attempt unauthorized access, code injection, data exfiltration, etc.

To summarise, the following testing matrix for AI-generated code may be worth exploring.

testing scope	test type	what it covers	risk areas addressed	tooling examples
Catch Bugs Early	Static Code Analysis	Duplicated code, unused variables/functions, basic security issues	Maintainability, code bloat, shallow logic	SonarQube, ESLint, Semgrep
	Unit Testing	Functional correctness of individual components	Logic bugs, broken branches, and edge cases	PyTest, JUnit, Jest
	Security Linting	Hardcoded secrets, insecure dependencies, unsafe libraries	Vulnerabilities in AI-suggested patterns	Bandit, Semgrep, Trivy
Get Feedback	Mutation Testing	Validates strength of existing unit tests	Superficial tests from AI, missed logic paths	Stryker, Pitest
	Fuzz Testing	Random/malformed input validation	Input handling, validation errors	Jazzer, AFL, Fuzzit
	Exploratory Testing	Dynamic interaction with app for UX, logic, and flow issues	Unexpected UI behavior, usability flaws	Manual / Session-based
Find More Bugs	Performance Testing	Response time, memory/CPU usage, throughput	Inefficiencies in LLM code, bottlenecks	JMeter, k6, Locust
	Regression Testing	Verifies unchanged behavior after new code, code improvements, or bug fixes	Delivery stability, accidental breakage	Selenium, Cypress, Robot
	Security Testing (DAST)	Simulated attacks, auth bypass, injection flaws	Critical vulnerabilities from auto-generated code	OWASP ZAP, Burp Suite, Nikto

Wrapping Up

AI assistance is a reality for writing code and testing. As reports indicate an increase in duplicated code and bugs, we presented a list of testing activities to catch bugs early, get feedback on what to test next, and find more bugs. The testing activities presented are by no means exhaustive, but I hope that they can provide a framework to expand upon.

As AI models continue to evolve and train on more data, their capabilities will probably improve. But for now, they are just a powerful ally, not a standalone solution. In the age of LLMs, the real skill isn’t just writing code — it’s understanding, thinking critically, and testing thoroughly so that it works.

AI Critical thinking Testing

Opinions expressed by DZone contributors are their own.

Related

Trending