Does AI-Generated Code Need To Be Tested Even More?

AI-powered tools make it easy to program applications. But you must apply the same rigor in testing and QA used for human-written code.

Peter Schrammel

Nov. 24, 23 · Opinion

Likes (1)

Comment

Save

3.5K Views

AI-powered tools for writing code, such as GitHub Copilot, are increasingly popular in software development. These tools promise to boost productivity, but some also claim that they democratize programming by allowing non-programmers to write applications.

But how do we actually know whether the code written by an AI tool is fit for purpose?

In the following, you are going to learn what “fit for purpose” even means and what tools you can use to assess it. We will see that AI-powered tools for writing code cannot guarantee anything regarding the functional correctness and security of the code they suggest. However, we will also see that there are actually AI tools that can support you in answering the above question.

Stepping Back in Time

It’s worthwhile to begin by stepping back in history a bit because this question is not unfamiliar at all: we’ve always had to ask ourselves, "How do we actually know that code written by humans is fit for purpose?" People have been scratching their heads for decades to find solutions for this fundamental problem in software engineering.

From the earliest days of programmable computer systems, engineers had nasty surprises when programs did not do what they intended. Back then, trial and error cycles to get programs right were very expensive. Low-level code needed to be handcrafted and punched into cards. The main countermeasure to unnecessary cycles was code review. Code review means that an expert reads and tries to understand what the code does to flag mistakes and give improvement suggestions - a successful technique that continues to be widely practiced today. However, the effectiveness of reviews and the effort to conduct them thoroughly decreases dramatically while costs soar with the growing size and complexity of programs.

Soon the question came up of how to actually tell in a more rigorous way whether a program is doing what it is supposed to be doing. The challenge is how to express what the program is supposed to be doing. Somehow it needs to be communicated to the machine what the user actually wants. This is a highly challenging problem that is still waiting to be fully solved today.

The problem is common to product engineering across all disciplines and is broken up into two steps, which are usually formulated as the questions:

Have we built the right product?
Have we built the product right?

Validation and Verification

Assessing whether the right product has been built is known as validation. Ultimately users validate whether the product fulfills its intended purpose. Verification starts from a requirements specification, which serves as a means of communication between the builders of the product and the user or customer. Specifications are supposed to be understood by both sides: the user (a domain expert) and the engineer (who may not be a domain expert). Verification amounts to assessing whether the implementation of the product conforms to the requirements specification. Validation is clearly the harder problem. What is a “right” product is highly subjective and hard to automate.

The good news is that the verification part can be fully automated: in principle, up to computational and complexity-theoretical limits. Simply put, it can be mathematically proven that an implementation satisfies the specification. This discipline is known as formal methods or formal verification and relies on logic-based formalisms to write specifications and automated reasoning to perform the proofs.

As promising as this sounds, the main problem is, again, who writes the specifications. It requires someone who is a domain expert and an expert in writing formal specifications - such people are hard to find and very expensive. Even when you have found such a person and you have succeeded in verifying your implementation, you are still left with the validation part of the problem; i.e., whether the specification actually describes the right product from the user’s point of view.

It has been commonly observed that specifications are often “more wrong” than the implementation because it is extremely difficult to get formal specifications right. The huge challenge remains of scaling automated reasoning to large systems. In practice, formal verification has found its place for comparably small, complex, and critical (safety, security, financial) software from embedded control, cryptography, and operations system kernels, to smart contracts.

A Different Perspective

In the 1970s there was another idea to approach the problem from a different angle, called N-version programming. The basic idea is that since it’s so hard to get programs and even their specifications right, let multiple independent teams implement the system and then vote on the output. The underlying assumption is that different teams make different mistakes; they may also have different interpretations of the requirements specification. So, on the whole, the result is expected to be “more correct” than a single implementation in terms of verification and maybe even validation. However, it turned out that the assumption was wrong: later research showed even independent teams make the same mistakes.

A legacy of this approach is that verification can be viewed as 2-version programming: requirements specification and implementation are two sides of the same coin. They describe the same system in different ways using different formalisms and points of view. Also, they are often written by different people or teams. Neither is the specification in any sense “more correct” than the implementation, nor vice versa. This way of thinking can guide us to a realistic view of what can be achieved in practice.

So, why do we even bother having both specifications and implementations? The benefit comes from 2-version programming: comparing two descriptions of the same systems allows us to gain confidence where they agree with each other and find bugs where they disagree, enabling us to reflect on both descriptions and ultimately arrive at “more correct” descriptions.

Testing Is a Verification Technique

Now, some readers may interject: We don’t have specifications, so we can’t do that. How does this concern us? You may not have a solid requirements specification, but you may actually test your software. True, we haven’t talked about testing yet. What is testing actually doing in the context of our discussion?

Testing is a verification technique. Tests check your implementation and the assertions in your tests are - you guessed it - the specification. How often has it happened to you that the bug was not in the implementation, but in the tests? This is not surprising, since testing is just 2-version programming. So, having a good testing practice with end-to-end tests for the high-level requirements and thorough unit testing for the low-level ones indeed increases confidence in delivering the right product. And what about model-based engineering? Yes, just 2-version programming with the same properties.

So Do We Need To Test AI-Written Code More?

Assessing whether an application written by an AI tool is actually fit for purpose is as difficult as doing the same for human-written code – although generative AI code creation tools absolutely require review by the developer. The hard part is answering the validation question: the person doing this assessment must be an expert in the application domain.

A tool for automatically writing tests helps you in this process, as the tests that it creates give you a behavioral input-output view of the code. For example, I am a co-author of an automated test-writing solution called Diffblue Cover. It doesn’t make guesses - it bluntly tells you what the code is actually doing and thus helps you assess the validation and verification questions. The tests serve as a baseline for regression testing when you make changes to code going forward - independently of whether it was written by a human or a machine.

The chances are quite good that non-programmers who are experts in their application domain will benefit from AI tools for programming applications. However, they need to be aware that implementations written by tools such as GitHub Copilot are not correct by construction - functional correctness and security properties are not built into the underlying models they use. Even if trained on “correct” code (we now know how little this means since no enterprise developer would push AI-generated code to production without first reviewing it carefully) and secure code, these properties are not preserved by training and model evaluation.

AI to Rescue AI

AI-powered tools make it easy to program applications that do something. However, the application code must be expected to suffer from similar issues as human-written code. Thus, you need to employ the same rigor in testing and QA that is used for human-written code. Confidence in these processes can be increased in combination with an AI tool for automatically writing tests that can guarantee the tests it produces describe the actual behavior of the code.

AI Application domain Formal methods Implementation Testing

Published at DZone with permission of Peter Schrammel. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending