For two years now, there has been Copilot, the AI coding assistant from GitHub. GitHub conducted an extensive study on code quality, showing that code created with AI assistance passes more unit tests and has fewer errors. Reviewers also rated it as more readable and reliable.
GitHub tasked 202 Python developers, each with at least five years of experience, with writing an API endpoint for a web server, specifically a restaurant rating system. Among them, 104 were allowed to use GitHub Copilot, while 98 had no AI assistance. The testers ran all examples through ten unit tests to check functionality, and the results were clear: 60.8% of all programs with Copilot passed all ten tests, compared to only 39.2% without AI help.
Programs written with AI assistance passed all ten unit tests much more often. Twenty-five selected developers, whose code passed all ten tests, then conducted blind, anonymized reviews of the programs, with each program being reviewed ten times. Errors at this stage were not about functionality but qualitative issues like consistency or readability: inconsistent naming, unclear identifiers, overly long lines, nested loops, missing comments, repeated expressions (don’t repeat yourself, DRY), or poor function distribution.
Here, too, the AI performed well, though not as decisively. On average, reviewers found 4.63 errors in programs with AI help and 5.35 errors in those without. The number of code lines per error was also higher without AI help: 18.2 lines compared to 16 lines. This contradicts older studies that feared bloated code and violations of the DRY principle.
Other recent studies support the assumption that AI improves quality. In terms of the number of faulty code lines, AI-supported programs also performed better. Reviewers were also asked to make softer statements about how readable, reliable, maintainable, and concise the code was. Here, Copilot had an advantage of 3 to 5%, although the study does not clearly explain how these values were determined.
The study also looked at the commits from participants, which were more frequent and smaller with AI help. The study concludes, “Our hypothesis is that because developers need less time to make the code functional, it allows them to focus more on refining the quality.”