GPT-4o fixes made code 37% MORE vulnerable

·3 min read

This study has one finding I can't stop thinking about.

First finding: 53% of AI-generated code contains at least one security vulnerability. Concerning but not shocking.

Second finding: researchers asked GPT-4o to fix the vulnerabilities through 5 revision rounds. Result: 37% MORE vulnerabilities than the original. It just confidently introduces new problems while "fixing" old ones.

This has implications way beyond security. If AI code gets worse with iterative AI revision, human review is non-negotiable. But reviewers are already overwhelmed. That thread here about "coworker uses AI to reply to PR review comments"... we're heading toward AI code reviewed by AI. A closed loop with no human verifying anything.

Same pattern with testing. Unit tests are the first thing people ask AI to generate. But if the model doesn't understand business logic well enough to write secure code, does it understand it well enough to write meaningful tests? Multiple teams report AI-generated tests "execute without errors but validate nothing meaningful." Great, it runs. What does it prove though lol

The counterpoint: State of Testing 2026 shows people who actually use AI tools daily are 4x less concerned about quality. Maybe the answer is AI for generation, human for validation.

But I keep coming back to that 37% number. The model doesn't know what it doesn't know...

Where does your team draw the line on AI code going to production without human review?

Stop babysitting flaky tests

Zerocheck runs E2E tests on every PR with recordings, screenshots, and step traces.

Get a demo