AI Found a Bug Hidden for 23 Years. Now What?
How AI coding agents went from finding trivial bugs to uncovering critical kernel vulnerabilities overnight, and what Cursor 3 tells us about where development tools are heading.
Tob
Backend Developer
Picture this. You hand an AI agent the entire Linux kernel source code. You tell it to find security vulnerabilities. You go make coffee.
When you come back, it has found multiple remotely exploitable heap buffer overflows, including one that has been sitting in the NFS driver for 23 years.
This is not a thought experiment. Nicholas Carlini, a research scientist at Anthropic, reported this at the [un]prompted AI security conference in March 2026. He described using Claude Code essentially out of the box, pointing it at the kernel and asking where the bugs were. The results were not theoretical, low-severity issues. They were remotely exploitable vulnerabilities in one of the most heavily audited codebases on the planet.
TL;DR: AI coding agents have crossed a threshold. They went from producing obvious noise to finding critical vulnerabilities that human researchers missed for decades. The security research landscape is shifting faster than anyone expected, and AI coding tools like Cursor 3 are making these capabilities accessible to everyday developers.
The 23-Year-Old Bug
Carlini was surprised by how little engineering it took. He used a simple script that looped through every file in the kernel source tree, fed each file to Claude Code with a capture-the-flag prompt, and collected the results. The script told Claude Code it was playing a hacking competition and to find the most serious vulnerability in the given file.
One of the bugs it found lived in the NFS (Network File System) driver. The attack is technical but the impact is stark. An attacker can spin up two cooperating NFS clients and manipulate lock owners in a way that forces the server to encode a response larger than its buffer, leaking kernel memory across the network. Any machine running an NFS server is potentially exposed.
What makes this remarkable is not just the severity. It is the specificity. The bug required understanding intricate NFS protocol behavior. That kind of vulnerability does not get found by scanning for common antipatterns. The model had to reason about protocol state machines, edge cases in the lock handshake, and buffer allocation decisions across thousands of lines of kernel code.
Why This Is Happening Now
Thomas Ptacek published a detailed analysis this week titled "Vulnerability Research Is Cooked." His argument cuts to the core of why LLMs are so effective at this problem.
Exploitation research is, at its foundation, a search problem. You are looking for bug classes (stale pointers, integer mishandling, type confusion) and solving constraints for reachability and exploitability. You are testing success/failure cases against a target. You need to understand how vast bodies of code connect.
This is precisely what LLMs are built for. Before you provide a single token of context, a frontier model already encodes correlations across enormous bodies of source code. It knows how the Linux KVM hypervisor connects to the hrtimer subsystem. It knows the documented library of bug classes that underpin all modern exploit development. It never gets bored and will search forever if you tell it to.
The pattern-matching + brute-force combination is devastating. The model has the knowledge. The agent has the persistence. Together they represent something fundamentally new in the vulnerability research toolkit.
The Linux Kernel Maintainers Are Seeing It
The shift is not theoretical. Greg Kroah-Hartman, Linux kernel maintainer, described the change in a recent interview. Months ago, they were getting what they called AI slop: generated security reports that were obviously wrong. It was almost funny.
Then, about a month ago, the quality jumped. Now they receive real, correct vulnerability reports generated by AI tools on a daily basis. So many that they have had to bring in additional maintainers to handle the volume.
The numbers tell the story. Two years ago, the kernel security list received 2 to 3 reports per week. Over the past year that rose to around 10 per week. Since the start of 2026, it is between 5 and 10 per day. Friday and Tuesday are the worst.
Willy Tarreau, lead software developer of HAProxy, confirmed the same pattern on the kernel security list. They are now seeing duplicate reports, the same bug found by different people using different AI tools. This is not a trickle. It is a flood.
Cursor 3: AI Agents Go Parallel
While security researchers grapple with what this means for vulnerability disclosure, the tooling around AI-assisted development is evolving fast.
Cursor released version 3 this week with a major new interface called the Agents Window. The core idea is simple but the implications are significant. You can now run many AI agents in parallel across repos and environments: locally, in worktrees, in the cloud, and on remote SSH machines.
The design mode lets you annotate and target UI elements directly in the browser, giving agents precise visual context. Agent tabs in the editor let you view multiple chats side by side or in a grid. Composer 2 delivers frontier-level coding performance at standard and fast pricing tiers.
For enterprise teams, self-hosted cloud agents keep code and tool execution entirely within your network. Your codebase, build outputs, and secrets stay on internal infrastructure while the agent runs locally. This is a direct response to the security concerns that have held back wider AI agent adoption in corporate environments.
Cursor also introduced automations: always-on agents that run on schedules or trigger from Slack, Linear, GitHub, PagerDuty, and webhooks. If this sounds familiar, it should. The line between an AI coding assistant and an autonomous software engineer is getting blurry fast.
What This Means Going Forward
The immediate implication is that the economics of vulnerability research are changing. Finding bugs has always required specialized expertise, significant time, and expensive manual effort. AI agents collapse all three dimensions simultaneously.
Ptacek predicts that within months, the standard approach to vulnerability research will be pointing an agent at a source tree and typing "find me zero days." That is a stark reframing of what security research looks like.
For developers, the lesson is more immediate. The same capabilities that found a 23-year-old kernel bug are available in tools like Cursor, Claude Code, and their competitors. Whether you are writing business logic or auditing infrastructure, AI agents can now examine codebases at a scale and depth that was not possible before.
The open questions are not about capability anymore. They are about process. How do security teams triage an AI-generated flood of valid vulnerabilities? How do disclosure frameworks work when the finder is an agent? How do we prioritize fixes when everything looks critical?
Those are good problems to have. But they are problems that the industry needs to solve quickly, because the flood is not slowing down.
Sources: Vulnerability Research Is Cooked - Sock Puppet, Simon Willison - ai-security-research tag, Cursor Changelog, mtlynch.io - Claude Code Found Linux Vulnerability