At AI Scale, Code Quality Is a Security Property.

Eran Yahav

Eran Yahav, CTO of Tabnine


Most security conversations about AI-generated code focus on the obvious: does the agent introduce known vulnerabilities? Does it pull in a dependency with a CVE? Does it leak secrets?

These are real problems. They are also the easy ones. SAST tools catch many of them. Dependency scanners catch more. The industry has twenty years of tooling for these failure modes.

The harder problem is subtler. When an agent generates ten times the volume of code, the codebase itself changes shape. More modules. More abstraction layers. More patterns, inconsistently applied. More surface area that no single engineer fully understands. The codebase becomes sprawling, and sprawl is where vulnerabilities hide — not as flagged CVEs, but as complexity that defeats review.

This is the argument I want to make: at AI-generated-code volumes, code quality is not an aesthetic preference. It is a security control.

The sprawl problem

A human engineer writing a new service will, most of the time, look at how existing services in the codebase are structured. Not because they were told to — because it’s faster. They absorb patterns by proximity. They ask a teammate. They copy an existing module and modify it. The result is imperfect consistency, but it is consistency.

A coding agent does not do this by default. It generates from its training distribution, conditioned on whatever context it receives. If that context is a file and a prompt, the agent produces code that is locally correct and globally unconstrained. It will introduce a new error-handling pattern in a codebase that already has one. It will create a service with a different directory structure than every other service. It will nest logic four levels deep in a codebase where the team norm is two.

None of these are bugs. A linter will not flag them. A SAST scanner will not flag them. They are quality problems — and at human-generated-code volumes, they are manageable. A reviewer catches the inconsistency, requests changes, the engineer fixes it.

At AI-generated-code volumes, this process breaks. The reviewer is now looking at five times more code, generated faster than they can absorb. The inconsistencies compound. The codebase drifts from its own patterns. Six months later, you have a system where the same operation is implemented three different ways, where module boundaries are unclear, where the dependency graph has grown edges no one intended.

This is not technical debt in the conventional sense. It is attack surface.

Why sprawl is a security problem

The connection between code quality and security is not metaphorical. It is mechanical.

Complexity defeats review. Security vulnerabilities that survive to production almost always survive because a reviewer did not catch them. The primary predictor of review effectiveness is comprehensibility — can the reviewer build a mental model of what the code does? Complex, inconsistent, sprawling code is harder to review. Harder-to-review code ships more vulnerabilities. Nagappan, Ball, and Zeller showed that code complexity metrics predict post-release defects more reliably than process metrics. The mechanism is straightforward: complexity defeats the reviewer’s ability to reason about the code. At AI-generated volumes, the complexity grows faster than the reviewer’s capacity.

Inconsistency creates gaps. When the same operation is implemented three ways, security properties that apply to one implementation may not apply to the others. Input validation in the canonical pattern gets omitted in the copy-paste variant. Authorization checks in the original service are absent from the structurally different clone. Consistency is a security property because it means a fix in one place is a fix everywhere.

Sprawl obscures ownership. When module boundaries are unclear and service structures are inconsistent, it becomes harder to answer “who is responsible for this code?” Unowned code is unreviewed code. Unreviewed code is the highest-risk code in any system.

Unnecessary abstraction creates hiding places. An agent that generates an extra layer of indirection — because its training data favors abstraction — creates code that is harder to audit. Every unnecessary interface, wrapper, or delegation is a place where a vulnerability can hide behind apparent structure.

These are not theoretical risks. They are the failure modes that experienced security engineers already worry about in human-written codebases. AI-generated code amplifies every one of them.

Why current tools don’t solve this

The standard quality toolchain — linters, formatters, style checkers, complexity analyzers — was built for human-generated code at human-generated volumes. It has two structural limitations at AI scale.

It operates post-hoc. The agent writes the code. The linter flags it. The developer fixes it. Or the developer suppresses the warning. Or the developer is operating in an agentic flow and the lint output scrolls past in a terminal. Post-hoc quality enforcement works when a human is attentive at every step. It degrades when the generation-review loop accelerates.

It enforces generic rules, not organizational ones. A linter knows that cyclomatic complexity above 10 is generally undesirable. It does not know that your team’s norm is 6, derived from three years of incident data showing that modules above 6 have twice the defect rate. It does not know that your architecture requires all external-facing services to follow a specific structural pattern. It does not know that your team abandoned a particular design approach after it caused an outage. Generic rules catch generic problems. Organizational quality — the quality that actually prevents your vulnerabilities — requires organizational knowledge.

Our approach: quality enforcement at generation time

This is where governance at generation intersects with organizational context.

The idea is straightforward: enforce quality constraints before the code is written, not after. And derive those constraints from the organization’s own patterns, not from generic rule sets.

We are building this in three layers.

Organizational pattern enforcement. The enforcement layer needs a model of how the codebase is actually structured — not a style guide someone wrote two years ago, but a computed representation of service patterns, module layouts, and error-handling approaches as they exist in the code today. The agent generates within those constraints. When it creates a new service, the output conforms to the structural patterns already present, because the generation is conditioned on them. This is pattern inference applied at generation time — closer to a type system than a linter.

Complexity budgets. Organizations should be able to set thresholds — maximum cyclomatic complexity, maximum nesting depth, maximum module size — that the agent respects at generation time. These are not novel metrics. What is novel is enforcing them before the code exists, rather than flagging them after. An agent that is constrained to produce code within a complexity budget produces code that is reviewable. Reviewable code is more secure code.

Architectural boundary enforcement. The Context Engine models service dependencies and ownership boundaries. Quality enforcement extends this to structural rules: which services can depend on which, which patterns are permitted in which modules, which abstractions are required at which boundaries. The agent cannot introduce a dependency that violates the architecture because the constraint exists at generation time.

I want to be honest about where this stands. Organizational pattern enforcement is working in early form — the Context Engine provides the knowledge, and generation-time governance can act on it. Complexity budgets are in development. Full architectural boundary enforcement is directional — we have the data model from the Context Engine, and we are building the enforcement layer on top of it.

This is not a GA announcement. It is a preview of the direction and early results.

Why pre-generation quality matters more than post-generation scanning

There is a pragmatic argument for why quality enforcement must move to generation time, beyond the theoretical case.

Post-generation quality tools create a fix-review cycle. The agent writes code. The tool flags problems. The agent (or developer) fixes them. The tool checks again. This loop has cost — in time, in compute, in developer attention. More importantly, it has a structural problem: an agent that generates low-quality code and then iterates to fix it often produces code that is technically compliant but structurally worse than code that was generated correctly in the first place. Patching complexity out of code that was generated complex is harder than generating simple code from the start.

Pre-generation enforcement avoids this. The agent operates within constraints from the beginning. The output is not patched to comply — it is generated to comply. The resulting code is simpler, more consistent, and more reviewable.

This is analogous to the security principle of secure-by-design versus bolt-on security. Quality-by-generation is more effective than quality-by-iteration, for the same reasons.

Share this post

Leave a Reply