When AI helps and when it hinders your development team

Everyone says AI makes developers faster. The vendor pitch decks show 40-55% productivity gains. Your board has probably asked why you are not already using Copilot. And I understand the pressure - when competitors claim they are shipping twice as fast with AI, standing still feels reckless.

But the evidence tells a more complicated story. One that I think technical leaders need to hear before they commit budget and reshape their engineering practices around assumptions that may not hold.

The study that should give every CTO pause

In mid-2025, METR (Model Evaluation and Threat Research) published a randomised controlled trial that landed like a grenade in the AI development productivity conversation. They studied 16 experienced open-source developers completing 246 tasks on their own repositories - codebases they knew intimately, averaging 10 years old with over a million lines of code.

The headline finding: developers using AI tools (Cursor Pro with Claude 3.5/3.7 Sonnet) were 19% slower than when working without AI.

That number alone is worth sitting with. But the more revealing finding was the perception gap. Before starting, developers predicted AI would make them 24% faster. After finishing, they still believed it had made them 20% faster. The tools felt productive even when they were not.

This is not an isolated data point. While Google's internal RCT showed a more positive result - a 21% speed improvement on enterprise-grade tasks with around 100 engineers - the METR study specifically examined experienced developers on complex, mature codebases. The kind of work that most established companies actually do.

Where AI development productivity is real

I am not anti-AI. I use AI tools daily in my own work. The question is not whether AI helps - it is where, and under what conditions.

Boilerplate and routine code

AI is genuinely excellent at generating repetitive code patterns. CRUD operations, data transfer objects, API endpoint scaffolding, configuration files. Tasks where the pattern is well-established and the developer's time is better spent elsewhere. In my experience, this is where the productivity claims hold up. The code is predictable, the patterns are well-represented in training data, and there is little room for subtle errors.

Test generation

Writing unit tests for existing code is one of AI's strongest use cases. The function signature, parameters, and return types provide clear constraints. AI can generate the happy path, edge cases, and null checks in seconds. I have seen teams double their test coverage in weeks after introducing AI-assisted test generation, with code quality improving as a result.

Documentation

Explaining what code does is a task AI handles well, particularly for inline documentation, README files, and API documentation. The code itself provides the source of truth, and AI translates it into human-readable descriptions. This frees developers from one of the tasks they most commonly skip.

Prototyping and exploration

When you need a working proof of concept quickly, AI can generate functional prototypes in hours rather than days. For exploring unfamiliar APIs, libraries, or frameworks, AI acts as a knowledgeable pair programmer who has read every Stack Overflow answer. This is valuable, provided you treat the output as disposable exploration rather than production code.

Code completion for routine tasks

The autocomplete-style suggestions from tools like Copilot and Cursor genuinely reduce keystrokes for everyday coding. Filling in function bodies where the intent is clear from the signature, completing repetitive patterns, suggesting common idioms. This is where most developers feel the speed improvement, and the data supports a modest but real gain for these straightforward tasks.

Where AI hinders your team

This is the section most AI evangelists skip. But if you are responsible for shipping reliable software, this is the section that matters most.

Complex architectural decisions

AI has no understanding of your business context, your team's capabilities, or your operational constraints. It will confidently suggest architectural patterns that are technically valid but entirely wrong for your situation. I have reviewed code where AI suggested microservices for a two-person team, recommended eventual consistency where strong consistency was a regulatory requirement, and proposed technology stacks that nobody on the team could maintain.

Architecture is about tradeoffs, and tradeoffs require context that AI simply does not have.

Security-sensitive code

This is where AI can actively cause harm. Veracode's analysis of more than 100 large language models found security flaws in 45% of the code they produced. AI-generated code tends towards insecure defaults - it will connect to databases without parameterised queries, generate authentication flows with subtle vulnerabilities, and suggest cryptographic patterns that look correct but are not.

Worse, there is an emerging threat called "slopsquatting." Research examining 576,000 AI-generated code samples found that nearly 20% of package dependencies referenced by AI do not actually exist. Malicious actors are now registering these hallucinated package names, meaning AI is not just suggesting bad code - it is potentially directing your developers to install malware.

Novel problem solving

The METR study found that AI tools "tended to perform worse in complex environments." This is consistent with what I see in practice. When the problem is genuinely novel - when you are building something that does not closely match patterns in the training data - AI becomes a hindrance rather than a help. Developers spend time crafting prompts, reviewing generated code, testing it, finding it does not quite work, and revising. In the METR study, developers accepted less than 44% of AI-generated code, meaning more than half the AI output was wasted effort.

Maintaining consistency across large codebases

AI models have context windows. Your codebase has conventions, patterns, and implicit architectural decisions built up over years. AI will generate code that is locally correct but inconsistent with your established patterns. It will use a different naming convention, a different error handling approach, or a different logging pattern from the rest of your codebase. Over time, this inconsistency compounds into maintenance burden.

The hidden costs nobody talks about

The productivity metrics that vendors quote typically measure code output. Lines written, pull requests merged, tasks completed. They rarely measure what matters: working software shipped to production with acceptable quality.

Review time explosion

Studies show that teams with heavy AI use saw pull request sizes increase by up to 150%, while PR review time increased by approximately 91%. AI makes it easy to generate large volumes of code, but every line still needs human review. You have not saved time if your developers are now spending their afternoons reviewing AI-generated code instead of writing their own.

The confidence problem

The METR study's most concerning finding was not the 19% slowdown - it was that developers believed they were faster even when they were not. This creates a dangerous feedback loop. Teams adopt AI tools, feel more productive, ship code that has not been adequately reviewed because "the AI wrote it," and accumulate technical debt they do not realise they are building.

Stack Overflow's 2025 Developer Survey found that only 33% of developers trust AI-generated code, yet 85% regularly use AI tools. That gap between usage and trust should concern any technical leader.

Hallucinated knowledge

AI does not know what it does not know. It will generate plausible-looking code that references APIs that do not exist, uses deprecated methods, or implements algorithms incorrectly in ways that pass superficial review. In one study, 43% of hallucinated package names appeared repeatedly across multiple prompts, meaning the AI is consistently wrong in the same way. This is harder to catch than random errors.

How to adopt AI responsibly

None of this means you should avoid AI tools. It means you should adopt them with the same rigour you would apply to any other engineering decision.

Establish guardrails before you start

Define where AI is permitted and where it is not. Security-sensitive code, authentication flows, payment processing, and data handling should require human-written code with AI assistance limited to review and suggestion. Create an AI usage policy that your team understands and follows.

Strengthen your code review practices

AI-generated code needs more review, not less. If your current review process is casual, fix that before introducing AI tools. Consider requiring explicit labels on AI-generated code in pull requests so reviewers know to apply additional scrutiny. Treat AI output the same way you would treat code from a junior developer - it might be correct, but verify it.

Measure actual productivity, not activity

Do not measure lines of code or pull requests merged. Measure what matters: cycle time from requirement to production, defect rates, time spent on rework, and customer-facing incidents. If those metrics are not improving, your AI tools are generating activity, not productivity.

The distinction matters. Teams that merged 98% more pull requests but saw a 9% rise in bug counts have not become more productive. They have become busier.

Start with the right tasks

Begin with boilerplate generation, test writing, and documentation. These are the areas where evidence supports genuine productivity gains and the risk of AI-generated errors is lowest. Expand cautiously into more complex tasks only after you have data showing improvement, not just feelings of improvement.

Invest in developer judgement

The developers who benefit most from AI are the ones who can evaluate its output critically. Google's study found that senior developers derived greater benefit from AI tools than juniors. This makes intuitive sense - you need to know what good code looks like before you can judge whether AI has produced it. Invest in your team's fundamental skills, not just their prompt engineering.

The honest position

I use AI tools every day. They make certain tasks faster and reduce the drudgery of repetitive work. I have also seen them generate confidently wrong architectural advice, introduce subtle bugs that took days to find, and create a false sense of progress on projects that were actually falling behind.

The organisations getting the most from AI are not the ones adopting it fastest. They are the ones adopting it most thoughtfully - with clear boundaries, strong review practices, and honest measurement of what is actually improving.

If you are evaluating whether your team is ready for AI-augmented development, I have put together an AI Development Readiness Guide that provides a structured assessment framework. It covers technical infrastructure, people and skills, and process dimensions, with a scoring system that tells you where to start.

Frequently asked questions

Should I ban AI coding tools from my development team?

No. Banning AI tools entirely puts you at a competitive disadvantage for the tasks where AI genuinely helps - boilerplate generation, test writing, documentation. The evidence supports real productivity gains for routine, well-defined tasks. What you should do is establish clear policies about where AI is appropriate and where it is not, and ensure your review processes are robust enough to catch AI-generated errors.

How do I measure whether AI tools are actually helping my team?

Do not rely on developer self-reporting - the METR study showed developers consistently overestimated AI's benefit. Instead, track objective metrics: cycle time from requirement to production, defect rates in production, time spent on code review, and rework rates. Compare these metrics before and after AI adoption. If cycle time is not decreasing and defect rates are not stable or improving, the tools are not delivering the value you expect.

Will AI tools replace developers on my team?

No. Current AI tools augment developers rather than replace them. The evidence consistently shows that AI performs best on routine, well-constrained tasks and struggles with complex, context-dependent work. You still need experienced developers to make architectural decisions, evaluate AI output, handle security-sensitive code, and maintain consistency across your codebase. The teams getting the best results are using AI to handle drudgery so their developers can focus on the work that requires human judgement.

What is the biggest risk of AI adoption in development teams?

False confidence. When developers believe they are more productive than they actually are - as the METR study demonstrated - they may reduce the scrutiny they apply to code, skip review steps, or take on more work than they can deliver at the required quality level. This creates technical debt that is invisible until it causes problems. Mitigate this by maintaining rigorous review standards regardless of how code was generated.

How much should I budget for AI development tools?

Current pricing for tools like GitHub Copilot and Cursor ranges from GBP 12-20 per developer per month. The tool cost is trivial compared to developer salaries. The real cost is the time investment: establishing policies, training developers on effective use, strengthening review processes, and measuring outcomes. Budget for the organisational change, not just the licence fees.