You've probably felt this. The first week you wired an AI assistant into your editor, you shipped twice as much. By month three, you were back to your old pace — except now you were debugging weirder bugs.
I've been using AI assistants in my daily workflow for about two years across four projects. The pattern keeps showing up: the productivity gains are real but front-loaded, and they erode unless you change how you work. Most of that erosion comes from one specific, fixable problem.
The Problem: Plausible Code That Doesn't Actually Work
The bug I see most often isn't an obvious syntax error. It's when generated code calls a function, method, or config option that looks exactly like something the library would have — but doesn't.
Last month I was building a CSV import feature and the assistant happily produced this:
import pandas as pd
# Read CSV with progress reporting — looks reasonable, right?
df = pd.read_csv(
"users.csv",
on_progress=lambda pct: print(f"Loading: {pct}%"), # this kwarg does not exist
chunksize=10_000,
)on_progress is not a real parameter on pd.read_csv. The code was syntactically valid Python, my linter didn't complain, and the failure mode was... silent. The kwarg got swallowed and the import ran without any progress reporting. I only noticed because a user pinged me saying the loading bar wasn't moving.
This is the core issue. AI-generated code is plausible in a specific, dangerous way: it pattern-matches the shape of real APIs, which is exactly what makes it hard to spot in review.
Root Cause: How Hallucinations Slip Through
Three things conspire here:
- Pattern-matching beats correctness. The model has seen thousands of
pd.read_csvcalls. It has also seen progress callbacks on other I/O functions. Stitching them together produces code that looks right without being right. - Type checkers often can't save you. Many libraries use
kwargs, dynamic dispatch, or duck typing. Static analysis won't flag a non-existent keyword argument that flows throughkwargs. - Reviewer fatigue. When the surrounding code is correct and the function name is real, your eyes glide over the made-up parameter. After 200 lines of mostly-good output, you stop reading carefully.
The deeper issue is a workflow one. If you're prompting for a feature and pasting the result, you've outsourced generation but kept full responsibility for verification — and verification is harder on code you didn't write, because you don't have the mental model the author would have.
The Fix: Force Verification Into the Loop
Here's the workflow I switched to after enough of these bites. The core idea: don't accept code unless something other than your eyes has touched it.
Step 1: Generate the test first
Before generating the implementation, write (or generate) a test that exercises the specific behavior you want. This pins the behavior to something runnable.
# tests/test_import.py
from myapp.importer import load_users
def test_load_users_reports_progress():
progress_log = []
# The whole point of the feature: progress callbacks fire
result = load_users(
"tests/fixtures/users.csv",
on_progress=lambda pct: progress_log.append(pct),
)
assert len(result) > 0
assert progress_log, "expected at least one progress update"
assert progress_log[-1] == 100If the implementation hallucinates an API, the test fails immediately with a real error message — usually TypeError: unexpected keyword argument. Way cheaper than debugging in production.
Step 2: Run code, don't just read it
Add a pre-commit hook that blocks commits when tests fail. Yes, this is obvious. Yes, most teams I've worked with don't actually enforce it.
# .pre-commit-config.yaml
repos:
- repo: local
hooks:
- id: pytest-fast
name: pytest (fast suite)
entry: pytest -x -m "not slow" # -x: stop on first failure
language: system
pass_filenames: false
always_run: trueThe point isn't catching every bug. It's catching the plausible-but-wrong ones the moment they hit your branch, before they pile up into a multi-hour debugging session two weeks later.
Step 3: Pin the dependency surface
A surprising amount of hallucination happens because the model assumes a different version of a library than you have installed. Lock your versions and tell the assistant which version you're on:
# pyproject.toml
[project]
dependencies = [
"pandas==2.2.3", # exact pin, not >=
"pydantic==2.9.2",
]When you prompt, include the version. "Using pandas 2.2.3, write a CSV importer with progress reporting" gets you closer to reality than the same prompt without the version, because the model will at least try to constrain its API recall.
Step 4: Prefer narrow prompts over broad ones
Long, multi-feature prompts produce code where errors compound. I get better results asking for one function at a time, with clear inputs and outputs:
Function signature:
def parse_user_row(row: dict) -> User: ...
Requirements:
- Strip whitespace from email
- Reject rows where email is missing or invalid
- Return User(email=..., name=..., created_at=...)
- Raise InvalidRowError on bad data, do not log
Use only the standard library and pydantic 2.9.Narrow scope, explicit constraints, named version. My hallucination rate drops noticeably with this format.
Prevention: Build Habits, Not Heroics
A few things I now do reflexively:
- Read the imports first. If the generated code imports something you didn't ask for, that's a yellow flag. Verify the import path exists in your installed version before reading further.
- Distrust convenience parameters. When a function call has a kwarg that feels suspiciously just right for your problem, look it up in the docs. That's the highest-probability hallucination spot.
- Treat "looks correct" as a smell. If you read 30 lines of generated code and have zero questions, you didn't read carefully. There should always be at least one thing to verify.
- Keep your test runtime fast. If your full suite takes eight minutes, you'll skip running it. Sub-30-second feedback loops are what actually keep this workflow honest.
So, More Work or Less?
After two years, my honest answer is: roughly the same amount of work, but distributed differently. Less typing, more reading. Less greenfield design, more verification. The people I see losing time to AI tools are the ones who didn't shift the verification load anywhere — they just trusted the output and inherited a slower debugging tail.
The tooling won't fix this for you. The workflow will.
