Set Up Harness Engineering on Your Python Repo in 30 Minutes

May 18, 2026

Five sentences to take with you

You can build a working outer harness around any coding agent in under an hour using five files and three packages.
The recipe: install aider-chat, add an AGENTS.md as a feedforward guide, wire pre-commit with Ruff + pytest + mypy as deterministic sensors, add a code-review script as an inferential sensor, and keep a progress.md as cross-session memory.
Each step maps to one of the five harness primitives (filesystem, code execution, sandbox, memory, context management).
The recipe directly mirrors what Anthropic, OpenAI, LangChain, and Stripe each ship in production, compressed to a single-developer scope.
Everything lives as plain files in your repo, locks you to no vendor, and survives any future model migration.

The fastest way to stop arguing about harness engineering is to build one.

A thirty-minute setup is enough to put every concept from the first two pieces this week into your own repo. By the end of this article your Python project will have an inner harness, a feedforward guide layer, deterministic sensors, one inferential sensor, and a memory file the agent reads and writes between sessions. A complete four-primitive harness, in plain text, owned by you.

You’ll need a Python repo, git, Python 3.10+, and a few pip installs. Minimal additional dependencies.

Step 0: clone the recipe into a real repo

If you want to follow along on a throwaway codebase, here’s a working starter:

mkdir agent-harness-demo && cd agent-harness-demo
git init
python -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
echo "print('hello, harness')" > app.py
echo "# Agent Harness Demo" > README.md
git add . && git commit -m "initial commit"

That’s the starting point. We add five things to it.

Step 1: install the inner harness with `aider-chat`

Why this is harness engineering: the inner harness is the agent loop itself. You’re choosing the runtime that the rest of the harness will wrap. Aider is the fastest “pip install and it works” option for a Python repo today.

pip install aider-chat

That single command installs a working coding agent that understands your git repo, edits files in place, runs your tests, and refuses to commit if they fail. It’s a complete inner harness in one package.

Point it at your repo and a model. You can use any provider supported by LiteLLM (Anthropic, OpenAI, Gemini, OpenRouter, local models via Ollama):

export ANTHROPIC_API_KEY=sk-ant-...
aider --model anthropic/claude-sonnet-4-6

Aider opens a chat in your terminal. You describe what you want; it proposes edits; you accept or refuse them; it commits. Out of the box, more harness than most teams run. The next four steps make it much better.

If you need browser automation and a server runtime, OpenHands is the alternative; install via the OpenHands Software Agent SDK repo. The rest of this article works with either.

Step 2: write `AGENTS.md` as the feedforward guide

Why this is harness engineering: the AGENTS.md file is a guide in Birgitta Böckeler’s sense, a piece of feedforward control that steers the agent before it acts. Without one, the agent infers conventions from the code on every session, which means it never quite gets them right.

AGENTS.md is an emerging convention (championed by the OpenAI Codex team and the Microsoft Skills Framework) for project-level instructions an agent reads at the start of every session. Create one in your repo root.

A minimal but real AGENTS.md for a Python project:

# AGENTS.md

## What this project is
A small data-pipeline utility. Reads CSVs from `data/`, transforms them, writes parquet to `out/`.

## How to run it
- Tests: `pytest -xvs`
- Lint: `ruff check .`
- Type check: `mypy src/`
- Build: `python -m build`

## Conventions
- Python 3.11+. Type-annotate every function.
- Use `pathlib.Path`, not `os.path`.
- All I/O goes through `src/io/` modules. Do not read files from `src/transform/`.
- Tests live in `tests/`, mirroring the `src/` tree. Every public function has a test.
- We use `ruff` (not `black`/`flake8`). Configured in `pyproject.toml`.
- We never commit `out/` or `data/`. Both are in `.gitignore`.

## Things that have bitten us before
- Mixing `pandas` and `polars` in the same module. Don't.
- Adding new top-level dependencies without updating `pyproject.toml`.
- Skipping the `--check` flag when running `ruff` in CI.

## How to verify your work
1. Run `pytest -xvs` and confirm all tests pass.
2. Run `ruff check .` and confirm zero warnings.
3. Run `mypy src/` and confirm zero errors.
4. Commit with a descriptive message.
5. Update `progress.md` with what you changed and why.

## Don't
- Don't add new dependencies without asking.
- Don't change `src/schema.py` without updating the migration notes in `progress.md`.
- Don't disable or delete tests to make them pass.

Three sections matter most: “Conventions” (what good looks like), “Things that have bitten us before” (negative feedforward, which is at least as valuable as positive), and “How to verify your work” (the agent’s own pre-flight checklist).

Aider reads AGENTS.md automatically. OpenHands reads it. Most modern coding agents read it. If yours doesn’t, paste its contents into the system prompt manually.

Step 3: wire pre-commit as the deterministic sensors

Why this is harness engineering: these are the cheap, fast, reliable sensors that catch every structural issue before it becomes an inferential problem. Computational sensors are the floor of every working harness.

Install pre-commit and the toolchain:

pip install pre-commit ruff pytest mypy bandit

Create .pre-commit-config.yaml:

repos:
  - repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.7.0
    hooks:
      - id: ruff
        args: [--fix]
      - id: ruff-format

  - repo: https://github.com/pre-commit/mirrors-mypy
    rev: v1.13.0
    hooks:
      - id: mypy
        additional_dependencies: ['types-requests']

  - repo: https://github.com/PyCQA/bandit
    rev: 1.7.10
    hooks:
      - id: bandit
        args: ['-c', 'pyproject.toml']
        additional_dependencies: ['bandit[toml]']

  - repo: local
    hooks:
      - id: pytest
        name: pytest
        entry: pytest -xvs
        language: system
        pass_filenames: false
        always_run: true

Install the hooks once:

pre-commit install

Now every commit triggers four sensors: Ruff (lint + format), MyPy (types), Bandit (security), pytest (the test suite). A failed sensor blocks the commit. The agent sees the failure as a structured error and self-corrects.

This is a high-impact step in the recipe. Most failures that look like “the agent is dumb” are actually “the agent committed code that didn’t pass linting and nobody noticed.” Wire the sensors and your agent can significantly reduce preventable errors.

Step 4: add the code-review skill as your one inferential sensor

Why this is harness engineering: deterministic sensors catch syntax and structure. They don’t catch semantic mistakes (a function that lints clean but does the wrong thing). The inferential sensor is what catches those, by paying a small cost on every PR to have a model read the diff and look for issues.

The cheapest version of this is a script that runs after the agent finishes a change but before the commit lands. Drop the following at scripts/review.py:

"""Inferential sensor: have a model review the staged diff."""
import os
import subprocess
import sys
from anthropic import Anthropic

PROMPT = """You are a senior Python engineer reviewing a diff before it lands.

The project's AGENTS.md is below. Read it carefully.

<agents_md>
{agents_md}
</agents_md>

Now review the following diff. Identify any issue that would violate the conventions
or "things that have bitten us before" sections of AGENTS.md, any logic bug you can
spot, and any missing test that should exist for the new code.

If you find no issues, respond with exactly: APPROVED.
If you find issues, list them concisely as numbered findings.

<diff>
{diff}
</diff>
"""

def main() -> int:
    diff = subprocess.check_output(["git", "diff", "--cached"]).decode()
    if not diff.strip():
        print("No staged changes; nothing to review.")
        return 0

    agents_md = open("AGENTS.md").read() if os.path.exists("AGENTS.md") else ""

    client = Anthropic()
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2000,
        messages=[{"role": "user", "content": PROMPT.format(agents_md=agents_md, diff=diff)}],
    )
    output = response.content[0].text.strip()
    print(output)

    return 0 if output.startswith("APPROVED") else 1

if __name__ == "__main__":
    sys.exit(main())

Add it as a pre-commit hook:

  - repo: local
    hooks:
      - id: code-review
        name: inferential code review
        entry: python scripts/review.py
        language: system
        pass_filenames: false
        stages: [commit]

Now every commit gets a model-as-judge review. The reviewer reads your AGENTS.md and the staged diff, and either approves or blocks with structured findings. The agent sees the findings as a structured error message (the same shape as a Ruff or mypy error) and can iterate.

This is the most expensive sensor in the set, so it runs last. If Ruff catches a problem, the inferential reviewer never gets called. Cheap sensors first, expensive sensors only on what survives.

Step 5: add `progress.md` as the cross-session memory

Why this is harness engineering: this is your memory primitive. It’s what lets a session that hits the context window stop cleanly and the next session pick up without re-deriving everything.

A starter progress.md template:

# Progress

## Current focus
What feature or fix the agent is working on right now.

## Decisions made (chronological)
- 2026-05-16: Switched from pandas to polars for the transform pipeline. Reason: 4x speedup on 1M-row inputs.
- 2026-05-15: Added a `--dry-run` flag to the CLI. Default off.

## Known issues / TODO
- [ ] `transform.normalize_dates()` doesn't handle ISO 8601 with timezone offsets.
- [ ] Memory usage spikes on files >500MB. Likely need streaming.

## What "done" looks like for the current task
A bullet list the agent (or you) can check off.

## Session log
Append a one-liner at the end of every session. Most recent on top.
- 2026-05-16, 14:30: Finished `--dry-run` flag. Added test. Committed.
- 2026-05-16, 13:45: Started `--dry-run` work. Identified that we need a context-manager pattern.

Tell AGENTS.md that the agent must read progress.md at session start and append to it at session end. Anthropic’s quickstart uses claude-progress.txt for the same purpose; the filename is up to you.

This is the harness-engineering primitive that makes multi-session agency possible. Without it, every session starts blind. With it, the agent always knows what was decided, what’s pending, and what counts as “done.”

What you just built, in harness primitives

Five steps, each mapped to one of the five primitives from last Thursday’s piece:

aider-chat (Step 1) is the inner harness: code execution + sandbox, with git-aware file edits and test runs.
AGENTS.md (Step 2) is the feedforward guide: project-level instructions the agent reads on every session.
Pre-commit hooks (Step 3) are the deterministic sensors: lint, types, security, tests gating every commit.
scripts/review.py (Step 4) is the inferential sensor: a model-as-judge reviewer that reads the diff against your AGENTS.md.
progress.md (Step 5) is the memory primitive: cross-session state, in plain text, that the agent reads at start and writes at end.

Everything lives in your repo as plain files. None of it locks you to a vendor. All five primitives are present. Total install time: roughly thirty minutes, depending on how much customization you do on AGENTS.md.

Optional: the evaluation harness

For teams who want the next layer, install inspect-ai (the UK AI Safety Institute’s open-source eval framework):

pip install inspect-ai

inspect-ai lets you define benchmark tasks against your harness and run them at CI time. The minimal setup is a folder of “task” files (each one a small prompt + expected behaviour) that get run against your agent on every push. When you start adding sensors to your harness, you’ll want evals to confirm the sensors actually catch what they’re supposed to catch.

This goes beyond a 30-minute setup, but it’s the natural next step. Tomorrow’s piece covers the failure mode where teams skip it.

Tomorrow: the five ways this quietly breaks

You now have a working harness. Tomorrow’s piece looks at the five most common ways harnesses fail in practice: skipping the guide layer, no sandbox, sensors that never fire, no compaction or progress files, and treating the harness as a one-time setup. Each of these will eventually visit any harness that ships. Knowing them in advance is the cheap version.

References and Further Reading

Aider and pip install aider-chat. The fastest install for a coding agent on a Python repo.
OpenHands and the OpenHands Software Agent SDK. Heavier alternative with browser automation and a server runtime.
Pre-commit framework. The standard way to wire deterministic sensors at the commit boundary.
Ruff, MyPy, Bandit, pytest. The default Python sensor toolchain.
inspect-ai. UK AISI’s open-source agent evaluation framework.
LiteLLM. The unified provider API Aider uses, so you can swap models with one flag.
Anthropic, Effective harnesses for long-running agents. The reference design that the progress.md pattern mirrors.
Böckeler, Harness engineering for coding agent users. The guides-and-sensors mental model used throughout this piece.
The Microsoft / OpenAI AGENTS.md convention and the Microsoft Skills Framework for related conventions worth knowing.

Learn Agentic AI

Discussion about this post

Ready for more?