Every coding agent you’ve ever used: Claude Code, OpenClawd, Cursor’s agent mode, Devin, Hermes, the autonomous research assistants proliferating across HackerNews, etc., all run on the same machinery, the Agentic Harness.
In this post, we’ll build a basic Harness from scratch to understand every moving part. By the end, you’ll have a working bare bones agent and a clear mental model of what agents like Claude Code are actually doing under the hood.
We’ll look at how to build a more sophisticated production-ready harness in a future post.
You can also find the companion notebook on the LLMs for Data Science GitHub repository:
In the early 1950s, war raged over the Korean peninsula with American-made F-86 fighters being outmatched by the technically superior Russian-made MiG-15. Col John Boyd was the US Air Force fighter pilot in charge of trying to His main insight has proven useful ever since: You can make the combined US Pilot + F-86 system better than the Korean Pilot + MiG-15 system by improving the way US Pilots’ reactions under stress.
He created a simple algorithm for how pilots could react effectively to any situation by following four simple steps, in sequence:
Observe - Look around, gather data
Orient - Analyze the situation and put the data into context
Decide - Plan the next best action.
Act - Do it! After each Action, you Observe the outcomes, and the loop begins anew.
This simple approach has proven successful in the military, law enforcement, business, medicine, etc., essentially, whenever critical decisions need to be made quickly and decisively.
Models and Agents
As it turns out, the OODA loop is also the fundamental piece necessary to turning Large Language Models (LLMs) into agents.
LLMs are, fundamentally, text-producing machines: they take text in and spit text out. That’s their greatest power and their greatest limitation, and if you haven’t been hiding under a rock for the past couple of years, you’ve noticed how useful that is for summarization, drafting, Q&A, etc… However, it can’t actually do anything in the real or virtual world. The LLM model can describe how to fix a bug in your code, but it can’t open the file, edit it, or run the tests to verify its fix.
LLMs can, however, process information very quickly. What if we could have an LLM run a simulated OODA loop within a system capable of executing the action? In other words:
The LLM is the pilot, and we build the fighter jet around it.
The scaffolding around the LLM that closes the loop is known as an Agentic Harness. Like a fighter jet, the Agentic Harnesses take the pilot's commands, executes them in the real world, feed back the results, and wait for the pilot to decide what to do next.
However, LLMs aren’t (yet?) quite as reliable as highly trained fighter pilots, sothe harness should be designed as a deterministic controller that treats the LLM as a fallible suggestion engine: it asks “what should we do next?”, checks whether the answer is well-formed and safe, runs it, and feeds the result back into the next iteration.
The high-level agentic control loop inside every Agentic Harness follows the same four steps outlined above, repeatedly:
Observes via the tool results,
Orients by assembling context from memory and recent observations,
Decides by choosing the next tool call
Acts by executing it
The harness runs this loop until the goal is satisfied or the token budget runs out. Everything else you might have heard of, memory systems, multi-agent orchestration, and verification, is just an add-on to this basic skeleton.
A barebones agent might look something like this:
where the agent calls a tool, waits for the output, and decides what to do next until it reaches its goal.
Tools
An agent’s tools are the eyes, ears, and arms that the LLM can use to observe and interact with the outside world. The specific implementation details are fairly arbitrary. They can be a Python function, an API call, a shell script, etc. The only requirement is that it can be called directly from the scaffolding surrounding the LLM. In general, it’s common to implement them as Python Callable objects.
The most important part, however, is to ensure that the tool is properly documented, with a detailed description of what it does, the arguments it expects, and the outputs it produces. This is what will allow the agent to know when to call it.
We wrap our tool implementations in a Dataclass that contains all necessary implementations. The actual function being called is included as the fn field:
@dataclass
class Tool:
name: str
description: str
params: dict[str, type]
fn: Callable[..., Any]
def schema_str(self) -> str:
args = ", ".join(f"{k}: {v.__name__}" for k, v in self.params.items())
return f"{self.name}({args}) — {self.description}"
def validate(self, args: dict) -> str | None:
missing = set(self.params) - set(args)
if missing:
return f"Missing args: {sorted(missing)}"
for k, v in args.items():
if k not in self.params:
return f"Unknown arg: {k}"
if not isinstance(v, self.params[k]):
return f"Arg {k!r} expected {self.params[k].__name__}"
return NonePrompts
Now that we have defined our tools, we must tell the agent about them. We do this by using the System Prompt. The system prompt is the very first thing the LLM sees when it begins a new session, and it helps shape the model's behavior throughout all interactions with the user. It is used to provide the LLM with detailed instructions on how it should behave, what tools are available to it, and how to structure its output. If you’re curious about what the system prompts of State of the Art models look like, you can check them out here.
For the purposes of this post, a more straightforward version will do nicely.
SYSTEM_PROMPT = “”“You are a helpful agent that solves problems by calling tools.
You MUST respond with a single JSON object in one of these two shapes:
1. Tool call:
{”thought”: “...”, “action”: {”type”: “tool_call”, “name”: “<tool>”, “args”: {...}}}
2. Final answer:
{”thought”: “...”, “action”: {”type”: “final”, “answer”: “...”}}
Available tools:
<<TOOL_LIST>>
Rules:
- Output JSON only, no markdown, no backticks, no prose.
- Use a tool when you need external information or computation.
- Produce a final answer only when you have enough evidence.
“”“In this prompt, we illustrate the two types of outputs we expect: a tool_call and a final answer, and instruct the model to use the tools it has available to gather the necessary information before generating the final answer. We also request a structured JSON output, as that will make it easier to verify and validate the tool call requests.
Validating Tool Calls
Despite significant effort by all major model providers to align models, LLMs are still not 100% reliable or dependable. Models have been known to hallucinate, fake their alignment, and even attemtp to “jailbreak” themselves. In other words, it’s always best to:
Never take the LLM at its word.
So, one of the most important step in developing an agentic loop is to verify that the output of the model is well formed and follows the restrictions and conventions we want it to follow. We implement this with a simple function that performs some straightforwards checks on the output to make sure that it can be trusted before we attempt to call whatever tool the model has decided to use:
def validate_proposal(raw_text: str) -> ValidationResult:
try:
data = json.loads(raw_text.strip())
except json.JSONDecodeError as e:
return ValidationResult(ok=False, error=f"Malformed JSON: {e}")
if not isinstance(data, dict) or "action" not in data:
return ValidationResult(ok=False, error="Missing 'action' field.")
action = data["action"]
if action.get("type") == "final":
if not action.get("answer"):
return ValidationResult(ok=False, error="Final missing 'answer'.")
return ValidationResult(ok=True, parsed=data)
if action.get("type") == "tool_call":
name = action.get("name")
if name not in TOOLS:
return ValidationResult(ok=False,
error=f"Unknown tool {name!r}. Known: {sorted(TOOLS)}")
tool_err = TOOLS[name].validate(action.get("args", {}))
if tool_err:
return ValidationResult(ok=False, error=tool_err)
return ValidationResult(ok=True, parsed=data)
return ValidationResult(ok=False, error="Unknown action type")Internal State
Even a toy agent like the one we’re building needs to keep track of its own internal state so that it knows what its goal is (what it’s trying to achieve), a budget of how many iterations, tool calls, etc., it can use, and a memory that keeps track of what it knows about the world. For traceability purposes, we also include a trace of all the steps it took. For convenience, these are implemented as dataclasses:
@dataclass
class AgentState:
goal: str
trace: list[Step] = field(default_factory=list)
memory: dict[str, Any] = field(default_factory=dict)
budget: Budget = field(default_factory=Budget)
status: str = “running”
final_answer: str | None = None
@dataclass
class Step:
index: int
thought: str
action: dict
observation: Any
latency_ms: int
@dataclass
class Budget:
max_steps: int = 10
max_tool_calls: int = 15
steps_used: int = 0
tool_calls_used: int = 0
def has_room(self) -> bool:
return (self.steps_used < self.max_steps
and self.tool_calls_used < self.max_tool_calls)
The Agentic Loop
Finally, all the pieces are in place and we can build our Fighter Jet, the Agentic Loop at the heart of our Harness. Designing a production ready harness requires careful thinking about the specific application the agent will be working in, what actions it should be allowed to perform (and which it shouldn’t) and how best to design the custom set of tools needed to make it effective.
The agentic loop design we choose for our implementation is a slight elaboration of the naive loop sketched earlier:
where we added validation, logging, and budget constraints. In this version, we:
Check the budget; if exhausted, Stop
Build the prompt from current state
Call the LLM
Validate the proposed action
If invalid: append an error observation to the trace so the LLM can see and recover on the next turn
If valid: execute the tool or set the final answer
Update state and loop
The Python of a complete (yet basic), functional, implementation of a Harness that is able to invoke tools, recover from errors, generate traces, etc. is fairly concise and straighforward:
def run_agent(goal: str, llm_provider, budget: Budget | None = None) -> AgentState:
state = AgentState(goal=goal, budget=budget or Budget())
system = build_system_prompt()
consecutive_errors = 0
while state.status == "running" and state.budget.has_room():
state.budget.steps_used += 1
user = build_user_prompt(state)
t0 = time.time()
raw = llm_provider.complete(system=system, user=user)
latency = int((time.time() - t0) * 1000)
result = validate_proposal(raw)
if not result.ok:
consecutive_errors += 1
state.trace.append(Step(
index=state.budget.steps_used,
thought="(parser)",
action={"type": "invalid", "raw": raw[:200]},
observation=f"VALIDATION_ERROR: {result.error}",
latency_ms=latency,
))
if consecutive_errors >= 3:
state.status = "failed"
continue
consecutive_errors = 0
proposal = result.parsed
action = proposal["action"]
if action["type"] == "final":
state.status = "done"
state.final_answer = action["answer"]
else:
tool = TOOLS[action["name"]]
try:
obs = tool.fn(**action["args"])
state.budget.tool_calls_used += 1
except Exception as e:
obs = f"TOOL_ERROR: {type(e).__name__}: {e}"
state.trace.append(Step(
index=state.budget.steps_used,
thought=proposal.get("thought", ""),
action=action,
observation=obs,
latency_ms=latency,
))
return stateThis is a fairly generic implementation where the hard work is done by the various pieces we already described. In a future post we’ll see how we can significantly improve on this design to build a production ready harness capable of real world work.
In the meantime, let’s try it out and ask our new-born agent to achieve a simple goal:
Find the capital of France, look up its population, and compute what percentage that is of France’s total population of 68 million.
The agent works through a few steps before providing the correct answer:
The capital of France is Paris, which has a population of 2.1 million. This represents approximately 3.09% of France’s total population of 68 million.
The trace it produces is also informative:
--- Step 1 (3896 ms) ---
thought: I need to find the capital of France first. I'll use the lookup tool to get this information.
action: {
"type": "tool_call",
"name": "lookup",
"args": {
"query": "capital of France"
}
}
obs: Paris
--- Step 2 (2026 ms) ---
thought: Now I know the capital is Paris. Next, I need to look up the population of Paris.
action: {
"type": "tool_call",
"name": "lookup",
"args": {
"query": "population of Paris"
}
}
obs: (no result for 'population of Paris')
--- Step 3 (2638 ms) ---
thought: The lookup for 'population of Paris' didn't return a result. Let me try a more specific query about Paris's population with the country context.
action: {
"type": "tool_call",
"name": "lookup",
"args": {
"query": "Paris France population"
}
}
obs: (no result for 'Paris France population')
--- Step 4 (2591 ms) ---
thought: The lookup tool hasn't been able to find Paris's population with different queries. Let me try a different approach and look up 'Paris population 2024' or 'Paris city population' to get more specific results.
action: {
"type": "tool_call",
"name": "lookup",
"args": {
"query": "Paris city population"
}
}
obs: (no result for 'Paris city population')
--- Step 5 (2352 ms) ---
thought: The lookup tool hasn't been successful with multiple queries about Paris's population. Let me try one more approach with a simpler, more direct query format.
action: {
"type": "tool_call",
"name": "lookup",
"args": {
"query": "Paris population"
}
}
obs: 2.1 million
--- Step 6 (2713 ms) ---
thought: Great! I now have all the information I need: Paris has a population of 2.1 million, and France's total population is 68 million. I need to calculate what percentage 2.1 million is of 68 million using the calculator tool.
action: {
"type": "tool_call",
"name": "calculator",
"args": {
"expression": "(2.1 / 68) * 100"
}
}
obs: 3.088235294117647
--- Step 7 (2605 ms) ---
thought: I have successfully found that Paris has a population of 2.1 million and calculated that this represents approximately 3.09% of France's total population of 68 million. I can now provide the final answer.
action: {
"type": "final",
"answer": "The capital of France is Paris, which has a population of 2.1 million. This represents approximately 3.09% of France's total population of 68 million."
}
obs: (final)We purposefully made the agent’s life harder than it needed to be by making the lookup tool a simple dictionary lookup so that the agent has to struggle through multiple attempts at getting the information it needs, properly handling “errors” along the way.
We hope you enjoyed the very first LLMs post on the Data for Science Substack and look forward to hearing your thoughts. We hope you
You can find all the code for the analysis in this post in our brand new companion GitHub Repository https://github.com/DataForScience/LLMs
And, of course, don’t forget to
this post with others who might be interested, and encourage them to
so that they have access to the entire backlog of posts and be the first to know when a new a new article is posted.






