Build Report · AI / Automation

OpenClaw Skill Architecture
From token bloat to engineered pipelines

The current state of OpenClaw skill execution is functional but expensive. The path forward is structured, checkpointed, and dramatically more efficient.

Author Calvin Graham Stack Python · JSON · Markdown Domain AI / Automation

Overview

OpenClaw is an AI agent framework built on top of Claude. Skills are the unit of work: discrete, reusable task definitions that tell the agent what to do and how. In theory, a well-written skill should be reliable, resumable, and cheap to run. In practice, most skills today are none of those things.

The problem isn't the model. It's the architecture around it. Skills are written as monolithic natural-language prompts, loaded whole into context on every turn, and executed in a single long conversation that accumulates everything it ever did. By rough estimate, 90% or more of token spend in this model is overhead, not output.

ℹ

This document describes both the problem and the solution: a phased, checkpointed, Python-first skill architecture that keeps the model context lean and makes skills genuinely production-grade.

The Problem

When you ask OpenClaw to build a skill, it produces a single large SKILL.md file, a dense natural-language prompt describing every step of the task in as much detail as the model thinks it needs to execute reliably. That file isn't read once. It's loaded into context on every single turn of the conversation.

Anatomy of Overhead

The SKILL.md file doesn't travel alone. OpenClaw ships with a collection of system-level markdown files (TOOL.md, AGENT.md, SOUL.md, and others) that define the agent's capabilities, personality, constraints, and tool access. Every one of those files rides along in context on every turn too.

Fixed cost per turn

System files (TOOL.md, AGENT.md, SOUL.md, ...)
Full SKILL.md loaded on every turn
Accumulated tool call results
Growing conversation history

What actually needs context

The immediate task at hand
Input for this specific AI call
Relevant checkpoint state
Nothing else

The Compounding Cost

The chat history accumulates. Every tool call the model makes, every result it gets back, all appended to context. A skill that runs for ten turns isn't just paying the overhead cost ten times. It's paying a growing tax as the conversation gets longer. By the later turns, the model is wading through its own prior output to find the instructions it's supposed to be following.

⚠

Important directives get diluted. The more content is packed into context, the harder it is for the model to reliably follow instructions buried in it. Edge cases in the skill file get missed. Behavior becomes inconsistent in ways that are hard to diagnose. The model isn't failing, it's drifting.

Left unchecked, this eventually hits a hard wall: the context window fills entirely. The run dies, and depending on what was in flight, recovery ranges from annoying to impossible. The natural response is to downgrade to a cheaper model to control costs, but cheaper models have less capacity to hold long contexts together. You pay less per token while getting worse instruction-following on a longer context. Things break in subtle ways.

Sustainable Architecture

The fix isn't to stop using AI. It's to use AI only for the parts that actually require it, and keep everything else out of the context window entirely.

Most of a well-designed skill is deterministic work: fetching data, parsing files, filtering records, writing outputs, logging progress. Python handles all of that without burning a single inference token. AI gets called in only where genuine non-determinism is needed: classifying ambiguous content, synthesizing summaries, making judgment calls that can't be reduced to a rule.

Task Type	Handled By	Rationale
Filtering, transforming, counting, joining	Python	Mechanical work, deterministic, zero token cost
Reading, writing, sorting, deduplicating	Python	No judgment required; rules suffice
Ambiguous data, classification	Model	Cannot be reduced to a rule; requires interpretation
Synthesis, natural language judgment	Model	Structured output, scoped prompt, bounded answer

A skill is broken into discrete phases. Each phase has a clear input, a clear output, and writes a checkpoint when it completes. If the skill fails at phase 4 of 7, the next run picks up at phase 4. Nothing reruns. Nothing is lost. The skill file itself becomes a lightweight orchestration spec, not a monolithic prompt. Everything else is Python.

When the model does get called, it receives a minimal, scoped prompt containing only what that specific call needs, not the full dataset, not the full skill file, not the accumulated session history. Each AI call is essentially a fresh, short conversation with a well-defined question and a bounded answer.

What OpenClaw Needs to Understand

OpenClaw is capable of writing skills in this architecture. The gap is conceptual. There are five things it needs to internalize to do it right:

A skill is a pipeline, not a prompt. Each phase is a function or script with defined inputs and outputs, not a paragraph of natural language instructions.
Checkpointing is a first-class concern. Every phase writes state to a checkpoint file before it exits. The entrypoint checks for existing checkpoints before doing any work.
Logging is structured. Progress, errors, and skip reasons all go to a log with timestamps and phase labels, not print statements, not status text in a response. Actual log files that can be inspected after the fact.
AI calls are scoped and bounded. When the skill needs the model, it constructs a minimal prompt with only the relevant context for that specific call. Each call should be short enough that instruction-following is reliable even on a smaller model.
Resume is the default behavior. Running a skill twice should be idempotent if the first run completed. Running it after a failure should continue from the last successful checkpoint, not restart.

ℹ

Note that AI calls don't only belong to later phases. Early phases may need the model too, for example, interpreting ambiguous spreadsheet structures or making sense of inconsistent field formats before deterministic processing can begin.

The Structure, Concretely

A well-architected skill looks like this on disk:

plaintext

skills/
└── my-skill/
    ├── SKILL.md              # Lightweight orchestration spec only
    ├── scripts/
    │   ├── phase_01_fetch.py
    │   ├── phase_02_parse.py
    │   ├── phase_03_classify.py   # AI-assisted phase
    │   ├── phase_04_enrich.py     # AI-assisted phase
    │   └── phase_05_output.py
    ├── checkpoint.json            # Written after each phase completes
    └── run.log                    # Structured log with timestamps

Checkpoint File

The checkpoint file tracks what has been completed and when. Failed phases record their partial progress so the next run can resume mid-phase without reprocessing completed records.

json

{
  "skill": "my-skill",
  "started_at": "2025-03-18T09:14:32Z",
  "last_updated": "2025-03-18T09:22:17Z",
  "phases": {
    "phase_01_fetch": {
      "status": "complete",
      "completed_at": "2025-03-18T09:16:45Z",
      "records_processed": 1842,
      "output": "scripts/data/fetched.json"
    },
    "phase_02_parse": {
      "status": "complete",
      "completed_at": "2025-03-18T09:19:03Z",
      "records_processed": 1842,
      "output": "scripts/data/parsed.json"
    },
    "phase_03_classify": {
      "status": "failed",
      "failed_at": "2025-03-18T09:22:17Z",
      "records_processed": 214,
      "error": "API timeout on batch 3",
      "resume_from_record": 214,
      "output": "scripts/data/classified_partial.json"
    }
  }
}

Log File

The log is append-only and human-readable. On the next run, the entrypoint reads the checkpoint, skips phases 1 and 2 entirely, and resumes phase 3 at record 214. No guessing, no restarting from scratch.

plaintext

2025-03-18T09:14:32Z [INFO]  [my-skill] Run started
2025-03-18T09:14:33Z [INFO]  [phase_01_fetch] Checkpoint not found, starting fresh
2025-03-18T09:16:45Z [INFO]  [phase_01_fetch] Complete - 1842 records fetched
2025-03-18T09:16:45Z [INFO]  [phase_02_parse] Checkpoint not found, starting fresh
2025-03-18T09:19:03Z [INFO]  [phase_02_parse] Complete - 1842 records parsed
2025-03-18T09:19:03Z [INFO]  [phase_03_classify] Checkpoint not found, starting fresh
2025-03-18T09:19:03Z [INFO]  [phase_03_classify] Calling model - batch 1 of ~9 (200 records)
2025-03-18T09:20:11Z [INFO]  [phase_03_classify] Batch 1 complete
2025-03-18T09:20:11Z [INFO]  [phase_03_classify] Calling model - batch 2 of ~9 (200 records)
2025-03-18T09:21:14Z [INFO]  [phase_03_classify] Batch 2 complete
2025-03-18T09:21:14Z [INFO]  [phase_03_classify] Calling model - batch 3 of ~9 (200 records)
2025-03-18T09:22:17Z [ERROR] [phase_03_classify] API timeout - writing partial checkpoint
2025-03-18T09:22:17Z [INFO]  [my-skill] Run halted at phase_03_classify, record 214

SKILL.md Template

The SKILL.md file for this architecture doesn't describe how to fetch or parse data. It describes the phases, their order, their inputs and outputs, which phases invoke the model and why, and the checkpointing and logging conventions. It stays short enough that it doesn't meaningfully inflate context on every turn.

markdown

# SKILL: [skill-name]

## Purpose
One or two sentences. What this skill does and what it produces.

## Structure
- `scripts/` - Python scripts, one per phase
- `checkpoint.json` - Phase completion state and resume pointers
- `run.log` - Append-only structured log

## Phases

### phase_01_[name]
- **Input:** [source - API, file, prior phase output]
- **Output:** `scripts/data/[filename].json`
- **Notes:** [any relevant detail]

### phase_02_[name]
- **Input:** `scripts/data/[filename].json`
- **Output:** `scripts/data/[filename].json`
- **Notes:** [any relevant detail]

## Checkpointing Rules
- Each phase writes to `checkpoint.json` on completion with ISO 8601 timestamp
- Failed phases write partial progress including last successfully processed record index
- Entrypoint reads checkpoint before executing any phase and skips completed phases
- Re-running a completed skill is a no-op unless checkpoint is manually cleared

## Logging Rules
- All log entries: `TIMESTAMP [LEVEL] [phase-name] message`
- Timestamps are ISO 8601 UTC - levels: INFO, WARN, ERROR
- Log to `run.log` only - no print statements, no console output

## AI Call Guidelines
- Use AI when the task cannot be reduced to a rule
- Use Python when the task is mechanical
- Prompts contain only what the model needs for that specific call
- Parse and validate model output before writing to disk
- On malformed response: log error, write partial checkpoint, halt gracefully

## Resume Behavior
- Default behavior: check checkpoint first on any invocation
- Completed phases are skipped entirely
- Failed phases resume from `resume_from_record` index if present

Why It Matters

The token savings are the obvious headline but they're a means, not the end. What this architecture actually buys is reliability at scale, and the ability to run capable models without wincing at the bill.

What the architecture enables

Overnight runs on large datasets without babysitting
Failures debugged in minutes from structured logs
Stronger models on hard calls, cheaper models on easy ones
No context window ceiling; heavy work stays in Python
Hand the log to OpenClaw to diagnose a failure directly

What the old approach costs

90%+ of token spend is overhead, not output
Instruction drift as context grows
Hard context ceiling kills long runs
No structured recovery path after failure

The structured logs and checkpoint files don't just help the human operator. When something breaks, you can hand OpenClaw the log file and the checkpoint and ask it to diagnose the failure. It has everything it needs in a compact, readable format, with no need to reconstruct what happened from a long conversation history. That feedback loop (run to failure to diagnosis to fix) becomes fast enough to actually iterate on.

The system files and agent configuration are going to be in context no matter what. That's the cost of running OpenClaw. The skill file on top of that should be as lean as possible, and the work the skill orchestrates should happen mostly outside the context window, in code, with the model consulted surgically. That's the difference between a prototype that works in demos and infrastructure that runs in production.