OpenClaw Skill Architecture
From token bloat to engineered pipelines
The current state of OpenClaw skill execution is functional but expensive. The path forward is structured, checkpointed, and dramatically more efficient.
Overview
OpenClaw is an AI agent framework built on top of Claude. Skills are the unit of work: discrete, reusable task definitions that tell the agent what to do and how. In theory, a well-written skill should be reliable, resumable, and cheap to run. In practice, most skills today are none of those things.
The problem isn't the model. It's the architecture around it. Skills are written as monolithic natural-language prompts, loaded whole into context on every turn, and executed in a single long conversation that accumulates everything it ever did. By rough estimate, 90% or more of token spend in this model is overhead, not output.
The Problem
When you ask OpenClaw to build a skill, it produces a single large SKILL.md file, a dense natural-language prompt describing every step of the task in as much detail as the model thinks it needs to execute reliably. That file isn't read once. It's loaded into context on every single turn of the conversation.
Anatomy of Overhead
The SKILL.md file doesn't travel alone. OpenClaw ships with a collection of system-level markdown files (TOOL.md, AGENT.md, SOUL.md, and others) that define the agent's capabilities, personality, constraints, and tool access. Every one of those files rides along in context on every turn too.
Fixed cost per turn
- System files (TOOL.md, AGENT.md, SOUL.md, ...)
- Full SKILL.md loaded on every turn
- Accumulated tool call results
- Growing conversation history
What actually needs context
- The immediate task at hand
- Input for this specific AI call
- Relevant checkpoint state
- Nothing else
The Compounding Cost
The chat history accumulates. Every tool call the model makes, every result it gets back, all appended to context. A skill that runs for ten turns isn't just paying the overhead cost ten times. It's paying a growing tax as the conversation gets longer. By the later turns, the model is wading through its own prior output to find the instructions it's supposed to be following.
Left unchecked, this eventually hits a hard wall: the context window fills entirely. The run dies, and depending on what was in flight, recovery ranges from annoying to impossible. The natural response is to downgrade to a cheaper model to control costs, but cheaper models have less capacity to hold long contexts together. You pay less per token while getting worse instruction-following on a longer context. Things break in subtle ways.
Sustainable Architecture
The fix isn't to stop using AI. It's to use AI only for the parts that actually require it, and keep everything else out of the context window entirely.
Most of a well-designed skill is deterministic work: fetching data, parsing files, filtering records, writing outputs, logging progress. Python handles all of that without burning a single inference token. AI gets called in only where genuine non-determinism is needed: classifying ambiguous content, synthesizing summaries, making judgment calls that can't be reduced to a rule.
| Task Type | Handled By | Rationale |
|---|---|---|
| Filtering, transforming, counting, joining | Python | Mechanical work, deterministic, zero token cost |
| Reading, writing, sorting, deduplicating | Python | No judgment required; rules suffice |
| Ambiguous data, classification | Model | Cannot be reduced to a rule; requires interpretation |
| Synthesis, natural language judgment | Model | Structured output, scoped prompt, bounded answer |
A skill is broken into discrete phases. Each phase has a clear input, a clear output, and writes a checkpoint when it completes. If the skill fails at phase 4 of 7, the next run picks up at phase 4. Nothing reruns. Nothing is lost. The skill file itself becomes a lightweight orchestration spec, not a monolithic prompt. Everything else is Python.
When the model does get called, it receives a minimal, scoped prompt containing only what that specific call needs, not the full dataset, not the full skill file, not the accumulated session history. Each AI call is essentially a fresh, short conversation with a well-defined question and a bounded answer.
What OpenClaw Needs to Understand
OpenClaw is capable of writing skills in this architecture. The gap is conceptual. There are five things it needs to internalize to do it right:
- A skill is a pipeline, not a prompt. Each phase is a function or script with defined inputs and outputs, not a paragraph of natural language instructions.
- Checkpointing is a first-class concern. Every phase writes state to a checkpoint file before it exits. The entrypoint checks for existing checkpoints before doing any work.
- Logging is structured. Progress, errors, and skip reasons all go to a log with timestamps and phase labels, not print statements, not status text in a response. Actual log files that can be inspected after the fact.
- AI calls are scoped and bounded. When the skill needs the model, it constructs a minimal prompt with only the relevant context for that specific call. Each call should be short enough that instruction-following is reliable even on a smaller model.
- Resume is the default behavior. Running a skill twice should be idempotent if the first run completed. Running it after a failure should continue from the last successful checkpoint, not restart.
The Structure, Concretely
A well-architected skill looks like this on disk:
skills/
└── my-skill/
├── SKILL.md # Lightweight orchestration spec only
├── scripts/
│ ├── phase_01_fetch.py
│ ├── phase_02_parse.py
│ ├── phase_03_classify.py # AI-assisted phase
│ ├── phase_04_enrich.py # AI-assisted phase
│ └── phase_05_output.py
├── checkpoint.json # Written after each phase completes
└── run.log # Structured log with timestamps
Checkpoint File
The checkpoint file tracks what has been completed and when. Failed phases record their partial progress so the next run can resume mid-phase without reprocessing completed records.
{
"skill": "my-skill",
"started_at": "2025-03-18T09:14:32Z",
"last_updated": "2025-03-18T09:22:17Z",
"phases": {
"phase_01_fetch": {
"status": "complete",
"completed_at": "2025-03-18T09:16:45Z",
"records_processed": 1842,
"output": "scripts/data/fetched.json"
},
"phase_02_parse": {
"status": "complete",
"completed_at": "2025-03-18T09:19:03Z",
"records_processed": 1842,
"output": "scripts/data/parsed.json"
},
"phase_03_classify": {
"status": "failed",
"failed_at": "2025-03-18T09:22:17Z",
"records_processed": 214,
"error": "API timeout on batch 3",
"resume_from_record": 214,
"output": "scripts/data/classified_partial.json"
}
}
}
Log File
The log is append-only and human-readable. On the next run, the entrypoint reads the checkpoint, skips phases 1 and 2 entirely, and resumes phase 3 at record 214. No guessing, no restarting from scratch.
2025-03-18T09:14:32Z [INFO] [my-skill] Run started
2025-03-18T09:14:33Z [INFO] [phase_01_fetch] Checkpoint not found, starting fresh
2025-03-18T09:16:45Z [INFO] [phase_01_fetch] Complete - 1842 records fetched
2025-03-18T09:16:45Z [INFO] [phase_02_parse] Checkpoint not found, starting fresh
2025-03-18T09:19:03Z [INFO] [phase_02_parse] Complete - 1842 records parsed
2025-03-18T09:19:03Z [INFO] [phase_03_classify] Checkpoint not found, starting fresh
2025-03-18T09:19:03Z [INFO] [phase_03_classify] Calling model - batch 1 of ~9 (200 records)
2025-03-18T09:20:11Z [INFO] [phase_03_classify] Batch 1 complete
2025-03-18T09:20:11Z [INFO] [phase_03_classify] Calling model - batch 2 of ~9 (200 records)
2025-03-18T09:21:14Z [INFO] [phase_03_classify] Batch 2 complete
2025-03-18T09:21:14Z [INFO] [phase_03_classify] Calling model - batch 3 of ~9 (200 records)
2025-03-18T09:22:17Z [ERROR] [phase_03_classify] API timeout - writing partial checkpoint
2025-03-18T09:22:17Z [INFO] [my-skill] Run halted at phase_03_classify, record 214
SKILL.md Template
The SKILL.md file for this architecture doesn't describe how to fetch or parse data. It describes the phases, their order, their inputs and outputs, which phases invoke the model and why, and the checkpointing and logging conventions. It stays short enough that it doesn't meaningfully inflate context on every turn.
# SKILL: [skill-name]
## Purpose
One or two sentences. What this skill does and what it produces.
## Structure
- `scripts/` - Python scripts, one per phase
- `checkpoint.json` - Phase completion state and resume pointers
- `run.log` - Append-only structured log
## Phases
### phase_01_[name]
- **Input:** [source - API, file, prior phase output]
- **Output:** `scripts/data/[filename].json`
- **Notes:** [any relevant detail]
### phase_02_[name]
- **Input:** `scripts/data/[filename].json`
- **Output:** `scripts/data/[filename].json`
- **Notes:** [any relevant detail]
## Checkpointing Rules
- Each phase writes to `checkpoint.json` on completion with ISO 8601 timestamp
- Failed phases write partial progress including last successfully processed record index
- Entrypoint reads checkpoint before executing any phase and skips completed phases
- Re-running a completed skill is a no-op unless checkpoint is manually cleared
## Logging Rules
- All log entries: `TIMESTAMP [LEVEL] [phase-name] message`
- Timestamps are ISO 8601 UTC - levels: INFO, WARN, ERROR
- Log to `run.log` only - no print statements, no console output
## AI Call Guidelines
- Use AI when the task cannot be reduced to a rule
- Use Python when the task is mechanical
- Prompts contain only what the model needs for that specific call
- Parse and validate model output before writing to disk
- On malformed response: log error, write partial checkpoint, halt gracefully
## Resume Behavior
- Default behavior: check checkpoint first on any invocation
- Completed phases are skipped entirely
- Failed phases resume from `resume_from_record` index if present
Why It Matters
The token savings are the obvious headline but they're a means, not the end. What this architecture actually buys is reliability at scale, and the ability to run capable models without wincing at the bill.
What the architecture enables
- Overnight runs on large datasets without babysitting
- Failures debugged in minutes from structured logs
- Stronger models on hard calls, cheaper models on easy ones
- No context window ceiling; heavy work stays in Python
- Hand the log to OpenClaw to diagnose a failure directly
What the old approach costs
- 90%+ of token spend is overhead, not output
- Instruction drift as context grows
- Hard context ceiling kills long runs
- No structured recovery path after failure
The structured logs and checkpoint files don't just help the human operator. When something breaks, you can hand OpenClaw the log file and the checkpoint and ask it to diagnose the failure. It has everything it needs in a compact, readable format, with no need to reconstruct what happened from a long conversation history. That feedback loop (run to failure to diagnosis to fix) becomes fast enough to actually iterate on.
The system files and agent configuration are going to be in context no matter what. That's the cost of running OpenClaw. The skill file on top of that should be as lean as possible, and the work the skill orchestrates should happen mostly outside the context window, in code, with the model consulted surgically. That's the difference between a prototype that works in demos and infrastructure that runs in production.