Chapter 4: Draft First — Save Money With Local LLMs
Not every task needs a frontier model. README sections, function docstrings, email drafts, commit messages — the first draft of anything with a predictable structure is mechanical work. The model doing that work doesn’t need to be Claude Sonnet.
The draft-first skill implements a two-stage pattern: use a local LLM to generate the initial draft, then use Claude for quality control and refinement. The local model does the bulk writing; Claude does the judgement.
The Skill File
---
name: draft-first
description: Generate initial drafts using local LLM, then refine with Claude. Use for boilerplate content, documentation stubs, or any content where the first draft is mechanical.
use_when: User needs a first draft of boilerplate content, documentation stubs, README sections, or mechanical writing where quality control matters more than initial generation.
user-invocable: true
tools: Read, mcp__local-llm__local_draft, mcp__local-llm__local_transform, Edit, Write
---
The tools field is worth noting — it explicitly lists the MCP tools this skill uses. The mcp__local-llm__* tools connect to a local Ollama instance running a smaller model (in my setup, qwen2.5-coder:7b).
The Four Phases
Phase 1: Gather context. Before generating anything, Claude reads any related files — an existing README to match style, a template to follow, examples of the format expected. This context goes into the generation prompt.
Phase 2: Generate with the local model. Claude calls mcp__local-llm__local_draft with the task description, gathered context, and format requirements. For longer content, the skill instructs Claude to break it into logical chunks and generate each section separately.
Phase 3: Claude refinement. Claude reviews the local model’s output for:
- Accuracy — does it reflect the actual code or project context?
- Voice — does it sound like a human, or does it read like a local LLM output?
- Completeness — are sections missing?
- Quality — grammar, clarity, flow
Phase 4: Output. Claude presents the refined draft with a short summary of what was changed and a confidence level.
The AI Slop Detection Step
The skill includes an explicit banned-phrases list for what to strip from local model output:
Banned phrases:
- "delve", "dive deep", "deep dive"
- "game-changer", "game changing"
- "landscape" (when not literal)
- "unlock" (when not literal)
- "leverage" (use sparingly)
- "cutting-edge", "state-of-the-art"
- "seamless", "seamlessly"
- "robust" (overused)
- "holistic"
Structural slop:
- Excessive bullet points where prose works better
- Numbered lists for non-sequential items
- Overuse of headers in short content
- Generic introductions ("In today's world...")
Small models are prone to all of these. Claude knows to catch them in the review pass.
What It’s Ideal For
The skill file includes a table of content types and why each one is suited (or not) to delegation:
| Content type | Why delegate |
|---|---|
| README templates | Boilerplate structure is mechanical |
| Function docstrings | Standard format, context-dependent |
| Email drafts | Initial response structure |
| Documentation stubs | Sections can be templated |
| Commit messages | Format is standardised |
| PR descriptions | Template-based content |
And what to keep in Claude:
| Content type | Why keep in Claude |
|---|---|
| Newsletter content | Requires voice matching, nuance |
| Strategic documents | Needs reasoning and judgement |
| Code implementation | Requires understanding context |
| Creative writing | Quality is paramount |
The rule of thumb: if the output has a predictable shape and the value is in the content rather than the structure, the first draft can be delegated.
The Real Cost Argument
The API cost difference between Claude Sonnet and a local model is significant. If you’re writing documentation, commit messages, or email responses regularly, the local model handles the bulk of the tokens and Claude handles the quality pass. Over a month of daily development work, that’s a meaningful reduction.
The skill is also insurance against bad local model output. The constraint is explicit: “ALWAYS review local LLM output before presenting to user. NEVER present unrefined local LLM output as final.” Claude acts as the filter. If the local draft is poor quality, Claude regenerates rather than refines.
Setting Up the Local Model
You’ll need Ollama running locally with a suitable model installed:
# Install Ollama
brew install ollama
# Pull a coding model
ollama pull qwen2.5-coder:7b
# Start the Ollama server
ollama serve
Then you need the local-llm MCP server configured in your Claude Code settings to connect to the Ollama instance. Once that’s in place, the mcp__local-llm__* tools become available.
How to Customise It
Model choice — the skill assumes whatever model is configured in your local-llm MCP server. Swap in a different Ollama model based on the task type. Coding tasks do well with qwen2.5-coder. General writing does better with a general-purpose model like mistral.
The banned-phrases list — extend it with any slop phrases that consistently appear in your local model’s output.
Quality thresholds — the skill says “flag when draft quality is too low for refinement (regenerate vs refine).” You can tune the criteria for what counts as too low.
Chunk size — for longer documents, you may need to adjust how the skill breaks content into generation chunks. Larger chunks mean more coherent output but slower generation.
Installing It
mkdir -p ~/.claude/skills/draft-first
# Copy the SKILL.md from github.com/aplaceforallmystuff
This skill depends on having the local-llm MCP server configured. Without it, the skill will fail on Phase 2. The tools field in the frontmatter makes that dependency explicit.
The next chapter covers think-first — the skill that stops Claude from answering too fast on decisions that deserve structured analysis.
Check Your Understanding
Answer all questions correctly to complete this module.
1. What is the core pattern of the draft-first skill?
2. What constraint does draft-first enforce about local LLM output?
3. Which content type should be kept in Claude rather than delegated to a local model?