Crossfire: Multi-Agent Adversarial Refinement
What if Ralph Wiggum had friends?
I built Crossfire to automate the only reliable way to get decent LLM results: make them argue. A single LLM’s answer sounds convincing, but without counterarguments, you cannot gauge its strength. The solution is adversarial generation: multiple models create competing solutions, while others attack them for their weaknesses. Manually juggling browser tabs to have various models critique one other works, but it is a miserable user experience. Crossfire runs the whole process.
How it works: generation, review, synthesis
Give Crossfire a task, and it sends that prompt to multiple LLMs simultaneously to generate different candidate solutions. Each candidate then faces multiple reviewer models trying to find factual errors, logic flaws, security holes, or other critical weaknesses. Finally, a synthesizer model takes all candidates and reviews, picks the best parts, and mixes them into a refined result. This generation → review → synthesis loop repeats for as many rounds as you specify. Such a multi-round refinement also matches how I think about and approach writing both text and code.
Fresh contexts prevent the same model from believing its own improvements are great when they are not. Each round starts clean, which avoids local optima. It is a multi-start evolutionary search with delayed selection, or the Ralph Wiggum method, but with friends providing honest feedback.
How to use it
A straightforward research task looks as follows:
uv run crossfire run \
--mode research \
--instruction "Compare error correction strategies for superconducting vs trapped-ion qubits"
By default, it runs a prompt enrichment step, one generator, three reviewers, and three rounds with a single synthesizer in each round. The prompt enrichment phase is especially useful for one-line instructions.
This is what it looks like for the quantum error correction task:
╭────────────────────────────────────────────── Crossfire ───────────────────────────────────────────────╮
│ Crossfire — research mode │
│ Rounds: 3 | Generators: 1 | Reviewers/candidate: 3 │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Enrichment ✓ gpt-4.1-mini 0:00:09
Round 1/3 ✓ 5 tasks 0:03:25
Round 2/3 ✓ 5 tasks 0:02:58
Round 3/3
✓ Generating 1 tasks 0:00:18
✓ Reviewing 3 tasks 0:00:40
✓ Synthesizing claude-opus-4 0:02:50
Rounds ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 3/3 0:11:48
Run Summary
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ Metric ┃ Value ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ Rounds completed │ 3 │
├──────────────────────┼─────────┤
│ Compressions applied │ 0 │
├──────────────────────┼─────────┤
│ Models dropped │ 0 │
├──────────────────────┼─────────┤
│ Round failures │ 0 │
├──────────────────────┼─────────┤
│ Total input tokens │ 56332 │
├──────────────────────┼─────────┤
│ Total output tokens │ 39393 │
├──────────────────────┼─────────┤
│ Total cost │ $1.0264 │
└──────────────────────┴─────────┘
Output written to docs/launch-announcement-qec.md
The output is available in GitHub. All done based on one sentence of instructions. So, spend a buck, enjoy a coffee, and in the meantime Crossfire generates a solid report on the latest research on quantum error correction, including citations.
For a more complex workflow, such as writing this announcement:
uv run crossfire run \
--mode write \
--instruction-file docs/launch-announcement-step-1-write-mode.md \
--context-file README.md \
--num-generators 5 \
--num-reviewers-per-candidate 3 \
--num-rounds 3 \
--output docs/launch-announcement-step-1-output.md
The result of the subsequent editing pass is what you are reading right now:
uv run crossfire run \
--mode edit \
--no-enrich \
--instruction-file docs/launch-announcement-step-2-edit-mode.md \
--context-file docs/launch-announcement-step-1-output.md \
--num-generators 3 \
--num-reviewers-per-candidate 3 \
--num-rounds 3 \
--output docs/launch-announcement-step-2-output.md
Within each round, generation and review each happen in parallel. Crossfire stops early if reviewers find no material weaknesses, since further rounds would not add any value.
Picking on weaknesses
What does that mean in practice?
This post was written and edited by Crossfire itself using the commands above.
You can see the originals in the docs folder on GitHub.
Five models generated competing drafts: Claude Sonnet 4, DeepSeek V3.2, Gemini 2.5 Pro, GPT-4.1, and Llama 4 Maverick. Below is a sample of what the reviewers said.
Gemma 3 on the DeepSeek V3.2 draft:
The biggest problem is the relentless focus on process. We get a lot of ‘generate, review, synthesize,’ and not enough of what that actually means for the user.
GPT-4.1 on Claude Sonnet 4’s draft:
The ‘Modes’ section is a bit of a laundry list — functional, but not especially compelling. There’s a whiff of ‘here’s every feature’ rather than a sharp, selective focus.
DeepSeek R1 flagged a material constraint violation in the GPT-4.1 draft:
The brief mandates ‘exactly one brief nod.’ The candidate uses it as the hook, then elaborates it into a full paragraph. This isn’t a nod; it’s a sustained metaphor.
And o4-mini on the Llama 4 Maverick draft:
You mention the manual-juggling pain only late; the ‘why’ is buried instead of woven into the loop explanation.
The synthesizer kept the “distinctive voice, memorable hook, and spec-perfect code blocks” from Gemini 2.5 Pro, and “strong voice, effective Ralph Wiggum analogy, and clean integration of code blocks” from GPT-4.1. It discarded the “bland voice, weak emotional resonance, and safe/derivative style” from Llama 4 Maverick, and “forced motivations, uneven pacing, and tell-over-show craft” from Claude Sonnet 4.
The full run artefacts (i.e. every candidate, every review, every synthesis decision) are archived locally after every run.
Ralph, it turns out, is much better with friends.
This post was written and edited by Crossfire ($0.64, 6 min), and subsequently reviewed and tweaked by Ian.