A Warhammer 40K MCP Server

Vibe Coding — A 3-Part Series

This post is part of a short series documenting a year of vibe coding: small, exploratory projects built quickly and with a focus on fun, treating tools like LLMs, CLIs, and game data as creative materials rather than products.

RogueLLMania: Running an LLM in Your Game Loop
Procedural narration in a roguelike using a local model.
Read it here
Vibe-Coding a Video-Cutting CLI Tool
FFmpeg, SuperCollider, and ambient automation.
Read it here
A Warhammer 40K MCP Server
Structured tabletop game knowledge for AI agents.
You are here

Warhammer

I like painting toy soldiers, I like to hang out with friends, and I think Warhammer 40K is pretty neat. Beyond that, there is not much I like about actually playing Warhammer 40K. The rules are complex and require your ability to intuit consecutive statistical odds. I don't like building "balanced" lists, but also if you don't have a decent list you're not going to have a great time playing the game.

After a recent match I went home and typed a few constraints (models I own) into ChatGPT to see if it could maybe create a better list. It confidently returned units that don't exist, points costs from the wrong edition, and weapon combinations that were never legal. I realized it was doing some web tool calls, effectively googling "good Warhammer lists" which depending on the results and edition could return very mixed results.

The model wasn't incompetent, it was doing exactly what language models do: generating plausible text that sounded like a Warhammer army lists because it had seen thousands during training. The failure wasn't intelligence. It was grounding. The model had no external source of truth. It was making things up.

For most AI disappointments, the instinct is to reach for a better model, a more detailed prompt, fine-tuning, RAG, retrieval pipelines. All reasonable. But what if the problem wasn't "how do we make the model smarter?" What if it was "how do we give the model better tools to ask questions instead of guessing answers?"

While Warhammer is a complex game, the problem space isn't really all that big. Units cost points and have stat lines, those stat lines determine effectiveness in the game. In the Warhammer community people refer to doing calculations around this as MathHammer ¹. It is possible to find the most efficient units in the game under different conditions and a Monte Carlo simulation can give you a baseline of unit effectiveness.

Tools Over Prompts

That reframing changed my approach. Instead of feeding the model raw data and hoping it learned, I could expose structured access to facts. Let it request, not generate.

My first iteration had me working directly in Cursor ². I figured I'd write some units in a .txt or .md file, use that as context in Cursor coding agent and have it write some python scripts as needed. Right as I was about to find the Datasheets online somewhere I had an idea: this seemed like the perfect case for a MCP server.

Model Context Protocol was started by Anthropic as a standard for connecting LLMs to external tools, typed inputs and outputs, deterministic access to knowledge. It's essentially an API design language for AI.

I didn't want to "teach an LLM Warhammer" instead "expose Warhammer data in a way an LLM can't mess up." That reframing, tools over prompts, constraints over freedom, is the through-line of this entire series. The tools you give a system shape what it can think about. The constraints you impose determine what becomes possible.

The Build: How Vibe-Coding Actually Works

I had a few false starts, or rather lessons learned and instead of trying to shape a messy depot into a working project I would summarize my learnings from this approach and start again. The first attempt had me using Python and FastMCP ³ but had me using a database of units using complex IDs and nested linked tables to construct a units composition. Luckily it's very easy to validate if a unit is being returned correctly so Pytest ⁴ and a suite of tests did a lot of work allowing an agent to work for long periods of time correcting itself as it went.

The False Starts

The initial data source was comprehensive but deeply nested. IDs referencing other IDs. Rules interleaved with stats. Metadata scattered across files. Every query required tree traversal. The parsing layer grew thick. Dependencies multiplied.

I noticed the pattern: the problem wasn't the data itself. The problem was the data shape. It was dictating the system's complexity.

So I restarted. Three times.

Each restart followed the same cycle:

Let agents explore against a vague goal
Watch where friction appears
Ask them to summarize what they discovered
Turn that summary into constraints
Start over with those constraints

By the third iteration, the constraints were simple: "Keep it flat. Keep it simple. Make data first."

Change Databases

I ended up switching databases to a very human readable .csv that ended up being much easier to parse (several hundred less lines of code as well.) Before switching to this new greenfield codebase I had an agent summarize our approach, learnings and things to avoid from our earlier attempts. I used that as additional context in my next approach using the new database, similar test suite and returned schema with a much easier to parse database.

The final architecture is intentionally boring:

CSV files loaded once at startup
A thin service layer that queries those indexes
A thin MCP tool layer on top

Result: ~75% less code than the first attempt. Easier to reason about. Easier to restart again if needed.

How It Works (The Tools)

When you ask the MCP server a question:

Natural language question arrives
MCP tool is invoked
Service layer queries precomputed indexes
Structured JSON response is returned

No fuzzy approximation. No "best guess." Exactly what exists in the data, or nothing.

The core tools:

search_units — find units by name or keyword
get_datasheet — retrieve a unit's stats and rules
list_factions — see all available factions

The Magical Moments (and Their Limits)

The Warhammer 40k MCP server works nearly perfectly as far as I can tell, in that it returns datasheets of units formatted consistently. My dreams of agentic list building and asking AI in natural language to resolve combat seems a bit more complex. I used the Anthropic Skill framework ⁵ and attempted to build a skill for resolving Warhammer 40k 10th edition combat.

I did have multiple magical moments. One where Claude Code using the MCP server understood the Precision rule on Eliminators and asked me if I would like to target the leader of the unit I told it I was targeting. Nowhere in my project had I described exactly how the precision rules worked, yet with proper grounding our agent started to understand the types of decisions that matter in Warhammer.

Note: This seems like something RLHF and/or fine-tuning could help with, training models to better recognize and apply game rules when given proper grounding data.

Unfortunately moments like this were few and far between. Models would often attempt to use 9th, 8th, or even older rules which gets complicated because many rules have the same name but drastically different effects edition to edition. I also had cases where instead of using the provided python scripts for rolling dice and doing combat math the models would just generate dice roles that while somewhat believable weren't actual dice simulated roles.

Why Grounding Works

Most AI failures look like intelligence problems, but they aren’t. They’re authority problems.

When a language model gives a wrong answer, it’s tempting to say it “didn’t understand.” Models can only produce something that sounds right. That's not a bug, it's the job. The mistake is asking it to behave like a database.

A model doesn’t know whether it’s recalling a fact, interpolating from examples, or inventing something entirely new. It only knows how to produce text that sounds plausible given the context it’s seen before. That works surprisingly well, until it doesn’t.

The failure happens when we implicitly ask the model to act as an authority without giving it one.

In Warhammer terms, the model wasn't wrong because it was bad at reasoning. It was wrong because it had no way to check whether a unit existed, whether a points cost was current, or whether a rule was legal in the current edition. It had no place to ask, and no mechanism to verify.

This is the same principle we've been applying across all three projects, just from different angles. In RogueLLMania, we gave the model structured XML inputs (<chamber>, <monster_count>, <artifact_title>) and constrained outputs to <description> tags—forcing it to work with facts we provided rather than inventing them. In the vibe-coding video tool, we used deterministic randomness: same seed, same result, so the system's creative choices were bounded by reproducibility. Here, with the MCP server, we're doing the same thing, instead of asking the model to remember Warhammer rules, we give it tools to query them.

Once you give a model explicit authority, structured data, deterministic tools, clear boundaries, its behavior changes. It stops pretending to know and starts asking questions. "What units exist?" "What does this rule say?" "Is this valid?" The model doesn't need to be smarter, it needs better tools to ask instead of guess.

Running Local: The Rubicon Moment

When working on this MCP server I was using Anthropic's Haiku model, but the ideal setup is Ollama ⁶ running locally. There's a Rubicon moment when you first run an LLM on your own hardware, disconnected from the internet, no API calls, no data leaving your machine. It's the same feeling I had running Stable Diffusion locally for the first time: this computer, in its current configuration, will be able to do this forever.

I am a believer in open-source technology, and running software locally democratizes its access. The internet could go down, the company could shut down the service, but this computer, directly in front of me, will be able to do this. As I write this now, I could be 40,000 ft in the air with limited to no internet access, yet I could ask a local model what the name of a Warhammer unit is, or simulate combat between two datasheets.

Local models solve the worst aspects of AI: your data can't be scraped and sold to the highest bidder, you're not outsourcing compute to some data center, and honestly, it feels punk. While running a model locally doesn't absolve generative AI of all its sins, the training on stolen works, the upfront energy cost of creation, it does change the relationship. When the tool runs on hardware you own, with data you control, disconnected from anyone's API or business model, the dynamics shift.

Personal Software

This MCP server isn't a product. It's not a SaaS. It's opinionated, local, low-risk, and personal. Built for one user, one hobby, one brain. Local models, local data, no stakes, just play.

Luckily I am not trying to build correct, scalable, or production-ready software. I am building tools for myself, for a hobby with low stakes. What is powerful is the ability to build tools for very specific problems you actually care about. I didn't know the details of building an MCP server but I understand it at a macro level, you feed context into a LLM. I know python but I never took a Computer Science class in my life and didn't know that I wanted a fuzzy string search. I knew the experience I was targeting and iterated my way toward it.

That's the point. Not everything needs to scale. Not everything needs a business model. Some software is worth building just because it makes your own thinking clearer.

Over this three-part series, I've built:

A game loop that runs language models locally
A video-editing CLI that treats randomness as a first-class tool
An MCP server that grounds AI in structured data

None of these are products. All of them changed how I think.

What actually makes these systems useful is the work around them: structured data, deterministic tools, explicit state, and tight constraints. None of that requires new technology or a paradigm shift. It's just iteration, trial and error, and elbow grease. I don't think I'm overly bullish on AI, but I'm not cynical either, mostly because the tech is already here. We don't need a major paradigm shift. LLMs are already good enough at their specific use case. Usefulness emerges from structure, not more intelligence, and finding that structure just takes iteration and some work.

Build Things

The reason I'm writing these posts isn't to say "build a Warhammer MCP server." It's to say: build weird, small, useful things for yourself.

The obstacles are mostly in your head:

"But won't it break?" Yeah, probably. It's personal.
"Is it efficient enough?" For an audience of one, yes.
"Should I open-source it?" Only if you want to.
"What if I restart it three times?" Then you learn three times.

Vibe coding is about the joy of tinkering. It's about treating tools (Claude Code, FFmpeg, game data, LLMs) as creative materials, not problems to solve. It's about playing around until patterns emerge instead of planning everything up front.

It's also about confidence. Once you've built things this way, you start seeing problems differently. You notice where infrastructure is missing. You notice where structure could replace complexity. You notice where a boring solution is better than a clever one.

This post concludes a three-part series on vibe coding. The thesis: constraints enable creativity. Infrastructure determines possibility. Personal software is worth building. Build weird things.

MathHammer is a community term referring to statistical analysis of Warhammer 40K unit effectiveness and combat outcomes. See UnitCrunch for a popular tool that performs these calculations. ↩
Cursor: https://cursor.sh/ ↩
FastMCP: https://github.com/jlowin/fastmcp ↩
Pytest: https://pytest.org/ ↩
Anthropic Skills framework: https://docs.anthropic.com/en/docs/build-with-skills ↩
Ollama: https://ollama.ai/ ↩