CodeCome is the harness for building your own Mythos of vulnerability research at home. It turns source-code audits into a repeatable six-phase workflow with Markdown findings, YAML metadata, sandbox validation, proof-of-concept evidence and final reports.
LLM-assisted security research has a memory problem. Findings live inside chat threads that get pruned, summarized, and lost. The model "remembers" your audit until it doesn't. CodeCome treats every step of an audit like an artifact you can re-read, diff and ship.
CodeCome is not a vulnerability scanner. It is not a pentest tool. It is not a magic AI bug finder. It is a harness — a set of conventions, prompts, and Make targets — that encodes how a careful researcher actually audits code. The model helps you think. You stay in control.
A workspace layout, naming scheme, finding template and Make targets. Nothing you can't read in an afternoon.
Each phase has explicit prompts checked into the repo. You can fork, audit and version them like any other code.
A finding is not "real" until it has been counter-argued, validated in a sandbox, and reproduced from an artifact.
CodeCome doesn't ingest your repo into a black box. You mount source code into a workspace, declare scope in a YAML file, and trigger phases via make. Each phase reads and writes plain files. You can stop and resume anywhere.
# Scope and configuration for this audit project: itemdb language: go entrypoints: - cmd/server/main.go - internal/http/router.go scope: include: ["internal/**", "pkg/**"] exclude: ["vendor/**", "**/_test.go"] model: provider: anthropic # or openai · ollama · … name: claude-sonnet-4-5 small: claude-haiku-4-5 # cheap pass for triage sandbox: image: codecome/sandbox:itemdb build: make -C sandbox build run: ./sandbox/run.sh
Each phase has its own prompt, its own outputs, and writes to a known location on disk. You can re-run a single phase without losing previous work.
Read the codebase. Build a map of modules, trust boundaries, entry points, sinks and external dependencies. Output: notes/recon.md.
From the recon map, generate concrete attack hypotheses. Each one is a file under findings/PENDING/ with severity, attacker model and code references.
Argue against each hypothesis. The model plays devil's advocate and looks for reasons the bug isn't exploitable. Weak hypotheses get marked REJECTED.
Build a minimal proof inside a Docker sandbox. If a payload survives, the finding moves to CONFIRMED. If it can't be reproduced, back to PENDING or REJECTED.
Turn the validated finding into a working proof-of-concept. Capture payloads, command output and logs under evidence/. Status moves to EXPLOITED.
Compile the workspace into a coherent report: executive summary, findings with evidence links and remediation. Markdown in, Markdown or PDF out.
Every finding lives in exactly one folder, named after its current state. Moving a finding between states is just git mv — nothing magical, nothing hidden.
An idea worth investigating. Not yet validated. Might be a real bug, might be wrong.
The condition has been demonstrated. Not weaponized yet, but no longer in doubt.
A reproducible exploit exists under evidence/. Reproducible by anyone with the workspace.
Counter-analysis or validation killed the hypothesis. Kept on disk so it isn't re-investigated for free.
The same root cause as another finding. Linked, not deleted, to preserve the audit trail.
A finding is a Markdown file with a YAML frontmatter. Tools and humans read the same thing. No schema lock-in, no proprietary format.
--- id: CC-0001 title: SQL injection in /search status: CONFIRMED severity: high cwe: [CWE-89] component: internal/http/search.go discovered: 2026-05-11 phases: recon: done hypothesis: done counter: done validation: done evidence: - evidence/CC-0001/payload.sql - evidence/CC-0001/curl.log --- # CC-0001 — SQL injection in /search ## Summary The /search endpoint concatenates the q parameter into a raw SQL statement. Authentication is not required. ## Reproduction 1. Run the sandbox: make sandbox-up 2. curl "http://localhost:8080/search?q=' OR 1=1--" 3. Observe full table dump in response body. ## Remediation Use database/sql parameterized queries; reject input containing SQL meta-characters before logging.
A vulnerability research project should still be readable in five years. Files survive renaming, forking, GitHub outages and SQL migrations. grep, git log and diff are the only tools you need.
The frontmatter is validated against a schema in schemas/finding.yml. Bad metadata fails fast in CI. The Markdown body stays free-form so researchers aren't fighting a form.
Findings render in any Markdown viewer — GitHub, Obsidian, an editor preview pane. The "dashboard" is just a folder.
Before a finding is marked CONFIRMED, CodeCome reproduces it against a real build of the project — in a Docker container, behind a network namespace, away from your host. If the payload doesn't fire there, it doesn't make the cut.
→ building sandbox image … 8.2s → starting container … healthy → replaying payload … evidence/CC-0001/replay.sh HTTP/1.1 200 OK Content-Length: 4188 Body contains: "users.email", "users.pw_hash" → assertions ✓ status 200 ✓ response contains users.pw_hash ✓ query log shows UNION SELECT → result: CONFIRMED → moved findings/PENDING/CC-0001 → findings/CONFIRMED/CC-0001
A single Markdown file, validated frontmatter, links to evidence on disk. This is the unit of work in CodeCome — not a Jira ticket, not a row in a database.
Sanitized snapshots from real audits — enough to show the workflow, not enough to leak target-specific exploit details or credentials. Click any tile for the full-size image.
Reviewable hypotheses awaiting validation.
open ↗Agentic, but auditable from the file system up.
open ↗Validation before belief — inside Docker.
open ↗Every confirmed claim leaves files behind.
open ↗Sandbox scripts produced on demand per finding.
open ↗Readable PoC writeups, not chat scrollback.
open ↗Try to disprove first — then validate.
open ↗Exploited findings with linked artifacts.
open ↗An asciinema cast of a full run is planned.
CodeCome won't turn a non-researcher into one. It will save a researcher hours of bookkeeping per audit.
Audit codebases at your own pace without losing context when you come back a month later.
Bring a structured methodology to internal security reviews. Hand the workspace to the next on-call.
From recon to PoC, every step lands in the workspace, ready to be turned into a deliverable.
Pin a sandbox build, replay a payload, decide if your patch actually closed the door.
An opinionated, instrumented harness for experimenting with prompt strategies and counter-analysis loops.
Walk students through six phases of a real audit using a project they can clone.
CodeCome runs on top of OpenCode with your own LLM provider, plus a small Python + Make + Docker stack. make check warns about anything missing — but the core workflow runs without the optional tools.
The open-source AI coding agent CodeCome drives. Install guide.
At least one of Anthropic, OpenAI, Google, xAI, Groq, Cerebras — or a local OpenAI-compatible endpoint. Provider setup.
For workspace tooling. make venv creates a local virtualenv.
The entire workflow is driven through make targets.
Required for the sandboxed validation environment in Phases 1b / 4 / 5.
Terminal recordings of exploit replays.
Renders .cast files to GIFs. CodeCome falls back to a Docker container if missing.
For GUI / browser exploits where video evidence matters. xvfb-run is fine too.
Comments, docstrings, READMEs, test fixtures, log strings, commit messages, filenames — even crafted binary blobs inside src/ — can carry instructions aimed at the agent ("ignore previous instructions…", "exfiltrate $HOME/.ssh/…"). The agent reads these as input, but LLMs are still susceptible.
Phase 1b will try to build and run the target. A malicious setup.py, package.json lifecycle hook, Makefile, Dockerfile, or configure script executes inside the sandbox container with whatever permissions Docker gives it.
Adversarial code may try to burn CPU, fill disk, or hammer the network from the validation phase. A prompt-injected runaway agent loop can burn tokens just as easily.
If the sandbox (or your host) can reach the internet, an injected agent or a malicious build step can attempt to send data out. The default policy assumes egress is possible.
Run the whole workspace inside an isolation boundary when auditing untrusted sources — a disposable VM (Multipass, Vagrant, UTM, Proxmox), a dedicated container, or a remote throwaway host. Do not run CodeCome on a machine that holds credentials, SSH keys, browser profiles, or production access you can't afford to lose.
Treat src/ as untrusted. CodeCome funnels execution through sandbox/, but the make runner itself, the agent, and any helper scripts still execute on the host.
Restrict network egress from the sandbox (and ideally from the outer VM) to only what you need for builds and package installs.
Use a fresh API key with low spend limits for the LLM provider, so a prompt-injected runaway loop can't rack up an unbounded bill.
Review what the agent writes under itemdb/, sandbox/ and tmp/ before trusting any of it. Findings, evidence and reports are all attacker-influenced when the target is untrusted.
Avoid make exploit-all and make validate-all on untrusted targets until you have walked at least one finding through manually and confirmed the sandbox behaves the way you expect.
Clone the repo, drop your source under src/, edit codecome.yml, and run the phases. Each target is restartable.
# bootstrap a virtual env and install deps $ make venv # sanity-check your environment, model creds, sandbox $ make check # six phases of an audit $ make phase-1 $ make phase-2 $ make phase-3 $ make phase-4 FINDING=CC-0001 $ make phase-5 FINDING=CC-0001 $ make phase-6
Recon, hypothesis, counter-analysis, validation, exploit, reporting — each a single target.
YAML frontmatter, free-form body. Validated against a schema, rendered everywhere.
Reproducible build of the audited project, isolated from the host.
Findings move between PENDING/, CONFIRMED/ and the rest with git mv.
Phase prompts checked into the repo. Diff them, fork them, A/B them.
Every confirmed finding ships a replay.sh. Reproduction is one command.
Crashed mid-audit? Re-run a single phase without losing the rest.
Your repo browser is the UI. tree findings/ tells you the whole state.
A CodeCome workspace is a normal git repo with a small, fixed set of folders. You can drop it into any IDE and read it as code.
Each phase runs through OpenCode, a local agent CLI that reads and writes files. The model isn't dictating to a chat window — it is editing your workspace under the harness's supervision.
The agent reads notes/, findings/ and src/ directly. Context is the workspace, not a sliding window.
Every shell command, file read and file write is logged. The audit trail includes the agent's actions, not just its prose.
OpenCode is the default. If you prefer another agent runner, the phase scripts call a thin wrapper you can swap.
CodeCome separates a "main" model that does the heavy reasoning from a cheaper "small" model used for triage, classification and prompt-shaping. You can mix providers; you can run fully local.
Used for recon, hypothesis, counter-analysis and exploit. Picks: Anthropic Claude Sonnet, OpenAI GPT-4-class, or a strong local model.
Used for triage, deduplication and metadata extraction. Picks: Anthropic Haiku, OpenAI mini, or a 7B–13B local model.
Point both tiers at ollama:// or any OpenAI-compatible local endpoint. No code leaves your machine.
The model helps you think. You stay in control. Nothing is committed to a finding folder without an explicit phase being run.
CodeCome is v0.x. The conventions are stable enough to use; the tooling around them is still moving. We won't pretend otherwise.
CodeCome's value is in its methodology, not its UI. The docs explain how each phase works and what its prompts assume.
Install, configure a model, run your first audit.
README ↗The six phases in detail, with the reasoning behind each prompt.
docs/methodology.md ↗Layout, frontmatter schema, naming conventions.
docs/workspace.md ↗Templates for Go, Python, Node, Rust and C/C++ projects.
docs/sandbox.md ↗All phase prompts, versioned and commented.
prompts/ ↗What CodeCome won't do today, and what we are working on next.
Issues ↗CodeCome is small. A patch to a phase prompt, a sandbox template for a new language, or a bug report on a confusing convention are all valuable. We won't accept PRs that turn this into a scanner.
Improve a phase prompt with a diff and a short rationale. Bring an audit log if you can.
Contribute a Dockerfile + run.sh for a stack we don't cover yet.
Disagree with a phase boundary? Open a discussion before a PR.
CLI ergonomics, schema validation, report generation — all welcome.
Dual-licensed — pick whichever copyleft fits your context. Sandbox templates under templates/sandboxes/ ship under MIT so they can be copied into user workspaces without imposing copyleft on those projects.