CodeCome — AI-assisted vulnerability research without losing the trail

the problem

Chat is not an audit trail.

LLM-assisted security research has a memory problem. Findings live inside chat threads that get pruned, summarized, and lost. The model "remembers" your audit until it doesn't. CodeCome treats every step of an audit like an artifact you can re-read, diff and ship.

No database. No RAG. No disappearing chat history. Everything lives on disk as Markdown and YAML you can grep.
Every claim becomes an artifact. A hypothesis is a file. A confirmation is a file. A PoC is a file.
Hypotheses are not bugs until evidence says so. Findings move through explicit states; you decide when something is real.
The audit lives on disk. Reproducible months later, by you or anyone who clones the workspace.

what CodeCome is

Research methodology, made executable.

CodeCome is not a vulnerability scanner. It is not a pentest tool. It is not a magic AI bug finder. It is a harness — a set of conventions, prompts, and Make targets — that encodes how a careful researcher actually audits code. The model helps you think. You stay in control.

01

Conventions over magic

A workspace layout, naming scheme, finding template and Make targets. Nothing you can't read in an afternoon.

02

Prompts as code

Each phase has explicit prompts checked into the repo. You can fork, audit and version them like any other code.

03

Evidence over vibes

A finding is not "real" until it has been counter-argued, validated in a sandbox, and reproduced from an artifact.

how it works

Drop source under src/, configure codecome.yml, run the phases.

CodeCome doesn't ingest your repo into a black box. You mount source code into a workspace, declare scope in a YAML file, and trigger phases via make. Each phase reads and writes plain files. You can stop and resume anywhere.

codecome.yml

# Scope and configuration for this audit
project: itemdb
language: go
entrypoints:
  - cmd/server/main.go
  - internal/http/router.go

scope:
  include: ["internal/**", "pkg/**"]
  exclude: ["vendor/**", "**/_test.go"]

model:
  provider: anthropic      # or openai · ollama · …
  name:     claude-sonnet-4-5
  small:    claude-haiku-4-5    # cheap pass for triage

sandbox:
  image: codecome/sandbox:itemdb
  build: make -C sandbox build
  run:   ./sandbox/run.sh

A flat, explicit workflow

1. Mount. Put the source you want to audit under src/.
2. Configure. Declare scope, entry points and model in codecome.yml.
3. Run the phases. Each make phase-N target is a discrete, restartable step.
4. Inspect on disk. Findings, evidence and reports are plain files you can grep and diff.
5. Ship. Generate the report or hand the workspace to a teammate.

Read the workflow guide Skip to commands →

six phases

The audit, broken into six discrete steps.

Each phase has its own prompt, its own outputs, and writes to a known location on disk. You can re-run a single phase without losing previous work.

PHASE 01

Recon

Read the codebase. Build a map of modules, trust boundaries, entry points, sinks and external dependencies. Output: notes/recon.md.

read-onlymap

PHASE 02

Hypothesis

From the recon map, generate concrete attack hypotheses. Each one is a file under findings/PENDING/ with severity, attacker model and code references.

PENDING

PHASE 03

Counter-analysis

Argue against each hypothesis. The model plays devil's advocate and looks for reasons the bug isn't exploitable. Weak hypotheses get marked REJECTED.

filter

PHASE 04

Validation

Build a minimal proof inside a Docker sandbox. If a payload survives, the finding moves to CONFIRMED. If it can't be reproduced, back to PENDING or REJECTED.

sandboxCONFIRMED

PHASE 05

Exploit

Turn the validated finding into a working proof-of-concept. Capture payloads, command output and logs under evidence/. Status moves to EXPLOITED.

EXPLOITED

PHASE 06

Reporting

Compile the workspace into a coherent report: executive summary, findings with evidence links and remediation. Markdown in, Markdown or PDF out.

report

finding lifecycle

Five states. No ambiguity.

Every finding lives in exactly one folder, named after its current state. Moving a finding between states is just git mv — nothing magical, nothing hidden.

PENDING

Hypothesis filed

An idea worth investigating. Not yet validated. Might be a real bug, might be wrong.

CONFIRMED

Reproduced in sandbox

The condition has been demonstrated. Not weaponized yet, but no longer in doubt.

EXPLOITED

Working PoC on disk

A reproducible exploit exists under evidence/. Reproducible by anyone with the workspace.

REJECTED

Falsified

Counter-analysis or validation killed the hypothesis. Kept on disk so it isn't re-investigated for free.

DUPLICATE

Already tracked

The same root cause as another finding. Linked, not deleted, to preserve the audit trail.

artifact-first workflow

Plain Markdown. Plain YAML. Real evidence.

A finding is a Markdown file with a YAML frontmatter. Tools and humans read the same thing. No schema lock-in, no proprietary format.

findings/CONFIRMED/CC-0001-sqli-search.md

---
id:        CC-0001
title:     SQL injection in /search
status:    CONFIRMED
severity:  high
cwe:       [CWE-89]
component: internal/http/search.go
discovered: 2026-05-11
phases:
  recon:        done
  hypothesis:   done
  counter:      done
  validation:   done
evidence:
  - evidence/CC-0001/payload.sql
  - evidence/CC-0001/curl.log
---

# CC-0001 — SQL injection in /search

## Summary
The /search endpoint concatenates the q parameter
into a raw SQL statement. Authentication is not required.

## Reproduction
1. Run the sandbox: make sandbox-up
2. curl "http://localhost:8080/search?q=' OR 1=1--"
3. Observe full table dump in response body.

## Remediation
Use database/sql parameterized queries; reject input
containing SQL meta-characters before logging.

Why files, not a database

A vulnerability research project should still be readable in five years. Files survive renaming, forking, GitHub outages and SQL migrations. grep, git log and diff are the only tools you need.

YAML you can validate

The frontmatter is validated against a schema in schemas/finding.yml. Bad metadata fails fast in CI. The Markdown body stays free-form so researchers aren't fighting a form.

Tooling that travels

Findings render in any Markdown viewer — GitHub, Obsidian, an editor preview pane. The "dashboard" is just a folder.

sandbox validation

Validation happens in a sandbox.

Before a finding is marked CONFIRMED, CodeCome reproduces it against a real build of the project — in a Docker container, behind a network namespace, away from your host. If the payload doesn't fire there, it doesn't make the cut.

Build the sandbox image

Reproducible build of the project under audit · pinned dependencies

Run the payload

Script under evidence/<id>/replay.sh · captured stdout / stderr / status

Score the result

Exit code · log signature · response body match · timing

CC-0001 PASS · CONFIRMED

CC-0009 FAIL · REJECTED

$ make phase-4 FINDING=CC-0001

→ building sandbox image  … 8.2s
→ starting container       … healthy
→ replaying payload        … evidence/CC-0001/replay.sh

  HTTP/1.1 200 OK
  Content-Length: 4188
  Body contains: "users.email", "users.pw_hash"

→ assertions
  ✓ status 200
  ✓ response contains users.pw_hash
  ✓ query log shows UNION SELECT

→ result: CONFIRMED
→ moved findings/PENDING/CC-0001 → findings/CONFIRMED/CC-0001

example finding

What a finding looks like in practice.

A single Markdown file, validated frontmatter, links to evidence on disk. This is the unit of work in CodeCome — not a Jira ticket, not a row in a database.

CC-0001

SQL injection in /search

HIGH CONFIRMED CWE-89

Componentinternal/http/search.go

Entry pointGET /search?q=…

AttackerUnauthenticated

ImpactFull DB read

Validated2026-05-11 in sandbox

Phases recon hypothesis counter validation

Evidence

evidence/CC-0001/payload.sql

evidence/CC-0001/curl.log

evidence/CC-0001/replay.sh

# CC-0001 — SQL injection in /search ## Summary The /search endpoint concatenates the q parameter directly into the SQL string used to query users. Authentication is not required. ## Reproduction 1. make sandbox-up 2. curl "http://localhost:8080/search?q=' UNION SELECT email,pw_hash,1 FROM users--" 3. Observe the response containing emails and password hashes. ## Counter-analysis - Argued: rate-limiter would block enumeration. - Outcome: rate-limiter is keyed per token; anonymous requests share a single bucket → attacker can bypass. ## Remediation Use parameterized queries (database/sql) and treat the search field as opaque text. Log queries with the raw parameter redacted.

screenshots

What CodeCome actually looks like.

Sanitized snapshots from real audits — enough to show the workflow, not enough to leak target-specific exploit details or credentials. Click any tile for the full-size image.

Finding queue

Reviewable hypotheses awaiting validation.

open ↗

Agent workflow

Agentic, but auditable from the file system up.

open ↗

Sandbox validation

Validation before belief — inside Docker.

open ↗

Evidence artifacts

Every confirmed claim leaves files behind.

open ↗

Generated helpers

Sandbox scripts produced on demand per finding.

open ↗

Exploit notes

Readable PoC writeups, not chat scrollback.

open ↗

Counter-analysis

Try to disprove first — then validate.

open ↗

Impact summary

Exploited findings with linked artifacts.

open ↗

An asciinema cast of a full run is planned.

who it's for

Built for people who already do this work.

CodeCome won't turn a non-researcher into one. It will save a researcher hours of bookkeeping per audit.

Solo security researcher

Keep the trail of your private hunts

Audit codebases at your own pace without losing context when you come back a month later.

AppSec engineer

Audit your own product, repeatably

Bring a structured methodology to internal security reviews. Hand the workspace to the next on-call.

Red teamer

Document offensive work as you go

From recon to PoC, every step lands in the workspace, ready to be turned into a deliverable.

Blue teamer

Reproduce vendor advisories

Pin a sandbox build, replay a payload, decide if your patch actually closed the door.

LLM researcher

Study LLM-assisted research

An opinionated, instrumented harness for experimenting with prompt strategies and counter-analysis loops.

Educator

Teach methodology, not tools

Walk students through six phases of a real audit using a project they can clone.

prerequisites

What you need before running it.

CodeCome runs on top of OpenCode with your own LLM provider, plus a small Python + Make + Docker stack. make check warns about anything missing — but the core workflow runs without the optional tools.

required core stack · every audit needs these

OC

OpenCode

The open-source AI coding agent CodeCome drives. Install guide.

K

An LLM provider key

At least one of Anthropic, OpenAI, Google, xAI, Groq, Cerebras — or a local OpenAI-compatible endpoint. Provider setup.

PY

Python 3.10+

For workspace tooling. make venv creates a local virtualenv.

MK

GNU Make

The entire workflow is driven through make targets.

D

Docker

Required for the sandboxed validation environment in Phases 1b / 4 / 5.

optional for Phase 5 visual evidence

▶

asciinema

Terminal recordings of exploit replays.

GIF

agg

Renders .cast files to GIFs. CodeCome falls back to a Docker container if missing.

●

ffmpeg + xvfb

For GUI / browser exploits where video evidence matters. xvfb-run is fine too.

safety considerations

Treat unknown source code as data, not safe input.

Risks worth knowing about

Prompt injection from the target

Comments, docstrings, READMEs, test fixtures, log strings, commit messages, filenames — even crafted binary blobs inside src/ — can carry instructions aimed at the agent ("ignore previous instructions…", "exfiltrate $HOME/.ssh/…"). The agent reads these as input, but LLMs are still susceptible.

Supply-chain hazards in the sandbox

Phase 1b will try to build and run the target. A malicious setup.py, package.json lifecycle hook, Makefile, Dockerfile, or configure script executes inside the sandbox container with whatever permissions Docker gives it.

Resource exhaustion and side effects

Adversarial code may try to burn CPU, fill disk, or hammer the network from the validation phase. A prompt-injected runaway agent loop can burn tokens just as easily.

Exfiltration via network

If the sandbox (or your host) can reach the internet, an injected agent or a malicious build step can attempt to send data out. The default policy assumes egress is possible.

Recommended precautions

Run the whole workspace inside an isolation boundary when auditing untrusted sources — a disposable VM (Multipass, Vagrant, UTM, Proxmox), a dedicated container, or a remote throwaway host. Do not run CodeCome on a machine that holds credentials, SSH keys, browser profiles, or production access you can't afford to lose.

Treat src/ as untrusted. CodeCome funnels execution through sandbox/, but the make runner itself, the agent, and any helper scripts still execute on the host.

Restrict network egress from the sandbox (and ideally from the outer VM) to only what you need for builds and package installs.

Use a fresh API key with low spend limits for the LLM provider, so a prompt-injected runaway loop can't rack up an unbounded bill.

Review what the agent writes under itemdb/, sandbox/ and tmp/ before trusting any of it. Findings, evidence and reports are all attacker-influenced when the target is untrusted.

Avoid make exploit-all and make validate-all on untrusted targets until you have walked at least one finding through manually and confirmed the sandbox behaves the way you expect.

CodeCome's sandbox is a containment aid, not a security boundary against a determined attacker. If you wouldn't be willing to run docker build and ./run-tests.sh from the target's repo on the host, you shouldn't run CodeCome against it on the host either.

quick start

Run your first audit in eight commands.

Clone the repo, drop your source under src/, edit codecome.yml, and run the phases. Each target is restartable.

quick start

# bootstrap a virtual env and install deps
$ make venv

# sanity-check your environment, model creds, sandbox
$ make check

# six phases of an audit
$ make phase-1
$ make phase-2
$ make phase-3
$ make phase-4 FINDING=CC-0001
$ make phase-5 FINDING=CC-0001
$ make phase-6

What each target does

make venv — sets up Python, OpenCode CLI and Docker dependencies.
make check — validates codecome.yml, model credentials and that Docker is reachable.
make phase-1..3 — recon, hypothesis, counter-analysis. Work on the whole workspace.
make phase-4 / 5 FINDING=<id> — operate on a specific finding so you can iterate safely.
make phase-6 — assemble the report from the current state of the workspace.

features

What's in the harness.

⌘

Six-phase Makefile

Recon, hypothesis, counter-analysis, validation, exploit, reporting — each a single target.

⎘

Markdown findings

YAML frontmatter, free-form body. Validated against a schema, rendered everywhere.

⎔

Docker sandbox

Reproducible build of the audited project, isolated from the host.

⇄

State as folders

Findings move between PENDING/, CONFIRMED/ and the rest with git mv.

∿

Versioned prompts

Phase prompts checked into the repo. Diff them, fork them, A/B them.

≡

Replay scripts

Every confirmed finding ships a replay.sh. Reproduction is one command.

↺

Restartable phases

Crashed mid-audit? Re-run a single phase without losing the rest.

⎕

Plain-file dashboard

Your repo browser is the UI. tree findings/ tells you the whole state.

workspace layout

One repo. Everything on disk.

A CodeCome workspace is a normal git repo with a small, fixed set of folders. You can drop it into any IDE and read it as code.

~/research/itemdb

▸itemdb/
├─codecome.ymlscope & model
├─src/code under audit
├─notes/
│  ├─recon.md
│  └─audit-log.md
├─findings/
│  ├─PENDING/
│  ├─CONFIRMED/
│  ├─EXPLOITED/
│  ├─REJECTED/
│  └─DUPLICATE/
├─evidence/PoCs, payloads, logs
├─sandbox/Dockerfile, scripts
├─prompts/per-phase prompts
└─reports/final deliverables

Why this shape

Code under audit lives under src/. Vendor it, submodule it, or symlink it — your call.
Findings are folders. No DB, no SaaS. State is "which folder is this file in".
Prompts and sandbox are first-class. They live in the repo and ship with the audit.
Evidence is auditable. Replay scripts and logs sit next to the finding that proves them.

opencode integration

Drives a real coding agent, not a chat box.

Each phase runs through OpenCode, a local agent CLI that reads and writes files. The model isn't dictating to a chat window — it is editing your workspace under the harness's supervision.

File-aware, not chat-aware

The agent reads notes/, findings/ and src/ directly. Context is the workspace, not a sliding window.

Tool calls you can audit

Every shell command, file read and file write is logged. The audit trail includes the agent's actions, not just its prose.

Drop-in replaceable

OpenCode is the default. If you prefer another agent runner, the phase scripts call a thin wrapper you can swap.

model strategy

Bring your own model. Two tiers.

CodeCome separates a "main" model that does the heavy reasoning from a cheaper "small" model used for triage, classification and prompt-shaping. You can mix providers; you can run fully local.

M

Main model

Used for recon, hypothesis, counter-analysis and exploit. Picks: Anthropic Claude Sonnet, OpenAI GPT-4-class, or a strong local model.

S

Small model

Used for triage, deduplication and metadata extraction. Picks: Anthropic Haiku, OpenAI mini, or a 7B–13B local model.

L

Local mode

Point both tiers at ollama:// or any OpenAI-compatible local endpoint. No code leaves your machine.

The model helps you think. You stay in control. Nothing is committed to a finding folder without an explicit phase being run.

project status

Early. Useful. Honest about both.

CodeCome is v0.x. The conventions are stable enough to use; the tooling around them is still moving. We won't pretend otherwise.

Stable Workspace layout · finding schema · phase contracts core/

Stable Markdown + YAML finding format schemas/

Beta OpenCode integration · phase prompts prompts/

Beta Docker sandbox templates sandbox/

Experimental Cross-finding deduplication · roll-up reports reports/

Planned Multi-repo audits · SARIF export · CI integration v0.next

documentation

Read the docs before you trust the output.

CodeCome's value is in its methodology, not its UI. The docs explain how each phase works and what its prompts assume.

contributing

Help shape an honest research harness.

CodeCome is small. A patch to a phase prompt, a sandbox template for a new language, or a bug report on a confusing convention are all valuable. We won't accept PRs that turn this into a scanner.

Read CONTRIBUTING Good first issues

Prompts

Improve a phase prompt with a diff and a short rationale. Bring an audit log if you can.

Sandbox templates

Contribute a Dockerfile + run.sh for a stack we don't cover yet.

Methodology

Disagree with a phase boundary? Open a discussion before a PR.

Tooling

CLI ergonomics, schema validation, report generation — all welcome.

AI-assisted vulnerability research without losing the trail.

Chat is not an audit trail.

Research methodology, made executable.

Conventions over magic

Prompts as code

Evidence over vibes

Drop source under src/, configure codecome.yml, run the phases.

A flat, explicit workflow

The audit, broken into six discrete steps.

Recon

Hypothesis

Counter-analysis

Validation

Exploit

Reporting

Five states. No ambiguity.

Hypothesis filed

Reproduced in sandbox

Working PoC on disk

Falsified

Already tracked

Plain Markdown. Plain YAML. Real evidence.

Why files, not a database

YAML you can validate

Tooling that travels

Validation happens in a sandbox.

What a finding looks like in practice.

SQL injection in /search

What CodeCome actually looks like.

Finding queue

Agent workflow

Sandbox validation

Evidence artifacts

Generated helpers

Exploit notes

Counter-analysis

Impact summary

Built for people who already do this work.

Keep the trail of your private hunts

Audit your own product, repeatably

Document offensive work as you go

Reproduce vendor advisories

Study LLM-assisted research

Teach methodology, not tools

What you need before running it.

Auditing untrusted code?

Treat unknown source code as data, not safe input.

Read this before pointing CodeCome at code you did not write.

Risks worth knowing about

Prompt injection from the target

Supply-chain hazards in the sandbox

Resource exhaustion and side effects

Exfiltration via network

Recommended precautions

Run your first audit in eight commands.

What each target does

What's in the harness.

Six-phase Makefile

Markdown findings

Docker sandbox

State as folders

Versioned prompts

Replay scripts

Restartable phases

Plain-file dashboard

One repo. Everything on disk.

Why this shape

Drives a real coding agent, not a chat box.

File-aware, not chat-aware

Tool calls you can audit

Drop-in replaceable

Bring your own model. Two tiers.

Main model

Small model

Local mode

Early. Useful. Honest about both.

Read the docs before you trust the output.

Getting started

Methodology

Workspace reference