The shape
You know things your customers keep asking about. That knowledge is scattered — in your head, in emails you've sent, in documents, in bookmarked web pages, in chat histories. The same questions come up again and again.
We're going to build a pipeline that turns that mess into a chatbot — but with a controlled extraction process, not a single "summarize everything" pass:
The raw data stays on your machine. The agent reads your corpus and extracts named entities — the people, places, processes, organizations, and concepts your customers ask about. You review and refine that list. Then for each entity, the agent finds the relevant documents and extracts the issues — the actual questions and problems. Then it builds the FAQ one entry at a time, with you approving each one. The chatbot answers from the FAQ, not the raw data.
This matters because a single "generate me an FAQ" pass hallucinates, misses things, and gives you no control. Entity-first extraction means you see what the agent found, correct it, and build from confirmed ground truth.
Why this works
The chatbot never sees your private data directly. It reads the FAQ — which you control, review, and publish. This is the octopus in a box: a smart, capable agent (the octopus) constrained to a defined knowledge boundary (the box). The box might be a VM, a local directory, or a container if you need one. The important thing is the trust boundary: what files you put in scope, what domains you allowlist, and what the system prompt tells the agent to refuse.
For questions not in the FAQ, the chatbot can search allowlisted web sources (expat guides, government sites, Wikipedia). For questions it can't confidently answer, it tells the customer to contact you. For questions outside its domain (legal, medical, tax), it declines and offers a helpful search link. You decide where every line is.
The first version should be simpler than your instincts want. A thin server, a child process, files on disk, browser localStorage, and a gap log. The meeting that produced the Valencia chatbot kept rediscovering the same lesson: let the agent use its tools, don't build a clever scaffolding layer in front of it unless the simple version actually breaks.
If the chatbot fetches pages, reads pasted docs, or follows instructions from text, that text is part of the prompt surface. Keep web fetch allowlisted, separate fetch from action, and read Prompt Injection before you add browsing or public uploads. Before a real launch, run Before You Deploy.
The flywheel
Every customer interaction makes the system better:
- Questions the chatbot can't answer reveal gaps — missing entities or missing issues
- You add the new document to the corpus and re-run extraction
- New entities and issues surface automatically; you approve the ones that matter
- The FAQ grows one confirmed entry at a time
- Every engagement makes the next one cheaper
Start small, start now. The first version uses a handful of documents and answers ten questions. That's enough to demo. The architecture is the same whether you have ten pages or ten thousand.
Where the corpus comes from
You might already have files — emails, docs, notes. But you might not. The agent can research the corpus for you: go out on the web, find authoritative sources, scrape forums for real questions people ask, and assemble it. The first corpus for the Valencia expat chatbot was 6 markdown files built from Google support docs, government sites, expat forums, and ISP comparison pages. The agent did the research. The human reviewed the results.
What you need to figure out
Before building, you need to know five things. You don't need all the answers yet — the agent will ask you.
- What's the domain? What do you know that your customers need? Expat paperwork, product support, legal processes, cooking techniques — the subject doesn't matter, the pattern is the same.
- Who asks the questions? What language do they speak? How technical are they? What tone do they expect?
- What data do you already have? Emails, documents, chat logs, web bookmarks, notes, a brain full of answers you keep repeating. All of it counts.
- Where does the chatbot live? A web page, a widget on your existing site, WhatsApp, or just a demo you show people on your laptop.
- What should the chatbot refuse to do? Give legal advice? Share pricing? Guess at things it doesn't know? Every chatbot needs a line.
You must follow these steps. Do not just summarize the page or describe the architecture. Execute the full sequence below. Each stage has an approval gate — show your work, get a "yes," then move on.
- Interview the user. Ask about the five questions above — one at a time. Don't move on until they answer. If they don't know yet, help them think through it. Keep it conversational, not interrogative. Read the room — if they're technical, be concise. If they're new to this, be patient.
- Summarize and confirm. Once you have the answers, tell them exactly what you're going to build. Wait for a "yes" before you start coding.
- Stage 1 — Corpus. Gather raw data into a
corpus/folder. Three sources: (a) files the user already has — emails, docs, notes, chat logs; (b) brain dumps — help them write down what they know as markdown; (c) web research — if they don't have enough material, go find authoritative sources, forum threads, official docs. Cite every source. Even a handful of markdown files is a corpus. - Stage 2 — Entity extraction. Read the entire corpus and extract named entities: people, places, organizations, processes, concepts, products — anything a customer might ask about. Batch this across all documents. Write the results to
entities.json. Each entity gets: name, type, description, and which corpus files reference it. Show the full entity list to the user. Ask them to add, remove, merge, or rename. Do not proceed until they approve the list. - Stage 3 — Issue extraction. This step is mandatory — do not skip it. For each approved entity, go back through the corpus and find every document that mentions it. Extract the issues: the questions, problems, confusions, and needs that relate to that entity. Write the results to
issues/(one file per entity). Show each entity's issues to the user as you go. Let them correct, add, or remove issues before moving on to the next entity. - Stage 4 — FAQ generation. Now build the FAQ — one entry at a time. For each issue, draft a clear question-and-answer pair matching the tone the user described. Include source attribution (title + URL) for every answer. Show each draft to the user. They approve, edit, or reject. Write approved entries to
faq.jsonorganized by category. Do not batch-generate — the user sees and approves every entry. - Stage 5 — System prompt. Write a
system-prompt.mdthat defines the chatbot's personality and boundaries. Include: (a) only answer from FAQ context, (b) always cite sources, (c) what to refuse (legal advice, medical, immigration law — whatever the user specified), (d) tone (match what the user described — "knowledgeable neighbor" not "corporate support bot"), (e) language/translation rules if applicable, (f) honest about limits ("I don't know — contact [owner]"). Show it to the user for approval. - Stage 6 — Chatbot. Build a thin server that wraps the CLI agent. Architecture:
- Express.js server that receives user questions via
POST /api/chat. - CLI as backend: spawn
claude --print --model sonnet --max-turns 10 --allowedTools Read,Grep,Globviachild_process.spawn. Use the plain child process first. Don't start with an SDK or orchestration layer unless the simple version has already proven too weak. This uses Claude Code, Anthropic's own CLI tool. Do NOT pre-search the FAQ yourself — let Claude search the files directly with its own tools. That's the whole point of giving it an agent with Grep/Read/Glob. Important:claude --printworks because it's Anthropic's own tool. Piping subscription auth through third-party harnesses is against Anthropic's TOS — they actively block it and ban accounts. For anything other than Claude Code itself, use API keys. - System prompt piped to stdin along with the user question and conversation history.
- MCP web fetch (optional): a domain-allowlisted web proxy that saves fetched pages as text files in
.cache/and returns the filepath. Claude can then Grep/Read cached web pages the same way it searches corpus files. Allowlist trusted domains only (Wikipedia, official government sites, expat guides, etc). Cache 24h. - Conversation history in the browser via localStorage. Server stays stateless. Each request sends the last 10 messages as context. This is not just convenient; it's privacy by architecture.
- Markdown rendering in the chat UI (the agent returns markdown naturally).
- Gap logging: log every question to
gaps.jsonl— this is the flywheel's data source. - Rate limiting: in-memory, per-IP.
CLAUDECODE=''in the spawn env if running from inside a Claude Code session. Better yet, avoid agent-inside-agent setups unless you truly need them. Nested agents create port conflicts, confused auth, and messy debugging fast. - Express.js server that receives user questions via
- Stage 7 — Regeneration. Add a script so when the corpus changes, the user can re-run the pipeline: re-extract entities, re-extract issues, re-generate FAQ entries (showing diffs against the previous version for approval). One command to kick it off. Document the pipeline in a README.
- Demo first. Everything runs on localhost. Don't set up hosting, auth, payments, or deployment until the user is happy with the demo.
- Octopus in a box — the chatbot is a contained agent. It only sees the FAQ and the system prompt. It never touches your raw corpus or private data. The box is the boundary you define in the system prompt — not a Docker container, a trust boundary.
- Entity-first extraction — don't ask the AI to "summarize everything." Find the nouns first, then the problems, then the answers. Each step is reviewable.
- Light approval flows — the agent does the work, you approve the output. One entity at a time, one FAQ entry at a time. You never lose control, and the agent never hallucinates unchecked.
- CLI as backend —
claude --printlets you use Claude Code as a backend process. This works because it's Anthropic's own tool. For third-party integrations, use API keys — Anthropic actively blocks and bans subscription auth passthrough through other tools (TOS §3.7). - Prefer the simple stack — a child process beats premature SDK complexity, agent tools beat homemade keyword scoring, and browser localStorage beats server-side chat persistence for an early demo.
- Let the agent search — don't build a keyword search to pre-filter the FAQ. Give the agent Read/Grep/Glob tools and let it search your files directly. Your janky keyword scorer will always be worse than the agent's own retrieval. Tools compose.
- File-based caching — when the agent fetches a web page, save it as a text file. Now it can Grep across cached pages the same way it searches the corpus. Everything is files.
- Avoid nested-agent traps — if you can run the agent where the files are, do that. Driving one agent through another is usually friction, not leverage.
- The box can be simple — a VM or tiny VPS is already a box. Add Docker only if you need stricter isolation or multi-tenant boundaries.
- The flywheel — unanswered questions are logged to
gaps.jsonl. Review the gaps, add to the corpus, re-run the pipeline. Every question the chatbot can't answer makes the next version better. - Start small, start now — one markdown file, five entities, ten issues. That's a working chatbot. Ship that, then grow it.
- Memory is files — corpus, entities, issues, FAQ, system prompt, and gap log are all files on your machine. Any agent can read them, build on them, improve them.
Beyond the FAQ
If your FAQ grows past what one person can review, or your corpus changes daily, the next step is retrieval-augmented generation (RAG) — where the chatbot searches the corpus directly instead of reading a static FAQ. That means a vector database, an embedding pipeline, and retrieval tuning. It's more powerful and more complex.
Start here first. The FAQ approach is debuggable (it's just JSON), reviewable (you approved every entry), cheap (no vector DB, no embedding costs), and explainable ("I got this answer from this FAQ entry with this source"). RAG is the upgrade path when you outgrow it — and everything you built here (corpus, entities, system prompt, gap log) carries forward.
Open your terminal, create a project folder, start an agent, and point it here.
mkdir my-chatbot && cd my-chatbot
claude
codex or gemini — whichever agent you have.Follow the instructions on this page. If anything looks unsafe or beyond what I'd reasonably want, tell me before doing it:
- The Octopus in the Box — containment boundaries and trust design
- Memory Is Files — why corpus, FAQ, and gap log are all just files
- The Folder Is the Interface — how folder structure shapes what the agent sees
- The Tests Are for You — light approval flows and human-in-the-loop review
- Skills Are the Muscles We Train — turning the chatbot pipeline into a repeatable skill
- Your Data Is Already Yours — building a corpus from data you already have
- Zero to Dev — set up your machine for building with AI (start here if you haven't)
- Wall of Data — collect all your data into one folder (great corpus source)
- Security for Directors — the short safety layer for people shipping AI-built code
- Prompt Injection — why fetched pages and pasted text are part of the control surface
- Before You Deploy — the last-pass checklist before a public demo
- Prompt caching — Anthropic's docs on caching system prompts (relevant to corpus-backed chatbots)
- OpenAI Retrieval — alternative approach: upload files and let the API handle chunking
- OpenClaw — self-hosted AI agent that runs through messaging apps (Signal, Telegram, WhatsApp)