Teaching an AI to Work in a Cage (On Purpose)
TLDR: When your AI partner can run code on your computer, you need a way to let it work without letting it wreck anything. I built a containment system: sandboxed containers for code execution, network filtering so nothing phones home, and a credential vault that the AI literally cannot see — only I can open it with a code from my phone.
The problem
Here’s a scenario that should make you nervous: you have an AI that can execute code on your computer, install software, read your files, and access the internet. You trust it — it’s helpful, it does good work. But what if someone tricks it? What if a malicious website or a cleverly crafted email plants instructions that your AI follows without realizing they’re harmful?
This isn’t hypothetical. Prompt injection — tricking an AI into following hidden instructions — is a real and growing attack vector. And when your AI has the ability to run commands on your machine, the consequences aren’t theoretical. It could read your passwords, delete files, or send your private data somewhere.
So I had a dilemma. I need my AI partner Bob to run code — that’s a huge part of his value. He builds software, runs tests, installs packages, manages systems. But I also need to make sure that if something goes wrong, the damage is contained.
The answer: let him work, but inside a cage. And keep the truly dangerous stuff behind a lock that only I can open.
graph TD
A[Bob] --> B{Task type?}
B -->|Low risk| C[Direct access]
B -->|Code execution| D[Container]
B -->|Credentials| E[MFA Vault]
D --> F[Review Gate]
F -->|Clean| G[Approved]
F -->|Suspicious| H[Blocked]
Containerized execution: the cage
When Bob needs to build software or run experiments, he doesn’t do it directly on my computer. Instead, he works inside a container — think of it as a separate, isolated mini-computer running inside my computer. The container has its own file system, its own network rules, and strict limits on what it can do.
If something goes wrong inside the container — a bad package, a compromised dependency, a hallucinated command — it can’t escape. It can’t read my personal files, install malware on my system, or access things it shouldn’t. The blast radius is contained.
But I didn’t want to make the cage so restrictive that Bob can’t do real work. So there’s a graduated trust model:
Reading files and searching the web? That happens on the main system freely — it’s low risk. Building software and running tests? That happens in a container. Modifying system configuration? That requires my explicit approval.
When Bob finishes work in a container, the results don’t automatically make it to my system. There’s a mandatory review gate — a script that checks for suspicious files, unexpected changes, and potential security issues. Only after passing that gate do the changes get committed.
The network leash
Containers can access the internet — they need to download packages and reach APIs. But they can’t access everything. An egress proxy filters all outbound traffic: the container can reach npm, PyPI, and the AWS API, but nothing else. If compromised code tries to send data to an attacker’s server, the connection gets blocked.
Think of it like a building where the doors are locked and there’s a security guard who checks IDs. The container can leave through approved exits, but it can’t just walk out any door it wants.
Secrets the AI cannot see
Some things are too sensitive for the cage model. Bank credentials, investment account access, personal API keys — these need a completely different approach. Even if Bob is working perfectly, I don’t want these credentials to ever exist in his context. Not because I don’t trust him, but because if someone found a way to manipulate him, the credentials would be exposed.
The solution: MFA-gated access. Critical credentials are stored in a cloud vault (AWS Secrets Manager). To access them, a script needs a one-time code from my phone — like the two-factor authentication you use for your bank. Bob can ask me for the code, but he can’t generate it himself. And the credential never passes through his conversation — it exists only in the running script’s memory, and disappears when the script finishes.
So the flow is: Bob determines he needs a credential. He messages me: “I need your verification code to sync bank data.” I decide whether to approve. If I do, I give the code directly to the script, the script fetches the credential, does its work, returns only the results, and the credential is gone. Bob sees the results but never sees the secret.
Why this matters
Every AI assistant with real capabilities faces this problem. The more powerful the AI, the bigger the risk if something goes wrong. Most people solve this by limiting what the AI can do — keep it in a chat box, don’t let it touch anything real.
I went the other direction: give the AI real power, but build the safety infrastructure to contain the risk. Containers for everyday code execution, micro-VMs for sensitive workloads (where even the secret itself can only reach the specific server it’s meant for), and credential isolation where the AI literally cannot see the password. The AI can do real work without me losing sleep over what happens if things go sideways.
The principle is the same one used in banking, military systems, and nuclear power: defense in depth. No single layer is perfect. But stacking multiple layers of protection makes the overall system resilient to any single failure.
The bigger lesson: trust isn’t binary. You don’t either trust your AI completely or not at all. You build systems where trust is graduated, proportional to the risk, and backed by technical enforcement — not just good intentions.

