Building a session retrospective skill for Claude Code

I've been using Claude Code for a while now, and I noticed a pattern: at the end of a productive session, I'd have this vague sense of "we figured out some useful stuff" but no concrete record of what those lessons actually were.

Recently, I learned of a skill called continuous-learning. It automatically extracts reusable patterns and saves them as skills. But I wanted something different. Not automated pattern extraction, but a human-readable summary I could actually share. Something I could look back on, or turn into a blog post.

So I built the session-retrospective skill.

What it does

The skill analyzes the current Claude session and generates a markdown summary covering:

  • What we set out to do
  • Problems encountered and how they were solved
  • Mistakes made and corrections
  • Techniques discovered worth remembering
  • Key takeaways

The output goes straight to console for copy/paste. No files created, no cleanup needed.

How it works

Claude Code stores session history as JSONL files in ~/.claude/projects/<project-dir>/<session-id>.jsonl. Each line is JSON with message type, content, timestamps, and metadata. A simple bash file locates and outputs the session JSONL:

SESSION_FILE=$(find "$PROJECTS_DIR" -name "${SESSION_ID}.jsonl" -type f | head -1)
cat "$SESSION_FILE"

Once the history is fetched, the actual analysis is left to Claude with structured guidance. I provided a template for the output format and a table of what to look for (problems, decisions, techniques, mistakes). There's a lot of freedom, since synthesizing lessons requires judgment.

What's next

The skill works for my current needs. I haven't tested it extensively on very long sessions (the JSONL can get large and might need chunking). For now, it handles my typical single-session bursts fine.

Also, the output format might need iteration based on actual use. The template includes sections for "mistakes made" and "techniques worth remembering" but maybe other categories would be more useful. I'll adjust as I use it more.

The code is here if you want to try it.

claude-code tools ai +1 more

Running AI agents in a box because I don't trust them

I built a Docker wrapper for Claude Code and OpenAI Codex. The main reason is simple: I don't trust AI agents running loose on my machine.

Being in Cyber Security, I've developed a healthy paranoia about software that can execute arbitrary commands. AI coding assistants are powerful, but they're also unpredictable. They can run shell commands, modify files, and access the network. I wanted all of that contained.

The setup

Claudecker is my personal tool that wraps Docker to run Claude Code CLI and Codex CLI in an isolated container. Point it at any project directory and it mounts that directory into the container. The AI can do whatever it wants inside the container, but it can't touch the rest of my system.

./claudecker.sh run /path/to/project

Each run starts with a fresh environment. Skills get reinstalled, settings reset to defaults. Only authentication tokens persist across restarts. This "clean slate" approach means I don't accumulate cruft or unexpected state changes.

The paranoid feature: network lockdown

The feature I'm most pleased with is the network lockdown toggle. It uses iptables to control the container's OUTPUT chain policy.

./claudecker.sh lockdown

This drops all outbound traffic except localhost and already-established connections. The AI can still work on code, but it can't phone home, download packages, or exfiltrate anything. I toggle this on when working on sensitive projects.

The implementation is straightforward. Just flipping between DROP and ACCEPT policies. The container needs NET_ADMIN capability for this to work, which is a trade-off I'm comfortable with since it's scoped to network operations.

Trade-offs from containerization

Isolation comes with friction. I had to solve several problems that wouldn't exist if I just ran the CLI directly on my host.

Browser authentication needs X11

Claude and Codex authentication use browser-based OAuth flows. Inside a container, there's no browser. I ended up mounting the X11 socket and forwarding the DISPLAY variable:

volumes:
  - /tmp/.X11-unix:/tmp/.X11-unix:rw

On Linux this works if you have DISPLAY set. On macOS you need XQuartz. For headless environments, there's a fallback where I manually copy the auth.json file from a machine where you've already logged in.

Thankfully, Claude Code and Codex CLI makes it easier by providing you a URL to visit, which will give you a code to enter back. This means I rarely need to use browser authentication, but at least the option is there.

SSH agent forwarding is annoying

Getting SSH keys into the container without copying them required some workarounds. Docker's socket permissions don't always cooperate, so I ended up using socat to proxy the SSH agent socket:

sudo socat UNIX-LISTEN:/tmp/ssh-agent-forwarded,fork,mode=600,user=node \
          UNIX-CONNECT:/ssh-agent &

The container tries direct socket access first, falls back to socat if that fails. Limited sudo permissions ensure the node user can only run specific commands.

Port forwarding for web apps

If you're developing a web app and want to access it from your host browser, you need to expose ports explicitly:

./claudecker.sh run --port 3000 /path/to/project

This maps the container's port to the host. Without this flag, localhost:3000 inside the container isn't reachable from outside.

Project-specific dependencies

Different projects need different tools. A C project needs gcc and cmake. A Python ML project needs different libraries. I didn't want to bloat the base image with everything.

The solution: a .claudecker file in the project directory.

# .claudecker
build-essential
cmake
gcc
python3-dev

On first run, the script hashes the file contents, generates a Dockerfile, and builds a custom image tagged with that hash. Subsequent runs use the cached image. Projects with identical .claudecker files share the same image.

claudecker-custom:a1b2c3d4e5f6

This content-based approach means I'm not rebuilding images unnecessarily, and cleanup is straightforward with clean-custom and clean-all-custom commands.

Skills system

A new and recent addition: Claudecker now supports Claude Skills, which are custom prompts that extend its capabilities. I implemented two types:

  • GitHub skills get cloned on container startup. The Humanizer skill, for example, comes from a public repo and helps remove AI-isms from text.
  • Local skills are baked into the Docker image. I keep these in a local-skills/ directory.

The build process copies these into the image, and the entrypoint installs them into Claude's skills directory. This way I can version-control project-specific skills alongside the code.

Multi-AI orchestration with PAL MCP

I also integrated PAL MCP Server, which lets Claude Code collaborate with other AI models (Gemini, GPT, Grok, local Ollama models). I export my API keys before running:

export OPENROUTER_API_KEY="your-key"
./claudecker.sh run /path/to/project

Inside Claude Code, I can ask it to use other models for second opinions, code review, or extended reasoning. The MCP server handles the routing.

This obviously requires network access, so it doesn't work in lockdown mode. Trade-offs.

Where's the code?

I planned to release this publicly but decided against it for now. There are rough edges:

  • The docker-compose volumes are duplicated in the docker run command for custom images. If you change one, you have to change the other. I left a warning comment but it's still error-prone.
  • The firewall whitelist script exists but isn't fully tested across different network configurations.
  • Some features assume specific host setups (X11, SSH agent running, etc.) and fail ungracefully when those assumptions don't hold.
  • Error handling is minimal in places.

I use this daily for my own work, but it's not polished enough for others to pick up without reading through the scripts first. Maybe later.

Current limitations

  • Network lockdown state doesn't persist across container restarts. Restart the container and you're back to full network access.
  • Custom image builds happen automatically but failures silently fall back to the base image. You might not notice a package didn't install.
  • X11 forwarding is a security surface I'm not entirely comfortable with, but I haven't found a better solution for browser auth.

What I actually use it for

Most days I run Claude Code in lockdown mode for general coding tasks. When I need it to fetch documentation or install packages, I toggle lockdown off, let it do its thing, then toggle it back on.

For security research projects, the isolation gives me peace of mind. The AI can analyze suspicious code, suggest modifications, even run tests, all without access to my actual filesystem or network.

It's not perfect containment. Docker isn't a security boundary the way a VM is. But it's enough friction that an AI agent can't accidentally (or intentionally) do something I'd regret.

For now, this setup works for my needs. The paranoia tax is a few extra seconds on startup and occasional friction with browser auth. Worth it.

Classifying More With Less: New VGL4NT Update

TLDR:

  • Packed malware machine learning classifier can only previously identify 10 packers
  • Solution was a customized version of model ensembling, which is to train multiple models and resolve their results
  • It works with a slight caveat of more extended training and processing, which I could happily live with

I recently presented VGL4NT, my tool that uses machine learning to classify packed malware, at the Blackhat Middle East and Africa meetup. During my talk, I candidly shared one of the tool's limitations which is it can only identify 10 packers because of my hardware constraints. If I want it to be able to identify more, I need to get more GPU (which will be costly) or keep my money and come up with a clever solution. Well, this post is about the latter.

A Simple Solution

The solution I came up with isn't exactly original. It's based on Task Decomposition, which involves training separate models for different categories and combining their predictions. This way, I could double the classification capacity without requiring additional hardware resources.

This was implemented by creating multiple machine learning models, each specializing in recognizing a subset of packers. The real challenge, however, lies in combining the predictions from these models to form a unified output.

classifying-more-with-less-new-vgl4nt-update-01

Here's how the process works:

The packed malware file is fed into Model 1, which outputs probabilities for Packer 1, Packer 2, and Others. For example, it might produce:

  • Packer 1: 10%
  • Packer 2: 20%
  • Others: 70%

The same file is then fed into Model 2, which outputs probabilities for Packer 3, Packer 4, and Others. For instance:

  • Packer 3: 60%
  • Packer 4: 30%
  • Others: 10%

I then take the 'Others' category with the lowest probability. For our example, the final 'Others' probability would be 10% from Model 2.

The final probabilities are:

  • Packer 1: 10%
  • Packer 2: 20%
  • Packer 3: 60%
  • Packer 4: 30%
  • Others: 10%

Packer 3 has the highest probability in this example, and the file is classified as such.

This simple combination approach ensures I maintain a suitable probability distribution while leveraging each model's strengths. The beauty of this method is not only its efficiency but also its scalability. I can introduce more models, each specializing in different packers, to further increase the classification capabilities.

Now you might wonder why I'd even write about this if the solution is this simple. The funny thing is I've explored multiple approaches to unifying the output. Before this, I fully implemented a complicated approach, only to later realize while writing this blog post that a much simpler approach works well enough for the tool's purpose.

Downsides

I am conscious that this may or may not be the most effective method to tackle this problem. But what is essential is that the current computation is simple and can maintain the appropriate prediction distribution based on the relative percentages. In essence, the category with the highest confidence score will always come out on top in the final output, primarily what users of my tool are interested in.

Aside from this, I am concerned that increasing the number of categories also increases training and prediction time. I'm not too worried about the increase in training time because this happens behind the scenes and remains unseen to users of my tool. I'm slightly concerned about the longer prediction time, as all models need to process each submission to the tool. And as I plan to incorporate more packer tool categories, the prediction time will definitely rise.

These downsides are not too much of a problem, however. They can easily be fixed if I find they are not meeting the tool's goals. For now, these will do.

Conclusion

I am genuinely happy with my progress with the VGL4NT Malware Packer Classifier. There are other topics I want to tackle, but I'll save those for future blog posts.

In the meantime, I invite you to check out the tool and see the changes yourself. Visit the VGL4NT website to get started. And for a more detailed walkthrough, you can also watch this YouTube video I created