Planet Cloudflare

Aggregated posts from the Cloudflare-adjacent community

February 20, 2026

The Software Development Lifecycle Is Dead

Boris Tane ·

AI agents didn't make the SDLC faster. They killed it.

I keep hearing people talk about AI as a "10x developer tool." That framing is wrong. It assumes the workflow stays the same and the speed goes up. That's not what's happening. The entire lifecycle, the one we've built careers around, the one that spawned a multi-billion dollar tooling industry, is collapsing in on itself.

And most people haven't noticed yet.

The SDLC you learned is a relic

Here's the classic software development lifecycle most of us were taught:

graph TD
    A[Requirements] --> B[System Design]
    B --> C[Implementation]
    C --> D[Testing]
    D --> E[Code Review]
    E --> F[Deployment]
    F --> G[Monitoring]
    G --> A

Every stage has its own tools, its own rituals, its own cottage industry. Jira for requirements. Figma for design. VS Code for implementation. Jest for testing. GitHub for code review. AWS for deployment. Datadog for monitoring.

Each step is discrete. Sequential. Handoffs everywhere.

Now here's what actually happens when an engineer works with a coding agent:

graph TD
    A[Intent] --> B[Agent]
    B --> C[Code + Tests + Deployment]
    C --> D{Does it work?}
    D -->|No| B
    D -->|Yes| E[Ship]
    style E fill:#d1fae5,stroke:#6ee7b7,color:#065f46

The stages collapsed. They didn't get faster. They merged. The agent doesn't know what step it's on because there are no steps. There's just intent, context, and iteration.

AI-native engineers don't know what the SDLC is

I spent a lot of time speaking with engineers who started their career after Cursor launched. They don't know what the software development lifecycle is. They don't know what's DevOps or what's an SRE. Not because they're bad engineers. Because they never needed it. They've never sat through sprint planning. They've never estimated story points. They've never waited three days for a PR review.

They just build things.

You describe what you want. The agent writes the code. You look at it. You iterate. You ship. Everything simultaneously.

These engineers aren't worse for skipping the ceremony. They're unencumbered by it. Sprint planning, code review workflows, release trains, estimation rituals. None of it. They skipped the entire orthodoxy and went straight to building.

And honestly? I'm jealous.

Every stage is collapsing

Let me walk through the SDLC and show you what's left of it.

Requirements gathering: fluid, not dictated

Requirements used to be handed down. A PM writes a PRD, engineers estimate it, and the spec gets frozen before a line of code is written. That made sense when building was expensive. When every feature took weeks, you had to decide upfront what to build.

That constraint is gone. When an agent can generate a complete version of a feature in minutes, you don't need to specify every detail in advance. You provide the direction, the agent builds a version, you look at it, you adjust, you try a different approach. You can generate ten versions and pick the best one. Requirements aren't a phase anymore. They're a byproduct of iteration.

Now, what is Jira when the audience isn't humans coordinating across a pipeline? What is Jira when it's agents consuming context? Jira was built to track work through stages that no longer exist. If your "requirements" are just context for an agent, then the ticketing system isn't a project management tool anymore. It's a context store. And it's a terrible one.

System Design: discovered, not dictated

System design still matters. But the way it happens is fundamentally shifting.

Design used to be something you did before writing code. You'd whiteboard the architecture, debate trade-offs, draw boxes and arrows, then go implement it. The gap between the design and the code was days or weeks.

That gap is closing. Design is becoming something you discover by giving the agent the right context, not something you dictate ahead of time. The model has seen more systems, more architectures, more patterns than any individual engineer. When you describe a problem, the agent doesn't just implement your design, it suggests architectures that are often superior to what you'd have come up with on your own. You're having a design conversation in real-time, and the output is working code.

You still need to know when an agent is over-engineering or missing a constraint. But you're collaborating on design, not prescribing it.

Implementation: this is the agent's job now

This one is obvious. The agent writes the code. Whole features. Complete solutions with error handling, types, edge cases.

I don't personally know anyone who still types lines of code. We review what agents write, feed them context, steer direction, and focus on the problems that actually require human judgment.

Testing: simultaneous, not sequential

Agents write tests alongside the code. Not as an afterthought. Not in a separate "testing phase." The test is part of the generation. TDD isn't a methodology anymore, it's just how agents work by default.

The entire QA function as a separate stage is gone. When code and tests are generated together, verified together, and iterated together, there's no handoff. No "throw it over the wall to QA.". The agent can do the QA itself.

Code review: give it up

The pull request flow needs to go. I was never a fan, but now it's just a relic of the past.

I know that's uncomfortable. Code review is sacred. It's how you catch bugs, share knowledge, maintain standards. It's also an identity thing. We're engineers, and reviewing code is what engineers do. But clinging to the PR workflow in an agent-driven world isn't rigor. It's an identity crisis.

Think about it. An agent generates 500 PRs a day. Your team can review maybe 10. The review queue backs up. This isn't a bottleneck worth optimising. It's a fake bottleneck, one that only exists because we're forcing a human ritual onto a machine workflow.

graph TD
    A[Agent generates PR] --> B[Waits for human review]
    B --> C{Reviewer available?}
    C -->|No| D[Sits in queue for hours/days]
    C -->|Yes| E[Review + Comments]
    E --> F[Agent addresses feedback]
    F --> B
    D --> B
    style B fill:#fee2e2,stroke:#fca5a5,color:#991b1b
    style D fill:#fee2e2,stroke:#fca5a5,color:#991b1b

This diagram shouldn't exist. The entire flow is wrong.

The review has to be rethought from scratch. Either it becomes part of the code generation itself, the agent verifies its own work against the plan document, runs the tests, checks for regressions, validates against architectural constraints, or a second agent reviews the first agent's output. Adversarial agents plough through the proposed changes, try to break it in every dimension. We already have the tools for this. Human-in-the-loop review becomes exception-based, triggered only when automated verification can't resolve a conflict or when the change touches something genuinely novel.

What does a world without pull requests look like? Agents commit to main. Automated checks, tests, type checks, security scans, behavioral diffs, validate the change. If everything passes, it ships, automatically. If something fails, the agent fixes it. A human only gets involved when the system genuinely doesn't know what to do.

graph TD
    A[Agent generates code] --> B[Agent self-verifies]
    B --> C[Second agent reviews]
    C --> D[Automated checks]
    D --> E{All clear?}
    E -->|Yes| F[Ship]
    E -->|No - resolvable| A
    E -->|No - novel issue| G[Human review]
    G --> A
    style F fill:#d1fae5,stroke:#6ee7b7,color:#065f46

We're spending our review cycles reading diffs that an agent could verify in seconds. That's not quality assurance. That's luddism.

Deployment: decoupled and continuous

Agents are already writing deployment pipelines that are more intricate and more specialised than what most teams would ever bother building by hand. Feature flags, canary releases, progressive rollouts, automatic rollback triggers, the kind of release engineering that used to require a dedicated platform team.

The key shift is that agents naturally decouple deployment from release. Code gets deployed continuously, every change, as soon as it's generated and verified, produces an artifact that lands in production behind a gate. Release is a separate decision, driven by feature flags or traffic rules.

Some teams are already approaching true continuous deployment and release. Code is generated, tests pass, artifacts are built, and the change is live, all in a single automated flow with no human in the loop between intent and production.

Where this goes next is even more interesting. Imagine agents that don't just deploy code but manage the entire release lifecycle, monitoring the rollout, adjusting traffic percentages based on error rates, automatically rolling back if latency spikes, and only notifying a human when something genuinely novel goes wrong. The deployment "stage" doesn't just get automated. It becomes an ongoing, self-adjusting process that never really ends.

graph TD
    A[Agent generates code] --> B[Automated verification]
    B --> C[Artifact produced]
    C --> D[Deploy behind feature flag]
    D --> E[Progressive rollout]
    E --> F{Healthy?}
    F -->|Yes| G[Full release]
    F -->|No| H[Auto-rollback]
    H --> I[Agent investigates]
    I --> A
    style G fill:#d1fae5,stroke:#6ee7b7,color:#065f46
    style H fill:#fee2e2,stroke:#fca5a5,color:#991b1b

Monitoring: the last stage standing, and it needs to evolve

Monitoring is the only stage of the SDLC that survives. And it doesn't just survive, it becomes the foundation everything else rests on.

When agents ship code faster than humans can review it, observability is no longer a nice-to-have dashboarding layer. It's the primary safety mechanism for the entire collapsed lifecycle. Every other safeguard, the design review, the code review, the QA phase, the release sign-off, has been absorbed or eliminated. Monitoring is what's left. It's the last line of defense.

But most observability platforms were built for humans. Alerts, log search, dashboard, etc. all designed for a person to look at, interpret, and act on. That model breaks when the volume of changes outpaces human attention. If an agent ships 500 changes a day and your observability setup requires a human to investigate each anomaly, you've created a new bottleneck. You've just moved it from code review to incident response.

Observability without action is just expensive storage. The future of observability isn't dashboards, it's closed-loop systems where telemetry data becomes context for the agent that shipped the code, so it can detect the regression and fix it.

The observability layer becomes the feedback mechanism that drives the entire loop. Not a stage at the end. The connective tissue of the whole system.

graph TD
    A[Intent] --> B[Agent builds, tests, deploys]
    B --> C[Production]
    C --> D[Observability layer]
    D -->|Anomaly detected| E[Agent investigates + fixes]
    E --> B
    D -->|Healthy| F[Next intent]
    F --> A
    style D fill:#dbeafe,stroke:#93c5fd,color:#1e40af

The teams that figure this out first, observability that feeds directly back into the agent loop, not into a human's pager, will ship faster and safer than everyone else. The teams that don't will drown in alerts.

The new lifecycle is tighter loop

The SDLC was a wide loop. Requirements → Design → Code → Test → Review → Deploy → Monitor. Linear. Sequential. Full of handoffs and waiting.

The new lifecycle is a tight loop.

graph TD
    A[Human Intent + Context] --> B[AI Agent]
    B --> C[Build + Test + Deploy]
    C --> D[Observe]
    D -->|Problem| B
    D -->|Fine| E[Next Intent]
    E --> B
    style B fill:#ede9fe,stroke:#c4b5fd,color:#5b21b6

Intent. Build. Observe. Repeat.

No tickets. No sprints. No story points. No PRs sitting in a queue. No separate QA phase. No release trains.

Just a human with intent and an agent that executes.

So what is left?

Context. That's it.

The quality of what you build with agents is directly proportional to the quality of context you give them. Not the process. Not the ceremony. The context.

The SDLC is dead. The new skill is context engineering. The new safety net is observability.

And most of the industry is still configuring Datadog dashboards no one looks at.

February 18, 2026

February 11, 2026

Ten things I love about Replicate (and ten things I don't)

Zeke Sikelianos ·

Ten things I love about Replicate

...and ten things I don't.

Wed Feb 11 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Replicate was acquired by Cloudflare in December 2025. We're teaming up to make the best platform for building AI-powered software on the internet. This feels like an opportune time to talk about the things that make Replicate special, and the things we now have a chance to improve.

What's to love?

1. Collections

When you go to the collections page on Replicate, you can quickly get a sense of all the different things these models can do. There are collections for text-to-image, text-to-video, text-to-speech, music generation, and more. It's a curated page where humans have added models to collections knowing they're good at specific tasks. If you're new to Replicate or just exploring what's possible, this is a great place to start.

2. Playground

The playground is a page on Replicate where you can run a bunch of models all at once and compare their outputs. Type in something like "dragon fruit", hit enter a few times, and you'll get results fast. You can switch between models while keeping the same prompt, which makes it easy to compare output quality and generation speed side by side. If you see something you like, you can download it or grab the code to reproduce it with the API using the same inputs. This is my favorite place to start before writing any code against a new model.

3. Official models

Official models are maintained by Replicate staff. They're guaranteed to be stable, always on, and fast. When you're choosing a model on Replicate, you generally want to look for the "Official" designation. Nano Banana Pro, for example, has that marker in the top left. If you want a model that's well-supported, won't break on you suddenly, and responds instantly to your requests, look for the official badge.

4. HTTP API

The website is great for discovering and running models, but eventually you want to write code. Replicate's API is actually pretty small. Unlike services with hundreds of endpoints, we've got about 35, covering the basics: creating predictions (the noun we use for the result of running a model), listing them, searching for models, browsing collections.

The API is documented with an OpenAPI schema. It's not a huge file, but if you're using a language model you can throw this schema URL at your agent and it can understand everything it needs to know about interacting with the Replicate API: searching models, comparing them, inspecting their inputs and outputs, running them. There are also official client libraries for Python and JavaScript.

5. One token, many providers

Creating an API token on Replicate is easy. Go to your account settings, create a token, copy it, and you're off. But what's really nice is that Replicate hosts models from many providers. If you've ever tried to get a Google API key for running models through their API, you know how painful it is: bouncing between AI Studio and another site, creating a key, attaching billing. Same story with Anthropic, where there's the Cloud Console and the Anthropic Console and it's never clear which is which.

With Replicate, one API key gets you access to models from Google, Anthropic, OpenAI, ElevenLabs, and more.

6. Model schemas

Every model on Replicate is built using Cog, an open source tool that lets you wrap a machine learning model in a bit of Python code to standardize its input and output schemas. The result is that every single model on Replicate has a schema defining all of its inputs and outputs: types, default values, whether they're required. It's a design that's friendly to both programmers and automated tools.

7. Sharing

Sharing predictions is like GitHub Gists for AI output. Whenever you run a model on the website, you can hit the share button to make your prediction publicly viewable. You get a URL you can pass around, and anyone who opens it can see all the inputs and outputs that led to that result.

This gets back to the origin of the name "Replicate": the goal was to make machine learning reproducible, so people could share and replicate each other's results instead of digging through PDFs from academic archives. It's a small feature, but it's genuinely useful.

8. Userland

Userland is a term from the Linux world where the core of the system is small and simple, and the majority of innovation happens outside in user space. Replicate works the same way. An example is all-the-replicate-models, an npm package that stuffs the entire public model catalog into a single JSON file. Anyone can write code that interacts with Replicate's API to get metadata about every model on the platform.

If Replicate doesn't yet have a specific feature you want, it's not available on the website or via the API, you can take matters into your own hands and build it yourself with the model metadata.

9. MCP

MCP servers are the way language models use tools to talk to APIs. Replicate has an MCP server that you can install locally or access at mcp.replicate.com. It lets you plug Replicate's capabilities into your agent environment without knowing much about how Replicate works. There's also a blog post about the remote MCP server.

This is a huge unlock. You can throw commands at it like "search Replicate for the best image models, compare them by price and speed, run them all, and save the outputs to my computer." The language model inspects the API's capabilities and dynamically figures out what to do. Instead of poking around on the website trying to figure everything out, you throw some context at your language model and let it sort things out. It's quickly becoming the default way people explore Replicate.

10. Generic models

You probably think of Replicate as a place to run sophisticated ML models that require GPUs. But you can also run generic models that are just a bit of Python code. One of my favorite things is wrapping ffmpeg capabilities into a model. ffmpeg is a powerful command-line tool for manipulating video, but it has an arcane syntax that used to send you to Stack Overflow.

Now AI can write those ffmpeg commands. I packaged up a change-video-speed model where you throw a video at it and change the playback speed. No GPU, just a CPU. Cheap and fast. You can use it through the API or drag and drop a file in the browser. It's a reminder that Replicate isn't only about AI models.


💔 Ten things I don't love

And now for the juicy bits. You don't normally hear people at companies talking about the parts of their product they don't like. But we've just been acquired by Cloudflare, and a lot is going to change for the better: more people, more resources, and a chance to rethink how we do things. This feels to me like the right time to reflect on some of the things that make Replicate difficult for our users, with an eye toward improving them now that we have the resources at Cloudflare to pull it off.

1. Cold boots

If you've poked around and found a community model on Replicate, you may have noticed it can take a very long time to start up. Sometimes several minutes before you get a response. That's okay for exploring, but it makes the model unusable for production applications.

The workaround today is deployments: take a model off the shelf, create a customized endpoint for it, and set the minimum instances to one so it's always on. Of course, that means you're footing the bill for the idle GPU time.

Cloudflare is pretty good at making things fast. I'm optimistic that techniques like GPU snapshotting will let us dramatically improve cold boot times, even for giant Docker containers with 10 gigs of model weights.

2. Output files expire

When you run a model via the API, you get an HTTPS URL for your output file. Your instinct is to treat it as a permanent asset. The problem is that API-generated output files expire after one hour. Files created through the website last indefinitely, and sharing a prediction also makes its outputs permanent. But if you're building with the API, that file is going to disappear.

There are workarounds. I wrote a guide showing how to set up a Cloudflare Worker that receives a webhook when a prediction completes and stores the output in R2 or Cloudflare Images. It's not super hard, but it should be easier.

3. No OpenAI format

When you run a language model on Replicate, the output you get back is an array of text fragments rather than the format that's become the de facto standard from OpenAI. This matters because many people build applications using OpenAI's SDK and then swap the base URL to point at other providers. Since Replicate's response structure doesn't match, you can't just pivot to another model.

I think this will get better as we integrate with Cloudflare's AI Gateway. Standardized response structures for language models feel inevitable.

4. replicate.run()

Our client libraries have a convenience method called replicate.run(). It's right there on the homepage. You specify a model and some input, and it returns the output. Dead simple.

The problem is that once you get beyond a toy demo, you usually want more: the prediction status, how long it took, the prediction ID so you can look it up later. replicate.run() hides all of that. When you're building a real application, you end up rewriting your code to use polling or webhooks. The initial experience is easy, but the transition to production requires starting over.

5. Too many models

There are something like 70,000 models on Replicate. Even with curated collections, you can still feel like you're missing the latest state-of-the-art model buried somewhere in the sea. I still have this feeling, and I work here.

MCP helps with this. Open a session in Claude Code or similar, plug in the Replicate MCP server, and ask things like "show me the last 10 models published with more than 100 runs." If you have model FOMO, MCP is a good antidote.

6. Outdated curation

The collections page is helpful, but it's maintained by humans. That's part of its charm, but also the problem. There's no guarantee that the collections show you the best model for a given use case. Something might exist that nobody thought to add yet.

If you have doubts about whether the collections are showing you everything, pull out your agent, plug in Replicate MCP, give it your specific requirements, and let it search through the full model catalog.

7. No pricing API

Pricing is visible on the website. Different models are priced differently, and official models often have pricing that varies by multiple properties like resolution. It's transparent and easy enough to understand on the web.

But if you want to programmatically compare model prices, there's no API for that. People end up manually visiting web pages or scraping them to get pricing data. This will improve.

8. Redundant APIs

This is one I hold myself personally accountable for. There are two different ways to run a model via the API. The original predictions.create endpoint takes a model version in the request body and works for everything: community models and official models. Then there's models.predictions.create, which has a slightly different structure and only works for official models.

This causes confusion for both users and agents, who find one endpoint and assume it's the only one, then hit errors depending on the model they're trying to run. In the new world, we'll make sure there's one consistent API for running models.

9. MCP context bloat

Early MCP servers, including Replicate's, define one tool for every API operation. That means a ton of JSON gets stuffed into the language model's context window. At one point, Replicate's MCP server was eating up something like half the context window in Claude Code. Not great. It makes people not want to use it, and it makes everything slow.

The fix is Code Mode. Instead of exposing one tool per operation, it exposes just two tools: one for searching the API docs and one for writing code. When you ask it to find the fastest video models, instead of a slow back-and-forth series of individual API calls, it evaluates your goal upfront, writes a custom TypeScript snippet that makes multiple API calls simultaneously, and returns the result. Way faster, no context bloat.


I'm hopeful for the future

None of these gripes are fundamental. They're papercuts. Cold boots are an infrastructure problem. Expiring files are a policy choice. Redundant APIs are tech debt. Context bloat is already getting fixed. Every one of these is solvable, and now that we're part of Cloudflare we have the people and the infrastructure to actually solve them. That's what makes this moment exciting: Replicate's core design is sound, the model ecosystem is thriving, and the rough edges are all things we can smooth out.

🖤 🤝 🧡

Components Will Kill Pages

Brayden Wilmoth ·

Components allow users using AI applications to experience your brand when pages can't

February 10, 2026

How I Use Claude Code

Boris Tane ·

I've been using Claude Code as my primary development tool for approx 9 months, and the workflow I've settled into is radically different from what most people do with AI coding tools. Most developers type a prompt, sometimes use plan mode, fix the errors, repeat. The more terminally online are stitching together ralph loops, mcps, gas towns (remember those?), etc. The results in both cases are a mess that completely falls apart for anything non-trivial.

The workflow I'm going to describe has one core principle: never let Claude write code until you've reviewed and approved a written plan. This separation of planning and execution is the single most important thing I do. It prevents wasted effort, keeps me in control of architecture decisions, and produces significantly better results with minimal token usage than jumping straight to code.

flowchart LR
    R[Research] --> P[Plan]
    P --> A[Annotate]
    A -->|repeat 1-6x| A
    A --> T[Todo List]
    T --> I[Implement]
    I --> F[Feedback & Iterate]

Phase 1: Research

Every meaningful task starts with a deep-read directive. I ask Claude to thoroughly understand the relevant part of the codebase before doing anything else. And I always require the findings to be written into a persistent markdown file, never just a verbal summary in the chat.

read this folder in depth, understand how it works deeply, what it does and all its specificities. when that's done, write a detailed report of your learnings and findings in research.md

study the notification system in great details, understand the intricacies of it and write a detailed research.md document with everything there is to know about how notifications work

go through the task scheduling flow, understand it deeply and look for potential bugs. there definitely are bugs in the system as it sometimes runs tasks that should have been cancelled. keep researching the flow until you find all the bugs, don't stop until all the bugs are found. when you're done, write a detailed report of your findings in research.md

Notice the language: "deeply", "in great details", "intricacies", "go through everything". This isn't fluff. Without these words, Claude will skim. It'll read a file, see what a function does at the signature level, and move on. You need to signal that surface-level reading is not acceptable.

The written artifact (research.md) is critical. It's not about making Claude do homework. It's my review surface. I can read it, verify Claude actually understood the system, and correct misunderstandings before any planning happens. If the research is wrong, the plan will be wrong, and the implementation will be wrong. Garbage in, garbage out.

This is the most expensive failure mode with AI-assisted coding, and it's not wrong syntax or bad logic. It's implementations that work in isolation but break the surrounding system. A function that ignores an existing caching layer. A migration that doesn't account for the ORM's conventions. An API endpoint that duplicates logic that already exists elsewhere. The research phase prevents all of this.

Phase 2: Planning

Once I've reviewed the research, I ask for a detailed implementation plan in a separate markdown file.

I want to build a new feature <name and description> that extends the system to perform <business outcome>. write a detailed plan.md document outlining how to implement this. include code snippets

the list endpoint should support cursor-based pagination instead of offset. write a detailed plan.md for how to achieve this. read source files before suggesting changes, base the plan on the actual codebase

The generated plan always includes a detailed explanation of the approach, code snippets showing the actual changes, file paths that will be modified, and considerations and trade-offs.

I use my own .md plan files rather than Claude Code's built-in plan mode. The built-in plan mode sucks. My markdown file gives me full control. I can edit it in my editor, add inline notes, and it persists as a real artifact in the project.

One trick I use constantly: for well-contained features where I've seen a good implementation in an open source repo, I'll share that code as a reference alongside the plan request. If I want to add sortable IDs, I paste the ID generation code from a project that does it well and say "this is how they do sortable IDs, write a plan.md explaining how we can adopt a similar approach." Claude works dramatically better when it has a concrete reference implementation to work from rather than designing from scratch.

But the plan document itself isn't the interesting part. The interesting part is what happens next.

The Annotation Cycle

This is the most distinctive part of my workflow, and the part where I add the most value.

flowchart TD
    W[Claude writes plan.md] --> R[I review in my editor]
    R --> N[I add inline notes]
    N --> S[Send Claude back to the document]
    S --> U[Claude updates plan]
    U --> D{Satisfied?}
    D -->|No| R
    D -->|Yes| T[Request todo list]

After Claude writes the plan, I open it in my editor and add inline notes directly into the document. These notes correct assumptions, reject approaches, add constraints, or provide domain knowledge that Claude doesn't have.

The notes vary wildly in length. Sometimes a note is two words: "not optional" next to a parameter Claude marked as optional. Other times it's a paragraph explaining a business constraint or pasting a code snippet showing the data shape I expect.

Some real examples of notes I'd add:

  • "use drizzle:generate for migrations, not raw SQL" -- domain knowledge Claude doesn't have
  • "no -- this should be a PATCH, not a PUT" -- correcting a wrong assumption
  • "remove this section entirely, we don't need caching here" -- rejecting a proposed approach
  • "the queue consumer already handles retries, so this retry logic is redundant. remove it and just let it fail" -- explaining why something should change
  • "this is wrong, the visibility field needs to be on the list itself, not on individual items. when a list is public, all items are public. restructure the schema section accordingly" -- redirecting an entire section of the plan

Then I send Claude back to the document:

I added a few notes to the document, address all the notes and update the document accordingly. don't implement yet

This cycle repeats 1 to 6 times. The explicit "don't implement yet" guard is essential. Without it, Claude will jump to code the moment it thinks the plan is good enough. It's not good enough until I say it is.

Why This Works So Well

The markdown file acts as shared mutable state between me and Claude. I can think at my own pace, annotate precisely where something is wrong, and re-engage without losing context. I'm not trying to explain everything in a chat message. I'm pointing at the exact spot in the document where the issue is and writing my correction right there.

This is fundamentally different from trying to steer implementation through chat messages. The plan is a structured, complete specification I can review holistically. A chat conversation is something I'd have to scroll through to reconstruct decisions. The plan wins every time.

Three rounds of "I added notes, update the plan" can transform a generic implementation plan into one that fits perfectly into the existing system. Claude is excellent at understanding code, proposing solutions, and writing implementations. But it doesn't know my product priorities, my users' pain points, or the engineering trade-offs I'm willing to make. The annotation cycle is how I inject that judgement.

The Todo List

Before implementation starts, I always request a granular task breakdown:

add a detailed todo list to the plan, with all the phases and individual tasks necessary to complete the plan - don't implement yet

This creates a checklist that serves as a progress tracker during implementation. Claude marks items as completed as it goes, so I can glance at the plan at any point and see exactly where things stand. Especially valuable in sessions that run for hours.

Phase 3: Implementation

When the plan is ready, I issue the implementation command. I've refined this into a standard prompt I reuse across sessions:

implement it all. when you're done with a task or phase, mark it as completed in the plan document. do not stop until all tasks and phases are completed. do not add unnecessary comments or jsdocs, do not use any or unknown types. continuously run typecheck to make sure you're not introducing new issues.

This single prompt encodes everything that matters:

  • "implement it all": do everything in the plan, don't cherry-pick
  • "mark it as completed in the plan document": the plan is the source of truth for progress
  • "do not stop until all tasks and phases are completed": don't pause for confirmation mid-flow
  • "do not add unnecessary comments or jsdocs": keep the code clean
  • "do not use any or unknown types": maintain strict typing
  • "continuously run typecheck": catch problems early, not at the end

I use this exact phrasing (with minor variations) in virtually every implementation session. By the time I say "implement it all," every decision has been made and validated. The implementation becomes mechanical, not creative. This is deliberate. I want implementation to be boring. The creative work happened in the annotation cycles. Once the plan is right, execution should be straightforward.

Without the planning phase, what typically happens is Claude makes a reasonable-but-wrong assumption early on, builds on top of it for 15 minutes, and then I have to unwind a chain of changes. The "don't implement yet" guard eliminates this entirely.

Feedback During Implementation

Once Claude is executing the plan, my role shifts from architect to supervisor. My prompts become dramatically shorter.

flowchart LR
    I[Claude implements] --> R[I review / test]
    R --> C{Correct?}
    C -->|No| F[Terse correction]
    F --> I
    C -->|Yes| N{More tasks?}
    N -->|Yes| I
    N -->|No| D[Done]

Where a planning note might be a paragraph, an implementation correction is often a single sentence:

  • "You didn't implement the deduplicateByTitle function."
  • "You built the settings page in the main app when it should be in the admin app, move it."

Claude has the full context of the plan and the ongoing session, so terse corrections are enough.

Frontend work is the most iterative part. I test in the browser and fire off rapid corrections:

  • "wider"
  • "still cropped"
  • "there's a 2px gap"

For visual issues, I sometimes attach screenshots. A screenshot of a misaligned table communicates the problem faster than describing it.

I also reference existing code constantly:

  • "this table should look exactly like the users table, same header, same pagination, same row density."

This is far more precise than describing a design from scratch. Most features in a mature codebase are variations on existing patterns. A new settings page should look like the existing settings pages. Pointing to the reference communicates all the implicit requirements without spelling them out. Claude would typically read the reference file(s) before making the correction.

When something goes in a wrong direction, I don't try to patch it. I revert and re-scope by discarding the git changes:

  • "I reverted everything. Now all I want is to make the list view more minimal -- nothing else."

Narrowing scope after a revert almost always produces better results than trying to incrementally fix a bad approach.

Staying in the Driver's Seat

Even though I delegate execution to Claude, I never give it total autonomy over what gets built. I do the vast majority of the active steering in the plan.md documents.

This matters because Claude will sometimes propose solutions that are technically correct but wrong for the project. Maybe the approach is over-engineered, or it changes a public API signature that other parts of the system depend on, or it picks a more complex option when a simpler one would do. I have context about the broader system, the product direction, and the engineering culture that Claude doesn't.

flowchart TD
    P[Claude proposes changes] --> E[I evaluate each item]
    E --> A[Accept as-is]
    E --> M[Modify approach]
    E --> S[Skip / remove]
    E --> O[Override technical choice]
    A & M & S & O --> R[Refined implementation scope]

Cherry-picking from proposals: When Claude identifies multiple issues, I go through them one by one: "for the first one, just use Promise.all, don't make it overly complicated; for the third one, extract it into a separate function for readability; ignore the fourth and fifth ones, they're not worth the complexity." I'm making item-level decisions based on my knowledge of what matters right now.

Trimming scope: When the plan includes nice-to-haves, I actively cut them. "remove the download feature from the plan, I don't want to implement this now." This prevents scope creep.

Protecting existing interfaces: I set hard constraints when I know something shouldn't change: "the signatures of these three functions should not change, the caller should adapt, not the library."

Overriding technical choices: Sometimes I have a specific preference Claude wouldn't know about: "use this model instead of that one" or "use this library's built-in method instead of writing a custom one." Fast, direct overrides.

Claude handles the mechanical execution, while I make the judgement calls. The plan captures the big decisions upfront, and selective guidance handles the smaller ones that emerge during implementation.

Single Long Sessions

I run research, planning, and implementation in a single long session rather than splitting them across separate sessions. A single session might start with deep-reading a folder, go through three rounds of plan annotation, then run the full implementation, all in one continuous conversation.

I am not seeing the performance degradation everyone talks about after 50% context window. Actually, by the time I say "implement it all," Claude has spent the entire session building understanding: reading files during research, refining its mental model during annotation cycles, absorbing my domain knowledge corrections.

When the context window fills up, Claude's auto-compaction maintains enough context to keep going. And the plan document, the persistent artifact, survives compaction in full fidelity. I can point Claude to it at any point in time.

The Workflow in One Sentence

Read deeply, write a plan, annotate the plan until it's right, then let Claude execute the whole thing without stopping, checking types along the way.

That's it. No magic prompts, no elaborate system instructions, no clever hacks. Just a disciplined pipeline that separates thinking from typing. The research prevents Claude from making ignorant changes. The plan prevents it from making wrong changes. The annotation cycle injects my judgement. And the implementation command lets it run without interruption once every decision has been made.

Try my workflow, you'll wonder how you ever shipped anything with coding agents without an annotated plan document sitting between you and the code.

February 04, 2026

February 01, 2026

January 26, 2026

Building on Cloudflare with OpenCode

Zeke Sikelianos ·

Building on Cloudflare with OpenCode

OpenCode turns Cloudflare into a candy store for developers.

Mon Jan 26 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

OpenCode is my new favorite tool for hacking. It's an open-source AI agent that can help you research, plan, and build software. In this post I'll show you how to set it up with some good defaults so you can start building on Cloudflare's developer platform right away.

Most imporantly, OpenCode is a remedy for FOMO (fear of missing out). The world is moving fast, but OpenCode helps you keep up.

Don't feel like reading? Here's a one-liner to get you going:

npx zeke/sweet-sweet-opencode

This command will kick off an interactive wizard to install OpenCode and set it up with some sensible defaults for Cloudflare development. Check out the source on GitHub if you're (understandably) paranoid about running random stuff.

Cloudflare has so many products 😵‍💫

Cloudflare's developer platform has everything you need to build web apps: Workers for serverless deployment, R2 for object storage, D1 for serverless databases. But there are over 100 other products on the developer platform.

Here are a few examples you might not have heard of:

  • Agents for deploying AI agents that can interact with tools and services.
  • Sandboxes for running untrusted code in isolated environments.
  • Browser Rendering for headless browsers in the cloud.
  • Turnstile for privacy-first bot protection without CAPTCHAs.

What are all these products and how do you use them?

Answering a question like this used to be a daunting task (at least for me), but now we have AI agents to help us research, plan, and build. What was once an overwhelming array of options now feels like a candy store of possibilities.

Step 1: Sign in to Cloudflare

Sign into the Cloudflare dashboard at dash.cloudflare.com. If you don't already have an account, you can easily create one using your Google, Apple, or GitHub account.

Cloudflare is free to start, with very generous limits before you have to start paying for anything.

Step 2: Authenticate

Wrangler is Cloudflare's command-line interface. You can run it with npx, which is included with Node.js. (Install Node.js with brew install node if you don't already have it installed.)

Run this command to authenticate your Cloudflare account:

npx -y wrangler login

This will open a browser window prompting you to authorize Wrangler to access your Cloudflare account.

Step 3: Install OpenCode

There are several ways to install OpenCode, but here we'll use Homebrew:

brew install anomalyco/tap/opencode

Step 4: Add MCP servers

Cloudflare has a handful of official MCP servers that expose various tools to your OpenCode agent. The one we're most interested in is the Cloudflare Documentation MCP server, which exposes a tool for searching Cloudflare's documentation from OpenCode.

There are a couple ways to install MCP servers in OpenCode. Here we'll manually edit your OpenCode configuration file (usually located at ~/.config/opencode/opencode.json) to add a few useful servers:

  • cloudflare-docs: lets your agent search Cloudflare's docs so it can answer product questions with citations and up-to-date details.
  • chrome-devtools: gives your agent a real browser to click around, read pages, and grab content (handy when sites block simple HTTP fetches).
  • replicate-code-mode (optional): adds Replicate's "code" tools; disabled by default and requires REPLICATE_API_TOKEN.
  • cloudflare-api (optional): lets your agent call Cloudflare's API directly; disabled by default and requires CLOUDFLARE_API_TOKEN.
{
  "$schema": "https://opencode.ai/config.json",
  "mcp": {
    "chrome-devtools": {
      "type": "local",
      "command": ["npx", "-y", "chrome-devtools-mcp@latest"]
    },
    "cloudflare-docs": {
      "type": "remote",
      "url": "https://docs.mcp.cloudflare.com/mcp",
      "enabled": true
    },
    "replicate-code-mode": {
      "type": "local",
      "enabled": false,
      "command": ["npx", "-y", "replicate-mcp@alpha", "--tools=code"],
      "environment": {
        "REPLICATE_API_TOKEN": "{env:REPLICATE_API_TOKEN}"
      }
    },
    "cloudflare-api": {
      "type": "remote",
      "enabled": false,
      "url": "https://cloudflare-mcp.mattzcarey.workers.dev/mcp",
      "headers": {
        "Authorization": "Bearer {env:CLOUDFLARE_API_TOKEN}"
      }
    }
  }
}

Step 5: Start OpenCode

Now you've got everything set up. It's time to fire up an OpenCode session.

Open your terminal, then navigate to an existing project or create a new directory:

mkdir my-new-app && cd my-new-app

Then start opencode:

opencode

Step 6: Chat!

Now that you've got OpenCode set up, type a query to get the conversation going. Here are some examples:

How do Cloudflare agents and sandboxes work together?

Can I use Cloudflare to process emails?

Do Cloudflare Workers costs depend on response sizes? I want to serve some images (map tiles) from an R2 bucket and I'm concerned about costs.

How many indexes are supported in Workers Analytics Engine? Give an example using the Workers binding API.

From here you'll be able to plan (and build!) your next Cloudflare-hosted app. Happy hacking! 🚀

January 24, 2026

How I Use Clawdbot

Kristian Freeman ·

[Clawdbot](https://clawdbot.com) lets me message an AI from Telegram and have it do stuff for me. Not "here's some information" stuff — actual stuff. Running shell commands, querying my finances, managing my task list, requesting movies for my media server. I've been running Clawdbot on my Mac mini since mid-January and it's become one of those tools I forget isn't normal. I named mine Roman. ![Roman introducing himself in Telegram](/images/roman-telegram.jpg) ## What Clawdbot is Clawdbot is a gateway between messaging apps and AI agents. You text it, it runs an agent with access to your systems — shell, files, browser, APIs, whatever you hook up. I use Telegram, but Clawdbot supports WhatsApp, Discord, iMessage, others. The Mac mini M4 runs 24/7 on my Tailscale network. I can message Clawdbot from my phone, my laptop, wherever. Same context, same capabilities. For models: MiniMax-M2.1 handles general chat (fast, cheap), but Clawdbot escalates to Claude Opus when I need code written or debugged. I'm on the Claude Max plan ($200/mo) which gives me heavy Opus usage without worrying about API costs. ## Clawdbot skills I use The magic is in "skills" — markdown files that teach Clawdbot how to use specific tools. Here's what I've got running: **Finances.** I do plain-text accounting with hledger. The skill knows my journal files and how to query them. "How much did I spend on food last month?" — it runs the right hledger command, gives me a number. ~7,800 transactions going back to mid-2023, all queryable via text message. **Linear.** My task management lives in Linear. "What's on my plate this week?" pulls my assigned issues sorted by priority. I can create tasks, update status, search across projects — all from Telegram. **NixOS NAS.** My home server runs NixOS. The skill knows the config structure and can SSH in. "Add a new Podman container for X" — it edits the Nix config, commits, runs `nixos-rebuild`. I've modified my server config from my phone while walking around. **Jellyseerr.** Media requests. "Add the new Lanthimos movie" — searches, finds it, submits the request. Shows up in my library once Radarr grabs it. Stupid simple. **X Bookmarks.** I bookmark way too much on Twitter. Health stuff, AI papers, programming tips. The skill has a DuckDB database with embeddings — 512+ bookmarks with vector search. "What did I bookmark about sleep optimization?" actually works. **Tweets.** Stores my past tweets (1200+), analyzes what performs well, hooks into Typefully for drafting. Syncs daily. **Skill Creator.** Meta, but useful. When I need a new Clawdbot integration, I describe what I want and it scaffolds the skill structure. Saves me from writing boilerplate every time I want to add something. ## Clawdbot automations **Daily briefing.** Every morning at 9am, Clawdbot sends me a Telegram message with today's calendar (pulled from macOS Calendar via icalBuddy), my open Linear tasks sorted by priority, and anything else that needs attention. Nice way to start the day without opening five apps. **Tweet sync.** Daily job pulls my latest tweets and appends them to an ndjson file. **Bookmark sync.** Hourly job fetches new X bookmarks, generates embeddings, updates the search index. ## Real examples from this week - "Sync my tweets" - "How much have I spent on rideshares this month?" - "What's the status of the Containers launch tasks?" - "Request Bugonia on Jellyseerr" - "Check my bookmarks for anything about sauna protocols" Responses come back like any other text. I'm on my phone, I ask a question, I get an answer. Sometimes that answer is "done, I pushed the changes to git." ## What makes Clawdbot useful **Memory.** There's a memory system where Clawdbot stores facts, preferences, decisions. It knows my account structures, project names, common queries. I don't re-explain context every time. **Action.** Clawdbot doesn't just answer questions. It runs commands, edits files, hits APIs, pushes to git. "Ship it" means it actually ships. **Composable.** Each skill is a markdown file. Want to add a new API? Write instructions in a markdown file. Want to share it? Copy the file. ## What I'm adding next - Email triage (summarize what needs attention, draft responses) — in progress - Content capture (tweet something good → auto-draft a blog post expansion) - Health logging (workouts, supplements, sleep scores via message) ## Clawdbot setup ```bash curl -fsSL https://clawd.bot/install.sh | bash clawdbot onboard --install-daemon ``` The wizard sets up auth, channels, and optionally installs Clawdbot as a background service. Then: ```bash clawdbot gateway status clawdbot status ``` Docs: [docs.clawd.bot](https://docs.clawd.bot) Source: [github.com/clawdbot/clawdbot](https://github.com/clawdbot/clawdbot) I'll update this as the setup evolves. The goal is an assistant that handles the boring operational stuff so I can focus on the interesting work. So far, it's working.

How I Use OpenClaw

Kristian Freeman ·

**Update (Jan 30, 2026):** Clawdbot has been renamed to **OpenClaw**. The project was previously known as Moltbot before that. All the functionality described below remains the same — just a new name. --- [OpenClaw](https://openclaw.ai) lets me message an AI from Telegram and have it do stuff for me. Not "here's some information" stuff — actual stuff. Running shell commands, querying my finances, managing my task list, requesting movies for my media server. I've been running OpenClaw on my Mac mini since mid-January and it's become one of those tools I forget isn't normal. I named mine Roman. ![Roman introducing himself in Telegram](/images/roman-telegram.jpg) ## What OpenClaw is OpenClaw is a gateway between messaging apps and AI agents. You text it, it runs an agent with access to your systems — shell, files, browser, APIs, whatever you hook up. I use Telegram, but OpenClaw supports WhatsApp, Discord, iMessage, others. The Mac mini M4 runs 24/7 on my Tailscale network. I can message OpenClaw from my phone, my laptop, wherever. Same context, same capabilities. For models: MiniMax-M2.1 handles general chat (fast, cheap), but OpenClaw escalates to Claude Opus when I need code written or debugged. I'm on the Claude Max plan ($200/mo) which gives me heavy Opus usage without worrying about API costs. ## OpenClaw skills I use The magic is in "skills" — markdown files that teach OpenClaw how to use specific tools. Here's what I've got running: **Finances.** I do plain-text accounting with hledger. The skill knows my journal files and how to query them. "How much did I spend on food last month?" — it runs the right hledger command, gives me a number. ~7,800 transactions going back to mid-2023, all queryable via text message. **Linear.** My task management lives in Linear. "What's on my plate this week?" pulls my assigned issues sorted by priority. I can create tasks, update status, search across projects — all from Telegram. **NixOS NAS.** My home server runs NixOS. The skill knows the config structure and can SSH in. "Add a new Podman container for X" — it edits the Nix config, commits, runs `nixos-rebuild`. I've modified my server config from my phone while walking around. **Jellyseerr.** Media requests. "Add the new Lanthimos movie" — searches, finds it, submits the request. Shows up in my library once Radarr grabs it. Stupid simple. **X Bookmarks.** I bookmark way too much on Twitter. Health stuff, AI papers, programming tips. The skill has a DuckDB database with embeddings — 512+ bookmarks with vector search. "What did I bookmark about sleep optimization?" actually works. **Tweets.** Stores my past tweets (1200+), analyzes what performs well, hooks into Typefully for drafting. Syncs daily. **Skill Creator.** Meta, but useful. When I need a new OpenClaw integration, I describe what I want and it scaffolds the skill structure. Saves me from writing boilerplate every time I want to add something. ## OpenClaw automations **Daily briefing.** Every morning at 9am, OpenClaw sends me a Telegram message with today's calendar (pulled from macOS Calendar via icalBuddy), my open Linear tasks sorted by priority, and anything else that needs attention. Nice way to start the day without opening five apps. **Tweet sync.** Daily job pulls my latest tweets and appends them to an ndjson file. **Bookmark sync.** Hourly job fetches new X bookmarks, generates embeddings, updates the search index. ## Real examples from this week - "Sync my tweets" - "How much have I spent on rideshares this month?" - "What's the status of the Containers launch tasks?" - "Request Bugonia on Jellyseerr" - "Check my bookmarks for anything about sauna protocols" Responses come back like any other text. I'm on my phone, I ask a question, I get an answer. Sometimes that answer is "done, I pushed the changes to git." ## What makes OpenClaw useful **Memory.** There's a memory system where OpenClaw stores facts, preferences, decisions. It knows my account structures, project names, common queries. I don't re-explain context every time. **Action.** OpenClaw doesn't just answer questions. It runs commands, edits files, hits APIs, pushes to git. "Ship it" means it actually ships. **Composable.** Each skill is a markdown file. Want to add a new API? Write instructions in a markdown file. Want to share it? Copy the file. ## What I'm adding next - Email triage (summarize what needs attention, draft responses) — in progress - Content capture (tweet something good → auto-draft a blog post expansion) - Health logging (workouts, supplements, sleep scores via message) ## OpenClaw setup ```bash curl -fsSL https://openclaw.ai/install.sh | bash openclaw onboard --install-daemon ``` The wizard sets up auth, channels, and optionally installs OpenClaw as a background service. Then: ```bash openclaw gateway status openclaw status ``` Docs: [docs.openclaw.ai](https://docs.openclaw.ai) Source: [github.com/openclawai/openclaw](https://github.com/openclawai/openclaw) I'll update this as the setup evolves. The goal is an assistant that handles the boring operational stuff so I can focus on the interesting work. So far, it's working.

January 20, 2026

Building an Instant Messenger

Brayden Wilmoth ·

I built an Instant Messenger application all on Cloudflare. It took 1 day, 3 files, 4 resources... and it's ready to scale from 0 to millions and this is how I did it.

January 18, 2026

The speculative and soon to be outdated AI consumability scorecard

Kody with a K's Blog ·

The speculative and soon to be outdated AI consumability scorecard

9 min read

There’s a lot of confusion in the docs and AI space.

That’s due to a lot of factors, including:

  • Everything is new.
  • Everything is changing, and changing at damn near the speed of light (or at least the speed of smell, which is faster than you’d think).
  • There’s no key incumbent setting standards (like Google for SEO or Amazon for shipping packaging).

Given that confusion — as well as the constant questions I get internally at Cloudflare and from my network — I figured I’d take a stab at defining what “AI friendly” means for docs.

Here’s an attempt, or what I’m calling the speculative and soon to be outdated AI consumability scorecard.

The reason I say speculative is because there’s really no centralized or official guidance in the space and — as such — most of the effects of individual actions are hard to evaluate.

Note

My assumption is that you do want your content available to AI crawlers and external developers for AI-related purposes. If you don’t, look at Cloudflare’s AI Crawl Control suite of features.


FeatureCategoryWeight
robots.txtAI crawling20
Inaccessible preview buildsAI crawling25
llms.txtAI crawling, Content portability5
Markdown outputContent portability15
Markdown via cURLContent portability5
”Copy page as markdown” buttonContent portability5
MCP serverContent portability20
MCP install linksContent portability5

My scorecard breaks down functionality into two broad buckets, AI crawling and Content portability.

The AI crawling portion of the scorecard evaluates how accessible your content is to the crawlers from major AI providers (Google, Anthropic, OpenAI, Perplexity, etc.).

By and large, there actually aren’t too many things on this list beyond general SEO optimizations. The general takeaway here is essentially, optimize for one robot, you optimize for them all.

Google explicitly says as such in its guidance for site owners:

While specific optimization isn’t required for AI Overviews and AI Mode, all existing SEO fundamentals continue to be worthwhile…

They repeat this sentiment further down the page too (in a note):

You don’t need to create new machine readable files, AI text files, or markup to appear in these features. There’s also no special schema.org structured data that you need to add.

A robots.txt file lists a website’s preferences for bot behavior, telling bots which webpages they should and should not access.

In a docs context, you generally want your robots.txt to be a giant PLEASE SCRAPE ME sign for all robots, so it’d probably look like:

robots.txt
User-agent: *
Content-Signal: search=yes,ai-train=yes
Allow: /
Sitemap: https://developers.cloudflare.com/sitemap-index.xml

The slightly new feature here is the Content Signals directive added to your file, explicitly allowing for AI training and AI search use cases. This is a very new standard (also promoted by Cloudflare), so unclear exactly on the effects… but it doesn’t hurt to pre-emptively opt in.

In the Cloudflare docs, preview builds help us review proposed changes to the docs. We create these whenever someone opens a pull request to our repo, which meant that we have hundreds if not thousands of URLs with different versions of our documentation.

What preview builds don’t help with though is AI crawling. We got a rather unpleasant surprise near the end of 2025, when someone internal flagged to us that ChatGPT was citing preview URLs instead of our production URLs.

Using Cloudflare’s AI Crawl Control, we looked at the traffic to our various subdomains and were unpleasantly suprised again… we found that as much as 80% of the AI crawls were accessing our preview sites instead of our main site. This meant that AI crawlers were prone to returning inaccurate information, which may have seemed like hallucinations.

We solved this problem by adding a custom Worker that covered the routes used by our preview URLs (*.preview.developers.cloudflare.com/robots.txt):

Custom robots.txt for preview URLs
export default {
async fetch(request, env, ctx) {
const robotsTxtContent = `User-agent: *\nDisallow: /`;
return new Response(robotsTxtContent, {
headers: {
"Content-Type": "text/plain",
},
});
},
};

What this means in practice is that any and all preview URLs automatically get a restrictive robots.txt to help instruct crawlers.

robots.txt for preview URLs
User-agent: *
Disallow: /

Not everyone needs to take this approach. Our massive set of preview URLs was due to the Cloudflare defaults (open by default, no auto deletion).

For example:

  • Vercel makes preview builds private by default.
  • Netlify deletes their preview builds automatically, so you may have less of a surface area for crawlers to accidentally find and index previews.
  • Cloudflare also supports Password-gated preview URLs, but we wanted our previews available to any human who wanted to look at them.

The main takeaway here is to figure out whether your own preview URLs are available to AI crawlers and - if so - prevent that from happening. Tools like Cloudflare’s AI Crawl Control are incredibly helpful in this respect (or at least logging that tracks request hosts and user agents).

llms.txt is file that provides a well-known path for a Markdown list of all your pages.

What’s interesting here is that it’s not a standard, it’s a proposal for a standard.

As such, we have it implemented on the Cloudflare docs… but have seen a lot of discrepancies for how AI crawlers hit it (thanks again, AI Crawl Control!).

Anec-datally, we’ve seen more crawlers hitting this file over time (started with just TikTok Bytespider and GPTAgent), but it’s still far from a common pattern across all crawlers.

It might be a touch overhyped, but it doesn’t hurt to have on your docs site.


The Content portability portion of the scorecard evalutes how easy it is to consume your docs content outside your docs site.

Generally, this maps to 2 behaviors:

  • Developers using AI tooling in IDEs or the terminal.
  • Users grabbing content to throw into chatbot interfaces.

Markdown output means that you provide .md equivalents for your standard HTML content.

HTMLMarkdown
Get started - Cloudflare Workers (HTML)Get started - Cloudflare Workers (MD)

The benefits of markdown are that:

  1. It’s simpler to parse and understand.
  2. Models theoretically understand markdown better than other types of content because it’s what they were trained on.
  3. It’s more token efficient (7x or 10x depending on the context), meaning that you spend less when a model consumes it as input.

The Cloudflare docs do this through some custom scripting, though you could also use something like Cloudflare Browser Rendering to achieve the same outcome. Some docs platforms also handle this natively for you.

Why markdown? Isn’t web scraping a solved problem?

There’s an interesting delineation here between AI crawlers and individual developers here:

  • AI crawlers are going to crawl your content however it appears (again, as Google explicitly says).
  • Individual developers say they want your docs content specifically in markdown.

Personally, I chalk this up to 3 reasons:

  1. Incentives: AI crawlers just want content, period. Individual developers want specific content and tools that fit within their workflows, so they’re pickier about formats.
  2. Expertise: AI crawlers know how to scrape large volumes of content. Individual developers might not be as versed in common standards (mostly Python libraries) like Beautiful Soup.
  3. Purpose: AI crawlers again are hoovering up any and all content. Developers are working within AI-specific tooling (IDEs, terminals) that happens to consume content. As such, the token efficiency matters, as does having a more standardized / less effortful way of getting the content you need. I don't want to ever think about parsing your HTML, just give me markdown is the vibe here.

A related feature is that your site will respond with markdown if someone uses the Accept: text/markdown header with a request (Mintlify | Bun).

Terminal window
curl https://developers.cloudflare.com/workers/get-started/guide/ -H 'Accept: text/markdown'

Provided you already have markdown output, this feature isn’t too difficult to implement. We have some custom logic for parsing requests in the Cloudflare docs for this feature.

In a growing number of docs sites, you’ll now see a dropdown of Page options at the top right.

Within this element, you can copy the current page as markdown and then paste the content into your LLM of choice.

Copy page as markdown button

Much like markdown via cURL, the main work of this feature is making sure you have the markdown output to copy in the first place.

If you use Starlight as a docs framework, they now have a custom plugin to automatically add this functionality.

An MCP server is a way of exposing certain tooling / functions to AI agents or interfaces. Think of it as a REST API, but for LLMs.

In an application context, that means an agent could create / update / delete something in another application easily.

In a docs context, your MCP server lets AI tools search your documentation directly instead of relying on generic web searches (or past training data). It’s another way in which developers are taking a lot of the actions normally performed on your docs site and moving that to a different context (IDE, terminal, chatbot).

If you already have an MCP server set up, you can provide developers direct install links for several flavors of IDEs and other tooling.

We have a subset of these in the Cloudflare docs, though you can see other examples in the OpenAI docs or Better-Auth docs.

MCP install links


The biggest gap is the “fun” connundrum of I need this content available to everyone... just not AI crawlers.

The clearest illustration of this is in versioned content, like outdated Wrangler commands:

  • You want that content accessible to people, who might be running the old versions.
  • You also want that content accessible for search crawlers, so people can find the content via Google or internal search.
  • You don’t want that content accessible for ai crawlers.

Another flavor of this appears in certain best practices guides, where you intentionally want to document an anti-pattern (don't do this). Makes sense for humans (and search), but not for AI crawlers to pull into their training data.

There’s really no mechanism to flag a specific set of content as preferred over another.


I very well might be missing things in this blog, whether they be current functionality or (very near) future standards.

Feel free to drop me a line at [email protected] if you have questions, thoughts, or ideas.

January 15, 2026

Open Graph Images in Astro: Build-Time vs Runtime

Jilles Soeters ·

January 15, 2026 / 9 min read

Open Graph Images in Astro: Build-Time vs Runtime

My recent move to Astro for this blog has given me new things to learn and explore. One thing I never paid much attention to is Open Graph images, or OG images. Those previews you see on social media.

I always assumed you’d just create a custom image for each page. But that’s the old way of thinking. We can now create images programmatically and serve them dynamically. Here’s an example of GitHub’s OG image for Astro:

Astro's Open Graph image showing dynamic GitHub stats

Notice how it has several dynamic elements such as stars, contributors, and the programming language distribution bar.

In this article, we’ll explore two ways you can create OG images in Astro:

  1. At build time
  2. At runtime (dynamically using an API)

While this article is about Astro, the same principles apply across frameworks.

Generating Open Graph Images at Build Time

When you only have a few pages, you might want to hand-craft your OG images. If you want to have a certain brand feel to it, you could ask your designer to make an image for you before you publish a new page or article.

We can do it in code, too! There are various ways, but the most common approach I’ve seen uses two well-maintained packages in a two-step process:

  1. Use Satori to render JSX into SVG
  2. Use resvg to render the SVG into a PNG

Alternatively you can use sharp to generate images in a wide array of image formats, but for our purposes of generating a PNG, the Rust-based resvg is faster.

I was using this approach for this blog, until I finished writing this article and realized dynamic generation is often the better choice, more on that later.

As mentioned earlier, you’d use Satori to generate an SVG and resvg to turn it into an image. My build script looked as follows:

scripts/build-og.mjs
import { readFile, writeFile, mkdir } from 'node:fs/promises';
import satori from 'satori';
import { Resvg } from '@resvg/resvg-js';
import matter from 'gray-matter';
import getReadingTime from 'reading-time';
const colors = {...};
// Get metadata for each og:image
function parsePost(source) {
const { data, content } = matter(source);
return {
title: data.title,
slug: data.slug,
tags: data.tags ?? [],
readingTime: getReadingTime(content).text,
};
}
// JSX-like tree (could use actual JSX too)
function createOgTree({ title, tags, readingTime }) {
return {
type: 'div',
props: {
style: { display: 'flex', flexDirection: 'column', /* ... */ },
children: [
{ type: 'h1', props: { children: title } },
{ type: 'span', props: { children: readingTime } },
],
},
};
}
// Step 1 and 2 from above (svg -> png)
async function renderPng(tree, fonts) {
const svg = await satori(tree, { width: 1200, height: 630, fonts });
return new Resvg(svg).render().asPng();
}
async function main() {
const fonts = [{ name: 'MyFont', data: await readFile('font.ttf') }];
for (const file of await glob('src/content/**/*.mdx')) {
const post = parsePost(await readFile(file, 'utf8'));
const png = await renderPng(createOgTree(post), fonts);
await writeFile(`dist/og/${post.slug}.png`, png);
}
}
  1. Parse the post to get its content and metadata
  2. Create a JSX-like tree without actually writing JSX
  3. Render into a PNG
  4. Write it into /dist so it’s available at runtime

You could also write these to /public so they’re treated as static assets, but I preferred generating them into dist/ during CI.

The above script ran after each build using an npm script:

package.json
{
"name": "jillesme",
"scripts": {
...
"og:build": "node scripts/generate-og.mjs",
"postbuild": "pnpm og:build"
}
I used Claude Code to write the result of compiled JSX directly inside of createOgTree instead of importing React and creating a component.

One caveat of using Satori is that it supports a subset of CSS. Each container element needs to have display: flex and you can’t use all CSS properties. There is also additional work required to import fonts.

If you wanted real page renders, exactly as the browser would render it… Then we can use the next method:

Using Playwright to Take Screenshots

Instead of writing JSX, what if you could automatically create a screenshot of each page and use that as the OG image?

It’s easier than it sounds. We need to add a single dependency: playwright. Then we can update our script to iterate over our posts, open them in a real browser, and take a screenshot.

scripts/build-og.ts
...
const baseUrl = process.argv.find(a => a.startsWith('--base-url='))?.split('=')[1]
?? 'http://localhost:4321';
async function screenshotPages(posts) {
const { chromium } = await import('playwright');
const browser = await chromium.launch();
const page = await browser.newPage();
await page.setViewportSize({ width: 1200, height: 630 });
for (const { slug } of posts) {
await page.goto(`${baseUrl}/${slug}/`, { waitUntil: 'domcontentloaded' });
await page.screenshot({ path: `dist/og/${slug}.png` });
}
await browser.close();
}

Running this will result in:

Terminal window
$ pnpm og:build --screenshot
> jillesme-astro@0.0.1 og:build /Users/jilles/Code/jilles.me
> node scripts/generate-og.mjs --screenshot
Screenshot mode: capturing 34 pages from http://localhost:4321
Screenshotting http://localhost:4321/badassify-your-terminal-and-shell/
Screenshotting http://localhost:4321/thinking-in-networks-cloudflare-storage/
Screenshotting http://localhost:4321/setting-up-spring-jdbc-and-sqlite-with-write-ahead-logging-mode/
...

This will take longer but has the additional benefits of rendering an OG image that is a real representation of your webpage. No limitations in CSS or font rendering:

Playwright Render of Another Article

The above example used 1200x630 for the browser size, but you could also use 600x315 for more condensed images or set a zoom factor in Playwright.

I don’t use this approach for my websites, but I think it’s valuable to know it is an option.

When to use Build Time Generation

Before writing this article, my understanding was that build time generation is great if you have less than a few hundred pages. After spending a lot of time thinking about this I no longer think this to be true.

Instead build time generation is great for the following scenarios:

  1. You want real screenshots. This takes too much time to do on demand.
  2. Your Open Graph images can’t easily be rendered by Satori.
  3. Your Open Graph images require heavy computation. I had this happen when we wanted to blur a frame of a 20 MB GIF. The Worker ran out of memory.

Since OG images aren’t fetched on every page view but only when shared on various social platforms, generating them at runtime is often the better option.

Generating Open Graph Images at Runtime

An effective approach to creating OG images is to use an API: a request to /og/post-name.png hits a GET handler that returns an image response instead of a static file.

Two popular options are:

  1. @vercel/og uses Satori and Resvg to create a PNG
  2. workers-og uses Satori and Resvg WASM to create a PNG

Since my blog runs on Cloudflare Workers (free plan, still!), I originally used workers-og, but the same principles apply to @vercel/og.

I’m leaving the following workers-og section as is because I want this to be about the learning, not promoting another npm package. After writing this post, I spent a few evenings creating cf-workers-og which has the latest Satori and ReSVG. It also handles robust HTML parsing and more importantly, has Vite integration.

Adding workers-og to Astro

Installing it is easy, but testing it locally requires extra effort.

Terminal window
$ pnpm i workers-og

And then we create an API page in Astro:

src/pages/og/image/[slug].png.ts
import type { APIRoute } from 'astro';
import { ImageResponse } from 'workers-og';
// Don't render during build
export const prerender = false;
const colors = {};
export const GET: APIRoute = async ({ params }) => {
const slug = params.slug || 'untitled';
// Here you could fetch dynamic data such as GitHub stars
const starCount = await getStarsBySlug(slug)
const html = `
<div style="display: flex;">
This has ${starCount} stars!
</div>
`;
return new ImageResponse(html, {
width: 1200,
height: 630,
headers: {
'Cache-Control': 'public, max-age=60',
},
});
};
  1. Turn prerendering off so it doesn’t run during the build step.
  2. Load your dynamic data inside this API
  3. Set caching headers if generation is computationally expensive to prevent DoS attacks (Optional)

And that’s all there is to it! Or so I thought, you will likely run into the following error in your local development environment:

Cannot find package 'a' imported from /Users/jilles/Code/jilles.me/node_modules/.pnpm/workers-og@0.0.27/node_modules/workers-og/dist/yoga-ZMNYPE6Z.wasm

This confused me at first. What is Yoga? Why a?

I spent some time investigating and learned that yoga.wasm is a WebAssembly port of Facebook’s Yoga layout engine, used by React Native and… Satori!

a in this case is a minified module Vite is trying to import from node_modules, but it’s not a node module. It’s a WASM import namespace.

This works on Cloudflare Workers, but not during development with Vite. So how do we test it? We build our Astro site and run wrangler dev instead of astro dev.

Terminal window
$ pnpm run build
$ pnpm dlx wrangler dev
Starting local server...
[wrangler:info] Ready on http://localhost:8787

Now we can see our result locally before deploying:

Astro's Open Graph image showing a dynamic title

Alternative: Adding cf-workers-og

If you don’t want to run into the issues above you can follow the following steps:

Terminal window
$ pnpm i cf-workers-og

Create the API route in similar fashion:

src/pages/og/image/[slug].png.ts
import type { APIRoute } from 'astro';
import { ImageResponse, GoogleFont, cache } from 'cf-workers-og/html';
export const prerender = false;
export const GET: APIRoute = async ({ params, locals }) => {
cache.setExecutionContext(locals.runtime.ctx);
return ImageResponse.create(<div style="display: flex;"></div>, {
width: 1200,
height: 630,
fonts: [
new GoogleFont('Inter', { weight: 400 }),
],
headers: {
'Cache-Control': 'public, max-age=60',
},
});
};
  1. Import is different (cf-workers-og/html)
  2. Set execution context for the cache, so that your fonts get cached (avoid refetching fonts and hitting rate limits)
  3. ImageResponse.create() instead of new ImageResponse
  4. Pass a Google Font

This works locally (e.g. http://localhost:4321) and in the Cloudflare Workers environment. A nicer developer experience in my opinion.

When to Use Runtime Generation

Unless you need browser screenshots or heavy computation, runtime generation is usually the better option.

There is no build-time overhead. OG images are requested by social platforms and heavily cached, so runtime generation typically runs once per cache window (per platform). You get the option to add any sort of data inside of the Open Graph image.

I’ve become a big fan.

Outro

While writing this article I learned a lot about Open Graph images. I found opengraph.xyz an excellent resource to test out Open Graph images in production.

I learned Satori uses Facebook’s Yoga layout. How WASM modules work in Vite and Cloudflare Worker environments. It led me to work on cf-workers-og, which in turn taught me about patching dependencies using pnpm.

January 09, 2026

I Was Thinking in Databases. I Should Have Been Thinking in Networks: A Mental Model Shift for Cloudflare Storage

Jilles Soeters ·

January 9, 2026 / 7 min read

I Was Thinking in Databases. I Should Have Been Thinking in Networks: A Mental Model Shift for Cloudflare Storage

It took me longer than I’d like to admit to understand Cloudflare’s storage products. It’s not that the docs aren’t sufficient; it’s that I was lacking the right mental model.

Once I started thinking in terms of a global network, it all clicked. Not just how storage products work, but why Cloudflare products are designed the way they are. I hope this article does the same for you.

KV (Key/Value)

Key value stores have been around since the late 70s. Ken Thompson worked on dbm which was able to store a single key and map it to a value. Today we might think of Redis as a popular KV store.

But to think that Cloudflare KV is “basically Redis” would be a mistake. A mistake I have made myself.

You’d use Redis for rate-limiting, for example. You wouldn’t use KV for rate-limiting due to its eventual consistency model.

Storing a value in Cloudflare KV is as easy as:

import { env } from 'cloudflare:workers';
await env.KV.put('feature:dark-mode', 'true', {
expirationTtl: 86400 // 24 hours
});

This stores the key feature:dark-mode with the string value 'true' in one of Cloudflare’s central stores.

When someone tries to read that value, it is fetched from the central store if the data is not cached (cold). The result is then cached at the edge by the requesting colocation, allowing subsequent reads to be served from the cache (hot), under 50 ms for 95% of the world.

What does that look like? Below you can see a simulation. A request will go out to the nearest central storage for the first time, then get cached at the requesting colocation. All users close to that location will get that value served for the remaining TTL.

Central storage / Cached
Writes / Cache hits
Cold reads
Speed:
1x2x3xStart

Aha! Now eventual consistency makes sense. When you write a new value, it will take several seconds to update all locations. It will be consistent across the network, eventually.

This insight cleared up for me why KV is not the correct product for rate-limiting. Instead, you’d want to use Rate Limiting or Durable Objects.

D1 (SQLite-like)

SQLite is the most used database engine in the world. While D1 isn’t exactly SQLite, it uses the same query engine.

D1 is Cloudflare’s serverless database which allows you to store relational data. It comes with useful features like time-travel and read replication (in beta at time of writing).

There is one important constraint you need to be aware of: D1 databases have a 10GB limit. You can have many smaller databases (up to 50,000, horizontal scale), but they can’t grow past 10GB (vertical scale).

For all my personal projects, D1 has been more than enough. Yet if I were building a business today, I’d consider Hyperdrive or Durable Objects instead. More on this later.

When you create a D1 database, it is placed close to you, or close to the location hint you provided. Then you can query it with familiar SQLite syntax.

worker.ts
import { env } from 'cloudflare:workers';
// Write: Insert new order
await env.DB.prepare(
`INSERT INTO orders (user_id, product)
VALUES (?, ?)`
).bind('user-123', 'Blue Belt').run();
// Read: Get order status
const order = await env.DB.prepare(
`SELECT * FROM orders WHERE user_id = ?`
).bind('user-123').first();

Users all over the world will reach that database for both reads and writes. This will be fast for your users close to the location hint I mentioned earlier, but slower for people on the other side of the planet.

This slowness is addressed by read replication, which keeps read-only replicas of your primary database synchronized across Cloudflare’s edge network, allowing read queries to be served from locations closer to users.

To use read replication, you have to enable the D1 Sessions API. This routes all queries from a user/browser session to the same D1 instance.

worker.ts
import { env } from 'cloudflare:workers';
const bookmark = request.headers.get('X-D1-Bookmark', 'new')
const session = env.DB.withSession();
// Write: Insert new order
await env.DB.prepare(
await session.prepare(
`INSERT INTO orders (user_id, product)
VALUES (?, ?)`
).bind('user-123', 'Blue Belt').run();
// Read: Get order status
const order = await env.DB.prepare(
const order = await session.prepare(
`SELECT * FROM orders WHERE user_id = ?`
).bind('user-123').first();
response.headers.set('X-D1-Bookmark', session.getBookmark())

Below it is visualized, toggle between read-replication ON/OFF.

Primary database
Writes
Reads
Responses
Read Replicationbeta
Speed:
1x2x3xStart

I’ve been using D1 for several personal and internal projects successfully with Drizzle ORM. If the database starts growing past 8GB I will migrate to the next product of discussion:

Hyperdrive

One of the most powerful products you might not have heard of is Hyperdrive. It is a great alternative for when you need globally fast applications but D1’s size limit is a deal-breaker.

Like the other products, Hyperdrive makes sense in the context of Cloudflare’s network. Instead of your application directly connecting to your database, Cloudflare will keep a connection pool warm close to your physical database, so you can skip the entire TCP/SSL set-up for every request.

Even better is query caching, which leverages the network to cache read queries, with a similar mechanism to KV from earlier. This means that your database in US east can feel local to a user in Europe.

import { env } from 'cloudflare:workers';
import postgres from 'postgres';
const sql = postgres(env.HYPERDRIVE.connectionString);
const users = await sql`
SELECT * FROM users WHERE active = true
`;

How does this look within the network? Even though the database is in Eastern North America, you will see read queries from Europe and Asia stay local.

Database (ENAM)
Edge / Connection pool
Queries
Cached
Speed:
1x2x3xStart

If you don’t have an external database, there is another great alternative for building stateful applications. The illustrious Durable Object.

I published a YouTube video recently. Where we create a PlanetScale database and publish a Next.js application using Hyperdrive. You can see Hyperdrive in action on https://thecoffeecluster.com.

Durable Objects

An extraordinary data product inside of Cloudflare’s catalog is Durable Objects. There is nothing like it. Because of that, it’s less intuitive to understand.

Like the products we discussed before, Durable Objects make sense when you understand them in the context of Cloudflare’s network.

They allow you to store arbitrary state. Imagine an object in JavaScript that only exists in memory, except it’s persisted to Cloudflare’s edge and has first class support for WebSockets and Alarms.

Each Durable Object instance can exist in a single location at a time. In most cases close to the Worker that created it. Once the Durable Object instance is created, it stays there.

worker.ts
import { DurableObject } from "cloudflare:workers";
export interface Env {
BOOKING: DurableObjectNamespace<SeatBooking>;
}
export class SeatBooking extends DurableObject<Env> {
async bookSeat(
seatId: string,
userId: string
): Promise<{ success: boolean; message: string }> {
...
}
}
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const url = new URL(request.url);
const eventId = url.searchParams.get("event") ?? "default";
const id = env.BOOKING.idFromName(eventId);
const booking = env.BOOKING.get(id);
const { seatId, userId } = await request.json<{
seatId: string;
userId: string;
}>();
const result = await booking.bookSeat(seatId, userId);
return Response.json(result, {
status: result.success ? 200 : 409,
});
},
};
The example was taken from the phenomenal Rules of Durable Objects. I removed some implementation details for brevity. You can see the full implementation in the link above.

What would this look like inside the network? See below. You can click “New Instance” to simulate creating a new Durable Object (DO).

Durable Object instance
Incoming requests
Speed:
1x2x3xStart

It’s important to state how powerful Durable Objects are. D1, Queues, Workflows and Agents are all examples of Cloudflare products built on top of Durable Objects.

R2 Object Storage

Your application has users. In many cases these users need to upload data. Whether it is a profile picture or a PDF. It needs to go somewhere.

Many of us are familiar with S3. Cloudflare R2 is similar, but has no egress fees. That means you don’t pay for people downloading items stored in your bucket.

R2 uses a similar model to D1 with a single write-primary, but data is replicated within the same region for redundancy. When you want to download a file from R2, a metadata request is made to the location of the bucket.

Downloading the content is fast if the content is cached in a tiered read cache. Which is the “Read Cache” in the diagram below:

Storage (ENAM)
Read cache
Uploads
Downloads
From storage
From cache
Speed:
1x2x3xStart

Outro

Did it click? Region: earth makes sense when you can visualize the network. I hope this article helped with that.

There are two honorable mentions I didn’t talk about: Vectorize and Analytics Engine. The truth is that I haven’t used them enough yet, but I will.

If you have any questions or want to give feedback, don’t hesitate to reach out to me on Twitter/X

January 07, 2026

the context is the work (what the day-to-day looks like now)

Sunil Pai ·

7 January 2026

(corollary- pr descriptions are becoming the apprenticeship surface for remote teams and coding agents)

in my last post, where good ideas come from (for coding agents), I argued that agents don’t magically make engineering easy; they make implementation cheap, and then they amplify whatever you were already doing to keep work aligned with reality: constraints, context, an oracle, and a loop. this is the zoomed-in sequel.

after you’ve used coding agents for a bit, you start to notice something that feels a little backwards:

  • code got cheaper.
  • but review often feels harder.

not because the diffs are bigger (they’re often smaller). and definitely not because agents write unreadable code (they often write very clean code).

I think it’s because the hard part of the job moved (“shifted left”, one might say)

tl;dr: the scarce part of the workflow seems to be becoming context: intent, assumptions, constraints, trade-offs, and verification. and the place that ends up holding a lot of that is the pull request description. not just the diff.


a normal day now: one pr, one change, a lot of thinking

let’s reuse the same running example from the last post: webhook ingestion.

you have a handler that validates a signature, stores an event, and enqueues a job. and prod keeps reminding you that the world is adversarial:

  • partners retry aggressively → duplicates
  • downstream fails halfway → partial side effects
  • p99 latency is creeping up → every “fix” risks making tails worse

you open a pr titled something like:

make webhook ingestion reliable (idempotency + retries)

you ask an agent to help.

it comes back fast with a plausible diff:

  • adds an idempotency key
  • prevents duplicate enqueue
  • adds a retry wrapper in the worker
  • improves logging

tests pass. the code is clean.

a few years ago, you might have merged this (or at least felt good about it).

today, the thing you feel is: unease.

because the question isn’t “does this compile?” and it isn’t even “does this pass tests?”; it’s:

  • what does “idempotent” mean here, exactly? (partner id + event id? payload hash? something else?)
  • where do retries live? (edge handler vs background worker)
  • what is the p99 budget? (and did we just tax it)
  • what’s the failure model? (fail closed? fail open? dlq?)
  • what did we not change? (data model? queues? partner contract?)

and here’s the part that I keep bumping into: the agent can generate the diff in minutes, but you can still spend hours making sure it’s the right diff.

at least in my experience, that time doesn’t feel like overhead anymore, it feels like where the work actually went.


the inversion: the pr description starts carrying the engineering

the diff answers: what changed.

the pr description is supposed to answer:

  • what did we mean?
  • what constraints shaped this?
  • how do we know it’s correct?
  • what should reviewers focus on?

in the agent era, that list feels like it’s becoming the center of gravity.

a clean diff is only evidence of taste, or the lack of it. (taste matters, but it’s not the same thing.)

when codegen is cheap, “engineering” shifts toward:

  • turning tribal knowledge into written constraints
  • making trade-offs explicit
  • choosing and running verification loops
  • drawing scope boundaries

and the pr description becomes the place where all of that gets serialized.

I think this is why reviews can feel harder: the implementation “matters” less; you’re reviewing an interpretation, and the history that brought you there.


why “paste the transcript” is usually not the move

a predictable reaction to this shift is: “ok, so we should attach the entire agent chat log to the pr.”

I get the instinct, but in practice I don’t think it works that well.

full transcripts have the same problem as raw debug logs:

  • they are high volume
  • they mix signal and noise
  • they contain false starts and vibe-driven tangents
  • they’re optimized for getting to an answer, not making the answer auditable

reviewers need the plot points! they need to follow the story as you’ve told it (indeed, as you discovered it). so you want structured provenance to help them follow that story:

  • what the goal was
  • what constraints were non-negotiable
  • what decision points existed
  • what you chose, and why
  • what you ran to validate it

(I’m sure a bunch of tools will try to automate this; ai-generated pr summaries, walkthroughs, session logs, etc. those are useful, but most of them are currently better at what changed than what constraints mattered. and that second part is still where correctness tends to live.)


this was always the interview signal

this is the part that made me feel less crazy: we’ve been here before.

think about coding interviews; the ones where the prompt is intentionally under-specified.

yes, you’re “solving a problem.” but the real signal isn’t whether you can type out a solution under time pressure. the real signal is usually:

  • what clarifying questions you ask
  • what constraints you surface
  • what assumptions you state explicitly
  • how you test the edges mentally before writing code

good interviewers, they watch how you construct the problem, and actually consider the code that you write to be a side effect of that process.

and agents, aha, they didn’t invent this skill! they just make the gap obvious when you skip it (I do wonder if having interview banks and tips in the training dataset influenced this, but that’s a digression for another day.)

in day-to-day work with coding agents, the first five minutes often look like the best kind of interview:

before we write anything: what does “correct” mean here, and what would make this wrong?


the lost apprenticeship (and where it went)

there’s another shift here that I think we’re still metabolizing: remote work.

pre-remote (or even just pre-async), juniors learned the job by proximity:

  • you sat next to seniors (like, literally, peering over their shoulder. it’s how even I learnt javascript 20 years ago)
  • you overheard how they reasoned
  • you watched how they decomposed work
  • you saw what they worried about before they wrote code
  • you learned the “how we do things here” stuff without anyone having to name it

that apprenticeship surface was always real, even if informal, and it was so fkin critical.

now, in remote teams, and especially with agents, most of that ambient learning doesn’t travel the same way.

you can’t really overhear judgment.

if you want it to transmit, you have to serialize it.

and where does that serialization most naturally happen?

  • in pr descriptions
  • in review comments
  • in small design notes
  • in decision logs

this is the part I care about for juniors: your future job is probably not to out-type the agent. it’s to learn the questions that make the work converge.


“knowing what questions to ask” becomes the job

I think juniors often feel like their job is “ship tickets.”

in an agent-heavy world, the more durable job description might be:

turn ambiguity into constraints, and make correctness legible.

some questions that almost always matter (and that you can learn to ask on purpose):

  • goal: what outcome are we trying to produce? (not the mechanism)
  • non-goals: what are we explicitly not doing?
  • invariants: what would make this change wrong even if tests pass?
  • failure modes: how does this fail in prod? what do we do then?
  • verification: what did we run? what would convince us this works?
  • rollout/rollback: how do we deploy safely, and how do we undo it?

I don’t think these are “soft skills.” they’re the mechanics of building correct systems.

seniors often have these questions in their head as scar tissue. agents make that scar tissue more visible because if you don’t write it down, the model will happily fill the gaps with a plausible completion.


the pr description as interface

the pr body feels like it’s becoming the interface between:

  • fast, probabilistic agent work
  • slow, accountable human judgment

and also between:

  • seniors and juniors
  • your team and the rest of the org
  • present you and future you

what I think would work is a three-layer pr description. think of it as optimizing for three reader modes.

layer 1: executive intent (30 seconds)

answer:

  • what changed?
  • why now?
  • what’s the user/system-visible outcome?

layer 2: reviewer guidance (3–7 minutes)

answer:

  • where should I look?
  • what invariants matter?
  • what trade-offs did you make?
  • what did you deliberately not change?

layer 3: provenance + replay (only if needed)

answer:

  • what context did you use?
  • what decisions were made?
  • what commands/tests were run?

(this is the part that belongs in a collapsed <details> block. it’s there when you need it, but it doesn’t turn the pr into a novel.)


a concrete format (with one filled example)

here’s a format that reads well (imo). it’s intentionally similar to the “context packet” template from the last post, but aimed at review rather than prompting.

suggested pr body

  • goal:
  • non-goals:
  • constraints / invariants:
  • approach:
  • what changed (walkthrough):
  • verification:
  • risks & rollback:
  • context manifest (for audit / archeology, put this inside a <details> block)
    • prompt summary (not transcript): what we asked the agent to do, in intent/constraints form
    • repo anchors used: the handful of files/docs that defined “truth”
    • decision points: the 2–4 moments where options existed and we chose one
    • tools invoked: tests, linters, benchmarks, and outcomes

filled example (webhook reliability)

goal: prevent duplicate downstream effects when partners retry the same webhook delivery.

non-goals: no new datastore; no partner-facing contract changes; no retries inside the http handler; no large refactors.

constraints / invariants:

  • idempotent under partner retries
  • p99 handler latency must not regress
  • no payload logging (pii)
  • retries must be bounded + jittered (worker-side)

approach: implement ingestion idempotency keyed by (partner_id, event_id); keep edge path fast; move retries to worker; add dlq path for poison events.

what changed (walkthrough):

  • added idempotency guard at ingestion to prevent duplicate enqueue
  • reused existing retry policy helper for worker retries
  • added dlq handling for permanent failures
  • added metrics: duplicate rate, retry count, dlq depth

verification:

  • added tests for duplicate deliveries → only one enqueue
  • added tests for transient downstream failure → bounded retries
  • ran: unit tests + integration tests

risks & rollback:

  • risk: incorrect idempotency key choice could drop legitimate events
  • rollout: feature flag at 1% for one partner, monitor duplicate rate + p99
  • rollback: disable flag
context manifest
  • prompt summary: implement ingestion idempotency only; keep diff small; reuse retry policy; do not add new abstractions; add tests for duplicates and retries.
  • repo anchors used: existing retry module; queue abstraction; error taxonomy types; logging/metrics helpers; prior ingestion endpoint pr.
  • decision points:
    • key choice: (partner_id, event_id) vs payload hash → chose ids because hashes aren’t stable identifiers across real retries
    • retry placement: worker vs handler → chose worker to protect p99
  • tools invoked: unit tests, integration tests (green)

how this relates to the “context packet” from seven-ways

in the last post I described a context packet as an input artifact: a small, curated bundle that stops the agent from guessing what “truth” is.

this post is about the corresponding output artifact:

  • the context packet tells the agent what truth is.
  • the pr description tells humans how we enforced it.

same discipline, different direction.


zooming out: prs are the smallest unit of org sense-making

once you take this seriously, it’s hard not to notice the scale effect.

pr descriptions are the “small” version of a bigger job:

  • explaining what your team is doing
  • explaining what you aren’t doing (and why)
  • sequencing work so the org doesn’t invent a story for you

this sounds like “communication,” but I think it’s more precise to call it engineering judgment at larger radii.

the failure mode is the same as with agents:

  • people fill in missing context with plausible completions
  • you get misalignment that nobody intended
  • and then you pay for it later

a good pr description starts by helping a reviewer, but then it helps your future teammate, your future oncall, your future self, and anyone trying to understand what changed and why. this is the job!


what this changes about onboarding and growth

in remote teams, onboarding is mostly archaeology.

people learn your system by:

  • reading prs
  • reading code
  • reading runbooks
  • reading incident docs

if those artifacts aren’t interrogable (i.e. - if they don’t encode intent and constraints) then onboarding turns into a game of telepathy.

for juniors, this can be weirdly good news:

  • you don’t need to be the fastest typist
  • you need to become fluent in constraints, verification, and failure modes

for seniors, it explains why the job can feel heavier:

  • you may write less code
  • but you’re responsible for coherence

so it seems to me that agents didn’t remove senior work, they mostly exposed it.


closing

we’ve said this before:agents make code cheaper; they don’t seem to make judgment cheap.

so the job shifts toward making your intent, constraints, and verification legible; first to the agent, then to your reviewer, then to your team, and eventually to the rest of the org.

if the future junior can’t sit next to you and absorb how you work, they’ll learn from what you actually wrote down.

which means pr descriptions probably can’t be a formality anymore. they might be the actual apprenticeship surface.


epilogue: what it might mean to “review” context

there’s a follow-on idea I can’t stop thinking about.

in the past, review wasn’t just reading:

  • you checked the branch out
  • you ran tests
  • you poked at it in a repl
  • you wrote a quick harness

that was a way of interrogating the work.

if the pr description now carries a meaningful chunk of the engineering judgment, we might need an equivalent way to interrogate that too.

maybe that looks like:

  • linking a pr to the coding session that produced it
  • replaying the session in a sandbox
  • branching the session to ask your own “what if…” questions
  • and (somehow) merging the results back, just like we used to merge commits from multiple humans into one pr

this opens a bunch of weird questions:

  • what’s the “diff” for reasoning?
  • what’s a merge conflict between two interpretations?
  • what does it mean to rebase intent?

we invented branches, diffs, and reviews to collaborate on code.

if context is now the work, we’ll need equally good ways to collaborate on reasoning. who’s building this?

January 05, 2026

From directional steps to deep links

Kody with a K's Blog ·

From directional steps to deep links

4 min read

In the Cloudflare docs - as with every docs site - we have standards for how we write.

Specifically, we have standards for how we write navigational instructions. And - until recently - those standards were to write something like the following:

  1. Log into the Cloudflare dashboard.
  2. Select your account and zone.
  3. Go to DNS > Records. …

We chose that standard a while ago because we didn’t want to assume that someone would already be inside the Cloudflare dash before navigating in. We also reasoned that navigational instructions would help users move around easier in the dash.

That strategy served us well as a team standard for a few years… but with one clear weakness.

In some situations — predominantly phased rollouts of features — our docs struggled to help our users:

  • Keeping one variation of instructions invariably confused the other group.
  • Keeping both variations of instructions confused everyone (and was way harder to maintain).

These struggles were an acceptable outcome for small rollouts, but we then had to support a major dashboard navigation overhaul.

It was shaping up to be a logistical and experiential nightmare.

In light of that challenge, we revived an option from our original discussion of our writing standards: deep links into the Cloudflare dashboard.

You see, the Cloudflare dashboard can actually navigate you pretty far just from a URL, so long as that URL is structured into something like the pattern below.

https://dash.cloudflare.com/?to=/:account/:zone/dns/records

We’d originally rejected this idea because we couldn’t get an auditable source of truth from the dashboard team. However, with the upcoming rollout (and the known struggles of working on phased rollouts), we revisited our requirements with them.

It took a little bit of negotiation, but because we were committing to so much work (thousands of updates), they agreed to help us and give us a list of routes for the dashboard.

We turned that source of truth into a simple component, Dash Button.

It takes the route as an input and gives you back a stylized button, meaning our instructions now looked a bit more like:

  1. In the Cloudflare dashboard, go to the WAF page:
    Go to WAF

As the simple ones are, this change took a metric f*ck-ton of work. Incredible kudos to the rest of the team here.

The results of this work were 4-fold.

The phased rollout went along without anyone complaining about the docs, which is generally your sign of success for docs.

Also, so long as internal teams set up redirects on their paths, we could continue to support users without knowing where they were coming from. We didn’t need to know which condition someone was in, the dash team did (and could route them appropriately).

As of this writing, the Dash Button is one of our most popular components, used almost 1000 times across our docs.

Usage of dash button component

Man, do I love that we track usage numbers!

A huge surprise for me, our docs got a shoutout during an open floor discussion with a few support folks.

They explicitly mentioned how having deep links throughout our docs prevented them from:

  • Having to work through the dashboard navigation themselves
  • Having to walk customers through the dashboard navigation

Especially that last one was so painful, they told us, “No no, click over on DNS. No, not WAF. DNS. Okay, then Records.” Much simpler to give a customer a link.

The other unexpected benefit was how much easier this has made it to audit (and then fix) changes in the dashboard nav.

Of course, we all know that you should let your writer know that something in the dashboard has changed… just as we all know that often that doesn’t happen. And instead then you’re reacting to changes reported by angry customers.

But — when you have a source of truth to audit against — boom! You get PRs like this one, where we realize a bunch of those routes are wrong without someone having to stumble upon them.

GitHub PR updating dash routes

39 times we fixed something before someone reported it, booyah!

January 03, 2026

where good ideas come from (for coding agents)

Sunil Pai ·

3 January 2026

(and the part where users have to level up)

I’ve been thinking about why some people absolutely cook with coding agents, and some people bounce off them hard. I had a thought last week: if llms are “next token predictors” in the small (i.e., sentence finishers) then in the large they’re closer to “thought completers.” you give them a few crumbs of context, they infer the genre, then they sprint down the most likely path in idea-space. which makes “good prompting” feel less like magic words and more like navigation: you’re steering the model toward a region of the space where the next steps are both plausible and useful. I wanted a better map for that, so i used steven johnson’s “where good ideas come from” as a rubric, the seven patterns that reliably produce interesting ideas, and tried applying it to coding agents: where they’re naturally strong, where they reliably drift, and what a user has to supply (constraints, context, oracles, loops) to make the whole thing converge.

tl;dr: a plausible “week in the life” you can map onto your own codebase. the point is to make the user-adaptation story concrete: agents are excellent at adjacent-possible work, but they only become reliably useful when you supply constraints, context, an oracle, and a loop.


the idea-space metaphor (and what the seven ways add to it)

it’s tempting to picture an llm as navigating a huge multidimensional “idea-space”: your prompt lights up certain internal features, which reshapes the probability landscape of what comes next, and generation is basically a trajectory through that landscape. in that framing, context engineering is just steering - adding constraints, examples, and relevant artifacts so the model’s “next steps” stay in the neighborhood you care about. johnson’s seven ways are useful here because they explain which kinds of trajectories llms find naturally, and which ones require help: models are natively strong at smooth, local moves like the adjacent possible (small diffs, incremental refinements) and at platforms (interfaces, scaffolds, reusable primitives), and they can do exaptation well when you explicitly state affordances and constraints. they’re weaker where progress depends on reality pushing back - error and serendipity - unless you give them feedback channels like tests, benchmarks, traces, and experiments that create a gradient toward truth. and they only approximate liquid networks and slow hunches when you supply diverse “voices” (prior art, docs, debates) and persist ideas long enough to recombine later. the point isn’t that llms can’t roam the space; it’s that they need mechanisms that select and validate the paths worth taking.

quick sidequest: the seven ways

steven johnson’s “where good ideas come from” is one of those lists that sounds like it belongs on a poster until you use it as a diagnostic tool. here’s the version that matters for engineering:

  1. the adjacent possible - most “new” ideas are the next reachable step from what already exists. stairs, not teleportation.
  2. liquid networks - ideas show up when partial thoughts collide: people, yes, but also artifacts (docs, code, past debates).
  3. the slow hunch - many good ideas start half-baked. you keep them around until they meet the missing piece.
  4. serendipity - luck plus recognition; you notice the useful anomaly when it appears.
  5. error - failure is information; feedback turns wandering into convergence.
  6. exaptation - repurpose a thing built for one job into a different job. reuse as invention.
  7. platforms - stable primitives and standards let lots of people build lots of things faster and safer.

now: drop an llm coding agent into this picture. what changes?

my take: the seven patterns don’t go away. agents just amplify some of them and brutally expose where you’ve been relying on implicit human context for the others.

let’s walk through that with one running example.


the running example: “make webhook ingestion reliable” (totally plausible, not actually shipped)

imagine a webhook ingestion service:

  • handler validates signature
  • stores event
  • enqueues downstream job

and prod keeps reminding you that the world is adversarial:

  • partners retry aggressively → duplicates
  • downstream sometimes fails halfway → partial side effects
  • p99 latency is creeping up → every “fix” risks making tail worse

the goal, as a human would say it: reliable ingestion with idempotency and bounded retries, without making latency worse.

the goal, as an agent hears it: “write some code that sounds like reliability.”

that mismatch is the whole story.

so here’s the one-week simulation.


day 1: I ask for “reliability.” the agent gives me plausible nonsense.

the naive prompt is basically:

make webhook ingestion reliable. handle duplicates and retries. keep latency reasonable.

the agent does what continuation machines do when you hand them vibes: it fills in the blanks with the most likely reliability narrative it has seen before.

so it might invent a new “reliability module,” add a retry helper (even if your repo already has one), choose a payload-hash idempotency key because it sounds right, and sprinkle logging everywhere like it’s free.

and the code might be clean! which is the annoying part. because it can be clean and still wrong.

in this simulation, you catch three problems quickly:

  • payload hashes aren’t stable identifiers for retries in the real world
  • retries in the request handler are a p99 tax (and can trigger more retries, which is a fun kind of circular misery)
  • duplicating retry logic is how you end up with a repo that has “one retry policy per mood”

so you don’t merge it. you don’t argue with it. you just learn the lesson: if you ask an agent for a vibe, it will give you a vibe-shaped completion.


day 2: adjacent possible - I stop asking for outcomes and start asking for stairs.

this is the first user adaptation: take the big thing and turn it into rungs small enough to verify.

the staircase looks like:

  • step 1: idempotency at ingestion (no duplicate enqueue)
  • step 2: bounded retries in the worker (not the handler)
  • step 3: dead-letter path + replay
  • step 4: metrics that tell us if it’s working

then you create an oracle for step 1. not a paragraph. an actual check.

maybe it’s a test that says:

  • same (partner_id, event_id) arrives twice → only one enqueue happens
  • second request returns quickly and doesn’t redo expensive work
  • storage failure behavior is explicit (fail closed vs fail open is a choice, not an accident)

then the prompt becomes boring on purpose:

implement step 1 only. keep the diff small. don’t invent new abstractions. make these tests pass.

suddenly the agent looks competent again, because this is its strength: incremental diffs along a well-lit path.

the “adjacent possible” isn’t just a creativity concept; it’s also a safety concept. small rungs are harder to misunderstand.


day 3: liquid networks - I build a context packet so it stops inventing my codebase.

even with good decomposition, agents have a habit: they’ll “helpfully” create new mini-frameworks unless you force them to collide with your existing ones.

so you manufacture a liquid network.

not by dumping the whole repo, but by curating the collision points.

in this simulation, you assemble a tiny context packet:

  • the canonical retry policy already used elsewhere
  • your error taxonomy types
  • logging/metrics rules (especially what not to log)
  • the queue abstraction you must use
  • one prior PR that did retries correctly in your house style

and you tell the agent, explicitly, to reuse what exists:

for step 2, reuse <retry_policy_file>, follow <error_types_file>, and cite the prior art you’re copying. do not add new abstractions unless you justify them.

this is one of the weirdly satisfying moments in agent work: the output starts to look like it came from someone who has actually been in your codebase for a while.

liquid networks aren’t just social. they’re documentary. agents need the documentary version.


day 4: slow hunch - a real design question appears, and we don’t pretend it’s settled.

around now you hit the question you can’t solve with a patch:

do you ack the webhook only after downstream succeeds, or ack on ingestion and process asynchronously?

there are real trade-offs here: partner timeouts, retry behavior, your p99 budget, operational complexity, and what “correctness” means for side effects.

in this simulation you have a hunch, but not certainty:

ack quickly, but make downstream idempotent and observable; add replay; make partial failure survivable.

so you do the “slow hunch” move: you write that hypothesis down and you refuse to force closure yet.

then you ask the agent to help refine it without floating off into generic advice:

given our constraints (partner retries within ~5s, p99 target X, current failure modes), lay out the trade-offs. then propose one small experiment that reduces uncertainty.

the useful output isn’t the prose. it’s the experiment. you want something that creates evidence.

slow hunch becomes a workflow: capture partial ideas, propose tests, run tiny experiments, update the hunch log.

agents won’t incubate for you. but they’re quite good at helping you tend incubation.


day 5: serendipity - I feed it anomalies instead of asking it to “be creative.”

serendipity in software is rarely “brainstorming.” it’s “something weird happened in prod, and someone noticed.”

agents can help with the noticing part if you give them the weirdness.

so in this simulation you bring:

  • slow traces
  • error logs (sanitized)
  • a couple incident summaries
  • maybe support-ticket clusters

and you ask for something constrained:

cluster failure modes. tell me the weirdest pattern that might matter. for the top 3, propose a hypothesis and one targeted change or experiment to confirm/deny it.

now you’re engineering serendipity: exposure plus recognition.

you’re not asking for originality in a vacuum. you’re asking for hypotheses anchored in reality signals.


day 6: error - we make the feedback loop the main character.

this is the turning point where the whole thing stops feeling like promptcraft and starts feeling like engineering again.

the user imposes workflow constraints that force convergence:

  • no patch unless it serves an oracle (test, benchmark, lint rule, property check)
  • diffs must be small enough for a human to review in one sitting
  • after each change: run the suite
  • for reliability changes: add at least one failure-mode test, not just happy path

the agent’s job becomes a loop:

  • propose patch
  • run tests
  • observe failure
  • patch
  • repeat until green

this is where people’s experiences diverge dramatically. teams with solid verification culture feel like they’ve gained leverage. teams without it feel like they’ve gained a chaos multiplier.

error isn’t a tax. it’s steering.


day 7: exaptation + platforms - we stop patching and extract primitives.

by day 7 you could plausibly have “fixed the problem” locally. fewer duplicates, bounded retries, DLQ, metrics.

but the meta-problem remains: you’ll build ingestion endpoints again. and you don’t want to rediscover the same lessons every time.

so you ask the platform question:

what are the smallest primitives we wish existed at the start of this week?

in this simulation you extract a small substrate:

  • an idempotency_guard(partner_id, event_id) helper with crisp semantics
  • one canonical retry policy implementation (and a rule: don’t invent another)
  • DLQ + replay workflow that’s operable by humans
  • a metrics schema that makes reliability legible (duplicate rate, retry rate, dlq depth, replay success)

then you do exaptation on purpose: reuse an existing outbox-ish or backoff-ish pattern already in the repo, but only after stating the affordances like physics:

  • we can tolerate at-least-once delivery, but side effects must be idempotent
  • we cannot add a new datastore
  • p99 at the edge is non-negotiable
  • no payload logging
  • rollback must be safe

with affordances named, reuse becomes safe and boring (the best kind). without them, reuse becomes clever and fragile.

finally you ask for the interface before the implementation:

design the primitives first. show how a future engineer adds a new handler using them. then implement one reference handler. keep APIs small. document invariants.

agents tend to do well here. scaffolding and boundary drawing are structured composition problems, and models are oddly strong at those… as long as you force them to respect your local laws.


what changed across the week wasn’t the model. it was the user.

in the simulation, the agent didn’t become smarter. the user became more explicit.

  • constraints moved from tribal knowledge to written laws
  • oracles became the interface (“make this test pass without breaking these invariants”)
  • context became curated rather than dumped
  • the loop became non-negotiable: small diffs, run checks, iterate

and once you do that, the seven ways start working with the agent rather than against you:

  • adjacent possible: stairs, not leaps
  • liquid networks: curated collisions with repo truth
  • slow hunch: persistent hypotheses, refined by evidence
  • serendipity: anomaly feeds turned into hypotheses
  • error: tests and checks as steering surfaces
  • exaptation: reuse, but only after affordances are named
  • platforms: extract primitives so next week is easier than this week

the practical punchline

agents make code cheaper. they do not make judgment cheap.

so the scarce skill becomes: expressing constraints, designing oracles, curating context, and running tight feedback loops. if you can do that, agents feel like leverage. if you can’t, they feel like accelerating into fog: fast, smooth, and directly toward the cliff.

epilogue: ok, but doesn’t this mostly work for seniors?

yeah, mostly.

this flow works particularly well for experienced engineers because they already carry the “implicit spec” in their heads:

  • the constraints you didn’t write down
  • the failure modes you only learn after being paged
  • the trade-offs you can smell
  • the verification reflex that turns “looks right” into “is right”

agents don’t supply that for free. they amplify whatever objective you actually manage to encode, which means seniors get outsized value early because they can encode better objectives, pick better oracles, and notice plausible-but-wrong output before it ships.

but juniors can gain the missing “context” faster in this world… if you restructure learning on purpose.

have them own the spec + constraints + non-goals + acceptance tests. let the agent draft implementation. then require them to:

  • iterate through the error loop (run ci, fix failures, explain what invariant broke)
  • support changes with “citations” to existing repo patterns
  • write a short “how this fails in prod + what to watch” note

the exact guardrails vary by company:

  • startups: keep it lightweight (small diffs, a couple tests, basic observability)
  • growth orgs: formalize playbooks and perf guardrails
  • big tech: emphasize blessed primitives and rollout discipline
  • regulated/safety-critical: shift juniors toward evidence and traceability with strong gates
  • consultancies: focus juniors on rapid context extraction and runnable harnesses

but the core idea is consistent: let agents accelerate implementation, while juniors are trained (and evaluated) on objective engineering, verification, and operational judgment, not keystrokes.


appendix: the context packet (a tight template)

a context packet is a small artifact that stops the agent (and reviewers) from guessing. it pins the objective, establishes what “truth” is, and installs an oracle so the work converges instead of meandering.

use it for anything non-trivial: reliability, perf, migrations, refactors, cross-cutting changes.

template (copy/paste)

goal (1 sentence):
what outcome are we trying to produce? (not the mechanism)

non-goals:
what is explicitly out of scope? (the “helpful creativity” kill-switch)

constraints / invariants:
the laws of physics: budgets, safety properties, compatibility rules, forbidden actions.
examples: p99 < __, idempotent under retries, no retries at edge, no pii logs, no new deps, backwards compatible.

authority order:
when sources disagree, what wins?
default: tests/ci > current code behavior > current docs/runbooks > old docs/lore.

repo anchors (3–10 links):
the files that define truth for this change: entrypoints, core helpers, types, config, metrics.

prior art / blessed patterns:
where should we copy from? what must we reuse? what must we avoid reinventing?

oracle (definition of done):
the checks that decide success: tests to add, edge cases, benchmarks, static checks, canary signals.

examples (if tests aren’t ready yet):
3–5 concrete input → expected output cases, including failure/edge cases.

risk + rollout/rollback:
how could this fail, what do we watch, how do we deploy safely, how do we undo?

agent instructions (optional, procedural):
keep diffs small; cite anchors/prior art used; don’t add abstractions without justification; run tests each step; stop after step N.


a filled example (webhook reliability)

goal: prevent duplicate downstream effects when partners retry the same webhook delivery.

non-goals: no new datastore; no partner-facing response changes; no retries inside the http handler; no large refactors.

constraints: idempotent under retries; p99 handler latency < X; worker retries bounded with jitter; no payload logging; feature flag + safe rollback.

authority order: tests/ci > code > runbooks > old docs.

repo anchors: handler, queue abstraction, retry policy module, error taxonomy, metrics/logging helpers.

prior art: link to the existing bounded-retry implementation and any prior ingestion endpoint that’s “done right.”

oracle: add tests (duplicate enqueues once; retries bounded; poison → dlq); run ci; run handler benchmark; canary and watch duplicate rate/latency/queue depth.

examples: duplicate request; storage timeout; poison payload; downstream transient failure.

risk/rollout: flag on at 1%; monitor key metrics; rollback by disabling flag.

agent instructions: implement step 1 only; reuse retry policy; keep diff reviewable; run tests; summarize invariants preserved.


why it works: it turns “senior intuition” into explicit constraints and executable truth. agents stop guessing, juniors learn faster, reviews become about invariants instead of vibes.

Agentic coding is the default

Kristian Freeman ·

At some point last year, agentic coding became the default in my workflow. This causes a lot of FUD for some developers. Will I forget how to code? When I learn new things, will I actually understand how they work? Honestly, I'm not sure of the answers to these questions. But the productivity gains are too incredible to ignore. I can kick off brand new projects and make sizeable progress in minutes; I can finish projects in hours. I can throw OpenCode (yep, I migrated from Claude Code in December) at a problem and Opus 4.5 will keep it cruising for 10-20 minutes without me. I can run _multiple_ OpenCode instances for multiple projects at once, and jump around doing some light context switching to ship at a pace that is probably 100x what I could have done before. I don't really know what that means for the future of software development. Things are clearly changing, and if you aren't interested or trying these tools, I think you're going to be left behind. Will my skills - the ones I spent over a decade building from scratch - become rusty? Probably. But does that matter when AI is consuming software, creating software, and becoming the way our tools interact with each other? Maybe not.

December 24, 2025

Lime's billing model is encouraging cyclists to run red lights

tk.gg ·

import { LimeBikeSimulator } from './ssr-safe.js'; import './style.css'; import route from './blackfriars-to-oxford.json'

You're a cyclist at an intersection in London waiting for a red light. You're on a dockless hire bike. Every second you wait, you're paying for being conscientious.

Every junction you stop at, another cyclist chooses to run the light.

This isn't a bug. It's encoded into the design of the pricing. And it creates a dangerous incentive.

Try it yourself. The simulator below shows a route derived from Lime's own data of the most popular locations: Blackfriars Bridge to Oxford Street. Watch how much time is possibly spent waiting at lights, and how much that could cost.

<LimeBikeSimulator client:only="react" precomputedRoute={route} />


Lime charges by the minute. That's a reasonable business model for a service where time correlates with value, but actually we're focused on distance when thinking about a journey. Time spent waiting at a red light isn't valuable to you. It's time spent obeying traffic law. And unlike the same situation in a taxi, the legal consequences for skipping a red light are inconsequential most of the time, and self directed. It is far more now a social contract than it is perceived as a legal obligation.

Here's the problem: with this model, running a red light saves money. The longer the light, the more you save. You can feel it every time you approach a junction and the light turns amber. Do you brake and pay to wait? Or do you push harder?

Putting a monetary incentive for skipping that light is dangerous for pedestrians, for other cyclists, and for the riders themselves. And the danger multiplies with the weight of a 30kg electric bike underneath you.

Lime's billing model has turned riding safely into a premium feature you pay an extra 10-30% on your ride for.

A trivial fix?

Lime bikes have GPS. They track your speed in real-time. They know when you're moving and when you're not. The data exists. You can even download it from your account.

Pausing billing when a rider's speed drops below a threshold for more than a few seconds would be straightforward. No new hardware. No new data. Just a different calculation in the billing logic. It could even work as a refund if the calculation requires journey and bike telemetry data.

If Lime wanted to make this only work at lights, this demo shows how it's possible with open data to do that. If Lime wanted to make sure it wasn't abused, they could set a proportionate limit of 'free' stoppage time per trip, or change their model entirely to focus on a combination of distance and time.

The choice to charge for stop time isn't a technical limitation. It's a product decision that prioritises revenue over rider safety.

In a quote to The Times, Hal Stevenson, director of policy at Lime, said that the “pricing model had a very low impact on whether people did or didn’t stop at red lights. We don’t subscribe to the idea that people are making these decisions about their safety, for the basis of 10 or 20p.”

But this came from research that Lime commissioned themselves in July. They launched a 'Respect the Red' campaign to 'encourage riders to stop at red lights'. But surely a crucial part of that campaign would be looking at the behavioural effect of how they charge for time.

Hackney found a workaround

Hackney Council has partially addressed this. From October 2025, e-bike rides in the borough were capped at £1.75 for 30 minutes — the same price as a bus fare. The meter still runs, but the financial pressure disappears. You pay the same whether you sprint through junctions or wait patiently at every light, provided your journey is less than 30 minutes total.

It's a pragmatic fix that removes the incentive without requiring Lime to change their billing logic (they just sell 30-minute bundles for a cheaper price). But it only works within Hackney's boundaries, and it required council intervention to negotiate. The rest of London is still paying by the minute.[^1]

Try building your own commute and see how much Lime's tax on safety might cost you.

<LimeBikeSimulator client:only="react" height={600} />

<details class='mb-8 rounded bg-slate-400/10 p-4'> <summary>About this simulation</summary> This is a simplified model. Traffic lights are placed using OpenStreetMap data, which includes signals on main roads even where separate cycleways exist. Light timings are randomised and cycle predictably rather than responding to real-world factors like traffic density or time of day. The actual stop-time cost on your commute will vary. </details>

[^1]: All of the above also applies to Forest and Voi, who also charge per minute in most of London.

December 21, 2025