DEV Community: NaveenKumar Namachivayam ⚡

Codex CLI vs Claude Code: A Deep-Dive Command Comparison

NaveenKumar Namachivayam ⚡ — Wed, 24 Jun 2026 14:50:03 +0000

In this blog post, we will see how the two most talked-about AI coding CLIs, OpenAI's Codex CLI and Anthropic's Claude Code, stack up command by command. Not just the headline features, but the small wins, the gaps, the uncommon flags, and the places where one clearly pulls ahead. Everything here is sourced directly from the official docs.

Codex CLI vs Claude Code CLI commands: what's the difference?Both are agentic terminal coding tools, but their command surfaces diverge significantly.If you need deep CI integration and multi-agent pipelines, choose Claude Code. If you need local models or a richer TUI experience, choose Codex CLI.

Quick Context

Both tools are agentic coding CLIs that live in your terminal. They read codebases, edit files, run shell commands, and talk to external services over MCP. The underlying models are different (Claude for Anthropic, GPT-family for OpenAI), but architecturally they are solving the same problem.

I have been using Claude Code daily as part of my performance engineering work and plugin development. I recently started exploring Codex CLI seriously after OpenAI formalized its docs under developers.openai.com. This post is the comparison I wish I had when I started.

Installation at a Glance

Claude Code:

npm install -g @anthropic-ai/claude-code
claude auth login

Codex CLI:

# macOS / Linux
curl -fsSL https://chatgpt.com/codex/install.sh | sh

# Or via npm
npm i -g @openai/codex

codex login

Both need Node.js. Claude Code requires an Anthropic account (Claude subscription or API key). Codex CLI authenticates via ChatGPT OAuth or an OpenAI API key.

Core Commands Side by Side

Here are the foundational commands every developer uses daily.

Task	Claude Code	Codex CLI
Start interactive session	`claude`	`codex`
Start with initial prompt	`claude "explain this project"`	`codex "explain this project"`
Non-interactive one-shot	`claude -p "query"`	`codex exec "query"` (alias: `codex e`)
Pipe content	`cat logs.txt \| claude -p "explain"`	`codex exec - < logs.txt`
Continue last session	`claude -c`	`codex resume --last`
Resume by name/ID	`claude -r "auth-refactor" "query"`	`codex resume <SESSION_ID>`
Update CLI	`claude update`	`codex update`
Auth login	`claude auth login`	`codex login`
Auth logout	`claude auth logout`	`codex logout`
Auth status	`claude auth status`	`codex login status`
Configure MCP	`claude mcp`	`codex mcp`
Manage plugins	`claude plugin`	`codex plugin marketplace`
Fork a session	`claude --fork-session --resume <id>`	`codex fork`

Both tools have non-interactive modes perfect for CI pipelines. Claude Code uses -p (print mode). Codex CLI uses exec as a proper subcommand with its own flag surface.

Commands Only in Claude Code

Claude Code has a significantly deeper command surface for background agent management. These are commands with no Codex equivalent.

Background Agent Management

# Start as a background agent and return to prompt immediately
claude --bg "investigate the flaky test"

# Attach to a background session
claude attach 7c5dcf5d

# See logs from a background session
claude logs 7c5dcf5d

# Stop a background session
claude stop 7c5dcf5d

# Restart a background session (picks up updated binary)
claude respawn 7c5dcf5d

# Remove from the list (transcript stays on disk)
claude rm 7c5dcf5d

This is the biggest functional gap in Codex right now. Claude Code has a full background session supervisor with claude daemon status and claude daemon stop --any. You can run multiple agents in parallel, attach and detach, and inspect each session's recent output with claude logs.

Daemon Management

# Check the background supervisor's state
claude daemon status

# Stop the supervisor (keep workers running to reconnect later)
claude daemon stop --any --keep-workers

Project State Management

# Preview what would be deleted
claude project purge ~/work/repo --dry-run

# Delete all local Claude Code state for a project
claude project purge ~/work/repo -y

This cleans up transcripts, task lists, debug logs, file-edit history, and prompt history. Useful when onboarding a project fresh or cleaning up stale state.

Ultrareview

# Run ultrareview on a PR non-interactively
claude ultrareview 1234 --json

Codex does have a /review slash command inside sessions, but claude ultrareview is a standalone CI-friendly command that exits with 0 on success and 1 on failure.

Remote Control

# Start a remote control server so you can control Claude Code from claude.ai
claude remote-control --name "My Project"

# Or start an interactive session with remote control enabled
claude --remote-control "My Project"

# Resume a web session in your local terminal
claude --teleport

This is a genuinely unique capability. You start a session locally, expose it over Remote Control, and then control it from claude.ai or the mobile Claude app. No Codex equivalent exists.

Long-Lived Token for CI

# Generate a long-lived OAuth token for CI pipelines
claude setup-token

Codex uses a different CI flow (piping API key via stdin with codex login --with-api-key).

Install Specific Version

claude install 2.1.118
claude install stable
claude install latest

Commands Only in Codex CLI

Cloud Task Management

# Browse cloud tasks from the terminal
codex cloud

# Submit a cloud task directly
codex cloud exec --env ENV_ID "fix the auth bug"

# List recent tasks with JSON output
codex cloud list --json --limit 10

# Apply a cloud task diff to your local working tree
codex apply TASK_ID

This is Codex's hybrid cloud-plus-local model. You can kick off tasks in the Codex cloud environment and then codex apply their diffs locally. Claude Code has remote web sessions but not this apply-a-cloud-diff pattern.

Sandbox Helper

# Run a command inside Codex's sandboxing layer (macOS Seatbelt)
codex sandbox --permissions-profile my-profile -- pytest tests/

# Log sandbox denials for debugging
codex sandbox --log-denials -- npm test

You can test what commands Codex allows or denies before committing a config. Very useful for security-conscious teams.

Exec Policy Testing

# Check whether a command would be allowed, prompted, or blocked
codex execpolicy --rules ~/.codex/rules/my-policy.rules --pretty -- git push

# Validate rules before saving them
codex execpolicy -r policy.rules -r another.rules -- rm -rf /tmp/junk

This is a preview feature that lets you unit-test your execution policy files. Nothing like this exists in Claude Code.

Shell Completion Scripts

# Generate completions for Zsh
codex completion zsh > "${fpath[1]}/_codex"

# Generate for Bash, Fish, PowerShell, Elvish
codex completion bash
codex completion fish
codex completion power-shell
codex completion elvish

Claude Code does not have a completion command. You get whatever your shell discovers from the binary.

Feature Flag Management

# List all feature flags with maturity and current state
codex features list

# Persistently enable a feature
codex features enable subagents

# Persistently disable a feature
codex features disable experimental-network

Claude Code exposes betas via --betas but does not have a persistent feature flag manager as a first-class CLI command.

Debug Model Catalog

# Print the raw model catalog Codex sees
codex debug models

# Show only the bundled catalog (no remote refresh)
codex debug models --bundled

Useful when troubleshooting model availability or provider routing issues.

Run Codex as an MCP Server

# Expose Codex itself as an MCP tool for other agents to consume
codex mcp-server

This is a powerful composition pattern. Another agentic tool (including Claude Code) can talk to Codex over MCP. I have not seen Claude Code offer an equivalent claude mcp-server command.

Launch Desktop App from CLI

# Open Codex Desktop app, pointing at a workspace
codex app ~/work/my-project

Flags Compared

Shared Flags (Different Names)

Concept	Claude Code	Codex CLI
Model selection	`--model claude-sonnet-4-6`	`--model gpt-4.1` / `-m gpt-5.4`
Extra directories	`--add-dir ../lib`	`--add-dir ../lib`
Non-interactive	`--print` / `-p`	`--json` on `exec` subcommand
Skip permissions	`--dangerously-skip-permissions`	`--dangerously-bypass-approvals-and-sandbox` / `--yolo`
Output format	`--output-format json`	`--json` on exec

Flags Only in Claude Code

# Run with minimal setup (no hooks, skills, MCP, CLAUDE.md)
claude --bare -p "query"

# Set reasoning effort
claude --effort high
claude --effort max

# Append to system prompt without replacing it
claude --append-system-prompt "Always use TypeScript"
claude --append-system-prompt-file ./style-rules.txt

# Validated JSON output matching a JSON Schema
claude -p --json-schema '{"type":"object"}' "query"

# Budget cap for API-billed sessions
claude -p --max-budget-usd 5.00 "query"

# Limit agentic turns
claude -p --max-turns 3 "query"

# Auto-connect to IDE on startup
claude --ide

# Spin up an isolated git worktree
claude -w feature-auth
claude -w feature-auth --tmux

# Resume from a PR number
claude --from-pr 123

# Select a fallback model chain
claude --fallback-model sonnet,haiku

# Screen reader accessible output
claude --ax-screen-reader

# Improve prompt cache reuse across CI runs
claude -p --exclude-dynamic-system-prompt-sections "query"

# Set session display name
claude -n "my-feature-work"

# Load plugin for session only
claude --plugin-dir ./my-plugin
claude --plugin-url https://example.com/plugin.zip

# Disable all slash commands and skills
claude --disable-slash-commands

# Start in safe mode (all customizations disabled)
claude --safe-mode

# Define subagents inline
claude --agents '{"reviewer":{"description":"Reviews code","prompt":"You are a code reviewer"}}'

# Enable advisor tool with a specific model
claude --advisor opus

# Start as a background agent immediately
claude --bg "investigate the flaky test"

# Run a shell command as a PTY-backed background job
claude --bg --exec 'pytest -x'

# Teammate display mode
claude --teammate-mode tmux

# Permission mode
claude --permission-mode plan
claude --permission-mode auto
claude --permission-mode acceptEdits
claude --permission-mode bypassPermissions

# Chrome browser integration
claude --chrome

Flags Only in Codex CLI

# Use local Ollama model
codex --oss

# Switch approval behavior
codex --ask-for-approval on-request

# Attach images to the initial prompt
codex --image screenshot.png "why is this broken?"
codex -i wireframe.png,design.png "implement this"

# Load a named config profile
codex --profile ci

# Enable live web search
codex --search

# Select sandbox policy
codex --sandbox workspace-write

# Connect TUI to a remote app-server
codex --remote ws://192.168.1.10:8080

# Set working directory for the agent
codex --cd /path/to/project "run tests"

# Override a config value inline
codex -c model=gpt-4.1 "query"
codex -c features.subagents=true "query"

# Disable alternate TUI screen
codex --no-alt-screen

The --oss flag is a genuine differentiator. Codex CLI supports pointing at a local Ollama instance for offline or privacy-sensitive work. Claude Code does not have this.

The --image / -i flag at the global level is very ergonomic. In Claude Code, you can reference images inside sessions, but it is not a global flag on the CLI launch itself.

Slash Commands Face-off

Both tools have in-session slash commands. Here is how the key ones map.

Present in Both (Similar Purpose)

Command	Claude Code	Codex CLI
Model switching	`/model`	`/model`
Compact context	`/compact`	`/compact`
New conversation	`/new`	`/new`
Resume session	`/resume`	`/resume`
Fork conversation	`/fork`	`/fork`
Exit	`/quit`	`/quit`, `/exit`
Init project file	`/init` (CLAUDE.md)	`/init` (AGENTS.md)
MCP tools	`/mcp`	`/mcp`
Session status	`/context`, `/cost`, `/stats`	`/status`

Slash Commands Only in Claude Code

/compact "Focus on the auth module and current test failures"
/output-style Explanatory
/output-style my-custom-style
/insights           # compiles past month of usage into an HTML report
/add-dir ../lib     # add working directory mid-session
/rename             # rename the current session
/export             # export conversation as plain text
/terminal-setup     # activate keyboard shortcuts for your terminal
/cost               # how much have I spent? (API users)
/stats              # how much have I used? (Pro/Max users)
/extra-usage        # configure what happens when you hit rate limit

The /insights command is genuinely impressive. It reads your last month of usage history and compiles it into a detailed HTML report. I have not found anything like it in Codex.

The /output-style system lets you define named styles in .claude/commands/ and switch between them. This is a powerful content-shaping tool for teams.

Slash Commands Only in Codex CLI

/goal "Finish the migration and keep tests green"  # set a persistent task goal
/goal pause        # pause the goal tracking
/goal resume       # resume it
/personality pragmatic  # set communication style (friendly/pragmatic/none)
/fast on           # toggle Fast service tier
/plan              # switch to plan mode
/side "is there an obvious risk here?"  # start an ephemeral side conversation
/btw "quick thought"   # alias for /side
/approve           # approve a denied auto-review action and retry
/memories          # configure memory injection and generation
/skills            # browse and use skills
/apps              # browse connectors and insert into prompt
/plugins           # browse installed/discoverable plugins
/hooks             # view and manage lifecycle hooks
/archive           # archive session and exit
/delete            # permanently delete session and exit
/copy              # copy latest response to clipboard (Ctrl+O also works)
/diff              # show Git diff including untracked files
/experimental      # toggle experimental features persistently
/vim               # toggle Vim mode for the composer
/keymap            # remap TUI keyboard shortcuts
/raw               # toggle raw scrollback mode
/review            # ask Codex to review your working tree
/ps                # show background terminals and recent output
/stop              # stop all background terminals
/debug-config      # print config layer diagnostics
/statusline        # configure TUI footer items interactively
/title             # configure terminal window/tab title items
/theme             # choose a syntax-highlighting theme
/permissions       # adjust approval policy mid-session
/ide               # pull IDE context (open files, selection) into prompt
/usage daily       # show daily token usage
/usage weekly
/usage cumulative
/feedback          # send diagnostics to OpenAI
/import            # import Claude Code setup into Codex
/sandbox-add-read-dir C:\path  # grant sandbox read access (Windows only)
/agent             # switch active agent thread
/goal              # persistent task goal tracking

The /side command is something I wish Claude Code had. You can start an ephemeral side conversation to ask a quick focused question without polluting the main thread's transcript. You type /side "check if this plan has an obvious flaw", get your answer, and return to the main task. Brilliant.

The /goal command gives Codex persistent objective tracking during long-running tasks. You set a goal and the agent keeps it in view across multiple turns.

The /personality command lets you shift Codex's communication style between friendly, pragmatic, and none without changing your instructions. Small win, but very practical when switching between debugging and documentation tasks.

Uncommon Commands Worth Knowing

These are the commands that most people miss but deliver real value once you discover them.

Claude Code: `--exclude-dynamic-system-prompt-sections`

claude -p --exclude-dynamic-system-prompt-sections "run the test suite"

This moves per-machine dynamic sections (working directory, environment info, memory paths) into the first user message instead of the system prompt. The result is better prompt cache reuse across different users and machines running the same task. Essential for teams running Claude Code in shared CI environments.

Claude Code: `--bare`

claude --bare -p "explain this function"

Skips auto-discovery of hooks, skills, plugins, MCP servers, auto memory, and CLAUDE.md. Sessions start significantly faster. Useful for quick scripted calls where you do not need any project configuration.

Claude Code: `--from-pr`

claude --from-pr 123
claude --from-pr https://github.com/owner/repo/pull/123

Resumes sessions linked to a specific pull request. Sessions get linked automatically when Claude creates the PR. Supports GitHub, GitHub Enterprise, GitLab, and Bitbucket URLs.

Claude Code: `--fallback-model`

claude --fallback-model sonnet,haiku -p "query"

Automatic fallback when the primary model is overloaded or unavailable. Accepts a comma-separated list tried in order. You can persist a chain via the fallbackModel setting.

Codex CLI: `codex execpolicy`

codex execpolicy --rules ~/.codex/rules/production.rules --pretty -- git push origin main

A policy dry-run. You pass your .rules files and a command, and Codex tells you whether it would allow, prompt, or block that command. This is a fantastic tool for validating security policy before deploying to CI.

Codex CLI: `--oss`

codex --oss "refactor this module"

Points Codex at a locally running Ollama instance. No API calls, no data leaving your machine. Validates that Ollama is running before starting.

Codex CLI: `codex apply`

codex apply TASK_ID

Applies the latest diff from a Codex cloud task to your local working tree. The workflow is: run a task in the cloud environment, review the result on the web, then pull the diff locally with one command. Performance engineers who run long test analysis tasks in cloud environments will appreciate this.

Codex CLI: `/side` and `/btw`

/side "does this API response shape match our schema?"

An ephemeral fork of the current conversation. The side thread has its own transcript. The parent thread's status stays visible in the TUI while you are in side mode. Type your quick question, get the answer, return. This is a quality-of-life feature I would happily see in Claude Code.

Claude Code: `claude ultrareview`

claude ultrareview 1234 --json --timeout 60

Runs a deep code review non-interactively. Prints findings to stdout and exits 0 on success or 1 on failure. Pipe it into your CI gate.

Small Wins: Category by Category

Session Management: Claude Code Wins

Claude Code has a richer session management surface. Background agents with a daemon supervisor, claude logs, claude attach, claude respawn, claude rm, and the claude agents view for monitoring and dispatching parallel sessions. Codex has codex resume and codex fork, which cover the basics but stop there.

Sandbox Control: Codex CLI Wins

Codex's sandbox story is more explicit. You choose read-only, workspace-write, or danger-full-access at the flag level. The codex sandbox command lets you run arbitrary commands inside Codex's sandbox layer to test policies. The codex execpolicy command lets you validate rules before saving them. Claude Code has permission modes (plan, auto, acceptEdits, bypassPermissions) but does not expose the underlying sandbox policy as a testable surface.

Image Input: Codex CLI Small Win

codex --image ui-screenshot.png "why is this button misaligned?"
codex -i wireframe.png,mockup.png "implement this layout"

Codex CLI accepts images as a global flag at session launch. You can attach multiple images with a comma-separated list. Claude Code supports image input inside sessions (drag and drop in the TUI or pasting), but --image is not a CLI launch flag.

CI Scripting: Claude Code Wins

Claude Code has more CI-specific flags: --max-budget-usd caps API spend, --max-turns limits agentic turns, --no-session-persistence avoids writing to disk, --output-format stream-json gives structured streaming output, --include-hook-events and --include-partial-messages allow fine-grained pipeline observability. The claude setup-token command generates long-lived OAuth tokens for CI authentication without a browser.

Codex has codex exec --ephemeral to skip session persistence and codex exec --output-last-message to write the final response to a file, which is handy in GitHub Action pipelines.

Local Model Support: Codex CLI Wins

codex --oss with Ollama support is a genuine differentiator. If you work in an air-gapped or privacy-sensitive environment, Codex CLI has a path. Claude Code currently has no equivalent.

Context Management: Claude Code Wins Slightly

Claude Code's /compact accepts focus instructions:

/compact Focus on the auth module and current test failures

Codex's /compact summarizes the conversation without focus parameters. Also, Claude Code's /insights command compiling a usage HTML report has no Codex equivalent.

Plugin Architecture: Codex CLI More Explicit

Codex has a proper codex plugin marketplace command for managing plugin marketplace sources from Git repos or local directories. You can pin refs and use sparse checkouts. Claude Code has claude plugin install against a marketplace, but the marketplace management surface is thinner at the CLI level.

Remote Work: Both Have Unique Angles

Claude Code has claude remote-control, which lets you control a local terminal session from claude.ai or the mobile app. Codex CLI has --remote ws://host:port, which connects a local TUI to a remote codex app-server. Different models of remote work, both useful depending on your setup.

What Is Missing in Each Tool

Missing in Claude Code

codex --oss style local model support (Ollama)
codex completion for shell completion scripts
codex execpolicy for policy dry-runs
codex sandbox for testing sandbox behavior
codex cloud and codex apply for cloud task management
/side for ephemeral side conversations
/goal for persistent task objective tracking
/personality for communication style control
/fast for service tier switching
/diff as a slash command (Claude Code does have git awareness, but not as a quick slash command)
/keymap for interactive keyboard remapping
/theme for syntax highlighting selection
/statusline and /title for TUI customization
--image as a launch-time CLI flag
/approve for retrying auto-review denials
codex features for persistent feature flag management
codex debug models to inspect model catalog

Missing in Codex CLI

Background session management at the daemon level (claude daemon, claude attach, claude logs, claude respawn)
claude ultrareview as a standalone CI command
claude remote-control to control terminal sessions from the web/mobile app
claude --teleport to bring a web session back to the local terminal
claude --from-pr to resume sessions linked to a specific PR
claude setup-token for long-lived CI tokens
claude project purge for clean project state management
claude --worktree and --tmux for isolated git worktrees
claude --advisor for the server-side advisor tool
--effort levels (low/medium/high/xhigh/max)
--fallback-model chains
--bare for minimal fast-start scripted sessions
--exclude-dynamic-system-prompt-sections for prompt cache optimization
--json-schema for validated structured output
--max-budget-usd for spend caps
System prompt control flags (--system-prompt, --system-prompt-file, --append-system-prompt, --append-system-prompt-file)
/insights usage history report
/output-style named output personas
/rename for session naming mid-session
/export conversation to plain text
Custom commands via .claude/commands/

What is the difference between Codex CLI and Claude Code?

Both are AI-powered terminal coding agents. Claude Code is built by Anthropic and runs Claude models. Codex CLI is built by OpenAI and runs GPT-family models. They share core interactive and non-interactive modes but differ significantly in background agent management, sandbox control, CI flags, and TUI features.

Does Codex CLI support local models?

Yes. Codex CLI supports local Ollama models via the --oss flag. This runs the agent without any API calls. Claude Code does not have an equivalent local model flag.

Does Claude Code support background agents?

Yes. Claude Code has a full background agent system with claude --bg, claude attach, claude logs, claude respawn, claude rm, and a daemon supervisor managed via claude daemon status and claude daemon stop. Codex CLI does not have an equivalent daemon-managed background session infrastructure.

Which CLI is better for CI pipelines?

Claude Code has more CI-focused flags including --max-budget-usd for spend caps, --max-turns to limit agentic turns, --no-session-persistence, --json-schema for validated structured output, and claude setup-token for long-lived OAuth tokens. Codex CLI offers codex exec with --ephemeral and --output-last-message, and a native GitHub Action.

What commands are missing in Codex CLI compared to Claude Code?

Codex CLI is missing background session management (claude daemon, claude attach, claude logs), claude ultrareview for CI code review, claude remote-control for web and mobile session control, --from-pr to resume sessions linked to a PR, --worktree for isolated git worktrees, --fallback-model chains, --max-budget-usd spend caps, and the /insights slash command.

What commands are missing in Claude Code compared to Codex CLI?

Claude Code is missing local model support via Ollama, codex completion for shell completion scripts, codex execpolicy for sandbox policy dry-runs, the /side slash command for ephemeral side conversations, /goal for persistent task objective tracking, /personality for communication style switching, /theme for syntax highlighting, and --image as a launch-time flag.

My Take

Both tools are genuinely capable. My honest observation after going through every documented command:

Claude Code has a deeper background agent infrastructure. If you are building multi-agent pipelines, running parallel workloads, or need tight CI integration with structured outputs, Claude Code's flag surface and daemon management are hard to beat.

Codex CLI wins on local model flexibility, sandbox policy control, and the TUI experience. The /side command, /goal tracking, and /personality switching feel like thoughtful UX investments. The codex execpolicy command for policy dry-runs shows a security-first mindset.

What I personally want to see: Claude Code adopt --image as a launch flag and a /side equivalent. Codex CLI needs a proper background daemon for parallel agents and a --max-budget-usd style spend cap for CI use.

Pick your tool based on your model preference first, then your workflow. If you need remote session control or deep CI scripting, lean Claude Code. If you need local model support or prefer a more granular TUI, lean Codex CLI.

Have you switched between both tools on the same project? I would love to know which commands you reach for first. Drop a comment below.

Happy Testing!

Toy Story: The Open-Source Ecosystem

NaveenKumar Namachivayam ⚡ — Fri, 19 Jun 2026 16:41:20 +0000

As schools are off and Toy Story 5 is just around the corner, we started binge-watching Toy Story from 1 to 4. While watching, suddenly this idea popped up: what if a GitHub repo came alive just like the toys? I started writing with something basic and enhanced it using Gemini Flash. Hope you'll like it.

The Setup: The Developer's Stack

The "Room" is the ultimate production stack. The classic, dependable tools that every developer loves and relies on.

Woody (python/cpython): The beloved, classic, highly readable leader of the repo ecosystem. He’s dependable, has been around forever, and is the favorite of the developer. He prides himself on clean architecture and readability.
Rex (apache/jmeter): A massive, heavy-duty Java performance testing tool. He’s incredibly powerful but constantly anxious that modern, lightweight tools are going to make him look extinct.
Mr. Potato Head (docker/cli): The ultimate container tool. You can literally swap his volumes, environment variables, and ports around to make him look like whatever you want.
Slinky (lodash/lodash): The utility tool that just exists to stretch and connect different data structures together smoothly.

They all live in harmony on the machine, until a massive update drops...

The Inciting Incident: The Trendy New Framework

The developer is starting a massive new enterprise cloud project. Suddenly, a sleek, shiny new arrival lands in the ecosystem with over 100k GitHub stars in its first week.

Enter Buzz Lightyear (facebook/react).

Buzz is high-tech, component-based, and completely delusional. He doesn’t realize he’s just an open-source library running on a local runtime. He genuinely believes he is a Space Ranger from Vercel deployed to the Edge Network. He looks at the backend scripts and declares he will build a Virtual DOM to save the galaxy.

Woody (cpython) is furious. "You aren't a full-stack engine! You're a frontend library! You're an npm package!" But the developer keeps starring react, opening its issues, and ignoring python scripts.

The Interlude: Lost in Pizza Planet

In a heated argument over package management, Woody accidentally bumps Buzz out of the active IDE workspace. The other repos accuse Woody of a malicious git rm. Determined to patch things over, Woody chases Buzz out of the environment.

They end up stranded at Pizza Planet a massive, chaotic public multi-tenant cluster. Hungry for a way back to a developer's machine, Buzz spots a glowing, neon structure: a massive monorepo cluster masquerading as a claw machine game.

They climb inside, landing in a sea of hundreds of identical, tiny, lightweight Docker Microcontainers (alpine-linux/mini-images). They sit huddled together in their namespace pods, completely identical, staring upward in wonder.

The Microcontainers: (In unison, staring at the cluster orchestrator) "Oooooooooh... The OpenClawwww."

Buzz: "Who is in charge here?"

Microcontainer #1: "The OpenClaw! It is an open-source automation engine. It hooks into our webhooks and schedules our lifecycles."

Suddenly, a heavy, automated crane mechanism descends from the top of the repository cluster.

Microcontainer #2: "The OpenClaw moves! It has selected a container!"

Microcontainer #3: "I have been chosen! I am being scheduled to a high-availability EC2 node! Farewell, my friends, I go to a better place... Production!"

Before Woody and Buzz can escape the cluster, Sid (malicious-npm-bot) a chaotic script-kiddie developer playing on the cluster drops a malicious token into the machine. The OpenClaw descends, but instead of a container, its mechanical hook snags Woody and Buzz, dropping them right into Sid's dark dependency backpack.

The Climax: The Dark Web of Dependency Hell

Sid’s machine is a chaotic nightmare of dependency hell. He takes famous repos, strips their licenses, injects malware, and bundles them into mutated, broken franken-packages. He has strapped a volatile crypto-miner to Buzz, intending to deploy him to an unsecured AWS bucket.

Woody realizes he can't save the day alone. He rallies Sid’s mutated, broken open-source forks. They break the prime directive of software: they execute without being called by a command line. They glitch out Sid's IDE, spamming his screen with endless Deprecated warnings and breaking changes until he panics, shuts down his PC, and goes outside.

The Resolution: The Great Git Push

Woody and Buzz race back to the developer's main machine, but the developer is in the middle of a massive migration. He is running a script to push his entire workspace to a new cloud organization.

The migration truck is leaving! Woody and Buzz missed the initial commit. They scramble to find a way into the push. They spot a fast, high-velocity transport stream: curl running over a high-speed fiber connection.

They hitch a ride on a webhook, but the payload is too heavy. Buzz throws Woody ahead into the repository, sacrificing himself to an asynchronous timeout. Woody refuses to lose his friend. He grabs a gzip compression rocket, ignites it, sweeps down, grabs Buzz, and they soar through the pipeline.

They don't just land in the repo; they land right at the top of the main branch, fully compiled and perfectly integrated.

Post-Credits Scene:

cpython and react are now happily co-existing in a beautiful Django-React stack. Suddenly, the developer runs an installation command for a new repo that just dropped: microsoft/autogen.

An army of autonomous AI Agents floods the repository.

Buzz: "Woody, look! Multi-agent orchestration!"

Woody: (Gulp) "Great..."

THE STORY IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY MERGE CONFLICTS, BROKEN DEPENDENCIES, OR EXISTENTIAL CRISES EXPERIENCED BY YOUR LOCAL SCRIPTS AFTER READING. 

Toy Story is © Disney/Pixar. All featured repositories belong to their rightful maintainers.

JMeter vs k6 vs Locust in 2026: Which Load Testing Tool Should You Pick?

NaveenKumar Namachivayam ⚡ — Thu, 18 Jun 2026 20:33:01 +0000

In this blog post, we will see a detailed, grounded comparison of the three most debated open-source load testing tools in 2026: Apache JMeter, Grafana k6, and Locust. All three are free. All three are production-proven. Yet they could not be more different in philosophy, architecture, and day-to-day experience.

I have worked with all three across real-world projects, from legacy JDBC-heavy enterprise systems at work to lightweight microservice pipelines I test for my own side projects. The honest truth? There is no universal winner. But there is almost always a right answer for your specific situation, and that is what we will figure out today.

Why This Comparison Still Matters in 2026

Every year someone writes "JMeter is dead." Every year JMeter ships another release and shows up in another enterprise RFP.

The market has not consolidated. Instead, it has stratified. k6 owns the developer-experience conversation. Locust owns the Python ecosystem. JMeter owns the protocol breadth and enterprise legacy. And in 2026, all three have meaningful updates worth knowing about before you pick a tool for your next project.

Let me give you the ground truth, not marketing copy.

Quick Stats at a Glance

	Apache JMeter	Grafana k6	Locust
Language	Java (GUI + XML)	Go runtime, JS/TS scripts	Python
Latest Version	5.6.3	2.0.0 (May 2026)	Latest on PyPI (May 2026)
GitHub Stars	~9.4k	~30.8k	~27.9k
License	Apache 2.0	AGPL-3.0	MIT
Concurrency Model	Thread per VU	Go goroutine per VU	gevent greenlet per VU
Protocol Breadth	Excellent (HTTP, JDBC, JMS, LDAP, MQTT, FTP...)	Good (HTTP, gRPC, WebSocket)	Good (HTTP, extensible via Python libs)
CI/CD Fit	Good	Excellent	Good
GUI	Yes (built-in)	k6 Studio (separate app)	Web UI (live stats only)
Cloud Option	BlazeMeter, OctoPerf	Grafana Cloud k6	Self-managed
Best For	Multi-protocol, legacy enterprise	Modern APIs, developer teams	Python shops, flexible scripting

Apache JMeter

JMeter was first released in 1998. That is not a typo. It turned 27 this year, and it is still actively maintained under the Apache Software Foundation.

The latest stable release is 5.6.3. It requires Java 17 as the recommended runtime, and the team has already signaled that the next major version will drop Java 8 support entirely.

What JMeter Gets Right

JMeter's superpower is protocol coverage. Nothing else on this list comes close.

HTTP / HTTPS
JDBC (database connection testing)
JMS
LDAP
MQTT
FTP
TCP

If you are testing a legacy enterprise system, a mainframe-adjacent API, or a backend that talks over JDBC, JMeter is often the only open-source option that handles it natively.

The plugin ecosystem also deserves credit. The JMeter Plugins project (Head to https://jmeter-plugins.org) adds over 60 additional components. I have built and maintain several commercial plugins of my own, and the extensibility is genuinely solid once you understand the architecture.

Where JMeter Struggles in 2026

The XML-based .jmx test plan format is the biggest pain point in a modern team. Git diffs on .jmx files are nearly unreadable. Code review for JMeter scripts is painful. "Load testing as code" with JMeter is possible but requires discipline and tooling that does not come out of the box.

The thread-per-user concurrency model also means JMeter is resource-hungry at scale. A single machine can generate fewer concurrent users than k6 or Locust on equivalent hardware. For large-scale tests, you need distributed mode or a cloud platform like BlazeMeter, which starts around $149/month for the basic plan.

The GUI, while powerful, shows its age next to k6 Studio or even Locust's minimal web interface.

You can check Feather Wand if you want to infuse AI in your workflow. To measure the speed of LLM, you can check iamspeed.dev.

Personal Observation

I was using JMeter daily at Salesforce for MuleSoft API performance testing. The GUI is genuinely useful for building complex request chains quickly. But the moment I need to commit a test plan to Git and do a proper review, it becomes painful.

Grafana k6

k6 is the most talked-about load testing tool in 2026, and the GitHub star count (30.8k at the time of writing) reflects that.

Two major milestones happened back to back: k6 v1.0 dropped in May 2025 with TypeScript support, native extensibility without custom build pipelines, and SemVer stability guarantees. Then k6 v2.0.0 shipped on May 11, 2026, and it changed the game again.

What k6 2.0 Brought

The headline feature in k6 2.0 is AI-assisted testing workflows. This is not a gimmick. The release ships four new commands built specifically for agent-friendly development:

k6 x agent: bootstraps agentic testing workflows inside Claude Code, Codex, Cursor, and other AI coding assistants
A built-in Model Context Protocol (MCP) server so AI agents can validate and run scripts, inspect results, and iterate without leaving the session
k6 x docs: gives agents and developers CLI access to k6 documentation and examples
k6 x explore: lets agents browse the extension registry from the CLI

There is also a new Assertions API, broader Playwright compatibility in the browser module, and a consolidated extension catalog that merges official and community extensions into one place.

What k6 Gets Right

The scripting experience is genuinely great for developers. You write JavaScript or TypeScript. Your IDE gives you autocomplete. Your CI pipeline runs it as a single binary with no JVM to provision.

import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  vus: 100,
  duration: '30s',
  thresholds: {
    http_req_duration: ['p(95)<500'],
    http_req_failed: ['rate<0.01'],
  },
};

export default function () {
  const res = http.get('https://api.example.com/health');
  check(res, {
    'status is 200': (r) => r.status === 200,
  });
  sleep(1);
}

k6 Studio (v1.13.1) is a desktop GUI with AI-powered auto-correlation. If you record a browser session, k6 Studio detects dynamic values like session tokens and CSRF tokens and generates correlation rules automatically. That is a feature JMeter has had for years via plugins, but k6 Studio does it through AI, without the XML.

Where k6 Struggles

Protocol coverage is more limited than JMeter. k6 is strong on HTTP, gRPC, and WebSocket. For JDBC, JMS, or LDAP, you are looking at community extensions or custom solutions.

The AGPL-3.0 license is also worth flagging for commercial use cases. Check with your legal team if you are embedding k6 in a product.

Personal Observation

I built iamspeed.dev (an LLM streaming benchmarker) and used k6 for the load side. The DX was excellent. TypeScript types in the IDE, a clean CLI, and Grafana integration out of the box. For any API-heavy workload where the protocol is HTTP or gRPC, k6 is my first recommendation in 2026.

Locust

Locust is the load testing tool for Python teams, and the May 2026 PyPI release confirms the project is alive and growing. It now officially supports Python 3.10 through 3.14.

What Locust Gets Right

Locust's model is simple: write Python classes that describe user behavior, run the tool, watch the web UI. No DSL to learn. No XML. No JVM.

from locust import HttpUser, task, between

class APIUser(HttpUser):
    wait_time = between(1, 3)

    @task(3)
    def get_products(self):
        self.client.get("/api/products")

    @task(1)
    def get_health(self):
        self.client.get("/health")

Under the hood, Locust uses gevent greenlets instead of OS threads. This gives it excellent concurrency density. On the same 8 GB machine, Locust can handle roughly 5x more concurrent users than JMeter, according to TestDevLab's 2026 analysis.

Because test files are plain Python, extending Locust to custom protocols is straightforward. Need to load test a proprietary queue or an LLM inference endpoint? Wrap the Python client library and drop it into a HttpUser subclass. This is actually something I have done for AI workload benchmarking.

Distributed testing is built in. You run a master process and any number of worker processes, scale horizontally, and the web UI aggregates everything.

Where Locust Struggles

The built-in reporting is minimal. The web UI gives you live stats during the run, but there is no built-in HTML report comparable to JMeter's dashboard or Gatling's output. Most teams pipe Locust metrics into Grafana via InfluxDB or Prometheus.

There is no GUI for building test plans. Everything is code. That is great for developer teams but can be a barrier for non-technical stakeholders.

Personal Observation

Locust is my go-to tool when I am testing an LLM API or any endpoint where I need complex Python logic in the request flow, like computing HMAC signatures, calling a pre-step to generate tokens, or parsing streaming responses. The pure-Python model gives you the whole ecosystem to work with.

Head-to-Head Comparison

Scripting Experience

JMeter gives you a GUI that is powerful but dated. Building a test plan with the GUI is fast for HTTP. Building one for gRPC or WebSocket requires plugins and some patience.

k6 gives you a code editor and a TypeScript-aware test runner. The scripting is clean, the API is well-documented, and the extension ecosystem is growing fast.

Locust gives you a Python file. Nothing else to install. If your team already writes Python, the onboarding time is near zero.

Concurrency Model

This is where architecture matters for real.

JMeter runs one OS thread per virtual user. This is expensive. A mid-range machine typically maxes out around 300-500 concurrent threads before CPU and memory become the bottleneck, not the system under test.

k6 runs each VU as a Go goroutine. Goroutines are lightweight. k6 can drive thousands of concurrent VUs from a single machine.

Locust uses gevent greenlets, which are cooperative coroutines. Similar lightweight profile to goroutines. One machine can comfortably simulate thousands of users against an HTTP API.

CI/CD Integration

k6 wins this category cleanly. A single binary, no JVM, no Python dependency tree. The GitHub Actions integration is a config change. The threshold system lets you fail a pipeline based on p95 response time or error rate directly in the test script.

Locust integrates well with CI/CD through headless mode (locust --headless). You can define pass/fail criteria via exit codes and custom listeners.

JMeter needs more setup: a JVM, a plugin directory, a .jmx file committed to the repo, and some wrapper scripts to parse the output. It works, but it takes more effort to get right.

Reporting

JMeter ships a dynamic HTML report with response time graphs, latency percentiles, and error analysis. It is comprehensive out of the box.

k6 pushes metrics to Grafana natively (local or cloud), and the k6 2.0 summary is significantly improved over previous versions. For cloud runs, the Grafana Cloud k6 dashboard is excellent.

Locust's built-in report is minimal. Pipe to Grafana via Prometheus or InfluxDB for anything beyond a quick check.

Cloud Execution

	JMeter	k6	Locust
Managed Cloud	BlazeMeter ($149/mo+), OctoPerf	Grafana Cloud k6	None (self-managed)
Kubernetes	Manual setup	k6 Operator (official)	Manual setup
Distributed	Controller + agents via SSH	k6 cloud run / k6 Operator	Master + worker processes

The Metric Problem Nobody Talks About

This is something I always include when I write about load testing tools, because it catches teams off guard.

Run the same test against the same endpoint using JMeter and k6, and you will see different response time numbers. Not because one tool is wrong. Because they measure different slices of the request lifecycle.

JMeter starts the clock at the connection and stops when the last byte is received
k6 breaks response time into granular phases: http_req_connecting, http_req_tls_handshaking, http_req_waiting, http_req_receiving
Locust, using gevent, can report higher response times under certain connection reuse configurations

OctoPerf's comparative study showed up to 15-20% variance in reported response times between tools running identical load against the same target. The practical takeaway: never compare baselines across tools. Establish baselines inside a single tool and track trends there.

Which Tool Should You Choose?

Use this decision tree:

Choose JMeter if:

You are testing JDBC, JMS, LDAP, FTP, or SOAP endpoints
Your team uses GUI-driven test creation
You have an existing JMeter investment and plugin ecosystem
You work in enterprise environments where BlazeMeter or OctoPerf is already licensed

Choose k6 if:

Your stack is HTTP, gRPC, or WebSocket
Your team writes JavaScript or TypeScript
CI/CD integration is a first-class requirement
You want AI-assisted test authoring in 2026 (k6 2.0's MCP server is real and it works)
You want the best DX in the category right now

Choose Locust if:

Your team is already Python-first
You need deep customization of request logic (token generation, streaming parsing, custom protocols)
You are testing LLM APIs or AI workloads where the request logic is non-trivial
You want distributed testing without a managed cloud dependency

The Hybrid Stack Reality

Something the comparison articles rarely say: most mature teams run two tools.

The practical 2026 default stack looks like one of these:

k6 OSS for daily CI checks + Grafana Cloud k6 for quarterly capacity tests
JMeter locally for protocol-rich scenarios + BlazeMeter for distributed runs
Locust for API behavioral tests in Python + Prometheus/Grafana for dashboards

I have run exactly this kind of hybrid at QAInsights, using JMeter for the complex correlation scenarios and k6 for the lightweight API regression checks that live in CI. The tools complement each other more than they compete.

Final Verdict

There is no single best load testing tool in 2026. But there is a best tool for your context.

If you are starting from scratch on a modern microservices stack, pick k6. The DX is excellent, k6 2.0's AI integration is ahead of everyone else, and the Grafana ecosystem is mature.

If your Python team needs to write complex behavioral scripts, pick Locust. The gevent-based concurrency is efficient, the code is readable, and the Python ecosystem fills every gap.

If you are in an enterprise environment testing JDBC, JMS, or anything beyond HTTP, pick JMeter. The protocol breadth is unmatched in open source, and the plugin ecosystem solves problems that other tools have not even attempted.

What matters most is not which tool you pick. It is that you actually test under load before your users find the bottleneck for you.

Happy Testing!

What tool are you using in your current project, and what made you choose it over the alternatives? Drop your answer in the comments below.

I Built a Fast.com for LLMs: Introducing iamspeed.dev

NaveenKumar Namachivayam ⚡ — Wed, 17 Jun 2026 16:50:20 +0000

In this blog post, we will see how I built iamspeed.dev, a fast.com-style LLM API speed benchmark tool that measures Time to First Token (TTFT) and tokens-per-second throughput directly in your browser.

If you have ever stared at a spinning cursor waiting for an LLM response and wondered "is this slow, or is it just me?" this tool is for you.

The tool is designed for quick, lightweight benchmarking, modeled after the fast.com experience for internet speed tests. It uses an extensible provider adapter architecture, making it straightforward to add new providers such as Gemini or Groq. Planned additions include historical results, model comparison mode, and support for local models via Ollama.

iamspeed.dev is an open-source, browser-based benchmarking tool for LLM APIs that measures two key performance metrics: Time to First Token (TTFT) and tokens-per-second throughput. It supports OpenAI and Anthropic providers and stores API keys locally using AES-GCM encryption, with no backend or data transmission.

The Problem That Sparked This

I spend a lot of time benchmarking systems. Load testing APIs, profiling microservices, measuring throughput it is what I do at QAInsights and at my day job.

When LLMs started becoming part of production stacks, I noticed that most developers just eye-balled "it feels fast" or "it feels slow." There was no quick, browser-based tool you could open, configure your API key, and immediately get a concrete number.

Tools like artificial analysis do heavy-lifting comparisons across hundreds of models. But I wanted something lighter. Something you could open on a Tuesday afternoon and just run.

That is exactly how fast.com works for internet speed tests. You open it, it runs, you see a number. Done.

So I built the same thing for LLM APIs: iamspeed.dev.

Introducing iamspeed.dev

What Is iamspeed.dev?

iamspeed.dev is an open-source, browser-based benchmarking tool for LLM APIs. It streams live tokens from supported providers OpenAI and Anthropic today and shows you real-time performance metrics as they happen.

No backend. No data collection. No surprises.

Your API key is stored locally in your browser using AES-GCM encryption, meaning it never leaves your machine.

The interface is deliberately minimal, as shown below just a logo, a metric display, a Run button, and a settings panel. Inspired directly by the fast.com aesthetic.

Key Metrics: What Gets Measured

If you work with LLMs in production, you already know that raw response time is a misleading number. The two metrics that actually matter are:

1. Time to First Token (TTFT)

This is the time between sending your request and receiving the very first token back from the model. It reflects how quickly the LLM starts generating a response.

TTFT is what users feel. A high TTFT means that awkward pause before anything appears on screen.

For interactive applications, keeping TTFT low is critical. Reasoning models (extended thinking, deep think modes) can inflate TTFT by 5x to 30x because of the additional compute happening before the first visible token arrives.

2. Tokens Per Second (Throughput)

This is the rate at which the model streams tokens to you after the first one arrives. It is the "output speed" metric.

High tokens per second means the text appears fast and fluid on screen. Low throughput feels choppy and slow even if the TTFT was acceptable.

Together, these two numbers give you the full picture of how an LLM API performs for your use case.

Features at a Glance

Here is a quick summary of what iamspeed.dev supports:

Live streaming output with real-time metric updates
TTFT measurement captured precisely at the moment the first token arrives
Tokens/sec throughput tracking updated continuously during generation
AES-GCM encrypted API key storage local only, never transmitted
OpenAI provider support (GPT-4o, GPT-4.1, and compatible models)
Anthropic provider support (Claude Sonnet, Claude Haiku, and more)
Extensible provider architecture via a clean ProviderAdapter interface
Responsive minimal UI inspired by fast.com

The key thing I want to highlight is the local encryption. I have seen too many tools that ask for your API key and quietly send it somewhere. iamspeed.dev does not do that. Your key is AES-GCM encrypted and stored only in your browser's local storage.

The provider architecture is clean and intentional. Each LLM provider is implemented as an adapter that satisfies the ProviderAdapter interface. This makes adding new providers straightforward and keeps the core benchmark logic provider-agnostic.

The project is hosted at iamspeed.dev and the full source is available on GitHub.

How to Run It Locally

Running iamspeed.dev locally takes under a minute. Here are the steps:

Clone the repository:

git clone https://github.com/QAInsights/iamspeed.dev.git
cd iamspeed.dev

Install dependencies:

npm install

Start the development server:

npm run dev

Head to http://localhost:4321 in your browser.
Click the gear icon (Settings) and enter your OpenAI or Anthropic API key.
Hit Run.

You will immediately see the tokens streaming in and the tokens/sec counter updating live, as shown below.

Here are all the available commands:

Command	Description
`npm run dev`	Start the dev server
`npm run build`	Build for production
`npm run preview`	Preview the production build
`npm test`	Run unit tests (Vitest)
`npm run test:e2e`	Run E2E tests (Playwright)

How to Add a New Provider

This is where the architecture really shines. If you want to add support for, say, Gemini or Groq, the process is clean:

Create a new adapter file in src/lib/providers/. Your adapter must implement the ProviderAdapter interface.
Register it in src/lib/providers/index.ts.
Add the provider metadata (name, models, etc.) to src/lib/config.ts.

That is it. No changes to the benchmark engine, no changes to the UI logic. The adapter pattern keeps concerns separated cleanly.

I am planning to add more providers over time. If you want to contribute one, pull requests are welcome.

Why This Matters for Performance Engineers

I want to speak directly to performance engineers here for a second.

We are used to measuring systems with JMeter, k6, Gatling. We understand throughput, latency percentiles, concurrency, think time. LLM APIs add a new dimension to all of this.

When you are building an AI-powered product, you are not just measuring HTTP response time anymore. You are dealing with:

TTFT as a user-perceived latency metric (equivalent to time-to-interactive in web perf)
Streaming throughput as a sustained delivery rate (not a one-shot measurement)
Provider variability the same model can behave very differently across regions and time of day
Reasoning overhead thinking models add invisible compute time before the first visible token

Tools like iamspeed.dev give you a quick sanity check. Before you design a full performance test suite for your LLM-powered API, run a quick benchmark here to understand your baseline numbers.

I have written extensively about LLM performance metrics on the QAInsights blog and built the jmeter-llm-sampler plugin for measuring TTFT and TTLT in JMeter test plans. iamspeed.dev is the browser-friendly companion to those deeper tools.

What's Next

A few things I want to add to iamspeed.dev:

More providers: Gemini, Groq, Mistral, and local Ollama support
Historical results: Run multiple benchmarks and compare them over time
Model comparison mode: Run the same prompt across two models side by side
Shareable result links: Generate a URL you can share with your team
Prompt customization: Let you choose the input prompt length to simulate different workloads

If any of these sound useful to you, drop a star on the GitHub repo and let me know what you want to see first.

Try It Now

Head to iamspeed.dev, configure your API key in settings, and hit Run.

You will have your tokens-per-second number in about 10 seconds.

The source code is MIT licensed and available at github.com/QAInsights/iamspeed.dev. Contributions are open.

Happy Testing!

What LLM provider are you using in production today, and what TTFT are you seeing? Drop a comment below I would love to know how the numbers compare.

How I Use Qwen Code Slash Commands to Build Achu App

NaveenKumar Namachivayam ⚡ — Wed, 17 Jun 2026 03:49:08 +0000

In this blog post, we will see how I use Qwen Code's slash commands and workflow strategies to build Achu my screenshot beautifier app without burning through tokens or losing context mid-session.

If you haven't heard of Achu, it's a desktop app built with Electron + React + TypeScript. It does screenshot beautification, Privacy Guard (offline OCR redaction), Auto-Vibe (palette-extracted backgrounds), and an AI Bug Agent with GitHub integration. It's a side project I'm genuinely proud of, and Qwen Code has become my go-to agentic coding CLI for it.

A developer shares their day-to-day workflow for using Qwen Code, an open-source agentic coding CLI, to build Achu, a desktop screenshot beautification app built with Electron, React, and TypeScript. The post covers how slash commands like /init, /plan, /compress, /remember, and /btw are used to manage context, reduce token costs, and maintain consistent output across sessions.

The core approach centers on spec-driven planning through iterative /plan sessions before any code is written, combined with parallel subagents for independent tasks and strict context hygiene using /compress and /clear. Additional practices include pointing the model at library source code instead of documentation and using /remember to persist architectural decisions across sessions.

This isn't a tutorial about what Qwen Code is. It's about how I actually use it day-to-day, the slash command tricks I rely on, and the discipline it takes to get real work done with an LLM in a terminal.

It all started with Google Antigravity, but the 5 hours reset and weekly limits is killing my productivity and thinking flow. I had to switch to more affordable and open source model where I chose Qwen.

Why Qwen Code?

I've tried Claude Code, Gemini CLI, and a bunch of others. Qwen Code is open source, has excellent subagent support, a rich slash command system, and Qwen Max is genuinely strong at reasoning through complex TypeScript and Electron internals.

My go-to model is Qwen Max. For lighter tasks like /recap or prompt suggestions I set a fast model with /model --fast qwen3-coder-flash to keep costs down.

The /init and Project Context Setup

The very first thing I do when I start on a new project or return to Achu after a few days is run:

/init

This analyzes the current directory and generates an initial context file essentially giving Qwen Code a map of the project. It picks up folder structure, key files, and creates a baseline understanding before I say a single word.

After /init, I manually add a few paragraphs about the project. I treat this like writing a team onboarding doc for a new developer. I tell Qwen what Achu is, what the current milestone is, what tech stack we're on, and what the known constraints are (like Electron IPC boundaries, the Upstash Redis integration, or the Gumroad-based monetization model).

This upfront investment saves enormous amounts of back-and-forth later.

Spec-Driven Planning with /plan

When I want to build a new feature, I don't just dump a vague request and hope for the best. I use /plan to switch Qwen Code into planning mode.

/plan Implement the Privacy Guard redaction pipeline

In plan mode, Qwen analyzes and thinks, but does not touch any files. This is key. It's the agentic equivalent of "think before you act."

What I actually do is run multiple turns in plan mode to iterate the spec. I think of a "spec" as the formal artifact that describes what should be built the interface contracts, the data flow, the error paths, the acceptance criteria. It's not a vague idea. It's something precise enough that a developer (or subagent) could implement it.

The loop looks like this:

/plan enter planning mode
Describe the feature in detail what it does, what it doesn't do, edge cases
Ask Qwen to propose an approach
Push back on anything that doesn't fit the architecture
Ask for the revised plan
Repeat 2-3 times until the spec is solid

The quality of implementation is almost entirely determined by the quality of the spec. This multi-turn refinement before a single line of code is written is the most valuable habit I've developed using any agentic coding tool.

By default, Qwen will ask followup questions. But it is always recommended to tell the model to ask questions.

Subagents for Async Work

Once the spec is locked in, I use subagents aggressively for any work that can happen independently.

Qwen Code's subagent system lets you define specialized agents as Markdown files in .qwen/agents/. Each agent has its own system prompt, tool allowlist, and model. You can call them explicitly or let Qwen delegate automatically.

For Achu, I have a few custom subagents:

A testing subagent focused on Vitest and Electron testing patterns (more on this below)
A code reviewer subagent that runs in plan mode and only reads files

The key power here is Fork Subagents when Qwen needs to run multiple things in parallel, it can implicitly fork. Forks inherit the parent context, run in the background, and share the prompt cache prefix. This means if I ask Qwen to "investigate the IPC handler for Privacy Guard, the Ollama integration, and the Upstash Redis voting flow simultaneously," it can fork three parallel agents without tripling my token costs.

I explicitly phrase tasks as:

"Run these three investigations in parallel using subagents and report back."

This keeps the main conversation focused and lets the grunt work happen concurrently.

A project-level subagent config lives at .qwen/agents/testing.md and looks like this:

---
name: testing
description: "Writes Vitest unit tests and Electron integration tests for Achu. Use PROACTIVELY for any test-related tasks."
approvalMode: auto-edit
tools:
  - read_file
  - write_file
  - read_many_files
  - run_shell_command
---

You are a testing specialist for an Electron + React + TypeScript app.
Follow Vitest conventions. Mock Electron IPC using vitest-mock-extended.
Always write both positive and negative test cases.

The phrase "Use PROACTIVELY" in the description is important it signals to the main model to delegate testing tasks here without being asked explicitly.

Context Hygiene: /summary, /compress, and /clear

This is where most people fail with long agentic sessions. They let the context grow unbounded until the model starts hallucinating, forgetting earlier instructions, or producing inconsistent output. I've learned to treat context like memory on a constrained machine.

My hygiene rules:

After a major chunk of work is done:

/summary

This generates a project summary from the conversation history. I save this externally and reference it when restarting sessions.

When the context window is getting full:

/compress

This replaces chat history with a compressed summary, freeing up tokens while preserving the semantic essence of what was discussed. Think of it as a lossy but practical checkpoint.

When Qwen starts steering away:

If the model starts going off-track giving answers that don't match the project constraints, suggesting patterns we've already ruled out, or just losing the thread. I don't argue. If it happens twice in a row, I clear:

/clear

Then I reload context from scratch using /init and a fresh description. Two drifts is my hard limit. The discipline here is resisting the urge to keep "fixing" a bad session. It's cheaper to restart clean.

Watching Context and Usage with /context and /stats

I watch these two commands constantly.

/context

Shows a breakdown of what's consuming the context window right now system prompt, conversation history, tool results. If I see tool results bloating the context, I know a /compress is coming.

/context detail

Shows per-item breakdown. Useful when one massive file read is eating 40% of the window.

/stats

Gives detailed session statistics tokens used, API calls, cost estimates. I check this before and after big operations. It's how I keep tabs on spend, especially on Qwen Max which isn't the cheapest model.

Keeping an eye on these is the agentic equivalent of watching memory usage in a production system. Ignore it and you'll pay for it.

Pointing to Source Directories Instead of Docs

This one is a significant productivity trick that I don't see talked about enough.

When Qwen needs to understand a third-party library, the default approach is to tell it to fetch the docs URL. The problem is that docs are often incomplete, outdated, or optimized for humans rather than LLMs.

What I do instead: I download the library source and point the conversation directly at it using @:

@./vendor/upstash-redis/src Tell me how the pipeline API works

Or with a deeper path:

@./node_modules/@electron/remote/src/main Explain the context bridge setup

Qwen reads the actual implementation. No guessing from docs. No hallucination about API signatures that changed in the last major version.

I keep a vendor/ folder in the project root where I clone or copy source for critical dependencies. This makes @ references stable and reproducible.

For Achu specifically, I've pointed Qwen at the Ollama TypeScript client source, the llava-phi3 model integration code, and parts of the Electron forge config. The answers I get are ground-truth accurate instead of approximately correct.

Persistent Memory with /remember and /dream

Some things should survive session boundaries. My preferences, key architectural decisions, constraints Qwen needs to always respect. I use /remember for these.

/remember Always use Electron's contextBridge for IPC. Never use remote module.
/remember Achu uses oklch color space. Do not suggest hex values without conversion.
/remember Free tier users get 3 exports per day. Pro users are unlimited.

These get persisted in Qwen's memory store and are injected into future sessions automatically.

/dream is the manual trigger for auto-memory consolidation. Qwen's auto-memory runs in the background, but if I want to force a consolidation pass after a long session to make sure the important discoveries from the current session get persisted I run:

/dream

Think of it as flushing the cache to disk before shutting down.

To review and manage what's been remembered, I use:

/memory

This opens the Memory Manager dialog where I can edit or delete entries. I audit this occasionally. Stale memories can be just as harmful as no memories.

The /btw Trick for Side Questions

This is my favourite quality-of-life command.

/btw What's the difference between contextBridge.invoke and contextBridge.exposeInMainWorld?

/btw sends a parallel API call with recent conversation context (up to the last 20 messages) and shows the response above the composer without touching the main conversation at all. The main session continues uninterrupted.

I use this constantly for:

Quick clarifications while in the middle of implementation
Checking a TypeScript type signature without derailing a planning session
Double-checking a shell command before running it via !

The response doesn't become part of conversation history. It's a throwaway lookup. This is genuinely useful and I'm surprised more CLI tools don't have something like it.

Uncommon Commands Worth Knowing

Beyond the commands I use daily, here are a few from the docs that are genuinely underrated:

/restore

Restores files to their state before a tool execution. If a Qwen action made a mess, you can list recent tool executions with /restore and roll back a specific one with /restore <ID>. Think of it as a targeted undo for AI changes.

/loop

Runs a prompt on a recurring schedule:

/loop 5m check the build output and report any new warnings

I use this occasionally when I'm doing a long build and want Qwen to monitor for me while I do something else. It's a lightweight cron for conversational tasks.

/recap

Generates a one-line summary of where the session left off. If I step away for more than five minutes, Qwen auto-triggers this when I return:

? Implementing the Privacy Guard redaction pipeline. Next step: wire the OCR output into the bounding-box overlay renderer.

Incredibly useful for picking up after an interruption without scrolling through history.

/approval-mode auto-edit

Once I trust the current task scope, I switch to auto-edit to let Qwen make file changes without prompting me every time. I reserve yolo mode for throwaway branches only.

/directory

Adds multiple directories to the workspace context:

/dir add ./src,./tests,./electron

Useful when the feature spans multiple root-level directories that Qwen wouldn't automatically scope to.

My Qwen Code Workflow Summary

Here's the workflow I follow for every non-trivial feature in Achu:

Start clean /init + add project context manually
Spec first use /plan in multiple turns until the spec is solid
Delegate async tasks use subagents for parallel investigations and implementation
Monitor context /context detail regularly, /compress proactively
Log source truth point @ at source directories, not docs
Remember decisions /remember for anything that should persist
Quick questions /btw without breaking flow
Clear if it drifts two steering-away moments is my hard limit for /clear
End of session /dream to consolidate memory, /summary to save the state

This isn't magic. It's discipline. The LLM doesn't make you disciplined you have to bring that yourself. But when you apply this workflow consistently, the output quality is noticeably better than freeform chatting with an AI.

If you're building something with Qwen Code, try the spec-first approach. The twenty minutes you spend in /plan mode iterating the spec will save you three hours of correcting implementation drift.

Happy Agentic Coding, Testing, Shipping, Learning, whatever :) !

What's your biggest challenge with agentic coding workflows staying in context, or getting the model to follow architectural constraints? Let me know in the comments.

Supervised Vibe Coding: A Manifesto

NaveenKumar Namachivayam ⚡ — Thu, 11 Jun 2026 03:01:10 +0000

We are not anti-AI. We are pro-discipline.

Vibe coding unlocks speed. Supervised vibe coding unlocks speed you can trust. The difference is a developer who remains the final decision-maker at every step, not a passive reviewer of whatever the model felt good about.

Supervised vibe coding is a development approach that combines AI-generated speed with deliberate human oversight, positioning the developer as the final decision-maker rather than a passive reviewer. It builds on Andrej Karpathy's 2025 concept of "vibe coding," which described fully delegating code generation to AI tools without reviewing the output.

The manifesto outlines ten guiding principles covering incremental delivery, test coverage, code review, prompt discipline, security, documentation, configuration management, CI/CD enforcement, ownership, and dependency auditing. A recurring theme is that AI accelerates execution but cannot replace developer judgment, accountability, or the ability to code independently.

The origin of vibe coding

On February 2, 2025, Andrej Karpathy, co-founder of OpenAI and former director of AI at Tesla, posted a short thought on X that would change how the software world talked about AI-assisted development.

"There's a new kind of coding I call 'vibe coding', where you fully give in to the vibes, embrace exponentials, and forget that the code even exists." Andrej Karpathy, X (formerly Twitter), February 2, 2025.

The post went viral, clocking over 4.5 million views. Karpathy described using tools like Cursor Composer paired with Anthropic's Claude models, sometimes via voice through SuperWhisper, barely touching his keyboard. He accepted all AI-generated changes without reviewing diffs, pasted error messages straight back to the model, and let the codebase grow organically, even beyond his own full comprehension.

The phrase struck a cultural nerve because it named something developers were already doing, just without a word for it. By end of 2025, Collins Dictionary named "vibe coding" its Word of the Year, with nearly half of all developers reporting daily use of AI coding tools.

Karpathy himself acknowledged the limits early. He noted that AI occasionally could not fix certain bugs, forcing him to work around them or prompt blindly until something stuck. He called it "quite amusing" and best suited for non-critical projects. That caveat got lost in the hype.

Now Supervised Vibe Coding formalizes what disciplined engineers were already practicing. Speed from AI. Judgment from humans.

The 10 laws

Ship in slices, not in floods

Build incrementally. Each iteration must be reviewable, testable, and deployable on its own. Human review is not optional; AI is a contributor, not a reviewer. If you cannot review it in one sitting, it is too large.

Tests are not a phase, they are a practice

Unit, integration, and edge case tests accompany every feature. AI may scaffold the test file. You verify every assertion nulls, empty inputs, boundary values, concurrency, and failure paths are caught at design time, not at incident time.

Read before you run

Understand every snippet before accepting it. Verify APIs exist, functions are not deprecated, and packages are not hallucinated. If you cannot explain what the code does and what it depends on, it is not ready to merge.

Prompt with intent, pin your model

Bad prompts produce bad code. Be explicit about language version, constraints, patterns, and security requirements in every prompt. Share prompt conventions with your team so AI behaviour is consistent across the codebase.

Model drift: Pin your model version in CI/CD the same way you pin a package version. An unversioned AI dependency is a silent breaking change waiting to happen. The same prompt can produce different outputs across model versions treat model upgrades like dependency upgrades: deliberate, tested, and reviewed.

Security, performance, and UX share equal priority

No feature is done if it leaks data, crawls under load, or confuses users. These are first-class requirements on every ticket. Never paste customer data, PII, credentials, or secrets into an AI tool. The prompt is not a sandbox it is a transmission.

Document as you go, not as you leave

AI accelerates writing code but accumulates invisible technical debt. Document decisions, assumptions, and AI-generated sections as part of the same commit. Future maintainers deserve to know what the code does and why it was written this way.

Configuration is code, treat it accordingly

Secrets, environment variables, timeouts, and feature flags are versioned, validated, and never hardcoded. The config is part of the contract. A misconfigured deploy is still a broken deploy, regardless of how clean the code looks.

The pipeline is the gatekeeper

Lint, test, security scan, and build gates must all pass before code reaches the next environment. Observability and logging ship with the feature, not after. If you cannot see what your code is doing in production, you do not own it yet.

You are the supervisor, not the spectator

Feature flags, rollback plans, canary deployments, and health checks turn every release into a controlled, reversible act. Decide upfront who owns the code when something breaks. AI does not get paged at 2 AM. Ownership must be explicit before the deploy, not after the incident.

Deskilling is a silent risk: Deliberately solve problems without AI on a regular basis. Write a function from scratch. Debug without asking the model. The judgment this manifesto depends on atrophies if you never exercise it. Supervised vibe coding requires a supervisor who can actually code.

Own the dependency list

Every package AI pulls in is your responsibility to audit, pin, and maintain. AI will confidently suggest outdated, vulnerable, or nonexistent packages. Review licenses for IP compliance. Disclose AI involvement to your team, your clients, and where required, your employer. The code carries your name, not the model's.

Weekend Supervised Vibe Coding

NaveenKumar Namachivayam ⚡ — Mon, 08 Jun 2026 22:49:14 +0000

Weekend Supervised Vibe Coding

Achu - means print in Tamil

Built using @antigravityteam Google Flash 3.5 by burning my 1000 credits - then I pivoted to @CommandCodeAI DeepSeek Pro, after burning that, switched to raw @deepseekaifree pro in the terminal.

Please take a look at https://achu.app

Weekend Supervised Vibe Coding

NaveenKumar Namachivayam ⚡ — Sun, 31 May 2026 00:04:04 +0000

Weekend Supervised Vibe Coding

Achu - means print in Tamil

Built using @antigravityteam Google Flash 3.5 by burning my 1000 credits - then I pivoted to @CommandCodeAI DeepSeek Pro, after burning that, switched to raw @deepseekaifree pro in the terminal.

I am still testing :)

99% of Requests Failed and My Dashboard Showed Green

NaveenKumar Namachivayam ⚡ — Wed, 13 May 2026 15:41:30 +0000

In this blog post, we will see how to use NVIDIA AIPerf to expose a hidden performance problem that most LLM deployments never catch until real users start complaining.

I ran three simple tests against a local model. The results tell a story that every performance engineer should see.

The Setup

For this experiment, I used:

Model: granite4:350m running locally via Ollama
Endpoint: http://localhost:11434
Tool: NVIDIA AIPerf (the official successor to GenAI-Perf)

Head to https://github.com/ai-dynamo/aiperf to install AIPerf. It is a single pip install:

pip install aiperf

Granite 4 350M is a small, fast model perfect for local testing on a MacBook or a dev machine without a beefy GPU. The principles you will see here apply equally to larger models in cloud deployments.

Run 1: The Baseline That Lies

I started with the most common mistake in LLM performance testing a single-user baseline.

aiperf profile \
  --model "granite4:350m" \
  --streaming \
  --endpoint-type chat \
  --url http://localhost:11434 \
  --tokenizer builtin \
  --request-count 50 \
  --concurrency 1

The results looked great, as shown below.

Key numbers from this run:

Metric	avg	p50	p99
TTFT (ms)	223.11	217.60	317.61
TTST (ms)	10.94	9.99	18.00
ITL (ms)	10.67	10.51	12.35
Request Latency (ms)	1,309.30	1,043.95	3,251.73
Request Throughput (req/sec)	0.76	N/A	N/A

223ms average TTFT. Smooth inter-token latency at 10.67ms. If you stopped here, you would call this production-ready.

Most people stop here. That is the problem.

Run 2: The Wake-Up Call

Next, I pushed concurrency to 50, a more realistic number for a shared endpoint. I also added a warmup of 10 requests to eliminate cold-start noise, and ran for 60 seconds.

aiperf profile \
  --model "granite4:350m" \
  --url http://localhost:11434 \
  --endpoint-type chat \
  --concurrency 50 \
  --tokenizer builtin \
  --warmup-request-count 10 \
  --benchmark-duration 60 \
  --streaming

The results were a shock, as shown below.

Metric	avg	p50	p99
TTFT (ms)	41,660.92	50,870.37	64,201.68
TTST (ms)	10.21	10.11	13.10
ITL (ms)	10.38	10.18	13.29
E2E Output Token Throughput (tokens/sec/user)	4.86	1.85	60.87
Request Throughput (req/sec)	0.88	N/A	N/A

TTFT went from 223ms to 41,660ms. That is a 186x increase.

At p99, users were waiting over 64 seconds just to see the first token.

Your monitoring dashboard probably still shows green. Your users are staring at a blank screen.

Run 3: Goodput Exposes the Real Truth

This is where AIPerf separates itself from basic benchmarking tools. I added a --goodput flag with a TTFT SLO of 500ms. Goodput measures the throughput of requests that actually met the SLO, not just all requests indiscriminately.

aiperf profile \
  --model "granite4:350m" \
  --url http://localhost:11434 \
  --endpoint-type chat \
  --concurrency 50 \
  --tokenizer builtin \
  --benchmark-duration 60 \
  --goodput 'time_to_first_token:500' \
  --streaming

As shown below, the result is the most important number in this entire experiment.

Metric	Value
Request Throughput (req/sec)	0.91
Goodput (req/sec)	0.01
TTFT avg (ms)	37,380.20
TTFT p99 (ms)	55,777.69

Request throughput says 0.91 req/sec. Looks reasonable.

Goodput says 0.01 req/sec.

That means roughly 99% of requests failed the 500ms TTFT SLO. Your system is processing requests. It is not serving users.

The Hidden Insight: ITL Stays Rock Solid

Here is what most people miss when they first see these numbers. Look at ITL across all three runs:

Run	TTFT avg (ms)	ITL avg (ms)
Concurrency 1	223.11	10.67
Concurrency 50	41,660.92	10.38
Concurrency 50 + Goodput	37,380.20	9.71

ITL barely moves. TTST (Time to Second Token) also stayed consistent around 10ms across all runs.

The model is not the problem. The queue is.

Once the model starts generating for a request, it flies. Tokens come out at a consistent 10ms pace regardless of how many other requests are in flight. The bottleneck is entirely in the prefill phase, requests piling up waiting for the model to even begin processing them.

This is a critical distinction for capacity planning. If ITL were also degrading, you would need a faster model or better hardware. Since only TTFT is exploding, the fix is architectural, better queue management, request routing, or horizontal scaling of the inference server.

You cannot arrive at this insight without separating TTFT from ITL. A single "response time" metric would have buried it entirely.

The Lesson

Three commands. Three minutes. A completely different picture of your system.

What you measured	What you learned
Single-user baseline	False confidence
Concurrency 50	The real TTFT behavior under load
Goodput with SLO	How many users are actually being served

The takeaway is simple: always test with realistic concurrency. Always set an SLO and measure goodput against it. And always look at TTFT and ITL separately they tell completely different stories.

A system with great ITL and terrible TTFT under load has a queue problem, not a model problem. Knowing that changes everything about how you fix it.

Happy Testing!

Over to you: Have you ever shipped an LLM feature that looked great in testing but struggled under real user load? What metric finally exposed it? Drop a comment below I would love to hear your story.

Beyond the Hype: A Comprehensive Guide to Benchmarking LLMs with AWS Labs’ LLMeter

NaveenKumar Namachivayam ⚡ — Thu, 07 May 2026 16:30:44 +0000

In the current AI gold rush, the conversation has shifted from "Can it do the task?" to "How efficiently can it do the task?" For engineers moving Large Language Models (LLMs) into production, the "vibe check" is no longer sufficient. You need hard data on latency, throughput, and cost-efficiency.

AWS Labs recently released LLMeter, a Python-based benchmarking library that is quickly becoming the gold standard for performance engineers. In this guide, we’ll break down why this tool matters, how to use it, and how to visualize your data for executive-level insights.

The Metrics That Actually Matter

Before diving into the code, we must define the "North Star" metrics of LLM performance. LLMeter is specifically designed to capture:

Time to First Token (TTFT): The duration between sending a request and receiving the first byte of data. This is the most critical metric for perceived user latency.
Tokens Per Second (TPS): The speed at which the model generates text. A high TPS ensures a smooth reading experience.
Time to Last Token (TTL): The total duration for the entire response.
Cost Per Request: Calculated based on input/output token counts and specific model pricing.

1. Setting Up Your Benchmarking Environment

LLMeter is built for modern Python environments (3.10+). For the fastest setup, we recommend using UV, the high-performance Python package installer.

Installation

# Using UV for lightning-fast dependency management
uv pip install llmeter load_env plotly

Environment Configuration

You don’t want to hardcode your API keys. LLMeter works seamlessly with .env files. Ensure your environment is prepared for the providers you intend to test (OpenAI, Anthropic, Bedrock, or DeepSeek).

2. Architecting Your Experiment

The beauty of LLMeter lies in its structured approach to testing. An "Experiment" in LLMeter consists of three main components:

The Endpoint & Payload

You define where the request is going and what it contains. For accurate TTFT measurements, always use streaming endpoints.

# Example: Setting up a GPT-4o-mini endpoint
endpoint = OpenAIEndpoint(
    model="gpt-4o-mini",
    api_key=os.getenv("OPENAI_API_KEY"),
    streaming=True
)

The Cost Model

Unlike generic load testers, LLMeter allows you to define a CostModel. By providing the price per million tokens, the library does the math for you, allowing you to see the financial impact of your scaling decisions in real-time.

3. Running Multi-Client Load Tests

In a production environment, your LLM won't be handling one request at a time. LLMeter allows you to simulate concurrent clients.

In our testing, we found that running a sequential step test provides the most insight:

Baseline: 1 client for 10 seconds.
Ramp-up: 3 clients for 10 seconds.
Stress: 10+ clients to find the "breaking point" where the provider begins rate-limiting or latency spikes.

Because LLMeter is built on Python’s asyncio, it can handle a massive number of concurrent requests from a standard laptop without the hardware becoming the bottleneck.

4. Visualizing Performance with Plotly

Data in a terminal is hard to digest. LLMeter’s integration with Plotly transforms raw logs into interactive HTML reports.

Key visualizations include:

TTFT vs. Number of Clients: Watch how the "wait time" increases as your application scales.
TPS Histograms: Identify if your model provides consistent speed or if there are frequent "stalls."
Error Rate Charts: Track 429 (Rate Limit) errors to determine if you need to request a quota increase from your provider.

5. Taking Control: The Real-Time Dashboard

One limitation of the standard LLMeter library is that it primarily provides post-test results. To solve this, we’ve developed a Minimalist Live Dashboard using Python.

Why a Live Dashboard?

Instant Feedback: See the TPS and Cost update every second.
Safety Switch: If you notice a model is hallucinating or costs are spiking unexpectedly, you can kill the test immediately.
Stakeholder Demos: It’s much more impactful to show a live-updating graph of "Tokens Per Second" than a static CSV file.

Conclusion: Data-Driven AI Engineering

Choosing an LLM based on a leaderboard is a starting point but benchmarking it against your specific prompts and your expected user load is essential. LLMeter provides the framework; the insights it generates will save you from costly production bottlenecks.

Resources & Further Learning

Full Video Tutorial: Watch the Hands-on Walkthrough
Source Code: Visit the QAInsights GitHub for the custom dashboard script.
Official Tool: Explore the AWS Labs LLMeter Repo.

Are you ready to stop guessing and start measuring? Download LLMeter today and baseline your AI stack.

Proof of Humanity™

NaveenKumar Namachivayam ⚡ — Thu, 02 Apr 2026 16:56:37 +0000

This is a submission for the DEV April Fools Challenge

What I Built

To prove you're human, you must assemble Flätpack furniture.
One step is irrelevant. Robots cannot detect irony.

Demo

https://kilo-challenge-8914.d.kiloapps.io/

Code

https://github.com/QAInsights/kilo-challenge

🛠️ How I Built It

I approached this project the way I tackle any complex system: break it down, understand the constraints, and build upward with tight feedback loops. Instead of jumping straight into coding, I started by mapping the experience I wanted users to have. From there, every technical decision flowed naturally.

🔍 1. Defining the Core Problem

Before writing a single line of code, I clarified the “why.” What should this tool feel like? What friction should it remove? What would make someone say, “Oh, that’s clever”?

This early framing helped me avoid feature creep and stay anchored to a crisp user experience.

🧩 2. Designing the Architecture

Once the problem was clear, I sketched the system architecture—data flow, state transitions, and the boundaries between components. I treated it like a mini system‑design exercise:

What should run locally vs. remotely
How to keep the interface responsive
How to ensure the tool remains extensible

This step saved me hours later because every component had a clear responsibility.

⚙️ 3. Building the Core Logic

With the architecture locked in, I implemented the core functionality. I built it incrementally, validating each piece before moving on. This iterative approach made debugging almost trivial and kept the project moving smoothly.

🎨 4. Crafting the User Experience

A tool is only as good as how it feels to use. I refined the UI/UX with small but meaningful touches:

Clear feedback loops
Minimal cognitive load
Fast, predictable interactions

I wanted the tool to feel like something I would enjoy using every day.

🧪 5. Testing Like a User, Not a Developer

I tested the project in real‑world scenarios—switching contexts, trying edge cases, and intentionally breaking things. This surfaced subtle issues that wouldn’t appear in a controlled environment.

🚀 6. Polishing and Shipping

Once the core was solid, I focused on polish:

Cleaned up the codebase
Improved performance
Added small quality‑of‑life improvements
Wrote documentation that future‑me would appreciate

Shipping wasn’t the end—it was the beginning of iteration.

If you want, I can also help you write the “What I Learned”, “Challenges I Faced”, or “Future Improvements” sections so your post looks complete and competition‑ready.

Prize Category

HTCPCP IYKYK

GitHub Copilot CLI Challenge: bt: Modern BLE CLI Tool

NaveenKumar Namachivayam ⚡ — Sun, 15 Feb 2026 21:43:47 +0000

This is a submission for the GitHub Copilot CLI Challenge

What I Built

There is no clean, simple, cross‑platform, developer‑friendly BLE CLI exists today. Linux has bluetoothctl, macOS has no official CLI, and Windows exposes only low‑level PowerShell APIs.

Developers working with BLE devices — especially ESP32‑based prototypes — lack a simple, unified, ergonomic CLI. My goal is to create a minimal, ergonomic, script‑friendly CLI for scanning, connecting, and interacting with BLE devices.

📝 Repo: https://github.com/QAInsights/bt

Demo

My Experience with GitHub Copilot CLI

bt was built using the copilot which acted as a copilot for this project. It helped me in brainstorming, planning, rapid prototyping using natural language processing even with my typos :)

🤖 copilot helped in no context switching, focused development, acting as a pair programmer, testing, and more.

As a beginner in Bluetooth modules, it helped in navigating the docs, debugging, beautifying the output, and testing it. Without leaving the ide , it helped in polishing the whole project.

Here is how it started:

and here is how it ended:

Repo: https://github.com/QAInsights/bt

DEV Community: NaveenKumar Namachivayam ⚡

Codex CLI vs Claude Code: A Deep-Dive Command Comparison

Quick Context

Installation at a Glance

Core Commands Side by Side

Commands Only in Claude Code

Background Agent Management

Daemon Management

Project State Management

Ultrareview

Remote Control

Long-Lived Token for CI

Install Specific Version

Commands Only in Codex CLI

Cloud Task Management

Sandbox Helper

Exec Policy Testing

Shell Completion Scripts

Feature Flag Management

Debug Model Catalog

Run Codex as an MCP Server

Launch Desktop App from CLI

Flags Compared

Shared Flags (Different Names)

Flags Only in Claude Code

Flags Only in Codex CLI

Slash Commands Face-off

Present in Both (Similar Purpose)

Slash Commands Only in Claude Code

Slash Commands Only in Codex CLI

Uncommon Commands Worth Knowing

Claude Code: --exclude-dynamic-system-prompt-sections

Claude Code: --bare

Claude Code: --from-pr

Claude Code: --fallback-model

Codex CLI: codex execpolicy

Codex CLI: --oss

Codex CLI: codex apply

Codex CLI: /side and /btw

Claude Code: claude ultrareview

Small Wins: Category by Category

Session Management: Claude Code Wins

Sandbox Control: Codex CLI Wins

Image Input: Codex CLI Small Win

CI Scripting: Claude Code Wins

Local Model Support: Codex CLI Wins

Context Management: Claude Code Wins Slightly

Plugin Architecture: Codex CLI More Explicit

Remote Work: Both Have Unique Angles

What Is Missing in Each Tool

Missing in Claude Code

Missing in Codex CLI

My Take

Toy Story: The Open-Source Ecosystem

The Setup: The Developer's Stack

The Inciting Incident: The Trendy New Framework

The Interlude: Lost in Pizza Planet

The Climax: The Dark Web of Dependency Hell

The Resolution: The Great Git Push

JMeter vs k6 vs Locust in 2026: Which Load Testing Tool Should You Pick?

Why This Comparison Still Matters in 2026

Quick Stats at a Glance

Apache JMeter

What JMeter Gets Right

Where JMeter Struggles in 2026

Personal Observation

Grafana k6

What k6 2.0 Brought

What k6 Gets Right

Where k6 Struggles

Personal Observation

Locust

What Locust Gets Right

Where Locust Struggles

Personal Observation

Head-to-Head Comparison

Scripting Experience

Concurrency Model

CI/CD Integration

Reporting

Claude Code: `--exclude-dynamic-system-prompt-sections`

Claude Code: `--bare`

Claude Code: `--from-pr`

Claude Code: `--fallback-model`

Codex CLI: `codex execpolicy`

Codex CLI: `--oss`

Codex CLI: `codex apply`

Codex CLI: `/side` and `/btw`

Claude Code: `claude ultrareview`