Skip to content

Skills, MCP, and Ecosystem Playbook for Coding Agents (2026-02-17)

Scope (Track 3 of 3)

This is Track 3/3 in the research set: 1. Binary tools (see docs/binary-tools-token-efficiency-landscape.md). 2. Cloud services (see docs/paid-cloud-services-token-efficiency-landscape.md). 3. Skills, MCP, and ecosystem patterns for reliable coding-agent operation.

1. Executive summary

  1. Treat skills as the primary token-control primitive for long-running coding agents: both Codex and Claude Code use metadata-first/progressive loading patterns, which keeps the baseline context smaller than dumping full runbooks into every prompt. Codex Skills Claude Code subagents
  2. Run complex work through explicit decomposition loops with machine-readable checkpoints (codex exec --json, structured output schemas, and per-step verification hooks) so retries are targeted instead of full-context restarts. Codex non-interactive mode Codex security/sandbox
  3. Use Claude Code's built-in controls (/compact, /clear, /cost, subagents, hooks, plan mode) to cap context growth, enforce permission boundaries, and keep sessions in a predictable budget envelope. Slash commands Costs Permissions Hooks
  4. Put LiteLLM in front of automation and team workflows for cache/routing/fallback/budget policy; this gives one place to enforce spend and reliability across providers and local/remote model paths. LiteLLM caching LiteLLM routing LiteLLM fallbacks LiteLLM budget routing LiteLLM budgets + rate limits
  5. Adopt eval gates early (prompt/agent regression checks + trace-level grading) so skill changes and prompt scaffolds improve success rate instead of drifting into expensive prompt churn. Promptfoo CI/CD Promptfoo Action OpenAI Agent Evals Testing Agent Skills with Evals

2. High-ROI skill and workflow patterns

Skill packaging and trigger discipline

  • Codex: use AGENTS.md layering plus skills (SKILL.md) with precise description boundaries. Keep descriptions explicit about when to use and when not to use. AGENTS.md guide Codex Skills
  • Claude Code: use project/user subagents for specialized flows (for example, incident-triage, test-runner) so each delegated task runs in a separate context window. Claude subagents
  • Practical mechanism: metadata-first skill loading reduces default prompt payload; full instructions load only on trigger. Codex Skills

Task decomposition for complex incidents

  • Start in analysis-only mode (plan in Claude permissions, read-only in Codex sandbox) to produce a bounded plan before code edits. Claude permissions Codex security
  • Convert each step into a verifiable subtask with a required artifact (diff, failing test, passing test, or schema-validated JSON summary). Codex non-interactive mode
  • Run subagents/skills for narrow units rather than a single monolithic prompt. Claude subagents Codex Skills

Context hygiene and memory discipline

  • Claude Code: use /compact with task-specific focus instructions, /clear between unrelated tasks, and CLAUDE.md memory for stable conventions. Slash commands Memory
  • Codex: keep durable guidance in global/project AGENTS.md; keep task state in artifacts (JSON, markdown logs) and resume only when required. AGENTS.md guide Codex non-interactive mode
  • For shell-heavy sessions, use compaction/output filters (rtk, repomix, files-to-prompt) before model calls. rtk repomix files-to-prompt

Tool-call governance and reproducibility

Verification loops that lift success rate

3. Comparison table (agent frameworks/patterns/tools)

Scoring: 0-3 (higher is better). For Human operator load and Time-to-adopt, 3 means lower burden / faster adoption.

Pattern / Tool Category Token savings potential Success-rate lift (complex tasks) Reproducibility Human operator load Time-to-adopt Works with Codex + Claude Code + LiteLLM Maturity snapshot (2026-02-17 UTC)
Codex AGENTS.md + Skills / Skills Skill system 3 2 3 2 2 Codex: native; Claude Code: pattern-compatible; LiteLLM: indirect openai/codex: 60,847 stars, pushed 2026-02-17, Apache-2.0 (API)
Codex non-interactive (codex exec --json) + rules Automation + checkpointing 2 3 3 2 2 Codex: native; Claude Code: analogous patterns; LiteLLM: indirect Same repo snapshot as above
Claude Code subagents + hooks + permissions plan mode Delegation + guardrails 3 3 2 2 2 Claude Code: native; Codex: pattern-compatible; LiteLLM: gateway-compatible Anthropic docs product (no public OSS repo metadata in cited sources)
LiteLLM gateway policies + cache + fallbacks + budgets Routing/cost control plane 3 2 3 2 2 Codex: proxy path unverified; Claude Code: documented LLM gateway path; LiteLLM: native BerriAI/litellm: 36,195 stars, pushed 2026-02-17, NOASSERTION (API)
Promptfoo eval gates Verification + regression control 1 3 3 2 3 Works as external gate for all three promptfoo/promptfoo: 10,488 stars, pushed 2026-02-17, MIT (API)
SWE-agent harness patterns + SWE-bench Complex task benchmarking 1 3 3 1 1 Framework-level; usable with Codex/Claude/LiteLLM as model backends SWE-agent/SWE-agent: 18,495 stars, pushed 2026-02-17, MIT (API)
OpenHands skills/microagents Alternative skill ecosystem 2 2 2 2 2 Pattern-compatible; useful for cross-tool runbook design OpenHands/OpenHands: 67,910 stars, pushed 2026-02-17, NOASSERTION (API)
rtk runtime output compaction Terminal-output token compression 3 1 2 3 3 Claude Code: explicit hook workflow; Codex/LiteLLM: indirect rtk-ai/rtk: 823 stars, pushed 2026-02-17, MIT (API)

4. End-to-end playbook for complex issue resolution

A. Intake and triage (analysis-only)

  1. Run in non-destructive mode first.
  2. Produce a bounded plan with explicit success criteria and affected files.
# Codex: read-only triage with machine-readable events
codex exec --json \
  --sandbox read-only \
  "Analyze failing CI on this repo. Output: root-cause hypotheses, exact files, and minimal fix plan."
# Claude Code session pattern
1) /permissions -> set mode to plan
2) Ask for: root cause + minimal diff strategy + verification checklist
3) Switch to edit mode only after plan approval

Sources: Codex non-interactive mode Codex security Claude permissions

B. Skill/subagent bootstrap for incident class

Create one reusable runbook skill and one verifier subagent.

---
name: incident-ci-fix
description: Use when CI failed on a recent commit and you need minimal-risk root cause and fix; do not use for broad refactors.
---
### Inputs
- failing job logs
- target commit SHA

### Workflow
1. Reproduce failure with the exact command.
2. Isolate smallest failing unit.
3. Patch only files in affected module.
4. Re-run failing checks, then full gate.

### Definition of done
- failing check now passes
- no new lint/type/test regressions
- concise root-cause note with before/after evidence
# .claude/agents/verification-lead.md
---
name: verification-lead
description: Validate fixes with deterministic command order and reject unresolved risk.
---
- Run targeted failing test first, then full suite.
- Reject if patch increases scope without evidence.

Sources: Codex Skills Claude subagents

C. Enforce runtime guardrails

Use runtime policy, not prompt-only policy.

{
  "permissions": {
    "allow": [
      "Bash(rg *)",
      "Bash(git diff *)",
      "Bash(* --version)",
      "Task(verification-lead)"
    ],
    "deny": [
      "Bash(git push *)",
      "Read(./.env)",
      "Read(./secrets/**)"
    ],
    "defaultMode": "acceptEdits"
  }
}
{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [
          {
            "type": "command",
            "command": "$CLAUDE_PROJECT_DIR/.claude/hooks/block-dangerous.sh"
          }
        ]
      }
    ]
  }
}

Sources: Claude permissions Claude hooks

D. Put LiteLLM in front for route and spend control

# litellm_config.yaml
model_list:
  - model_name: claude_primary
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY
      tags: ["interactive", "paid"]

  - model_name: codex_api
    litellm_params:
      model: openai/gpt-5-codex
      api_key: os.environ/OPENAI_API_KEY
      tags: ["automation", "paid"]

  - model_name: local_fast
    litellm_params:
      model: ollama/qwen3-coder
      api_base: http://127.0.0.1:11434
      tags: ["interactive", "free"]

litellm_settings:
  cache: true
  cache_params:
    type: redis
    ttl: 900

router_settings:
  routing_strategy: cost-based-routing
  fallbacks:
    - {"claude_primary": ["codex_api", "local_fast"]}

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
litellm --config ./litellm_config.yaml

Sources: LiteLLM caching LiteLLM routing LiteLLM fallbacks Tag-based routing Claude Code LLM gateway

E. Verification gate before merge

# 1) deterministic test order
npm run test:targeted
npm run lint
npm run typecheck
npm test

# 2) agent eval gate (example)
npx promptfoo@latest eval -c promptfooconfig.yaml -o .artifacts/promptfoo.json
# fail fast if pass-rate under threshold
PASS_RATE=$(jq '.results.stats.successes / (.results.stats.successes + .results.stats.failures) * 100' .artifacts/promptfoo.json)
awk -v p="$PASS_RATE" 'BEGIN { if (p < 95) exit 1 }'

Sources: Promptfoo CI/CD Promptfoo Action OpenAI evaluation best practices

F. Checkpoint output contract

Require the final agent output to include: - root cause (1 paragraph) - diff summary (file-level) - verification evidence (exact commands + pass/fail) - rollback plan

Use codex exec --output-schema for machine-readable handoff in automation. Codex non-interactive mode

5. Token-efficiency tactics for coding agents

  1. Use metadata-only skill discovery by default; avoid embedding full runbooks in system prompts. Codex Skills
  2. Write skill descriptions as routing boundaries with negative examples to reduce accidental skill activation. Shell + Skills + Compaction
  3. Use Claude subagents for separation of concerns so each delegated task has independent context. Claude subagents
  4. Compact aggressively during long sessions and clear context between unrelated tasks (/compact, /clear). Slash commands
  5. Track session spend continuously (/cost) and cap high-variance sessions early. Costs
  6. Set MCP output limits (MAX_MCP_OUTPUT_TOKENS) to prevent accidental token floods from large tool outputs. Claude MCP
  7. Use JSONL events (codex exec --json) for logs/artifacts instead of pasting verbose transcripts back into context. Codex non-interactive mode
  8. Use gateway response caching and semantic caching for repeat calls in iterative debugging loops. LiteLLM caching
  9. Use budget routing and tagged routes to keep exploratory tasks on cheaper lanes and reserve premium models for high-impact steps. Budget routing Tag-based routing
  10. Add runtime output compression at the shell layer (rtk) when token waste comes from command output, and use adjacent tools (repomix, files-to-prompt, aider repo map) when waste comes from broad repo context. rtk repomix files-to-prompt aider repo map

Explicit rtk adjacency pass (when each is better/worse)

  • Better than rtk: repomix for one-shot whole-repo prompt packaging; files-to-prompt for deterministic subset selection; aider repo map for persistent token-bounded codebase maps. repomix files-to-prompt aider repo map
  • Better than those: rtk when the dominant waste is repeated shell/test/log output in coding-agent loops, especially with Claude Code hook-based auto-rewrite. rtk

6. Anti-patterns and failure modes

  1. Monolithic mega-prompts replacing skills.
  2. Failure: high baseline token burn and brittle behavior drift.
  3. Fix: split into skill/subagent contracts with explicit triggers. Codex Skills Claude subagents

  4. Prompt-only safety controls.

  5. Failure: agents still run risky commands when the prompt and runtime policy disagree.
  6. Fix: enforce permissions/rules/hooks at runtime. Claude permissions Claude hooks Codex rules

  7. Unbounded MCP/server output.

  8. Failure: context flooding and low-signal token spend.
  9. Fix: set output caps and add pre-filtering commands. Claude MCP

  10. Cost control without telemetry.

  11. Failure: cannot attribute improvements or regressions.
  12. Fix: attach gateway logging callbacks and spend/budget reports. LiteLLM logging LiteLLM budgets + rate limits

  13. No regression loop for skills.

  14. Failure: "works once" runbooks regress silently.
  15. Fix: add small prompt sets + CI eval gates + benchmark slices for incident classes. Testing Agent Skills with Evals Promptfoo CI/CD SWE-bench

  16. Using bypass modes in shared environments.

  17. Failure: avoidable security and compliance risk.
  18. Fix: least-privilege defaults and managed policy restrictions. Codex security Claude permissions

7. Suggested default operating model (Codex/Claude Code + LiteLLM)

Role split

Minimal architecture

  1. Repository standards in AGENTS.md (Codex) and CLAUDE.md (Claude Code memory).
  2. Shared skill/subagent registry in-repo (.codex/skills/, .claude/agents/).
  3. Gateway policies in litellm_config.yaml with tagged routes, fallbacks, and budgets.
  4. CI gate: targeted tests + eval metrics + structured final summary artifact.

Compatibility notes

  • Claude Code has documented LLM gateway configuration including LiteLLM examples. Claude Code LLM gateway
  • Direct Codex CLI -> third-party gateway wiring is unverified in cited public docs for this report; validate in your environment before broad standardization.

8. Adopt now / try next / monitor

Adopt now

  1. Ship one incident-focused skill and one verifier subagent per top recurring incident class.
  2. Enable runtime guardrails (Codex rules / Claude permissions + hooks) before scaling agent autonomy.
  3. Put LiteLLM budgets + caching in front of automated runs and track weekly spend-by-route.
  4. Add CI eval gates with explicit pass thresholds and artifact outputs.

Try next

  1. Add tag-based routing (free, paid, critical) and fallback trees tied to reliability SLOs.
  2. Add rtk hook-first compression for shell-heavy loops; measure pre/post token deltas via session metrics.
  3. Add benchmark subsets (SWE-bench-style incident buckets) for quarterly runbook quality checks.

Monitor

  1. Skills over-triggering or under-triggering after prompt or naming changes.
  2. Budget routing misconfiguration that accidentally routes critical work to underpowered models.
  3. Security drift from permission bypasses or expanding network allowlists.

All links below were accessed on 2026-02-17 unless noted.

Global targeted queries executed

  • llm token efficiency dev tools
  • prompt caching coding agent
  • context compression terminal output coding agent
  • liteLLM routing cost control
  • claude code workflow token usage
  • codex agent productivity patterns
  • llm observability token cost tracing
  • agent runbook skill system

Required discovery sources

Primary implementation references

10. Merged Coding-Agent Iterative Tooling Shortlist

This section merges the former iterative/shortlist coding-agent tooling docs into Track 3.

Corpus snapshot (curated coding-agent iterative set): - Total repositories: 59 | Category | Count | | --- | ---: | | Coding Benchmark Harness | 12 | | Evaluation & Regression | 18 | | Experiment Orchestration | 12 | | Gateway/Policy/Quality Gates | 8 | | Prompt/Strategy Optimization | 9 |

Category Tables

Experiment Orchestration (12)

Repository Stars Updated (UTC) Why it matters for Track 3
anthropics/claude-code 67372 2026-02-17 Claude Code is an agentic coding tool that lives in your terminal, understands your codebase, and helps you code faster by executing routine tasks, explaining complex code, and handling git workflows - all through natural language commands.
Aider-AI/aider 40703 2026-02-17 aider is AI pair programming in your terminal
wshobson/agents 28788 2026-02-17 Intelligent automation and multi-agent orchestration for Claude Code
thedotmack/claude-mem 28783 2026-02-17 A Claude Code plugin that automatically captures everything Claude does during your coding sessions, compresses it with AI (using Claude's agent-sdk), and injects relevant context back into future sessions.
musistudio/claude-code-router 27974 2026-02-17 Use Claude Code as the foundation for coding infrastructure, allowing you to decide how to interact with the model while enjoying updates from Anthropic.
BloopAI/vibe-kanban 21376 2026-02-17 Get 10X more out of Claude Code, Codex or any coding agent
SuperClaude-Org/SuperClaude_Framework 20844 2026-02-17 A configuration framework that enhances Claude Code with specialized commands, cognitive personas, and development methodologies.
winfunc/opcode 20570 2026-02-17 A powerful GUI app and Toolkit for Claude Code - Create custom agents, manage interactive Claude Code sessions, run secure background agents, and more.
farion1231/cc-switch 18647 2026-02-17 A cross-platform desktop All-in-One assistant tool for Claude Code, Codex, OpenCode & Gemini CLI.
ruvnet/claude-flow 14152 2026-02-17 The leading agent orchestration platform for Claude. Deploy intelligent multi-agent swarms, coordinate autonomous workflows, and build conversational AI systems. Features enterprise-grade architecture, distributed swarm intelligence, RAG integration, and native Claude Code support via MCP protocol. Ranked #1 in agent-based frameworks.
ryoppippi/ccusage 10780 2026-02-17 A CLI tool for analyzing Claude Code/Codex CLI usage from local JSONL files.
danicat/tenkai 11 2026-02-10 Experimentation framework for coding agents

Prompt/Strategy Optimization (9)

Repository Stars Updated (UTC) Why it matters for Track 3
microsoft/PromptWizard 3767 2026-02-17 Task-Aware Agent-driven Prompt Optimization Framework
zou-group/textgrad 3368 2026-02-17 TextGrad: Automatic ''Differentiation'' via Text -- using large language models to backpropagate textual gradients. Published in Nature.
SalesforceAIResearch/promptomatix 909 2026-02-17 An Automatic Prompt Optimization Framework for Large Language Models
shobrook/promptimal 298 2026-01-15 A very fast, very minimal prompt optimizer
austin-starks/Promptimizer 210 2026-02-13 An Automated AI-Powered Prompt Optimization Framework
CTLab-ITMO/CoolPrompt 174 2026-02-16 Automatic Prompt Optimization Framework
developzir/gepa-mcp 46 2025-12-26 MCP server integrating GEPA (Genetic-Evolutionary Prompt Architecture) for automatic prompt optimization with Claude Desktop
Bubobot-Team/mcp-prompt-optimizer 21 2026-02-16 Advanced MCP server providing cutting-edge prompt optimization tools with research-backed strategies
johnpsasser/claude-code-prompt-optimizer 19 2026-02-12 AI-powered prompt optimization hook for Claude Code. Transforms simple prompts into comprehensive, structured instructions using Claude Opus 4.1's advanced reasoning capabilities.

Evaluation & Regression (18)

Repository Stars Updated (UTC) Why it matters for Track 3
langfuse/langfuse 22009 2026-02-17 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. YC W23
openai/evals 17694 2026-02-17 Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
confident-ai/deepeval 13697 2026-02-17 The LLM Evaluation Framework
vibrantlabsai/ragas 12636 2026-02-17 Supercharge Your LLM Application Evaluations
EleutherAI/lm-evaluation-harness 11439 2026-02-17 A framework for few-shot evaluation of language models.
promptfoo/promptfoo 10489 2026-02-17 Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
Arize-ai/phoenix 8575 2026-02-17 AI Observability & Evaluation
open-compass/opencompass 6671 2026-02-17 OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Giskard-AI/giskard-oss 5117 2026-02-17 Open-Source Evaluation & Testing library for LLM Agents
Helicone/helicone 5116 2026-02-17 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23
openai/simple-evals 4352 2026-02-17
Agenta-AI/agenta 3849 2026-02-17 The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.
truera/trulens 3093 2026-02-17 Evaluation and Tracking for LLM Experiments and AI Agents
uptrain-ai/uptrain 2338 2026-02-17 UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.
UKGovernmentBEIS/inspect_ai 1746 2026-02-17 Inspect: A framework for large language model evaluations
langchain-ai/openevals 901 2026-02-16 Readymade evaluators for your LLM apps
langchain-ai/agentevals 476 2026-02-12 Readymade evaluators for agent trajectories
hidai25/eval-view 44 2026-02-17 Catch AI agent regressions before you ship. YAML test cases, golden baselines, execution tracing, cost tracking, CI integration. LangGraph, CrewAI, Anthropic, OpenAI.

Coding Benchmark Harness (12)

Repository Stars Updated (UTC) Why it matters for Track 3
SWE-bench/SWE-bench 4303 2026-02-17 SWE-bench: Can Language Models Resolve Real-world Github Issues?
SWE-agent/mini-swe-agent 2913 2026-02-17 The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepobut scores >74% on SWE-bench verified!
openai/SWELancer-Benchmark 1439 2026-02-11 This repo contains the dataset and code for the paper "SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?"
augmentcode/augment-swebench-agent 855 2026-02-12 The #1 open-source SWE-bench Verified implementation
multi-swe-bench/multi-swe-bench 321 2026-02-17 Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
JARVIS-Xs/SE-Agent 232 2026-02-11 SE-Agent is a self-evolution framework for LLM Code agents. It enables trajectory-level evolution to exchange information across reasoning paths via Revision, Recombination, and Refinement, expanding the search space and escaping local optima. On SWE-bench Verified, it achieves SOTA performance
microsoft/SWE-bench-Live 162 2026-02-17 [NeurIPS 2025 D&B] SWE-bench Goes Live!
Aider-AI/aider-swe-bench 79 2026-01-28 Harness used to benchmark aider against SWE Bench benchmarks
amazon-science/SWE-PolyBench 77 2026-01-30 SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents
smallcloudai/refact-bench 63 2026-01-22 A benchmarking tool for evaluating AI coding assistants on real-world software engineering tasks from the SWE-Bench dataset.
SWE-bench/sb-cli 56 2026-02-15 Run SWE-bench evaluations remotely
logic-star-ai/insights 49 2026-02-12 We track and analyze the activity and performance of autonomous code agents in the wild

Gateway/Policy/Quality Gates (8)

Repository Stars Updated (UTC) Why it matters for Track 3
BerriAI/litellm 36197 2026-02-17 Python SDK, Proxy Server (AI Gateway) to call 100+ LLM APIs in OpenAI (or native) format, with cost tracking, guardrails, loadbalancing and logging. [Bedrock, Azure, OpenAI, VertexAI, Cohere, Anthropic, Sagemaker, HuggingFace, VLLM, NVIDIA NIM]
tensorzero/tensorzero 10971 2026-02-17 TensorZero is an open-source stack for industrial-grade LLM applications. It unifies an LLM gateway, observability, optimization, evaluation, and experimentation.
Portkey-AI/gateway 10630 2026-02-17 A blazing fast AI Gateway with integrated guardrails. Route to 200+ LLMs, 50+ AI Guardrails with 1 fast & friendly API.
messkan/prompt-cache 196 2026-02-13 Cut LLM costs by up to 80% and unlock sub-millisecond responses with intelligent semantic caching.A drop-in, provider-agnostic LLM proxy written in Go with sub-millisecond response
traceloop/hub 165 2026-02-16 High-scale LLM gateway, written in Rust. OpenTelemetry-based observability included
Dimon94/cc-devflow 97 2026-02-08 One-command requirement development flow for Claude Code - Complete workflow system with sub-agents, quality gates, and intelligent automation
usetig/sage 83 2026-02-16 An LLM council that reviews your coding agent's every move
rigour-labs/rigour 11 2026-02-17 Local-first quality gate + fix-loop controller for AI coding agents (CLI + MCP).