Skills, MCP, and Ecosystem Playbook for Coding Agents (2026-02-17)¶

Scope (Track 3 of 3)¶

This is Track 3/3 in the research set: 1. Binary tools (see docs/binary-tools-token-efficiency-landscape.md). 2. Cloud services (see docs/paid-cloud-services-token-efficiency-landscape.md). 3. Skills, MCP, and ecosystem patterns for reliable coding-agent operation.

1. Executive summary¶

Treat skills as the primary token-control primitive for long-running coding agents: both Codex and Claude Code use metadata-first/progressive loading patterns, which keeps the baseline context smaller than dumping full runbooks into every prompt. Codex Skills Claude Code subagents
Run complex work through explicit decomposition loops with machine-readable checkpoints (codex exec --json, structured output schemas, and per-step verification hooks) so retries are targeted instead of full-context restarts. Codex non-interactive mode Codex security/sandbox
Use Claude Code's built-in controls (/compact, /clear, /cost, subagents, hooks, plan mode) to cap context growth, enforce permission boundaries, and keep sessions in a predictable budget envelope. Slash commands Costs Permissions Hooks
Put LiteLLM in front of automation and team workflows for cache/routing/fallback/budget policy; this gives one place to enforce spend and reliability across providers and local/remote model paths. LiteLLM caching LiteLLM routing LiteLLM fallbacks LiteLLM budget routing LiteLLM budgets + rate limits
Adopt eval gates early (prompt/agent regression checks + trace-level grading) so skill changes and prompt scaffolds improve success rate instead of drifting into expensive prompt churn. Promptfoo CI/CD Promptfoo Action OpenAI Agent Evals Testing Agent Skills with Evals

2. High-ROI skill and workflow patterns¶

Skill packaging and trigger discipline¶

Codex: use AGENTS.md layering plus skills (SKILL.md) with precise description boundaries. Keep descriptions explicit about when to use and when not to use. AGENTS.md guide Codex Skills
Claude Code: use project/user subagents for specialized flows (for example, incident-triage, test-runner) so each delegated task runs in a separate context window. Claude subagents
Practical mechanism: metadata-first skill loading reduces default prompt payload; full instructions load only on trigger. Codex Skills

Task decomposition for complex incidents¶

Start in analysis-only mode (plan in Claude permissions, read-only in Codex sandbox) to produce a bounded plan before code edits. Claude permissions Codex security
Convert each step into a verifiable subtask with a required artifact (diff, failing test, passing test, or schema-validated JSON summary). Codex non-interactive mode
Run subagents/skills for narrow units rather than a single monolithic prompt. Claude subagents Codex Skills

Context hygiene and memory discipline¶

Claude Code: use /compact with task-specific focus instructions, /clear between unrelated tasks, and CLAUDE.md memory for stable conventions. Slash commands Memory
Codex: keep durable guidance in global/project AGENTS.md; keep task state in artifacts (JSON, markdown logs) and resume only when required. AGENTS.md guide Codex non-interactive mode
For shell-heavy sessions, use compaction/output filters (rtk, repomix, files-to-prompt) before model calls. rtk repomix files-to-prompt

Tool-call governance and reproducibility¶

Claude Code: enforce runtime policy with permissions + hooks (PreToolUse, PostToolUse), not only prompt text. Permissions Hooks
Codex: use rules (prefix_rule) and sandbox/approval combinations to control escalation paths. Codex rules Codex security
LiteLLM: enforce team/key/provider budgets and tagged routes to avoid runaway exploration loops. LiteLLM budgets + rate limits Tag-based routing Budget routing

Verification loops that lift success rate¶

Add eval gates in CI (promptfoo eval, thresholds, JSON artifacts) and keep a compact benchmark set for recurring incident classes. Promptfoo CI/CD Promptfoo Action
Use benchmark-style harnesses for regression visibility on agent changes (SWE-bench ecosystem, SWE-agent workflows). SWE-bench SWE-agent SWE-bench leaderboards

3. Comparison table (agent frameworks/patterns/tools)¶

Scoring: 0-3 (higher is better). For Human operator load and Time-to-adopt, 3 means lower burden / faster adoption.

Pattern / Tool	Category	Token savings potential	Success-rate lift (complex tasks)	Reproducibility	Human operator load	Time-to-adopt	Works with Codex + Claude Code + LiteLLM	Maturity snapshot (2026-02-17 UTC)
Codex AGENTS.md + Skills / Skills	Skill system	3	2	3	2	2	Codex: native; Claude Code: pattern-compatible; LiteLLM: indirect	`openai/codex`: 60,847 stars, pushed 2026-02-17, Apache-2.0 (API)
Codex non-interactive (`codex exec --json`) + rules	Automation + checkpointing	2	3	3	2	2	Codex: native; Claude Code: analogous patterns; LiteLLM: indirect	Same repo snapshot as above
Claude Code subagents + hooks + permissions plan mode	Delegation + guardrails	3	3	2	2	2	Claude Code: native; Codex: pattern-compatible; LiteLLM: gateway-compatible	Anthropic docs product (no public OSS repo metadata in cited sources)
LiteLLM gateway policies + cache + fallbacks + budgets	Routing/cost control plane	3	2	3	2	2	Codex: proxy path `unverified`; Claude Code: documented LLM gateway path; LiteLLM: native	`BerriAI/litellm`: 36,195 stars, pushed 2026-02-17, `NOASSERTION` (API)
Promptfoo eval gates	Verification + regression control	1	3	3	2	3	Works as external gate for all three	`promptfoo/promptfoo`: 10,488 stars, pushed 2026-02-17, MIT (API)
SWE-agent harness patterns + SWE-bench	Complex task benchmarking	1	3	3	1	1	Framework-level; usable with Codex/Claude/LiteLLM as model backends	`SWE-agent/SWE-agent`: 18,495 stars, pushed 2026-02-17, MIT (API)
OpenHands skills/microagents	Alternative skill ecosystem	2	2	2	2	2	Pattern-compatible; useful for cross-tool runbook design	`OpenHands/OpenHands`: 67,910 stars, pushed 2026-02-17, `NOASSERTION` (API)
rtk runtime output compaction	Terminal-output token compression	3	1	2	3	3	Claude Code: explicit hook workflow; Codex/LiteLLM: indirect	`rtk-ai/rtk`: 823 stars, pushed 2026-02-17, MIT (API)

4. End-to-end playbook for complex issue resolution¶

A. Intake and triage (analysis-only)¶

Run in non-destructive mode first.
Produce a bounded plan with explicit success criteria and affected files.

# Codex: read-only triage with machine-readable events
codex exec --json \
  --sandbox read-only \
  "Analyze failing CI on this repo. Output: root-cause hypotheses, exact files, and minimal fix plan."

# Claude Code session pattern
1) /permissions -> set mode to plan
2) Ask for: root cause + minimal diff strategy + verification checklist
3) Switch to edit mode only after plan approval

Sources: Codex non-interactive mode Codex security Claude permissions

B. Skill/subagent bootstrap for incident class¶

Create one reusable runbook skill and one verifier subagent.

---
name: incident-ci-fix
description: Use when CI failed on a recent commit and you need minimal-risk root cause and fix; do not use for broad refactors.
---
### Inputs
- failing job logs
- target commit SHA

### Workflow
1. Reproduce failure with the exact command.
2. Isolate smallest failing unit.
3. Patch only files in affected module.
4. Re-run failing checks, then full gate.

### Definition of done
- failing check now passes
- no new lint/type/test regressions
- concise root-cause note with before/after evidence

# .claude/agents/verification-lead.md
---
name: verification-lead
description: Validate fixes with deterministic command order and reject unresolved risk.
---
- Run targeted failing test first, then full suite.
- Reject if patch increases scope without evidence.

Sources: Codex Skills Claude subagents

C. Enforce runtime guardrails¶

Use runtime policy, not prompt-only policy.

{
  "permissions": {
    "allow": [
      "Bash(rg *)",
      "Bash(git diff *)",
      "Bash(* --version)",
      "Task(verification-lead)"
    ],
    "deny": [
      "Bash(git push *)",
      "Read(./.env)",
      "Read(./secrets/**)"
    ],
    "defaultMode": "acceptEdits"
  }
}

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [
          {
            "type": "command",
            "command": "$CLAUDE_PROJECT_DIR/.claude/hooks/block-dangerous.sh"
          }
        ]
      }
    ]
  }
}

Sources: Claude permissions Claude hooks

D. Put LiteLLM in front for route and spend control¶

# litellm_config.yaml
model_list:
  - model_name: claude_primary
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY
      tags: ["interactive", "paid"]

  - model_name: codex_api
    litellm_params:
      model: openai/gpt-5-codex
      api_key: os.environ/OPENAI_API_KEY
      tags: ["automation", "paid"]

  - model_name: local_fast
    litellm_params:
      model: ollama/qwen3-coder
      api_base: http://127.0.0.1:11434
      tags: ["interactive", "free"]

litellm_settings:
  cache: true
  cache_params:
    type: redis
    ttl: 900

router_settings:
  routing_strategy: cost-based-routing
  fallbacks:
    - {"claude_primary": ["codex_api", "local_fast"]}

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY

litellm --config ./litellm_config.yaml

Sources: LiteLLM caching LiteLLM routing LiteLLM fallbacks Tag-based routing Claude Code LLM gateway

E. Verification gate before merge¶

# 1) deterministic test order
npm run test:targeted
npm run lint
npm run typecheck
npm test

# 2) agent eval gate (example)
npx promptfoo@latest eval -c promptfooconfig.yaml -o .artifacts/promptfoo.json

# fail fast if pass-rate under threshold
PASS_RATE=$(jq '.results.stats.successes / (.results.stats.successes + .results.stats.failures) * 100' .artifacts/promptfoo.json)
awk -v p="$PASS_RATE" 'BEGIN { if (p < 95) exit 1 }'

Sources: Promptfoo CI/CD Promptfoo Action OpenAI evaluation best practices

F. Checkpoint output contract¶

Require the final agent output to include: - root cause (1 paragraph) - diff summary (file-level) - verification evidence (exact commands + pass/fail) - rollback plan

Use codex exec --output-schema for machine-readable handoff in automation. Codex non-interactive mode

5. Token-efficiency tactics for coding agents¶

Use metadata-only skill discovery by default; avoid embedding full runbooks in system prompts. Codex Skills
Write skill descriptions as routing boundaries with negative examples to reduce accidental skill activation. Shell + Skills + Compaction
Use Claude subagents for separation of concerns so each delegated task has independent context. Claude subagents
Compact aggressively during long sessions and clear context between unrelated tasks (/compact, /clear). Slash commands
Track session spend continuously (/cost) and cap high-variance sessions early. Costs
Set MCP output limits (MAX_MCP_OUTPUT_TOKENS) to prevent accidental token floods from large tool outputs. Claude MCP
Use JSONL events (codex exec --json) for logs/artifacts instead of pasting verbose transcripts back into context. Codex non-interactive mode
Use gateway response caching and semantic caching for repeat calls in iterative debugging loops. LiteLLM caching
Use budget routing and tagged routes to keep exploratory tasks on cheaper lanes and reserve premium models for high-impact steps. Budget routing Tag-based routing
Add runtime output compression at the shell layer (rtk) when token waste comes from command output, and use adjacent tools (repomix, files-to-prompt, aider repo map) when waste comes from broad repo context. rtk repomix files-to-prompt aider repo map

Explicit rtk adjacency pass (when each is better/worse)¶

Better than rtk: repomix for one-shot whole-repo prompt packaging; files-to-prompt for deterministic subset selection; aider repo map for persistent token-bounded codebase maps. repomix files-to-prompt aider repo map
Better than those: rtk when the dominant waste is repeated shell/test/log output in coding-agent loops, especially with Claude Code hook-based auto-rewrite. rtk

6. Anti-patterns and failure modes¶

Monolithic mega-prompts replacing skills.
Failure: high baseline token burn and brittle behavior drift.
Fix: split into skill/subagent contracts with explicit triggers. Codex Skills Claude subagents
Prompt-only safety controls.
Failure: agents still run risky commands when the prompt and runtime policy disagree.
Fix: enforce permissions/rules/hooks at runtime. Claude permissions Claude hooks Codex rules
Unbounded MCP/server output.
Failure: context flooding and low-signal token spend.
Fix: set output caps and add pre-filtering commands. Claude MCP
Cost control without telemetry.
Failure: cannot attribute improvements or regressions.
Fix: attach gateway logging callbacks and spend/budget reports. LiteLLM logging LiteLLM budgets + rate limits
No regression loop for skills.
Failure: "works once" runbooks regress silently.
Fix: add small prompt sets + CI eval gates + benchmark slices for incident classes. Testing Agent Skills with Evals Promptfoo CI/CD SWE-bench
Using bypass modes in shared environments.
Failure: avoidable security and compliance risk.
Fix: least-privilege defaults and managed policy restrictions. Codex security Claude permissions

7. Suggested default operating model (Codex/Claude Code + LiteLLM)¶

Role split¶

Codex: high-throughput automation, JSON checkpoint generation, reproducible batch runs. Codex non-interactive mode
Claude Code: deep interactive debugging/refactor loops with subagents, hooks, and plan/permission controls. Claude subagents Hooks Permissions
LiteLLM: shared routing/cache/fallback/budget policy layer for team workflows and API-driven automation. LiteLLM routing LiteLLM caching LiteLLM budget routing

Minimal architecture¶

Repository standards in AGENTS.md (Codex) and CLAUDE.md (Claude Code memory).
Shared skill/subagent registry in-repo (.codex/skills/, .claude/agents/).
Gateway policies in litellm_config.yaml with tagged routes, fallbacks, and budgets.
CI gate: targeted tests + eval metrics + structured final summary artifact.

Compatibility notes¶

Claude Code has documented LLM gateway configuration including LiteLLM examples. Claude Code LLM gateway
Direct Codex CLI -> third-party gateway wiring is unverified in cited public docs for this report; validate in your environment before broad standardization.

8. Adopt now / try next / monitor¶

Adopt now¶

Ship one incident-focused skill and one verifier subagent per top recurring incident class.
Enable runtime guardrails (Codex rules / Claude permissions + hooks) before scaling agent autonomy.
Put LiteLLM budgets + caching in front of automated runs and track weekly spend-by-route.
Add CI eval gates with explicit pass thresholds and artifact outputs.

Try next¶

Add tag-based routing (free, paid, critical) and fallback trees tied to reliability SLOs.
Add rtk hook-first compression for shell-heavy loops; measure pre/post token deltas via session metrics.
Add benchmark subsets (SWE-bench-style incident buckets) for quarterly runbook quality checks.

Monitor¶

Skills over-triggering or under-triggering after prompt or naming changes.
Budget routing misconfiguration that accidentally routes critical work to underpowered models.
Security drift from permission bypasses or expanding network allowlists.

9. Appendix (search log + links)¶

All links below were accessed on 2026-02-17 unless noted.

Global targeted queries executed¶

llm token efficiency dev tools
prompt caching coding agent
context compression terminal output coding agent
liteLLM routing cost control
claude code workflow token usage
codex agent productivity patterns
llm observability token cost tracing
agent runbook skill system

Required discovery sources¶

Hacker News / Show HN:
Show HN: File-based sub-agents for Codex CLI
Show HN: 2d platformer built with Codex skills
HN search: Claude Code hooks
HN search: Codex skills
Lobsters:
Claude Skills may be a bigger deal than MCP
Claude Skills, anywhere: making them first-class in Codex CLI
How I Use Every Claude Code Feature
GitHub Trending and Search:
GitHub Trending
GitHub search: codex skills
GitHub search: claude code hooks
GitHub search: litellm routing
Reddit (required subreddits):
r/LocalLLaMA search: coding agents
r/LLMDevs search: LiteLLM
r/MachineLearning search: SWE-agent
r/OpenAI search: Codex CLI
r/ClaudeAI search: skills
Awesome lists:
Awesome-LLMOps (tensorchord)
awesome-claude-code
awesome-ai-agents
Awesome-LLM-RAG

Primary implementation references¶

Codex:
Codex AGENTS.md guide
Codex Skills
Codex non-interactive mode
Codex rules
Codex security/sandbox
Codex MCP
Claude Code:
Subagents
Hooks
Slash commands
Memory
Costs
Permissions
LLM gateway configuration
LiteLLM:
Caching
Routing
Fallbacks
Tag-based routing
Budget routing
Budgets + rate limits
Logging
Verification and benchmarks:
Promptfoo CI/CD
Promptfoo Action
OpenAI Agent Evals
OpenAI evaluation best practices
Testing Agent Skills Systematically with Evals
SWE-bench
SWE-agent
SWE-bench leaderboards
rtk and adjacent token-efficiency tools:
rtk
repomix
files-to-prompt
aider repo map

10. Merged Coding-Agent Iterative Tooling Shortlist¶

This section merges the former iterative/shortlist coding-agent tooling docs into Track 3.

Corpus snapshot (curated coding-agent iterative set): - Total repositories: 59 | Category | Count | | --- | ---: | | Coding Benchmark Harness | 12 | | Evaluation & Regression | 18 | | Experiment Orchestration | 12 | | Gateway/Policy/Quality Gates | 8 | | Prompt/Strategy Optimization | 9 |

Category Tables¶

Experiment Orchestration (12)¶

Repository	Stars	Updated (UTC)	Why it matters for Track 3
anthropics/claude-code	67372	2026-02-17	Claude Code is an agentic coding tool that lives in your terminal, understands your codebase, and helps you code faster by executing routine tasks, explaining complex code, and handling git workflows - all through natural language commands.
Aider-AI/aider	40703	2026-02-17	aider is AI pair programming in your terminal
wshobson/agents	28788	2026-02-17	Intelligent automation and multi-agent orchestration for Claude Code
thedotmack/claude-mem	28783	2026-02-17	A Claude Code plugin that automatically captures everything Claude does during your coding sessions, compresses it with AI (using Claude's agent-sdk), and injects relevant context back into future sessions.
musistudio/claude-code-router	27974	2026-02-17	Use Claude Code as the foundation for coding infrastructure, allowing you to decide how to interact with the model while enjoying updates from Anthropic.
BloopAI/vibe-kanban	21376	2026-02-17	Get 10X more out of Claude Code, Codex or any coding agent
SuperClaude-Org/SuperClaude_Framework	20844	2026-02-17	A configuration framework that enhances Claude Code with specialized commands, cognitive personas, and development methodologies.
winfunc/opcode	20570	2026-02-17	A powerful GUI app and Toolkit for Claude Code - Create custom agents, manage interactive Claude Code sessions, run secure background agents, and more.
farion1231/cc-switch	18647	2026-02-17	A cross-platform desktop All-in-One assistant tool for Claude Code, Codex, OpenCode & Gemini CLI.
ruvnet/claude-flow	14152	2026-02-17	The leading agent orchestration platform for Claude. Deploy intelligent multi-agent swarms, coordinate autonomous workflows, and build conversational AI systems. Features enterprise-grade architecture, distributed swarm intelligence, RAG integration, and native Claude Code support via MCP protocol. Ranked #1 in agent-based frameworks.
ryoppippi/ccusage	10780	2026-02-17	A CLI tool for analyzing Claude Code/Codex CLI usage from local JSONL files.
danicat/tenkai	11	2026-02-10	Experimentation framework for coding agents

Prompt/Strategy Optimization (9)¶

Repository	Stars	Updated (UTC)	Why it matters for Track 3
microsoft/PromptWizard	3767	2026-02-17	Task-Aware Agent-driven Prompt Optimization Framework
zou-group/textgrad	3368	2026-02-17	TextGrad: Automatic ''Differentiation'' via Text -- using large language models to backpropagate textual gradients. Published in Nature.
SalesforceAIResearch/promptomatix	909	2026-02-17	An Automatic Prompt Optimization Framework for Large Language Models
shobrook/promptimal	298	2026-01-15	A very fast, very minimal prompt optimizer
austin-starks/Promptimizer	210	2026-02-13	An Automated AI-Powered Prompt Optimization Framework
CTLab-ITMO/CoolPrompt	174	2026-02-16	Automatic Prompt Optimization Framework
developzir/gepa-mcp	46	2025-12-26	MCP server integrating GEPA (Genetic-Evolutionary Prompt Architecture) for automatic prompt optimization with Claude Desktop
Bubobot-Team/mcp-prompt-optimizer	21	2026-02-16	Advanced MCP server providing cutting-edge prompt optimization tools with research-backed strategies
johnpsasser/claude-code-prompt-optimizer	19	2026-02-12	AI-powered prompt optimization hook for Claude Code. Transforms simple prompts into comprehensive, structured instructions using Claude Opus 4.1's advanced reasoning capabilities.

Evaluation & Regression (18)¶

Repository	Stars	Updated (UTC)	Why it matters for Track 3
langfuse/langfuse	22009	2026-02-17	Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. YC W23
openai/evals	17694	2026-02-17	Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
confident-ai/deepeval	13697	2026-02-17	The LLM Evaluation Framework
vibrantlabsai/ragas	12636	2026-02-17	Supercharge Your LLM Application Evaluations
EleutherAI/lm-evaluation-harness	11439	2026-02-17	A framework for few-shot evaluation of language models.
promptfoo/promptfoo	10489	2026-02-17	Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
Arize-ai/phoenix	8575	2026-02-17	AI Observability & Evaluation
open-compass/opencompass	6671	2026-02-17	OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Giskard-AI/giskard-oss	5117	2026-02-17	Open-Source Evaluation & Testing library for LLM Agents
Helicone/helicone	5116	2026-02-17	Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23
openai/simple-evals	4352	2026-02-17
Agenta-AI/agenta	3849	2026-02-17	The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.
truera/trulens	3093	2026-02-17	Evaluation and Tracking for LLM Experiments and AI Agents
uptrain-ai/uptrain	2338	2026-02-17	UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.
UKGovernmentBEIS/inspect_ai	1746	2026-02-17	Inspect: A framework for large language model evaluations
langchain-ai/openevals	901	2026-02-16	Readymade evaluators for your LLM apps
langchain-ai/agentevals	476	2026-02-12	Readymade evaluators for agent trajectories
hidai25/eval-view	44	2026-02-17	Catch AI agent regressions before you ship. YAML test cases, golden baselines, execution tracing, cost tracking, CI integration. LangGraph, CrewAI, Anthropic, OpenAI.

Coding Benchmark Harness (12)¶

Repository	Stars	Updated (UTC)	Why it matters for Track 3
SWE-bench/SWE-bench	4303	2026-02-17	SWE-bench: Can Language Models Resolve Real-world Github Issues?
SWE-agent/mini-swe-agent	2913	2026-02-17	The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepobut scores >74% on SWE-bench verified!
openai/SWELancer-Benchmark	1439	2026-02-11	This repo contains the dataset and code for the paper "SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?"
augmentcode/augment-swebench-agent	855	2026-02-12	The #1 open-source SWE-bench Verified implementation
multi-swe-bench/multi-swe-bench	321	2026-02-17	Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
JARVIS-Xs/SE-Agent	232	2026-02-11	SE-Agent is a self-evolution framework for LLM Code agents. It enables trajectory-level evolution to exchange information across reasoning paths via Revision, Recombination, and Refinement, expanding the search space and escaping local optima. On SWE-bench Verified, it achieves SOTA performance
microsoft/SWE-bench-Live	162	2026-02-17	[NeurIPS 2025 D&B] SWE-bench Goes Live!
Aider-AI/aider-swe-bench	79	2026-01-28	Harness used to benchmark aider against SWE Bench benchmarks
amazon-science/SWE-PolyBench	77	2026-01-30	SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents
smallcloudai/refact-bench	63	2026-01-22	A benchmarking tool for evaluating AI coding assistants on real-world software engineering tasks from the SWE-Bench dataset.
SWE-bench/sb-cli	56	2026-02-15	Run SWE-bench evaluations remotely
logic-star-ai/insights	49	2026-02-12	We track and analyze the activity and performance of autonomous code agents in the wild

Gateway/Policy/Quality Gates (8)¶

Repository	Stars	Updated (UTC)	Why it matters for Track 3
BerriAI/litellm	36197	2026-02-17	Python SDK, Proxy Server (AI Gateway) to call 100+ LLM APIs in OpenAI (or native) format, with cost tracking, guardrails, loadbalancing and logging. [Bedrock, Azure, OpenAI, VertexAI, Cohere, Anthropic, Sagemaker, HuggingFace, VLLM, NVIDIA NIM]
tensorzero/tensorzero	10971	2026-02-17	TensorZero is an open-source stack for industrial-grade LLM applications. It unifies an LLM gateway, observability, optimization, evaluation, and experimentation.
Portkey-AI/gateway	10630	2026-02-17	A blazing fast AI Gateway with integrated guardrails. Route to 200+ LLMs, 50+ AI Guardrails with 1 fast & friendly API.
messkan/prompt-cache	196	2026-02-13	Cut LLM costs by up to 80% and unlock sub-millisecond responses with intelligent semantic caching.A drop-in, provider-agnostic LLM proxy written in Go with sub-millisecond response
traceloop/hub	165	2026-02-16	High-scale LLM gateway, written in Rust. OpenTelemetry-based observability included
Dimon94/cc-devflow	97	2026-02-08	One-command requirement development flow for Claude Code - Complete workflow system with sub-agents, quality gates, and intelligent automation
usetig/sage	83	2026-02-16	An LLM council that reviews your coding agent's every move
rigour-labs/rigour	11	2026-02-17	Local-first quality gate + fix-loop controller for AI coding agents (CLI + MCP).