Skills, MCP, and Ecosystem Playbook for Coding Agents (2026-02-17)¶
Scope (Track 3 of 3)¶
This is Track 3/3 in the research set:
1. Binary tools (see docs/binary-tools-token-efficiency-landscape.md).
2. Cloud services (see docs/paid-cloud-services-token-efficiency-landscape.md).
3. Skills, MCP, and ecosystem patterns for reliable coding-agent operation.
1. Executive summary¶
- Treat skills as the primary token-control primitive for long-running coding agents: both Codex and Claude Code use metadata-first/progressive loading patterns, which keeps the baseline context smaller than dumping full runbooks into every prompt. Codex Skills Claude Code subagents
- Run complex work through explicit decomposition loops with machine-readable checkpoints (
codex exec --json, structured output schemas, and per-step verification hooks) so retries are targeted instead of full-context restarts. Codex non-interactive mode Codex security/sandbox - Use Claude Code's built-in controls (
/compact,/clear,/cost, subagents, hooks, plan mode) to cap context growth, enforce permission boundaries, and keep sessions in a predictable budget envelope. Slash commands Costs Permissions Hooks - Put LiteLLM in front of automation and team workflows for cache/routing/fallback/budget policy; this gives one place to enforce spend and reliability across providers and local/remote model paths. LiteLLM caching LiteLLM routing LiteLLM fallbacks LiteLLM budget routing LiteLLM budgets + rate limits
- Adopt eval gates early (prompt/agent regression checks + trace-level grading) so skill changes and prompt scaffolds improve success rate instead of drifting into expensive prompt churn. Promptfoo CI/CD Promptfoo Action OpenAI Agent Evals Testing Agent Skills with Evals
2. High-ROI skill and workflow patterns¶
Skill packaging and trigger discipline¶
Codex: useAGENTS.mdlayering plus skills (SKILL.md) with precisedescriptionboundaries. Keep descriptions explicit about when to use and when not to use. AGENTS.md guide Codex SkillsClaude Code: use project/user subagents for specialized flows (for example,incident-triage,test-runner) so each delegated task runs in a separate context window. Claude subagents- Practical mechanism: metadata-first skill loading reduces default prompt payload; full instructions load only on trigger. Codex Skills
Task decomposition for complex incidents¶
- Start in analysis-only mode (
planin Claude permissions, read-only in Codex sandbox) to produce a bounded plan before code edits. Claude permissions Codex security - Convert each step into a verifiable subtask with a required artifact (diff, failing test, passing test, or schema-validated JSON summary). Codex non-interactive mode
- Run subagents/skills for narrow units rather than a single monolithic prompt. Claude subagents Codex Skills
Context hygiene and memory discipline¶
Claude Code: use/compactwith task-specific focus instructions,/clearbetween unrelated tasks, andCLAUDE.mdmemory for stable conventions. Slash commands MemoryCodex: keep durable guidance in global/projectAGENTS.md; keep task state in artifacts (JSON, markdown logs) and resume only when required. AGENTS.md guide Codex non-interactive mode- For shell-heavy sessions, use compaction/output filters (
rtk,repomix,files-to-prompt) before model calls. rtk repomix files-to-prompt
Tool-call governance and reproducibility¶
Claude Code: enforce runtime policy with permissions + hooks (PreToolUse,PostToolUse), not only prompt text. Permissions HooksCodex: use rules (prefix_rule) and sandbox/approval combinations to control escalation paths. Codex rules Codex securityLiteLLM: enforce team/key/provider budgets and tagged routes to avoid runaway exploration loops. LiteLLM budgets + rate limits Tag-based routing Budget routing
Verification loops that lift success rate¶
- Add eval gates in CI (
promptfoo eval, thresholds, JSON artifacts) and keep a compact benchmark set for recurring incident classes. Promptfoo CI/CD Promptfoo Action - Use benchmark-style harnesses for regression visibility on agent changes (SWE-bench ecosystem, SWE-agent workflows). SWE-bench SWE-agent SWE-bench leaderboards
3. Comparison table (agent frameworks/patterns/tools)¶
Scoring: 0-3 (higher is better). For Human operator load and Time-to-adopt, 3 means lower burden / faster adoption.
| Pattern / Tool | Category | Token savings potential | Success-rate lift (complex tasks) | Reproducibility | Human operator load | Time-to-adopt | Works with Codex + Claude Code + LiteLLM | Maturity snapshot (2026-02-17 UTC) |
|---|---|---|---|---|---|---|---|---|
| Codex AGENTS.md + Skills / Skills | Skill system | 3 | 2 | 3 | 2 | 2 | Codex: native; Claude Code: pattern-compatible; LiteLLM: indirect | openai/codex: 60,847 stars, pushed 2026-02-17, Apache-2.0 (API) |
Codex non-interactive (codex exec --json) + rules |
Automation + checkpointing | 2 | 3 | 3 | 2 | 2 | Codex: native; Claude Code: analogous patterns; LiteLLM: indirect | Same repo snapshot as above |
| Claude Code subagents + hooks + permissions plan mode | Delegation + guardrails | 3 | 3 | 2 | 2 | 2 | Claude Code: native; Codex: pattern-compatible; LiteLLM: gateway-compatible | Anthropic docs product (no public OSS repo metadata in cited sources) |
| LiteLLM gateway policies + cache + fallbacks + budgets | Routing/cost control plane | 3 | 2 | 3 | 2 | 2 | Codex: proxy path unverified; Claude Code: documented LLM gateway path; LiteLLM: native |
BerriAI/litellm: 36,195 stars, pushed 2026-02-17, NOASSERTION (API) |
| Promptfoo eval gates | Verification + regression control | 1 | 3 | 3 | 2 | 3 | Works as external gate for all three | promptfoo/promptfoo: 10,488 stars, pushed 2026-02-17, MIT (API) |
| SWE-agent harness patterns + SWE-bench | Complex task benchmarking | 1 | 3 | 3 | 1 | 1 | Framework-level; usable with Codex/Claude/LiteLLM as model backends | SWE-agent/SWE-agent: 18,495 stars, pushed 2026-02-17, MIT (API) |
| OpenHands skills/microagents | Alternative skill ecosystem | 2 | 2 | 2 | 2 | 2 | Pattern-compatible; useful for cross-tool runbook design | OpenHands/OpenHands: 67,910 stars, pushed 2026-02-17, NOASSERTION (API) |
| rtk runtime output compaction | Terminal-output token compression | 3 | 1 | 2 | 3 | 3 | Claude Code: explicit hook workflow; Codex/LiteLLM: indirect | rtk-ai/rtk: 823 stars, pushed 2026-02-17, MIT (API) |
4. End-to-end playbook for complex issue resolution¶
A. Intake and triage (analysis-only)¶
- Run in non-destructive mode first.
- Produce a bounded plan with explicit success criteria and affected files.
# Codex: read-only triage with machine-readable events
codex exec --json \
--sandbox read-only \
"Analyze failing CI on this repo. Output: root-cause hypotheses, exact files, and minimal fix plan."
# Claude Code session pattern
1) /permissions -> set mode to plan
2) Ask for: root cause + minimal diff strategy + verification checklist
3) Switch to edit mode only after plan approval
Sources: Codex non-interactive mode Codex security Claude permissions
B. Skill/subagent bootstrap for incident class¶
Create one reusable runbook skill and one verifier subagent.
---
name: incident-ci-fix
description: Use when CI failed on a recent commit and you need minimal-risk root cause and fix; do not use for broad refactors.
---
### Inputs
- failing job logs
- target commit SHA
### Workflow
1. Reproduce failure with the exact command.
2. Isolate smallest failing unit.
3. Patch only files in affected module.
4. Re-run failing checks, then full gate.
### Definition of done
- failing check now passes
- no new lint/type/test regressions
- concise root-cause note with before/after evidence
# .claude/agents/verification-lead.md
---
name: verification-lead
description: Validate fixes with deterministic command order and reject unresolved risk.
---
- Run targeted failing test first, then full suite.
- Reject if patch increases scope without evidence.
Sources: Codex Skills Claude subagents
C. Enforce runtime guardrails¶
Use runtime policy, not prompt-only policy.
{
"permissions": {
"allow": [
"Bash(rg *)",
"Bash(git diff *)",
"Bash(* --version)",
"Task(verification-lead)"
],
"deny": [
"Bash(git push *)",
"Read(./.env)",
"Read(./secrets/**)"
],
"defaultMode": "acceptEdits"
}
}
{
"hooks": {
"PreToolUse": [
{
"matcher": "Bash",
"hooks": [
{
"type": "command",
"command": "$CLAUDE_PROJECT_DIR/.claude/hooks/block-dangerous.sh"
}
]
}
]
}
}
Sources: Claude permissions Claude hooks
D. Put LiteLLM in front for route and spend control¶
# litellm_config.yaml
model_list:
- model_name: claude_primary
litellm_params:
model: anthropic/claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY
tags: ["interactive", "paid"]
- model_name: codex_api
litellm_params:
model: openai/gpt-5-codex
api_key: os.environ/OPENAI_API_KEY
tags: ["automation", "paid"]
- model_name: local_fast
litellm_params:
model: ollama/qwen3-coder
api_base: http://127.0.0.1:11434
tags: ["interactive", "free"]
litellm_settings:
cache: true
cache_params:
type: redis
ttl: 900
router_settings:
routing_strategy: cost-based-routing
fallbacks:
- {"claude_primary": ["codex_api", "local_fast"]}
general_settings:
master_key: os.environ/LITELLM_MASTER_KEY
litellm --config ./litellm_config.yaml
Sources: LiteLLM caching LiteLLM routing LiteLLM fallbacks Tag-based routing Claude Code LLM gateway
E. Verification gate before merge¶
# 1) deterministic test order
npm run test:targeted
npm run lint
npm run typecheck
npm test
# 2) agent eval gate (example)
npx promptfoo@latest eval -c promptfooconfig.yaml -o .artifacts/promptfoo.json
# fail fast if pass-rate under threshold
PASS_RATE=$(jq '.results.stats.successes / (.results.stats.successes + .results.stats.failures) * 100' .artifacts/promptfoo.json)
awk -v p="$PASS_RATE" 'BEGIN { if (p < 95) exit 1 }'
Sources: Promptfoo CI/CD Promptfoo Action OpenAI evaluation best practices
F. Checkpoint output contract¶
Require the final agent output to include: - root cause (1 paragraph) - diff summary (file-level) - verification evidence (exact commands + pass/fail) - rollback plan
Use codex exec --output-schema for machine-readable handoff in automation. Codex non-interactive mode
5. Token-efficiency tactics for coding agents¶
- Use metadata-only skill discovery by default; avoid embedding full runbooks in system prompts. Codex Skills
- Write skill descriptions as routing boundaries with negative examples to reduce accidental skill activation. Shell + Skills + Compaction
- Use Claude subagents for separation of concerns so each delegated task has independent context. Claude subagents
- Compact aggressively during long sessions and clear context between unrelated tasks (
/compact,/clear). Slash commands - Track session spend continuously (
/cost) and cap high-variance sessions early. Costs - Set MCP output limits (
MAX_MCP_OUTPUT_TOKENS) to prevent accidental token floods from large tool outputs. Claude MCP - Use JSONL events (
codex exec --json) for logs/artifacts instead of pasting verbose transcripts back into context. Codex non-interactive mode - Use gateway response caching and semantic caching for repeat calls in iterative debugging loops. LiteLLM caching
- Use budget routing and tagged routes to keep exploratory tasks on cheaper lanes and reserve premium models for high-impact steps. Budget routing Tag-based routing
- Add runtime output compression at the shell layer (
rtk) when token waste comes from command output, and use adjacent tools (repomix,files-to-prompt,aiderrepo map) when waste comes from broad repo context. rtk repomix files-to-prompt aider repo map
Explicit rtk adjacency pass (when each is better/worse)¶
- Better than
rtk:repomixfor one-shot whole-repo prompt packaging;files-to-promptfor deterministic subset selection;aiderrepo map for persistent token-bounded codebase maps. repomix files-to-prompt aider repo map - Better than those:
rtkwhen the dominant waste is repeated shell/test/log output in coding-agent loops, especially with Claude Code hook-based auto-rewrite. rtk
6. Anti-patterns and failure modes¶
- Monolithic mega-prompts replacing skills.
- Failure: high baseline token burn and brittle behavior drift.
-
Fix: split into skill/subagent contracts with explicit triggers. Codex Skills Claude subagents
-
Prompt-only safety controls.
- Failure: agents still run risky commands when the prompt and runtime policy disagree.
-
Fix: enforce permissions/rules/hooks at runtime. Claude permissions Claude hooks Codex rules
-
Unbounded MCP/server output.
- Failure: context flooding and low-signal token spend.
-
Fix: set output caps and add pre-filtering commands. Claude MCP
-
Cost control without telemetry.
- Failure: cannot attribute improvements or regressions.
-
Fix: attach gateway logging callbacks and spend/budget reports. LiteLLM logging LiteLLM budgets + rate limits
-
No regression loop for skills.
- Failure: "works once" runbooks regress silently.
-
Fix: add small prompt sets + CI eval gates + benchmark slices for incident classes. Testing Agent Skills with Evals Promptfoo CI/CD SWE-bench
-
Using bypass modes in shared environments.
- Failure: avoidable security and compliance risk.
- Fix: least-privilege defaults and managed policy restrictions. Codex security Claude permissions
7. Suggested default operating model (Codex/Claude Code + LiteLLM)¶
Role split¶
Codex: high-throughput automation, JSON checkpoint generation, reproducible batch runs. Codex non-interactive modeClaude Code: deep interactive debugging/refactor loops with subagents, hooks, and plan/permission controls. Claude subagents Hooks PermissionsLiteLLM: shared routing/cache/fallback/budget policy layer for team workflows and API-driven automation. LiteLLM routing LiteLLM caching LiteLLM budget routing
Minimal architecture¶
- Repository standards in
AGENTS.md(Codex) andCLAUDE.md(Claude Code memory). - Shared skill/subagent registry in-repo (
.codex/skills/,.claude/agents/). - Gateway policies in
litellm_config.yamlwith tagged routes, fallbacks, and budgets. - CI gate: targeted tests + eval metrics + structured final summary artifact.
Compatibility notes¶
Claude Codehas documented LLM gateway configuration including LiteLLM examples. Claude Code LLM gateway- Direct
Codex CLI -> third-party gatewaywiring isunverifiedin cited public docs for this report; validate in your environment before broad standardization.
8. Adopt now / try next / monitor¶
Adopt now¶
- Ship one incident-focused skill and one verifier subagent per top recurring incident class.
- Enable runtime guardrails (Codex rules / Claude permissions + hooks) before scaling agent autonomy.
- Put LiteLLM budgets + caching in front of automated runs and track weekly spend-by-route.
- Add CI eval gates with explicit pass thresholds and artifact outputs.
Try next¶
- Add tag-based routing (
free,paid,critical) and fallback trees tied to reliability SLOs. - Add
rtkhook-first compression for shell-heavy loops; measure pre/post token deltas via session metrics. - Add benchmark subsets (SWE-bench-style incident buckets) for quarterly runbook quality checks.
Monitor¶
- Skills over-triggering or under-triggering after prompt or naming changes.
- Budget routing misconfiguration that accidentally routes critical work to underpowered models.
- Security drift from permission bypasses or expanding network allowlists.
9. Appendix (search log + links)¶
All links below were accessed on 2026-02-17 unless noted.
Global targeted queries executed¶
llm token efficiency dev toolsprompt caching coding agentcontext compression terminal output coding agentliteLLM routing cost controlclaude code workflow token usagecodex agent productivity patternsllm observability token cost tracingagent runbook skill system
Required discovery sources¶
- Hacker News / Show HN:
- Show HN: File-based sub-agents for Codex CLI
- Show HN: 2d platformer built with Codex skills
- HN search: Claude Code hooks
- HN search: Codex skills
- Lobsters:
- Claude Skills may be a bigger deal than MCP
- Claude Skills, anywhere: making them first-class in Codex CLI
- How I Use Every Claude Code Feature
- GitHub Trending and Search:
- GitHub Trending
- GitHub search: codex skills
- GitHub search: claude code hooks
- GitHub search: litellm routing
- Reddit (required subreddits):
- r/LocalLLaMA search: coding agents
- r/LLMDevs search: LiteLLM
- r/MachineLearning search: SWE-agent
- r/OpenAI search: Codex CLI
- r/ClaudeAI search: skills
- Awesome lists:
- Awesome-LLMOps (tensorchord)
- awesome-claude-code
- awesome-ai-agents
- Awesome-LLM-RAG
Primary implementation references¶
- Codex:
- Codex AGENTS.md guide
- Codex Skills
- Codex non-interactive mode
- Codex rules
- Codex security/sandbox
- Codex MCP
- Claude Code:
- Subagents
- Hooks
- Slash commands
- Memory
- Costs
- Permissions
- LLM gateway configuration
- LiteLLM:
- Caching
- Routing
- Fallbacks
- Tag-based routing
- Budget routing
- Budgets + rate limits
- Logging
- Verification and benchmarks:
- Promptfoo CI/CD
- Promptfoo Action
- OpenAI Agent Evals
- OpenAI evaluation best practices
- Testing Agent Skills Systematically with Evals
- SWE-bench
- SWE-agent
- SWE-bench leaderboards
rtkand adjacent token-efficiency tools:- rtk
- repomix
- files-to-prompt
- aider repo map
10. Merged Coding-Agent Iterative Tooling Shortlist¶
This section merges the former iterative/shortlist coding-agent tooling docs into Track 3.
Corpus snapshot (curated coding-agent iterative set): - Total repositories: 59 | Category | Count | | --- | ---: | | Coding Benchmark Harness | 12 | | Evaluation & Regression | 18 | | Experiment Orchestration | 12 | | Gateway/Policy/Quality Gates | 8 | | Prompt/Strategy Optimization | 9 |
Category Tables¶
Experiment Orchestration (12)¶
| Repository | Stars | Updated (UTC) | Why it matters for Track 3 |
|---|---|---|---|
| anthropics/claude-code | 67372 | 2026-02-17 | Claude Code is an agentic coding tool that lives in your terminal, understands your codebase, and helps you code faster by executing routine tasks, explaining complex code, and handling git workflows - all through natural language commands. |
| Aider-AI/aider | 40703 | 2026-02-17 | aider is AI pair programming in your terminal |
| wshobson/agents | 28788 | 2026-02-17 | Intelligent automation and multi-agent orchestration for Claude Code |
| thedotmack/claude-mem | 28783 | 2026-02-17 | A Claude Code plugin that automatically captures everything Claude does during your coding sessions, compresses it with AI (using Claude's agent-sdk), and injects relevant context back into future sessions. |
| musistudio/claude-code-router | 27974 | 2026-02-17 | Use Claude Code as the foundation for coding infrastructure, allowing you to decide how to interact with the model while enjoying updates from Anthropic. |
| BloopAI/vibe-kanban | 21376 | 2026-02-17 | Get 10X more out of Claude Code, Codex or any coding agent |
| SuperClaude-Org/SuperClaude_Framework | 20844 | 2026-02-17 | A configuration framework that enhances Claude Code with specialized commands, cognitive personas, and development methodologies. |
| winfunc/opcode | 20570 | 2026-02-17 | A powerful GUI app and Toolkit for Claude Code - Create custom agents, manage interactive Claude Code sessions, run secure background agents, and more. |
| farion1231/cc-switch | 18647 | 2026-02-17 | A cross-platform desktop All-in-One assistant tool for Claude Code, Codex, OpenCode & Gemini CLI. |
| ruvnet/claude-flow | 14152 | 2026-02-17 | The leading agent orchestration platform for Claude. Deploy intelligent multi-agent swarms, coordinate autonomous workflows, and build conversational AI systems. Features enterprise-grade architecture, distributed swarm intelligence, RAG integration, and native Claude Code support via MCP protocol. Ranked #1 in agent-based frameworks. |
| ryoppippi/ccusage | 10780 | 2026-02-17 | A CLI tool for analyzing Claude Code/Codex CLI usage from local JSONL files. |
| danicat/tenkai | 11 | 2026-02-10 | Experimentation framework for coding agents |
Prompt/Strategy Optimization (9)¶
| Repository | Stars | Updated (UTC) | Why it matters for Track 3 |
|---|---|---|---|
| microsoft/PromptWizard | 3767 | 2026-02-17 | Task-Aware Agent-driven Prompt Optimization Framework |
| zou-group/textgrad | 3368 | 2026-02-17 | TextGrad: Automatic ''Differentiation'' via Text -- using large language models to backpropagate textual gradients. Published in Nature. |
| SalesforceAIResearch/promptomatix | 909 | 2026-02-17 | An Automatic Prompt Optimization Framework for Large Language Models |
| shobrook/promptimal | 298 | 2026-01-15 | A very fast, very minimal prompt optimizer |
| austin-starks/Promptimizer | 210 | 2026-02-13 | An Automated AI-Powered Prompt Optimization Framework |
| CTLab-ITMO/CoolPrompt | 174 | 2026-02-16 | Automatic Prompt Optimization Framework |
| developzir/gepa-mcp | 46 | 2025-12-26 | MCP server integrating GEPA (Genetic-Evolutionary Prompt Architecture) for automatic prompt optimization with Claude Desktop |
| Bubobot-Team/mcp-prompt-optimizer | 21 | 2026-02-16 | Advanced MCP server providing cutting-edge prompt optimization tools with research-backed strategies |
| johnpsasser/claude-code-prompt-optimizer | 19 | 2026-02-12 | AI-powered prompt optimization hook for Claude Code. Transforms simple prompts into comprehensive, structured instructions using Claude Opus 4.1's advanced reasoning capabilities. |
Evaluation & Regression (18)¶
| Repository | Stars | Updated (UTC) | Why it matters for Track 3 |
|---|---|---|---|
| langfuse/langfuse | 22009 | 2026-02-17 | Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. YC W23 |
| openai/evals | 17694 | 2026-02-17 | Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks. |
| confident-ai/deepeval | 13697 | 2026-02-17 | The LLM Evaluation Framework |
| vibrantlabsai/ragas | 12636 | 2026-02-17 | Supercharge Your LLM Application Evaluations |
| EleutherAI/lm-evaluation-harness | 11439 | 2026-02-17 | A framework for few-shot evaluation of language models. |
| promptfoo/promptfoo | 10489 | 2026-02-17 | Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration. |
| Arize-ai/phoenix | 8575 | 2026-02-17 | AI Observability & Evaluation |
| open-compass/opencompass | 6671 | 2026-02-17 | OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets. |
| Giskard-AI/giskard-oss | 5117 | 2026-02-17 | Open-Source Evaluation & Testing library for LLM Agents |
| Helicone/helicone | 5116 | 2026-02-17 | Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 |
| openai/simple-evals | 4352 | 2026-02-17 | |
| Agenta-AI/agenta | 3849 | 2026-02-17 | The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place. |
| truera/trulens | 3093 | 2026-02-17 | Evaluation and Tracking for LLM Experiments and AI Agents |
| uptrain-ai/uptrain | 2338 | 2026-02-17 | UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them. |
| UKGovernmentBEIS/inspect_ai | 1746 | 2026-02-17 | Inspect: A framework for large language model evaluations |
| langchain-ai/openevals | 901 | 2026-02-16 | Readymade evaluators for your LLM apps |
| langchain-ai/agentevals | 476 | 2026-02-12 | Readymade evaluators for agent trajectories |
| hidai25/eval-view | 44 | 2026-02-17 | Catch AI agent regressions before you ship. YAML test cases, golden baselines, execution tracing, cost tracking, CI integration. LangGraph, CrewAI, Anthropic, OpenAI. |
Coding Benchmark Harness (12)¶
| Repository | Stars | Updated (UTC) | Why it matters for Track 3 |
|---|---|---|---|
| SWE-bench/SWE-bench | 4303 | 2026-02-17 | SWE-bench: Can Language Models Resolve Real-world Github Issues? |
| SWE-agent/mini-swe-agent | 2913 | 2026-02-17 | The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepobut scores >74% on SWE-bench verified! |
| openai/SWELancer-Benchmark | 1439 | 2026-02-11 | This repo contains the dataset and code for the paper "SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?" |
| augmentcode/augment-swebench-agent | 855 | 2026-02-12 | The #1 open-source SWE-bench Verified implementation |
| multi-swe-bench/multi-swe-bench | 321 | 2026-02-17 | Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving |
| JARVIS-Xs/SE-Agent | 232 | 2026-02-11 | SE-Agent is a self-evolution framework for LLM Code agents. It enables trajectory-level evolution to exchange information across reasoning paths via Revision, Recombination, and Refinement, expanding the search space and escaping local optima. On SWE-bench Verified, it achieves SOTA performance |
| microsoft/SWE-bench-Live | 162 | 2026-02-17 | [NeurIPS 2025 D&B] SWE-bench Goes Live! |
| Aider-AI/aider-swe-bench | 79 | 2026-01-28 | Harness used to benchmark aider against SWE Bench benchmarks |
| amazon-science/SWE-PolyBench | 77 | 2026-01-30 | SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents |
| smallcloudai/refact-bench | 63 | 2026-01-22 | A benchmarking tool for evaluating AI coding assistants on real-world software engineering tasks from the SWE-Bench dataset. |
| SWE-bench/sb-cli | 56 | 2026-02-15 | Run SWE-bench evaluations remotely |
| logic-star-ai/insights | 49 | 2026-02-12 | We track and analyze the activity and performance of autonomous code agents in the wild |
Gateway/Policy/Quality Gates (8)¶
| Repository | Stars | Updated (UTC) | Why it matters for Track 3 |
|---|---|---|---|
| BerriAI/litellm | 36197 | 2026-02-17 | Python SDK, Proxy Server (AI Gateway) to call 100+ LLM APIs in OpenAI (or native) format, with cost tracking, guardrails, loadbalancing and logging. [Bedrock, Azure, OpenAI, VertexAI, Cohere, Anthropic, Sagemaker, HuggingFace, VLLM, NVIDIA NIM] |
| tensorzero/tensorzero | 10971 | 2026-02-17 | TensorZero is an open-source stack for industrial-grade LLM applications. It unifies an LLM gateway, observability, optimization, evaluation, and experimentation. |
| Portkey-AI/gateway | 10630 | 2026-02-17 | A blazing fast AI Gateway with integrated guardrails. Route to 200+ LLMs, 50+ AI Guardrails with 1 fast & friendly API. |
| messkan/prompt-cache | 196 | 2026-02-13 | Cut LLM costs by up to 80% and unlock sub-millisecond responses with intelligent semantic caching.A drop-in, provider-agnostic LLM proxy written in Go with sub-millisecond response |
| traceloop/hub | 165 | 2026-02-16 | High-scale LLM gateway, written in Rust. OpenTelemetry-based observability included |
| Dimon94/cc-devflow | 97 | 2026-02-08 | One-command requirement development flow for Claude Code - Complete workflow system with sub-agents, quality gates, and intelligent automation |
| usetig/sage | 83 | 2026-02-16 | An LLM council that reviews your coding agent's every move |
| rigour-labs/rigour | 11 | 2026-02-17 | Local-first quality gate + fix-loop controller for AI coding agents (CLI + MCP). |