Orca Benchmarks
One Agent. Any Model.
Orca is a model-agnostic coding agent. Bring your own key, pick your model, and Orca does the rest — search, understand, edit, verify.
Evaluated on SWE-bench Lite — 300 real GitHub issues, official Docker evaluation, fully reproducible.
Orca v2.8.0 | Last updated: 2026-06-06
Issues Resolved
22
out of 300
Pass Rate
7.3%
11.1% of evaluated
Patches Generated
297
99% of instances
Actively Improving
17%+
projected after v2.8.0 fixes
Why Model-Agnostic Matters
Other coding agents are locked to one model. Claude Code only works with Claude. Codex CLI only works with GPT. Orca works with any model — use cheap models for routine tasks, powerful models for complex ones, or your enterprise's approved model.
BYOK
Bring Your Own Key
15+
Supported Models
You Choose
Price vs Performance
Agent Comparison
SWE-bench Lite scores. Each agent uses its best available model.
| Agent | Score | Model Flexibility | Open Source |
|---|---|---|---|
| Claude Code | 49.0% | Locked to Claude | No |
| Codex CLI | 45.2% | Locked to GPT-4.1 | Yes |
| Aider | 26.3% | Multiple models | Yes |
| Devin | 20.0% | Locked to Proprietary | No |
| Orca | 7.3% | Any model (BYOK) | Yes |
Orca's score reflects the first full run. Active optimization is underway — 7/13 re-tested instances now pass after v2.8.0 improvements.
Active Improvement
Orca is improving every week. After agent optimizations in v2.8.0, 7/13 previously-failed instances now pass.
7/13
previously-failed instances now pass
Projected full-run score: 17%+
What Changed
- •Smarter tool orchestration — grep over semantic search
- •Raw code visibility — agent sees uncompressed file contents
- •Structured workflow with action deadlines
- •Read-loop detection — forces edit after repeated reads
- •Source-only edits — agent fixes code, not tests
Resolved Issues (22)
Real GitHub issues fixed by Orca. Each one verified with the project's own test suite in Docker.
How Orca Solves Issues
Search
grep across the codebase to find relevant files
Understand
Read the specific function, understand the root cause
Fix
Edit the source code with a surgical patch
Verify
Run tests to confirm the fix works
Evaluation Methodology
Official SWE-bench Harness
We use the official swebench v4.1.0 package with Docker containers. Each instance runs in an isolated environment matching the original repo's Python version and dependencies.
Scoring
An instance is “resolved” only if ALL failing tests now pass AND ALL previously-passing tests still pass. No partial credit — it either works or it doesn't.
Transparency
99 instances had Docker build failures (sympy, sphinx, scikit-learn) and were not evaluated. Our reported score of 7.3% is against all 300 instances, not just the evaluated ones.
Try Orca
Bring your own API key. Pick any model. Start fixing bugs.
npm install -g @axplusb/keplerOr use the web app