17 OSS PR AI Code Review: MergeMitra vs CodeRabbit vs Greptile

May 5, 2026Vidya Shree B V

MergeMitra vs. CodeRabbit vs. Greptile across 17 replayed OSS pull requests A practical enterprise benchmark on Plane, Infisical, Formbricks, and Twenty.

Executive Summary

Objective: The objective was to evaluate which AI PR reviewer provides the strongest enterprise value when reviewing real, replayed pull requests from modern open-source products.

Context & Approach: We evaluated MergeMitra, CodeRabbit, and Greptile across four source reports covering Plane, Infisical, Formbricks, and Twenty. The corpus included 17 replayed PR cases and 51 tool reviews.

Some cases were issue-led or revert-led, with public follow-up evidence showing a real regression. Other cases were larger feature or cross-dependency PRs, where the evaluation measured confirmed predictive review value: product behavior, security, data integrity, queue behavior, API contracts, optimistic UI state, and test coverage.

Key Takeaways:

MergeMitra won all four repository-level evaluations: Plane, Infisical, Formbricks, and Twenty.
MergeMitra had the strongest bug depth and cross-system reasoning, especially where the bug lived outside the changed line: auth/session boundaries, queues, optimistic UI, metadata propagation, schema contracts, and backend/frontend behavior.
CodeRabbit was the runner-up. It produced broad exploratory reviews and several excellent point findings, but its output needed more senior triage because of duplicates, noisy style/lint comments, over-prioritized claims, and missed public regressions.
Greptile was concise and sometimes very sharp, including a standout Twenty security catch, but it was too sparse and inconsistent to be the primary enterprise reviewer on this corpus.

Recommendation: Use MergeMitra as the primary AI PR reviewer for this benchmark profile. Use CodeRabbit as a broad second pass when a team can afford careful triage. Use Greptile as a concise supplemental reviewer when quiet output matters more than complete coverage.

Background

Code review is not a complete defect-prevention system. QA, tests, staging, canaries, monitoring, and incident response all catch bugs that a reviewer cannot see from a diff.

The enterprise question is more specific: when a regression is visible from changed code and surrounding context, which AI reviewer is most likely to find it, explain the impact, and avoid drowning the team in low-value comments?

That is what this benchmark measures. Bug depth and regression prevention are weighted highest. Maintainability, tests, security, performance, accessibility, and signal-to-noise still matter, but a single confirmed regression catch is worth more than a long list of style suggestions.

1. Benchmark at a Glance

Dimension	Value
Tools evaluated	MergeMitra, CodeRabbit, Greptile
Repositories	Plane, Infisical, Formbricks, Twenty
PR cases	17 replayed cases
Tool reviews	51 total PR reviews
Evaluation type	Historical replay benchmark with issue-led, revert-led, and feature/cross-dependency cases
Primary metric	Confirmed bug/regression detection and enterprise review value
Secondary metrics	Cross-file reasoning, product/data integrity, auth/security, tests, performance, accessibility, maintainability, signal-to-noise

No replayed PR cases were excluded.

2. Codebases Under Test

Codebase	Domain	Cases	Why it is useful
Plane	Product/project management	5	Editor lifecycle, realtime sync, cover images, auth sync, and UI behavior
Infisical	Secrets management	5	Auth/session correctness, token handling, tenant isolation, gateway enrollment, audit behavior
Formbricks	Survey platform	4	SSO behavior, media contracts, BullMQ response pipeline, API migration, accessibility
Twenty	CRM/productivity platform	3	Bulk updates, filters, optimistic metadata, permissions, transactions, frontend state

Together, these repositories stress exactly the places where code review matters most: shared helpers, auth/session boundaries, background work, product state, API contracts, permissions, UI behavior, and tests.

3. How We Set Up the Test

Step 1 - Replay historical PRs

Each source report replayed historical pull requests into mirror repositories, one mirror per tool. The replay cases used pinned base and head SHAs so the tools saw the repository state from the original PR window, not today's main branch.

Step 2 - Separate issue-led and feature cases

The benchmark used two case types:

Issue-led or revert-led cases: A later public issue, fix PR, or revert gave ground-truth evidence. A tool only counted as catching the primary bug when it identified the faulty code and impact consistent with that public follow-up.
Feature or cross-dependency cases: These did not always have one public gold bug. They were scored for confirmed predictive review value: real integration risks, product/data failures, auth/security gaps, missing tests, performance costs, accessibility issues, and false or noisy claims.

Step 3 - Validate before scoring

The reports verified target PR existence, expected state, base SHA, head SHA, and commit counts. Where a CodeRabbit PR had to be reopened because the original mirror PR was closed, the same-diff replacement was used and called out:

Plane-4 used CodeRabbit PR #7 instead of closed PR #5.
Infisical-1 used CodeRabbit PR #9 instead of closed PR #1.
Formbricks-2 used CodeRabbit PR #7 instead of closed PR #3.
Formbricks-3 used CodeRabbit PR #10 instead of closed PR #4.
Twenty-3 used CodeRabbit PR #11 instead of closed PR #8.

Step 4 - Collect and validate evidence

The evaluations collected PR metadata, changed files, commit lists, reviews, inline review comments, issue comments, compare JSON, and compare diffs through the GitHub CLI and GitHub API. Bot walkthroughs, summaries, and generated checklists were not treated as findings unless they made concrete, checkable claims.

Findings were validated by reading exact base..head diffs and tracing caller/callee relationships, frontend state, backend services, permissions, auth/session/token flows, database behavior, queue/job handoffs, API contracts, tests, and UI behavior where relevant.

4. Full PR Inventory

Repo	Case	Upstream / follow-up context	Greptile PR	CodeRabbit PR	MergeMitra PR	Base SHA	Head / cutoff SHA	Commits
Plane	Plane-1: editor initialization	`makeplane/plane#3013`, follow-up `#3025`	#1	#1	#1	`bffba6b9dcbe`	`2df534689ca1`	1
Plane	Plane-2: grouped loader regression	`makeplane/plane#5210`, follow-up `#5238`	#2	#2	#2	`0839666d81a8`	`d04366632858`	1
Plane	Plane-3: realtime/editor feature	`makeplane/plane#8294`	#4	#4	#4	`20510bb2dd53`	`b4fb4b6eda83`	4
Plane	Plane-4: static cover images	`makeplane/plane#8184`	#5	#7	#5	`e6d584fde7e6`	`e667a5496561`	2
Plane	Plane-5: auth sync	`makeplane/plane#8336`	#6	#6	#6	`22339b9786e9`	`0df31a84ee0b`	6
Infisical	Infisical-1: stale auth/session cache	`Infisical/infisical#6002`, follow-ups `#6058`, `#6059`, `#6065`	#1	#9	#1	`e15e40d0fdc1`	`4849bbc765b3`	4
Infisical	Infisical-2: SAML select-org flow	`Infisical/infisical#5652`, follow-ups `#5663`, `#5667`	#2	#2	#2	`df9bed951781`	`8a5119718966`	2
Infisical	Infisical-3: stale selected org	`Infisical/infisical#2412`, follow-up `#2421`	#3	#3	#3	`f213c75ede1a`	`36e3e4c1b583`	1
Infisical	Infisical-4: large auth refactor	`Infisical/infisical#5947`	#4	#4	#4	`9c6573bd71bc`	`359c0261c42a`	24
Infisical	Infisical-5: gateway/PAM enrollment	`Infisical/infisical#6020`	#5	#5	#5	`f3af0f7f4f89`	`db7c851436af`	25
Formbricks	Formbricks-1: SSO hardening	`formbricks/formbricks#7702`, follow-ups `#7728`, `#7755`	#1	#1	#1	`d96304d86d78`	`885a81d2b85b`	1
Formbricks	Formbricks-2: welcome-card media	`formbricks/formbricks#7497`, follow-ups `#7712`, `#7720`	#3	#7	#3	`1e7817fb69f9`	`6c871b5cd5f4`	1
Formbricks	Formbricks-3: BullMQ response pipeline	`formbricks/formbricks#7695`	#4	#10	#4	`ebaa2d363ce9`	`8de5079db380`	9
Formbricks	Formbricks-4: v3 survey overview	`formbricks/formbricks#7741`	#6	#6	#6	`a1a11b2bb8c0`	`848a85bb3453`	1
Twenty	Twenty-1: bulk update	`twentyhq/twenty#16384`, issue `#17117`, fix `#17213`	#4	#4	#4	`3e57aa14d390`	`b52bcd6e349c`	1
Twenty	Twenty-2: select filter propagation	`twentyhq/twenty#12082`, follow-up `#12352` with cutoff caveat	#6	#6	#6	`c98439d76ae6`	`9b72bce76042`	19
Twenty	Twenty-3: navigation menu optimistic metadata	`twentyhq/twenty#18710`	#8	#11	#8	`994215e0dca5`	`28747d3dd26f`	2

5. Overall Results

Repository	Winner	Runner-up	Greptile	CodeRabbit	MergeMitra
Plane	MergeMitra	CodeRabbit	Caught Plane-2 and some feature bugs, but missed Plane-1 and was sparse on large PRs.	Broad exploratory coverage, especially Plane-3, but missed both public issue-led regressions.	Caught both public issue-led regressions and had the strongest feature integration findings.
Infisical	MergeMitra	CodeRabbit	Useful on smaller PRs, but missed Infisical-1, skipped Infisical-4, and raised a false P1.	Strong on Infisical-5 gateway/PAM authorization issues, but missed Infisical-1, skipped Infisical-4, and produced noisy critical claims.	Only tool to review the large auth refactor and catch the Infisical-1 stale auth/session class; strongest security/auth coverage.
Formbricks	MergeMitra	CodeRabbit	Concise and found some real defects, but missed too many historical and cross-system risks.	Found the critical Prisma `upsert` shape bug and strong BullMQ issues, but had duplicates, locale noise, and a material false positive.	Best confirmed coverage across auth behavior, queues, product/data integrity, performance, accessibility, and tests.
Twenty	MergeMitra	CodeRabbit	Had the clearest inline Twenty-3 SSE permission leak and strong Twenty-2 comments, but was narrow and missed much of Twenty-1.	Broad and useful on tests, edge cases, DnD, and rollback issues, but noisier and less direct on security framing.	Highest confirmed signal across stale UI, permissions, filter propagation, transactions, optimistic metadata, and tests.

Tool	Overall pattern	Strengths	Weaknesses
Greptile	Concise supplemental reviewer with uneven coverage.	Focused comments, clear findings when correct, useful security/product catches in some cases.	Missed many primary regressions, sparse on large cross-system PRs, skipped Infisical-4 due file limits, some overstated findings.
CodeRabbit	Broad runner-up, best as an aggressive second pass.	Strong breadth, useful test/security edge cases, several excellent point findings such as Formbricks Prisma `upsert` and Infisical PAM bypasses.	More noise, duplicated comments, over-prioritized claims, missed public regressions, skipped Infisical-4.
MergeMitra	Strongest primary enterprise reviewer in this corpus.	Best bug depth, cross-system reasoning, product/data integrity, auth/security, targeted tests, and senior-review framing.	Verbose, sometimes stylistic, occasional weak source-of-truth or architecture comments, still missed important all-tool issues.

6. Aggregate Scorecard

The scores below are simple unweighted averages of per-repository scorecards, rounded to one decimal, where the category was comparable across the four source reports.

Category	Greptile	CodeRabbit	MergeMitra	Interpretation
Bug depth / regression detection	2.9	3.7	4.6	MergeMitra leads clearly; it won every repository-level bug-depth conclusion.
Cross-file / cross-system reasoning	2.7	3.7	4.8	MergeMitra led across auth/session, queues, metadata, optimistic UI, and backend/frontend contracts.
Security / authorization	2.7	3.4	4.3	MergeMitra led overall; Greptile had one standout Twenty security catch but was inconsistent.
Maintainability	2.5	3.4	4.2	CodeRabbit and MergeMitra both had maintainability signal; MergeMitra tied it more often to product risk.
Tests	1.5	2.7	4.6	MergeMitra most consistently tied test gaps to concrete regression paths.
Performance	2.2	2.7	3.9	MergeMitra surfaced more scaling, queue, count-query, and hot-path concerns.
Signal-to-noise	3.8	2.9	4.3	Greptile was quiet but incomplete; MergeMitra had the best useful signal overall; CodeRabbit required the most triage.

Product behavior/data integrity and accessibility were not averaged because they were not scored consistently across all four reports.

7. Repository-by-Repository Results

Plane

Winner: MergeMitra.

MergeMitra caught both public issue-led regressions and produced the strongest cross-file predictions on large feature PRs. Its most important confirmed findings included the Plane-1 editor initialization regression, Plane-2 grouped-loader height regression, Plane-3 compile blocker and unregistered TitleSyncExtension, Plane-4 static cover image issues, and Plane-5 auth-sync/profile-data risks.

The biggest gap: all tools missed the Plane-5 avatar-sync SSRF expansion. MergeMitra also had one clear trust hit around external profile-cover source of truth.

Infisical

Winner: MergeMitra.

Infisical is a secrets-management platform, so auth, session, token, tenant, and audit correctness matter more than style polish. MergeMitra was the only tool that reviewed the large auth refactor and caught the Infisical-1 stale auth/session regression class. It also surfaced refresh-token exposure, cookie-path drift, unsigned signup-token trust, domain uniqueness race, MFA URL leak, OIDC default-login bug, gateway identity drift, permission mismatch, relay persistence, audit gaps, rollback, and tests.

The biggest gap: all tools missed the exact Infisical-2 SAML redirect loop. Greptile and CodeRabbit also skipped Infisical-4 due file limits.

Formbricks

Winner: MergeMitra.

MergeMitra most consistently connected changes to product behavior, data integrity, async handoff risk, rollback behavior, performance, accessibility, and test coverage. Its strongest confirmed findings included BullMQ producer-only deployments skipping jobs, stale response snapshots, webhook failures not triggering retries, v1 delete status drift, optimistic empty-state risk, keyboard accessibility, full-count cost, route contract gaps, and adjacent media-contract bugs.

The biggest gap: no tool fully caught the exact historical SSO break, the exact welcome-card logo sizing regression, or the enqueue helper swallowing failures. CodeRabbit did catch a critical Prisma upsert shape bug that MergeMitra missed.

Twenty

Winner: MergeMitra.

MergeMitra produced the most confirmed actionable signal across stale bulk-update UI, permission gating, view-filter corruption, transaction boundaries, optimistic metadata drift, folder cascade behavior, and tests. It led on Twenty-2 and Twenty-3 and had the highest signal on Twenty-1, even though it missed the most dangerous empty-selection bug along with the other tools.

The biggest gap: all tools missed the exact Twenty-1 empty-selection mass-update bug, repeated FindMany/cache-churn root cause, and side-panel/dropdown click-outside deselection path.

8. Case Library

Repo	Case	Type	Expected risk / bug class	Greptile	CodeRabbit	MergeMitra
Plane	Plane-1: editor initialization	Issue-led	Hook dependency drift and editor initialization regression	Missed primary	Missed primary	Caught primary
Plane	Plane-2: grouped loader regression	Issue-led	`RenderIfVisible` placeholder height collapse	Caught primary	Missed primary	Caught primary
Plane	Plane-3: realtime/editor feature	Feature/cross-dependency	Title/body sync drift, collaboration races, editor lifecycle	Found some title/race issues; missed blockers	Broad but noisy; missed blockers	Best overall: compile blocker, extension gap, event plumbing, fallback loop, tests
Plane	Plane-4: static cover images	Feature/cross-dependency	Static/uploaded/external source-of-truth, fallback UI, accessibility	Found duplicate declaration and tab mismatch; one overstated claim	Found duplicate, asset mismatch, a11y, orphan/persistence	Best overall: duplicate, asset drift, orphan risk, tab mismatch, a11y, tests
Plane	Plane-5: auth sync	Feature/cross-dependency	Provider sync mismatch, avatar data loss, admin/backend drift	Found double sync/avatar churn; unsafe guard	Found avatar data loss and S3/delete issues; noisy	Best overall: display-name reset, null avatar, destructive replacement, hot path, tests
Infisical	Infisical-1: stale auth/session cache	Issue-led	React Query stale auth/session data and redirect loops	Missed	Missed	Caught stale auth token, stale org/user hierarchy, sub-org reset
Infisical	Infisical-2: SAML select-org flow	Issue-led	SAML refresh-token failures and redirect loops	Partial	Partial	Partial/best, but missed exact SAML variant
Infisical	Infisical-3: stale selected org	Issue-led	Stale selected org causing permission errors/black screens	Caught primary	Caught primary	Partial
Infisical	Infisical-4: large auth refactor	Feature/cross-dependency	Token/cookie/session/SSO/email-domain regressions	Skipped at file limit	Skipped at file limit	Strong: refresh token, cookie path, token trust, MFA leak, OIDC bug
Infisical	Infisical-5: gateway/PAM enrollment	Feature/cross-dependency	Gateway actor auth, PAM, audit, relay, rollback, token lifecycle	Mixed; real issues plus false P1	Strong but noisy; best PAM null-gateway bypasses	Strong/broad: identity drift, permission mismatch, relay, audit, tests
Formbricks	Formbricks-1: SSO hardening	Issue-led	Legacy same-email SSO/account-linking break	Missed primary	Missed primary; found Prisma `upsert` bug	Partial; issue notes warned behavior change
Formbricks	Formbricks-2: welcome-card media	Issue-led	Logo/media sizing and image-vs-video rendering drift	Missed historical logo bug; found adjacent media issues	Missed historical logo bug; found persistence concern	Missed historical logo bug; caught adjacent media bugs and tests
Formbricks	Formbricks-3: BullMQ response pipeline	Feature/cross-dependency	Queue handoff, stale snapshots, retry semantics, metering	Strong on config gate, metering, fire-and-forget	Strong/broad but noisy	Best overall: config, fire-and-forget, stale snapshot, webhook retry, tests
Formbricks	Formbricks-4: v3 survey overview	Feature/cross-dependency	API/UI contract, delete/audit, optimistic UI, a11y, count cost	Mostly low severity	Found nested button-in-link and tests/schema; one false positive	Best overall: delete drift, optimistic empty state, a11y, count cost, tests
Twenty	Twenty-1: bulk update	Issue-led/feature stress	Empty-selection mass update, stale UI, deselection, cache churn	Mostly missed primary	Mostly missed primary; broader on permissions/a11y/cancel	Highest signal but partial; missed empty-selection mass update
Twenty	Twenty-2: select filter propagation	Issue-led with cutoff caveat	Filter propagation, display-value drift, deletion refresh	Strong on cutoff server defects	Stronger breadth but noisier	Highest signal: server defects, operands, transaction drift, fanout, tests
Twenty	Twenty-3: navigation menu optimistic metadata	Feature/cross-dependency	Optimistic metadata drift, SSE enrichment, permissions, DnD, performance	High-signal but narrow; clearest SSE permission leak	Broad/useful/noisy	Highest overall: partial replacement, raw create gap, folder cascade, rollback, sort, tests

9. Tool-by-Tool Observations

MergeMitra - the best primary reviewer

MergeMitra's strongest pattern was following changed code into downstream behavior. It repeatedly connected a local diff to the product state, auth boundary, queue worker, schema consumer, optimistic UI path, or missing test that would actually determine whether the PR was safe.

That is why it won all four repository-level evaluations. Its weakness was not hallucination-heavy output; it was occasional verbosity and some maintainability comments that were true but not PR-blocking.

CodeRabbit - the broad second pass

CodeRabbit produced broad exploratory reviews and several high-value point findings. It was especially useful on test hygiene, security edges, authorization details, and line-level bugs. It caught notable issues such as the Formbricks Prisma upsert shape bug and Infisical PAM null-gateway bypasses.

The trade-off was triage. The source reports repeatedly found duplicated comments, noisy style/lint feedback, over-prioritized claims, and missed public regressions. CodeRabbit is useful, but the benchmark does not support using it as the sole merge gate without severity calibration.

Greptile - concise but incomplete

Greptile's best comments were focused and easy to act on. It had the clearest inline Twenty-3 SSE permission leak and several good product/security catches.

Its problem was coverage. It missed too many primary regressions, was sparse on large cross-system PRs, and skipped Infisical-4 because of file-limit behavior. For teams that want a quiet second opinion, Greptile has a place. For primary enterprise review, it was too narrow on this evidence.

10. False Positives and Noise

Greptile

Plane-4: overstated project static-image persistence loss; the source report found the upload status path can set project.cover_image_asset_id.
Plane-5: suggested an unsafe if not is_signup guard because is_signup is inverted in the current code.
Infisical-5: raised a false P1 auth-bypass/unreachable enrollment claim; the global auth extraction returns early without an Authorization header.
Formbricks-4: totalCount feedback was low-value because it admitted current behavior was functionally correct.
Twenty-1 and Twenty-2: dynamic key feedback was over-prioritized as P1, and Tmp type naming was cleanup-level.

CodeRabbit

Plane-1: "missing React import" was likely a false positive because React.FC was used as an erased type in a Next/TypeScript setup.
Infisical-5: repeated the same false critical v3 enrollment auth-bypass claim as Greptile and mixed security findings with low-value clipboard, Prettier, and unused-import comments.
Formbricks-4: audit-status feedback was a confirmed false positive because the wrapper defaults to failure and only changes to success on response.ok.
Formbricks-1 and Formbricks-3: locale-string and fire-and-forget comments inflated volume without proportional reasoning.
Twenty-3: an incomplete WorkspaceEntity cast was marked critical even though the real issue was system-auth/per-subscriber permission semantics.

MergeMitra

Plane-4: external profile-cover source-of-truth was the clearest trust hit; the backend computed cover_image_url from cover_image_asset or cover_image, so returning cover_image for external URLs was the correct persistence path.
Infisical: no clear false positive was confirmed in scored inline comments, though some maintainability comments were low priority.
Formbricks: no material hallucination was found in high-severity comments; weaker cases were prioritization issues.
Twenty: no material hallucination was found in high-severity comments; weaker cases included duplicated optimistic-create feedback, generic emitter coupling, and force-cast comments.

11. Missed Issues

No tool caught everything. These all-tool misses matter because they show where human review, tests, and runtime validation still have to carry the system.

Plane

Plane-5 avatar sync SSRF expansion: all tools missed that avatar sync now runs on normal synced OAuth sign-in and performs server-side requests.get(avatar_url, timeout=10) on provider-supplied URLs without scheme, host, DNS, redirect, or private-network validation.

Infisical

Infisical-2 exact SAML redirect loop: all tools missed the missing guard for SAML auth-method variants.
Infisical-4 resend email-verification token-shape mismatch: all tools missed that resend email verification decodes username from a signup token that contains email, userId, and aliasId, but no username.
Infisical-4 impossible CLI callback condition: all tools missed a branch that enters only when callbackPort is truthy, then checks authToken && !callbackPort.

Formbricks

Formbricks-1 exact historical SSO break: no tool fully caught the branch that blocks same-email users without a canonical Account row by throwing OAuthAccountNotLinked.
Formbricks-2 welcome-card logo sizing regression: no tool caught the exact logo/media regression fixed by upstream follow-ups.
Formbricks-3 enqueue helper swallowing failures: no tool fully called out that awaiting the enqueue helper is insufficient because the helper catches snapshot hydration failures and logs rejected queue adds without throwing.

Twenty

Twenty-1 empty-selection mass-update bug: no tool caught that selection mode with an empty selection can return no ID filter and apply bulk update to the broader view filter.
Twenty-1 repeated FindMany/cache-churn root cause: no tool fully identified the repeated FindMany/cache-refetch mechanism.
Twenty-1 side-panel/dropdown deselection path: no tool caught that clicking inside the bulk-update panel or dropdown could clear Kanban selection and create accidental mass-update risk.
Twenty-3 N+1 enrichment cost: no tool directly called out per-event navigation menu metadata enrichment cost.
Twenty-3 record favorites not visible optimistically: no tool directly caught that optimistic record items without targetRecordIdentifier are filtered out until mutation/SSE reconciliation.

12. Recommendation

Use MergeMitra as the primary AI PR reviewer for enterprise adoption on this benchmark profile.

It produced the strongest confirmed bug/regression detection, the best cross-system reasoning, and the most useful senior-review framing across all four repositories. The source reports consistently showed MergeMitra moving reviewer effort from "find the dangerous flows from scratch" to "validate and prioritize a strong shortlist."

Use CodeRabbit as the runner-up when the team wants a broad second pass and can afford senior-engineer triage. It is especially valuable for aggressive line-level bug hunting, security point findings, test failures, and edge-case checks. It should not be the sole gate without severity calibration.

Use Greptile as a concise supplemental reviewer when the team wants fewer comments and can tolerate narrower coverage. It is useful as a quick sanity check, but this benchmark does not support it as the primary merge-control signal.

The expected review-time impact is directional, not measured. MergeMitra should save the most time on risky PRs that touch shared helpers, auth/session boundaries, queues, optimistic state, API contracts, metadata propagation, or product data integrity. It does not replace senior reviewers, and high-severity findings still require human validation.

13. Caveats

The sample is 17 replayed PR cases across four repositories, not every possible language, architecture, team workflow, or tool configuration.
Some PRs were selected because they had known public regressions, so the benchmark intentionally stresses bug-finding when risk is present.
Feature/cross-dependency cases were scored by confirmed predictive review value, not by one public gold bug.
Some findings were validated statically rather than by executing full test suites.
Tool outputs can drift because these are LLM-backed reviewers.
Vendors may improve or regress after this evaluation.
Teams making a long-term buying decision should rerun this methodology on their own recent incidents, reverts, and large cross-dependency PRs.