17 OSS PR AI Code Review: MergeMitra vs CodeRabbit vs Greptile
MergeMitra vs. CodeRabbit vs. Greptile across 17 replayed OSS pull requests A practical enterprise benchmark on Plane, Infisical, Formbricks, and Twenty.
Executive Summary
Objective: The objective was to evaluate which AI PR reviewer provides the strongest enterprise value when reviewing real, replayed pull requests from modern open-source products.
Context & Approach: We evaluated MergeMitra, CodeRabbit, and Greptile across four source reports covering Plane, Infisical, Formbricks, and Twenty. The corpus included 17 replayed PR cases and 51 tool reviews.
Some cases were issue-led or revert-led, with public follow-up evidence showing a real regression. Other cases were larger feature or cross-dependency PRs, where the evaluation measured confirmed predictive review value: product behavior, security, data integrity, queue behavior, API contracts, optimistic UI state, and test coverage.
Key Takeaways:
- MergeMitra won all four repository-level evaluations: Plane, Infisical, Formbricks, and Twenty.
- MergeMitra had the strongest bug depth and cross-system reasoning, especially where the bug lived outside the changed line: auth/session boundaries, queues, optimistic UI, metadata propagation, schema contracts, and backend/frontend behavior.
- CodeRabbit was the runner-up. It produced broad exploratory reviews and several excellent point findings, but its output needed more senior triage because of duplicates, noisy style/lint comments, over-prioritized claims, and missed public regressions.
- Greptile was concise and sometimes very sharp, including a standout Twenty security catch, but it was too sparse and inconsistent to be the primary enterprise reviewer on this corpus.
Recommendation: Use MergeMitra as the primary AI PR reviewer for this benchmark profile. Use CodeRabbit as a broad second pass when a team can afford careful triage. Use Greptile as a concise supplemental reviewer when quiet output matters more than complete coverage.
Background
Code review is not a complete defect-prevention system. QA, tests, staging, canaries, monitoring, and incident response all catch bugs that a reviewer cannot see from a diff.
The enterprise question is more specific: when a regression is visible from changed code and surrounding context, which AI reviewer is most likely to find it, explain the impact, and avoid drowning the team in low-value comments?
That is what this benchmark measures. Bug depth and regression prevention are weighted highest. Maintainability, tests, security, performance, accessibility, and signal-to-noise still matter, but a single confirmed regression catch is worth more than a long list of style suggestions.
1. Benchmark at a Glance
No replayed PR cases were excluded.
2. Codebases Under Test
Together, these repositories stress exactly the places where code review matters most: shared helpers, auth/session boundaries, background work, product state, API contracts, permissions, UI behavior, and tests.
3. How We Set Up the Test
Step 1 - Replay historical PRs
Each source report replayed historical pull requests into mirror repositories, one mirror per tool. The replay cases used pinned base and head SHAs so the tools saw the repository state from the original PR window, not today's main branch.
Step 2 - Separate issue-led and feature cases
The benchmark used two case types:
- Issue-led or revert-led cases: A later public issue, fix PR, or revert gave ground-truth evidence. A tool only counted as catching the primary bug when it identified the faulty code and impact consistent with that public follow-up.
- Feature or cross-dependency cases: These did not always have one public gold bug. They were scored for confirmed predictive review value: real integration risks, product/data failures, auth/security gaps, missing tests, performance costs, accessibility issues, and false or noisy claims.
Step 3 - Validate before scoring
The reports verified target PR existence, expected state, base SHA, head SHA, and commit counts. Where a CodeRabbit PR had to be reopened because the original mirror PR was closed, the same-diff replacement was used and called out:
- Plane-4 used CodeRabbit PR #7 instead of closed PR #5.
- Infisical-1 used CodeRabbit PR #9 instead of closed PR #1.
- Formbricks-2 used CodeRabbit PR #7 instead of closed PR #3.
- Formbricks-3 used CodeRabbit PR #10 instead of closed PR #4.
- Twenty-3 used CodeRabbit PR #11 instead of closed PR #8.
Step 4 - Collect and validate evidence
The evaluations collected PR metadata, changed files, commit lists, reviews, inline review comments, issue comments, compare JSON, and compare diffs through the GitHub CLI and GitHub API. Bot walkthroughs, summaries, and generated checklists were not treated as findings unless they made concrete, checkable claims.
Findings were validated by reading exact base..head diffs and tracing caller/callee relationships, frontend state, backend services, permissions, auth/session/token flows, database behavior, queue/job handoffs, API contracts, tests, and UI behavior where relevant.
4. Full PR Inventory
5. Overall Results
6. Aggregate Scorecard
The scores below are simple unweighted averages of per-repository scorecards, rounded to one decimal, where the category was comparable across the four source reports.
Product behavior/data integrity and accessibility were not averaged because they were not scored consistently across all four reports.
7. Repository-by-Repository Results
Plane
Winner: MergeMitra.
MergeMitra caught both public issue-led regressions and produced the strongest cross-file predictions on large feature PRs. Its most important confirmed findings included the Plane-1 editor initialization regression, Plane-2 grouped-loader height regression, Plane-3 compile blocker and unregistered TitleSyncExtension, Plane-4 static cover image issues, and Plane-5 auth-sync/profile-data risks.
The biggest gap: all tools missed the Plane-5 avatar-sync SSRF expansion. MergeMitra also had one clear trust hit around external profile-cover source of truth.
Infisical
Winner: MergeMitra.
Infisical is a secrets-management platform, so auth, session, token, tenant, and audit correctness matter more than style polish. MergeMitra was the only tool that reviewed the large auth refactor and caught the Infisical-1 stale auth/session regression class. It also surfaced refresh-token exposure, cookie-path drift, unsigned signup-token trust, domain uniqueness race, MFA URL leak, OIDC default-login bug, gateway identity drift, permission mismatch, relay persistence, audit gaps, rollback, and tests.
The biggest gap: all tools missed the exact Infisical-2 SAML redirect loop. Greptile and CodeRabbit also skipped Infisical-4 due file limits.
Formbricks
Winner: MergeMitra.
MergeMitra most consistently connected changes to product behavior, data integrity, async handoff risk, rollback behavior, performance, accessibility, and test coverage. Its strongest confirmed findings included BullMQ producer-only deployments skipping jobs, stale response snapshots, webhook failures not triggering retries, v1 delete status drift, optimistic empty-state risk, keyboard accessibility, full-count cost, route contract gaps, and adjacent media-contract bugs.
The biggest gap: no tool fully caught the exact historical SSO break, the exact welcome-card logo sizing regression, or the enqueue helper swallowing failures. CodeRabbit did catch a critical Prisma upsert shape bug that MergeMitra missed.
Twenty
Winner: MergeMitra.
MergeMitra produced the most confirmed actionable signal across stale bulk-update UI, permission gating, view-filter corruption, transaction boundaries, optimistic metadata drift, folder cascade behavior, and tests. It led on Twenty-2 and Twenty-3 and had the highest signal on Twenty-1, even though it missed the most dangerous empty-selection bug along with the other tools.
The biggest gap: all tools missed the exact Twenty-1 empty-selection mass-update bug, repeated FindMany/cache-churn root cause, and side-panel/dropdown click-outside deselection path.
8. Case Library
9. Tool-by-Tool Observations
MergeMitra - the best primary reviewer
MergeMitra's strongest pattern was following changed code into downstream behavior. It repeatedly connected a local diff to the product state, auth boundary, queue worker, schema consumer, optimistic UI path, or missing test that would actually determine whether the PR was safe.
That is why it won all four repository-level evaluations. Its weakness was not hallucination-heavy output; it was occasional verbosity and some maintainability comments that were true but not PR-blocking.
CodeRabbit - the broad second pass
CodeRabbit produced broad exploratory reviews and several high-value point findings. It was especially useful on test hygiene, security edges, authorization details, and line-level bugs. It caught notable issues such as the Formbricks Prisma upsert shape bug and Infisical PAM null-gateway bypasses.
The trade-off was triage. The source reports repeatedly found duplicated comments, noisy style/lint feedback, over-prioritized claims, and missed public regressions. CodeRabbit is useful, but the benchmark does not support using it as the sole merge gate without severity calibration.
Greptile - concise but incomplete
Greptile's best comments were focused and easy to act on. It had the clearest inline Twenty-3 SSE permission leak and several good product/security catches.
Its problem was coverage. It missed too many primary regressions, was sparse on large cross-system PRs, and skipped Infisical-4 because of file-limit behavior. For teams that want a quiet second opinion, Greptile has a place. For primary enterprise review, it was too narrow on this evidence.
10. False Positives and Noise
Greptile
- Plane-4: overstated project static-image persistence loss; the source report found the upload status path can set
project.cover_image_asset_id. - Plane-5: suggested an unsafe
if not is_signupguard becauseis_signupis inverted in the current code. - Infisical-5: raised a false P1 auth-bypass/unreachable enrollment claim; the global auth extraction returns early without an
Authorizationheader. - Formbricks-4:
totalCountfeedback was low-value because it admitted current behavior was functionally correct. - Twenty-1 and Twenty-2: dynamic
keyfeedback was over-prioritized as P1, andTmptype naming was cleanup-level.
CodeRabbit
- Plane-1: "missing React import" was likely a false positive because
React.FCwas used as an erased type in a Next/TypeScript setup. - Infisical-5: repeated the same false critical v3 enrollment auth-bypass claim as Greptile and mixed security findings with low-value clipboard, Prettier, and unused-import comments.
- Formbricks-4: audit-status feedback was a confirmed false positive because the wrapper defaults to failure and only changes to success on
response.ok. - Formbricks-1 and Formbricks-3: locale-string and fire-and-forget comments inflated volume without proportional reasoning.
- Twenty-3: an incomplete
WorkspaceEntitycast was marked critical even though the real issue was system-auth/per-subscriber permission semantics.
MergeMitra
- Plane-4: external profile-cover source-of-truth was the clearest trust hit; the backend computed
cover_image_urlfromcover_image_assetorcover_image, so returningcover_imagefor external URLs was the correct persistence path. - Infisical: no clear false positive was confirmed in scored inline comments, though some maintainability comments were low priority.
- Formbricks: no material hallucination was found in high-severity comments; weaker cases were prioritization issues.
- Twenty: no material hallucination was found in high-severity comments; weaker cases included duplicated optimistic-create feedback, generic emitter coupling, and force-cast comments.
11. Missed Issues
No tool caught everything. These all-tool misses matter because they show where human review, tests, and runtime validation still have to carry the system.
Plane
- Plane-5 avatar sync SSRF expansion: all tools missed that avatar sync now runs on normal synced OAuth sign-in and performs server-side
requests.get(avatar_url, timeout=10)on provider-supplied URLs without scheme, host, DNS, redirect, or private-network validation.
Infisical
- Infisical-2 exact SAML redirect loop: all tools missed the missing guard for SAML auth-method variants.
- Infisical-4 resend email-verification token-shape mismatch: all tools missed that resend email verification decodes
usernamefrom a signup token that containsemail,userId, andaliasId, but nousername. - Infisical-4 impossible CLI callback condition: all tools missed a branch that enters only when
callbackPortis truthy, then checksauthToken && !callbackPort.
Formbricks
- Formbricks-1 exact historical SSO break: no tool fully caught the branch that blocks same-email users without a canonical
Accountrow by throwingOAuthAccountNotLinked. - Formbricks-2 welcome-card logo sizing regression: no tool caught the exact logo/media regression fixed by upstream follow-ups.
- Formbricks-3 enqueue helper swallowing failures: no tool fully called out that awaiting the enqueue helper is insufficient because the helper catches snapshot hydration failures and logs rejected queue adds without throwing.
Twenty
- Twenty-1 empty-selection mass-update bug: no tool caught that selection mode with an empty selection can return no ID filter and apply bulk update to the broader view filter.
- Twenty-1 repeated FindMany/cache-churn root cause: no tool fully identified the repeated FindMany/cache-refetch mechanism.
- Twenty-1 side-panel/dropdown deselection path: no tool caught that clicking inside the bulk-update panel or dropdown could clear Kanban selection and create accidental mass-update risk.
- Twenty-3 N+1 enrichment cost: no tool directly called out per-event navigation menu metadata enrichment cost.
- Twenty-3 record favorites not visible optimistically: no tool directly caught that optimistic record items without
targetRecordIdentifierare filtered out until mutation/SSE reconciliation.
12. Recommendation
Use MergeMitra as the primary AI PR reviewer for enterprise adoption on this benchmark profile.
It produced the strongest confirmed bug/regression detection, the best cross-system reasoning, and the most useful senior-review framing across all four repositories. The source reports consistently showed MergeMitra moving reviewer effort from "find the dangerous flows from scratch" to "validate and prioritize a strong shortlist."
Use CodeRabbit as the runner-up when the team wants a broad second pass and can afford senior-engineer triage. It is especially valuable for aggressive line-level bug hunting, security point findings, test failures, and edge-case checks. It should not be the sole gate without severity calibration.
Use Greptile as a concise supplemental reviewer when the team wants fewer comments and can tolerate narrower coverage. It is useful as a quick sanity check, but this benchmark does not support it as the primary merge-control signal.
The expected review-time impact is directional, not measured. MergeMitra should save the most time on risky PRs that touch shared helpers, auth/session boundaries, queues, optimistic state, API contracts, metadata propagation, or product data integrity. It does not replace senior reviewers, and high-severity findings still require human validation.
13. Caveats
- The sample is 17 replayed PR cases across four repositories, not every possible language, architecture, team workflow, or tool configuration.
- Some PRs were selected because they had known public regressions, so the benchmark intentionally stresses bug-finding when risk is present.
- Feature/cross-dependency cases were scored by confirmed predictive review value, not by one public gold bug.
- Some findings were validated statically rather than by executing full test suites.
- Tool outputs can drift because these are LLM-backed reviewers.
- Vendors may improve or regress after this evaluation.
- Teams making a long-term buying decision should rerun this methodology on their own recent incidents, reverts, and large cross-dependency PRs.