17 OSS PR AI Code Review: MergeMitra vs CodeRabbit vs Greptile

Published onAuthorVidya Shree B V

MergeMitra vs. CodeRabbit vs. Greptile across 17 replayed OSS pull requests A practical enterprise benchmark on Plane, Infisical, Formbricks, and Twenty.

Executive Summary

Objective: The objective was to evaluate which AI PR reviewer provides the strongest enterprise value when reviewing real, replayed pull requests from modern open-source products.

Context & Approach: We evaluated MergeMitra, CodeRabbit, and Greptile across four source reports covering Plane, Infisical, Formbricks, and Twenty. The corpus included 17 replayed PR cases and 51 tool reviews.

Some cases were issue-led or revert-led, with public follow-up evidence showing a real regression. Other cases were larger feature or cross-dependency PRs, where the evaluation measured confirmed predictive review value: product behavior, security, data integrity, queue behavior, API contracts, optimistic UI state, and test coverage.

Key Takeaways:

  • MergeMitra won all four repository-level evaluations: Plane, Infisical, Formbricks, and Twenty.
  • MergeMitra had the strongest bug depth and cross-system reasoning, especially where the bug lived outside the changed line: auth/session boundaries, queues, optimistic UI, metadata propagation, schema contracts, and backend/frontend behavior.
  • CodeRabbit was the runner-up. It produced broad exploratory reviews and several excellent point findings, but its output needed more senior triage because of duplicates, noisy style/lint comments, over-prioritized claims, and missed public regressions.
  • Greptile was concise and sometimes very sharp, including a standout Twenty security catch, but it was too sparse and inconsistent to be the primary enterprise reviewer on this corpus.

Recommendation: Use MergeMitra as the primary AI PR reviewer for this benchmark profile. Use CodeRabbit as a broad second pass when a team can afford careful triage. Use Greptile as a concise supplemental reviewer when quiet output matters more than complete coverage.

Background

Code review is not a complete defect-prevention system. QA, tests, staging, canaries, monitoring, and incident response all catch bugs that a reviewer cannot see from a diff.

The enterprise question is more specific: when a regression is visible from changed code and surrounding context, which AI reviewer is most likely to find it, explain the impact, and avoid drowning the team in low-value comments?

That is what this benchmark measures. Bug depth and regression prevention are weighted highest. Maintainability, tests, security, performance, accessibility, and signal-to-noise still matter, but a single confirmed regression catch is worth more than a long list of style suggestions.

1. Benchmark at a Glance

DimensionValue
Tools evaluatedMergeMitra, CodeRabbit, Greptile
RepositoriesPlane, Infisical, Formbricks, Twenty
PR cases17 replayed cases
Tool reviews51 total PR reviews
Evaluation typeHistorical replay benchmark with issue-led, revert-led, and feature/cross-dependency cases
Primary metricConfirmed bug/regression detection and enterprise review value
Secondary metricsCross-file reasoning, product/data integrity, auth/security, tests, performance, accessibility, maintainability, signal-to-noise

No replayed PR cases were excluded.

2. Codebases Under Test

CodebaseDomainCasesWhy it is useful
PlaneProduct/project management5Editor lifecycle, realtime sync, cover images, auth sync, and UI behavior
InfisicalSecrets management5Auth/session correctness, token handling, tenant isolation, gateway enrollment, audit behavior
FormbricksSurvey platform4SSO behavior, media contracts, BullMQ response pipeline, API migration, accessibility
TwentyCRM/productivity platform3Bulk updates, filters, optimistic metadata, permissions, transactions, frontend state

Together, these repositories stress exactly the places where code review matters most: shared helpers, auth/session boundaries, background work, product state, API contracts, permissions, UI behavior, and tests.

3. How We Set Up the Test

Step 1 - Replay historical PRs

Each source report replayed historical pull requests into mirror repositories, one mirror per tool. The replay cases used pinned base and head SHAs so the tools saw the repository state from the original PR window, not today's main branch.

Step 2 - Separate issue-led and feature cases

The benchmark used two case types:

  • Issue-led or revert-led cases: A later public issue, fix PR, or revert gave ground-truth evidence. A tool only counted as catching the primary bug when it identified the faulty code and impact consistent with that public follow-up.
  • Feature or cross-dependency cases: These did not always have one public gold bug. They were scored for confirmed predictive review value: real integration risks, product/data failures, auth/security gaps, missing tests, performance costs, accessibility issues, and false or noisy claims.

Step 3 - Validate before scoring

The reports verified target PR existence, expected state, base SHA, head SHA, and commit counts. Where a CodeRabbit PR had to be reopened because the original mirror PR was closed, the same-diff replacement was used and called out:

  • Plane-4 used CodeRabbit PR #7 instead of closed PR #5.
  • Infisical-1 used CodeRabbit PR #9 instead of closed PR #1.
  • Formbricks-2 used CodeRabbit PR #7 instead of closed PR #3.
  • Formbricks-3 used CodeRabbit PR #10 instead of closed PR #4.
  • Twenty-3 used CodeRabbit PR #11 instead of closed PR #8.

Step 4 - Collect and validate evidence

The evaluations collected PR metadata, changed files, commit lists, reviews, inline review comments, issue comments, compare JSON, and compare diffs through the GitHub CLI and GitHub API. Bot walkthroughs, summaries, and generated checklists were not treated as findings unless they made concrete, checkable claims.

Findings were validated by reading exact base..head diffs and tracing caller/callee relationships, frontend state, backend services, permissions, auth/session/token flows, database behavior, queue/job handoffs, API contracts, tests, and UI behavior where relevant.

4. Full PR Inventory

RepoCaseUpstream / follow-up contextGreptile PRCodeRabbit PRMergeMitra PRBase SHAHead / cutoff SHACommits
PlanePlane-1: editor initializationmakeplane/plane#3013, follow-up #3025#1#1#1bffba6b9dcbe2df534689ca11
PlanePlane-2: grouped loader regressionmakeplane/plane#5210, follow-up #5238#2#2#20839666d81a8d043666328581
PlanePlane-3: realtime/editor featuremakeplane/plane#8294#4#4#420510bb2dd53b4fb4b6eda834
PlanePlane-4: static cover imagesmakeplane/plane#8184#5#7#5e6d584fde7e6e667a54965612
PlanePlane-5: auth syncmakeplane/plane#8336#6#6#622339b9786e90df31a84ee0b6
InfisicalInfisical-1: stale auth/session cacheInfisical/infisical#6002, follow-ups #6058, #6059, #6065#1#9#1e15e40d0fdc14849bbc765b34
InfisicalInfisical-2: SAML select-org flowInfisical/infisical#5652, follow-ups #5663, #5667#2#2#2df9bed9517818a51197189662
InfisicalInfisical-3: stale selected orgInfisical/infisical#2412, follow-up #2421#3#3#3f213c75ede1a36e3e4c1b5831
InfisicalInfisical-4: large auth refactorInfisical/infisical#5947#4#4#49c6573bd71bc359c0261c42a24
InfisicalInfisical-5: gateway/PAM enrollmentInfisical/infisical#6020#5#5#5f3af0f7f4f89db7c851436af25
FormbricksFormbricks-1: SSO hardeningformbricks/formbricks#7702, follow-ups #7728, #7755#1#1#1d96304d86d78885a81d2b85b1
FormbricksFormbricks-2: welcome-card mediaformbricks/formbricks#7497, follow-ups #7712, #7720#3#7#31e7817fb69f96c871b5cd5f41
FormbricksFormbricks-3: BullMQ response pipelineformbricks/formbricks#7695#4#10#4ebaa2d363ce98de5079db3809
FormbricksFormbricks-4: v3 survey overviewformbricks/formbricks#7741#6#6#6a1a11b2bb8c0848a85bb34531
TwentyTwenty-1: bulk updatetwentyhq/twenty#16384, issue #17117, fix #17213#4#4#43e57aa14d390b52bcd6e349c1
TwentyTwenty-2: select filter propagationtwentyhq/twenty#12082, follow-up #12352 with cutoff caveat#6#6#6c98439d76ae69b72bce7604219
TwentyTwenty-3: navigation menu optimistic metadatatwentyhq/twenty#18710#8#11#8994215e0dca528747d3dd26f2

5. Overall Results

RepositoryWinnerRunner-upGreptileCodeRabbitMergeMitra
PlaneMergeMitraCodeRabbitCaught Plane-2 and some feature bugs, but missed Plane-1 and was sparse on large PRs.Broad exploratory coverage, especially Plane-3, but missed both public issue-led regressions.Caught both public issue-led regressions and had the strongest feature integration findings.
InfisicalMergeMitraCodeRabbitUseful on smaller PRs, but missed Infisical-1, skipped Infisical-4, and raised a false P1.Strong on Infisical-5 gateway/PAM authorization issues, but missed Infisical-1, skipped Infisical-4, and produced noisy critical claims.Only tool to review the large auth refactor and catch the Infisical-1 stale auth/session class; strongest security/auth coverage.
FormbricksMergeMitraCodeRabbitConcise and found some real defects, but missed too many historical and cross-system risks.Found the critical Prisma upsert shape bug and strong BullMQ issues, but had duplicates, locale noise, and a material false positive.Best confirmed coverage across auth behavior, queues, product/data integrity, performance, accessibility, and tests.
TwentyMergeMitraCodeRabbitHad the clearest inline Twenty-3 SSE permission leak and strong Twenty-2 comments, but was narrow and missed much of Twenty-1.Broad and useful on tests, edge cases, DnD, and rollback issues, but noisier and less direct on security framing.Highest confirmed signal across stale UI, permissions, filter propagation, transactions, optimistic metadata, and tests.
ToolOverall patternStrengthsWeaknesses
GreptileConcise supplemental reviewer with uneven coverage.Focused comments, clear findings when correct, useful security/product catches in some cases.Missed many primary regressions, sparse on large cross-system PRs, skipped Infisical-4 due file limits, some overstated findings.
CodeRabbitBroad runner-up, best as an aggressive second pass.Strong breadth, useful test/security edge cases, several excellent point findings such as Formbricks Prisma upsert and Infisical PAM bypasses.More noise, duplicated comments, over-prioritized claims, missed public regressions, skipped Infisical-4.
MergeMitraStrongest primary enterprise reviewer in this corpus.Best bug depth, cross-system reasoning, product/data integrity, auth/security, targeted tests, and senior-review framing.Verbose, sometimes stylistic, occasional weak source-of-truth or architecture comments, still missed important all-tool issues.

6. Aggregate Scorecard

The scores below are simple unweighted averages of per-repository scorecards, rounded to one decimal, where the category was comparable across the four source reports.

CategoryGreptileCodeRabbitMergeMitraInterpretation
Bug depth / regression detection2.93.74.6MergeMitra leads clearly; it won every repository-level bug-depth conclusion.
Cross-file / cross-system reasoning2.73.74.8MergeMitra led across auth/session, queues, metadata, optimistic UI, and backend/frontend contracts.
Security / authorization2.73.44.3MergeMitra led overall; Greptile had one standout Twenty security catch but was inconsistent.
Maintainability2.53.44.2CodeRabbit and MergeMitra both had maintainability signal; MergeMitra tied it more often to product risk.
Tests1.52.74.6MergeMitra most consistently tied test gaps to concrete regression paths.
Performance2.22.73.9MergeMitra surfaced more scaling, queue, count-query, and hot-path concerns.
Signal-to-noise3.82.94.3Greptile was quiet but incomplete; MergeMitra had the best useful signal overall; CodeRabbit required the most triage.

Product behavior/data integrity and accessibility were not averaged because they were not scored consistently across all four reports.

7. Repository-by-Repository Results

Plane

Winner: MergeMitra.

MergeMitra caught both public issue-led regressions and produced the strongest cross-file predictions on large feature PRs. Its most important confirmed findings included the Plane-1 editor initialization regression, Plane-2 grouped-loader height regression, Plane-3 compile blocker and unregistered TitleSyncExtension, Plane-4 static cover image issues, and Plane-5 auth-sync/profile-data risks.

The biggest gap: all tools missed the Plane-5 avatar-sync SSRF expansion. MergeMitra also had one clear trust hit around external profile-cover source of truth.

Infisical

Winner: MergeMitra.

Infisical is a secrets-management platform, so auth, session, token, tenant, and audit correctness matter more than style polish. MergeMitra was the only tool that reviewed the large auth refactor and caught the Infisical-1 stale auth/session regression class. It also surfaced refresh-token exposure, cookie-path drift, unsigned signup-token trust, domain uniqueness race, MFA URL leak, OIDC default-login bug, gateway identity drift, permission mismatch, relay persistence, audit gaps, rollback, and tests.

The biggest gap: all tools missed the exact Infisical-2 SAML redirect loop. Greptile and CodeRabbit also skipped Infisical-4 due file limits.

Formbricks

Winner: MergeMitra.

MergeMitra most consistently connected changes to product behavior, data integrity, async handoff risk, rollback behavior, performance, accessibility, and test coverage. Its strongest confirmed findings included BullMQ producer-only deployments skipping jobs, stale response snapshots, webhook failures not triggering retries, v1 delete status drift, optimistic empty-state risk, keyboard accessibility, full-count cost, route contract gaps, and adjacent media-contract bugs.

The biggest gap: no tool fully caught the exact historical SSO break, the exact welcome-card logo sizing regression, or the enqueue helper swallowing failures. CodeRabbit did catch a critical Prisma upsert shape bug that MergeMitra missed.

Twenty

Winner: MergeMitra.

MergeMitra produced the most confirmed actionable signal across stale bulk-update UI, permission gating, view-filter corruption, transaction boundaries, optimistic metadata drift, folder cascade behavior, and tests. It led on Twenty-2 and Twenty-3 and had the highest signal on Twenty-1, even though it missed the most dangerous empty-selection bug along with the other tools.

The biggest gap: all tools missed the exact Twenty-1 empty-selection mass-update bug, repeated FindMany/cache-churn root cause, and side-panel/dropdown click-outside deselection path.

8. Case Library

RepoCaseTypeExpected risk / bug classGreptileCodeRabbitMergeMitra
PlanePlane-1: editor initializationIssue-ledHook dependency drift and editor initialization regressionMissed primaryMissed primaryCaught primary
PlanePlane-2: grouped loader regressionIssue-ledRenderIfVisible placeholder height collapseCaught primaryMissed primaryCaught primary
PlanePlane-3: realtime/editor featureFeature/cross-dependencyTitle/body sync drift, collaboration races, editor lifecycleFound some title/race issues; missed blockersBroad but noisy; missed blockersBest overall: compile blocker, extension gap, event plumbing, fallback loop, tests
PlanePlane-4: static cover imagesFeature/cross-dependencyStatic/uploaded/external source-of-truth, fallback UI, accessibilityFound duplicate declaration and tab mismatch; one overstated claimFound duplicate, asset mismatch, a11y, orphan/persistenceBest overall: duplicate, asset drift, orphan risk, tab mismatch, a11y, tests
PlanePlane-5: auth syncFeature/cross-dependencyProvider sync mismatch, avatar data loss, admin/backend driftFound double sync/avatar churn; unsafe guardFound avatar data loss and S3/delete issues; noisyBest overall: display-name reset, null avatar, destructive replacement, hot path, tests
InfisicalInfisical-1: stale auth/session cacheIssue-ledReact Query stale auth/session data and redirect loopsMissedMissedCaught stale auth token, stale org/user hierarchy, sub-org reset
InfisicalInfisical-2: SAML select-org flowIssue-ledSAML refresh-token failures and redirect loopsPartialPartialPartial/best, but missed exact SAML variant
InfisicalInfisical-3: stale selected orgIssue-ledStale selected org causing permission errors/black screensCaught primaryCaught primaryPartial
InfisicalInfisical-4: large auth refactorFeature/cross-dependencyToken/cookie/session/SSO/email-domain regressionsSkipped at file limitSkipped at file limitStrong: refresh token, cookie path, token trust, MFA leak, OIDC bug
InfisicalInfisical-5: gateway/PAM enrollmentFeature/cross-dependencyGateway actor auth, PAM, audit, relay, rollback, token lifecycleMixed; real issues plus false P1Strong but noisy; best PAM null-gateway bypassesStrong/broad: identity drift, permission mismatch, relay, audit, tests
FormbricksFormbricks-1: SSO hardeningIssue-ledLegacy same-email SSO/account-linking breakMissed primaryMissed primary; found Prisma upsert bugPartial; issue notes warned behavior change
FormbricksFormbricks-2: welcome-card mediaIssue-ledLogo/media sizing and image-vs-video rendering driftMissed historical logo bug; found adjacent media issuesMissed historical logo bug; found persistence concernMissed historical logo bug; caught adjacent media bugs and tests
FormbricksFormbricks-3: BullMQ response pipelineFeature/cross-dependencyQueue handoff, stale snapshots, retry semantics, meteringStrong on config gate, metering, fire-and-forgetStrong/broad but noisyBest overall: config, fire-and-forget, stale snapshot, webhook retry, tests
FormbricksFormbricks-4: v3 survey overviewFeature/cross-dependencyAPI/UI contract, delete/audit, optimistic UI, a11y, count costMostly low severityFound nested button-in-link and tests/schema; one false positiveBest overall: delete drift, optimistic empty state, a11y, count cost, tests
TwentyTwenty-1: bulk updateIssue-led/feature stressEmpty-selection mass update, stale UI, deselection, cache churnMostly missed primaryMostly missed primary; broader on permissions/a11y/cancelHighest signal but partial; missed empty-selection mass update
TwentyTwenty-2: select filter propagationIssue-led with cutoff caveatFilter propagation, display-value drift, deletion refreshStrong on cutoff server defectsStronger breadth but noisierHighest signal: server defects, operands, transaction drift, fanout, tests
TwentyTwenty-3: navigation menu optimistic metadataFeature/cross-dependencyOptimistic metadata drift, SSE enrichment, permissions, DnD, performanceHigh-signal but narrow; clearest SSE permission leakBroad/useful/noisyHighest overall: partial replacement, raw create gap, folder cascade, rollback, sort, tests

9. Tool-by-Tool Observations

MergeMitra - the best primary reviewer

MergeMitra's strongest pattern was following changed code into downstream behavior. It repeatedly connected a local diff to the product state, auth boundary, queue worker, schema consumer, optimistic UI path, or missing test that would actually determine whether the PR was safe.

That is why it won all four repository-level evaluations. Its weakness was not hallucination-heavy output; it was occasional verbosity and some maintainability comments that were true but not PR-blocking.

CodeRabbit - the broad second pass

CodeRabbit produced broad exploratory reviews and several high-value point findings. It was especially useful on test hygiene, security edges, authorization details, and line-level bugs. It caught notable issues such as the Formbricks Prisma upsert shape bug and Infisical PAM null-gateway bypasses.

The trade-off was triage. The source reports repeatedly found duplicated comments, noisy style/lint feedback, over-prioritized claims, and missed public regressions. CodeRabbit is useful, but the benchmark does not support using it as the sole merge gate without severity calibration.

Greptile - concise but incomplete

Greptile's best comments were focused and easy to act on. It had the clearest inline Twenty-3 SSE permission leak and several good product/security catches.

Its problem was coverage. It missed too many primary regressions, was sparse on large cross-system PRs, and skipped Infisical-4 because of file-limit behavior. For teams that want a quiet second opinion, Greptile has a place. For primary enterprise review, it was too narrow on this evidence.

10. False Positives and Noise

Greptile

  • Plane-4: overstated project static-image persistence loss; the source report found the upload status path can set project.cover_image_asset_id.
  • Plane-5: suggested an unsafe if not is_signup guard because is_signup is inverted in the current code.
  • Infisical-5: raised a false P1 auth-bypass/unreachable enrollment claim; the global auth extraction returns early without an Authorization header.
  • Formbricks-4: totalCount feedback was low-value because it admitted current behavior was functionally correct.
  • Twenty-1 and Twenty-2: dynamic key feedback was over-prioritized as P1, and Tmp type naming was cleanup-level.

CodeRabbit

  • Plane-1: "missing React import" was likely a false positive because React.FC was used as an erased type in a Next/TypeScript setup.
  • Infisical-5: repeated the same false critical v3 enrollment auth-bypass claim as Greptile and mixed security findings with low-value clipboard, Prettier, and unused-import comments.
  • Formbricks-4: audit-status feedback was a confirmed false positive because the wrapper defaults to failure and only changes to success on response.ok.
  • Formbricks-1 and Formbricks-3: locale-string and fire-and-forget comments inflated volume without proportional reasoning.
  • Twenty-3: an incomplete WorkspaceEntity cast was marked critical even though the real issue was system-auth/per-subscriber permission semantics.

MergeMitra

  • Plane-4: external profile-cover source-of-truth was the clearest trust hit; the backend computed cover_image_url from cover_image_asset or cover_image, so returning cover_image for external URLs was the correct persistence path.
  • Infisical: no clear false positive was confirmed in scored inline comments, though some maintainability comments were low priority.
  • Formbricks: no material hallucination was found in high-severity comments; weaker cases were prioritization issues.
  • Twenty: no material hallucination was found in high-severity comments; weaker cases included duplicated optimistic-create feedback, generic emitter coupling, and force-cast comments.

11. Missed Issues

No tool caught everything. These all-tool misses matter because they show where human review, tests, and runtime validation still have to carry the system.

Plane

  • Plane-5 avatar sync SSRF expansion: all tools missed that avatar sync now runs on normal synced OAuth sign-in and performs server-side requests.get(avatar_url, timeout=10) on provider-supplied URLs without scheme, host, DNS, redirect, or private-network validation.

Infisical

  • Infisical-2 exact SAML redirect loop: all tools missed the missing guard for SAML auth-method variants.
  • Infisical-4 resend email-verification token-shape mismatch: all tools missed that resend email verification decodes username from a signup token that contains email, userId, and aliasId, but no username.
  • Infisical-4 impossible CLI callback condition: all tools missed a branch that enters only when callbackPort is truthy, then checks authToken && !callbackPort.

Formbricks

  • Formbricks-1 exact historical SSO break: no tool fully caught the branch that blocks same-email users without a canonical Account row by throwing OAuthAccountNotLinked.
  • Formbricks-2 welcome-card logo sizing regression: no tool caught the exact logo/media regression fixed by upstream follow-ups.
  • Formbricks-3 enqueue helper swallowing failures: no tool fully called out that awaiting the enqueue helper is insufficient because the helper catches snapshot hydration failures and logs rejected queue adds without throwing.

Twenty

  • Twenty-1 empty-selection mass-update bug: no tool caught that selection mode with an empty selection can return no ID filter and apply bulk update to the broader view filter.
  • Twenty-1 repeated FindMany/cache-churn root cause: no tool fully identified the repeated FindMany/cache-refetch mechanism.
  • Twenty-1 side-panel/dropdown deselection path: no tool caught that clicking inside the bulk-update panel or dropdown could clear Kanban selection and create accidental mass-update risk.
  • Twenty-3 N+1 enrichment cost: no tool directly called out per-event navigation menu metadata enrichment cost.
  • Twenty-3 record favorites not visible optimistically: no tool directly caught that optimistic record items without targetRecordIdentifier are filtered out until mutation/SSE reconciliation.

12. Recommendation

Use MergeMitra as the primary AI PR reviewer for enterprise adoption on this benchmark profile.

It produced the strongest confirmed bug/regression detection, the best cross-system reasoning, and the most useful senior-review framing across all four repositories. The source reports consistently showed MergeMitra moving reviewer effort from "find the dangerous flows from scratch" to "validate and prioritize a strong shortlist."

Use CodeRabbit as the runner-up when the team wants a broad second pass and can afford senior-engineer triage. It is especially valuable for aggressive line-level bug hunting, security point findings, test failures, and edge-case checks. It should not be the sole gate without severity calibration.

Use Greptile as a concise supplemental reviewer when the team wants fewer comments and can tolerate narrower coverage. It is useful as a quick sanity check, but this benchmark does not support it as the primary merge-control signal.

The expected review-time impact is directional, not measured. MergeMitra should save the most time on risky PRs that touch shared helpers, auth/session boundaries, queues, optimistic state, API contracts, metadata propagation, or product data integrity. It does not replace senior reviewers, and high-severity findings still require human validation.

13. Caveats

  • The sample is 17 replayed PR cases across four repositories, not every possible language, architecture, team workflow, or tool configuration.
  • Some PRs were selected because they had known public regressions, so the benchmark intentionally stresses bug-finding when risk is present.
  • Feature/cross-dependency cases were scored by confirmed predictive review value, not by one public gold bug.
  • Some findings were validated statically rather than by executing full test suites.
  • Tool outputs can drift because these are LLM-backed reviewers.
  • Vendors may improve or regress after this evaluation.
  • Teams making a long-term buying decision should rerun this methodology on their own recent incidents, reverts, and large cross-dependency PRs.