MergeMitra vs CodeRabbit vs Greptile: 2026 AI Code Review Benchmark

Published onAuthorAshik Shaji and Vishnumohan R K

MergeMitra vs. CodeRabbit vs. Greptile A controlled, side-by-side evaluation on two large open-source codebases.

Executive Summary

Objective: The objective was to evaluate how three AI code review tools -- MergeMitra, CodeRabbit, and Greptile -- compare at effectiveness of code review on production grade code and find out which tool fits best for enterprise teams.

Context & Approach: Greptile published an open benchmark in 2025 that tested AI review tools against real bugs from large open-source codebases. We adopted the same methodology, using the same PRs and the same real production bugs Greptile used.

We also took two distinctly unique codebases: Cal.com (TypeScript) and Keycloak (Java). For each codebase, we created three mirror repositories on GitHub and installed exactly one code review tool in each. We then opened 10 PRs per mirror, each carrying a real historical bug, for a total of 60 PRs.

Every tool saw byte-identical diffs and ran on default settings with no custom rules, so any difference in results reflects the tool, not the input. All 60 reviews were independently examined by a standard prompt run on Claude Code (with Opus 4.6), with every verdict is linked to verifiable GitHub evidence.

Key Takeaways:

  • MergeMitra caught 85% of planted bugs (17/20), compared to CodeRabbit at 65% (13/20) and Greptile at 60% (12/20). Four bugs -- including a Critical SQL injection and a High-severity email blacklist bypass -- were caught only by MergeMitra.
  • MergeMitra led on every quality dimension that matters for bug prevention: security, performance, test quality, maintainability, architectural insight, and cross-file reasoning -- all at 80-90% effectiveness versus 30-60% for the other tools.
  • Greptile excels at signal-to-noise (zero false positives) but trades breadth for precision, staying silent on many real issues. CodeRabbit offers broad coverage with polished autofix UX but introduces noise that requires triage, including a hallucinated CVE identifier.

Recommendation: For enterprise adoption, MergeMitra is the clear winner as per Claude Code. Full details are available below.

Background

In July 2025, Greptile published an open benchmark for AI code review tools: greptile.com/benchmarks. Their methodology was straightforward and well-designed. They selected five large, real-world open-source repositories - Sentry (Python), Cal.com (TypeScript), Grafana (Go), Keycloak (Java), and Discourse (Ruby) - traced back 10 real bug-fix commits per repo from Git history, and reintroduced the original buggy code as fresh pull requests. They then ran five AI review tools - Greptile, CodeRabbit, Bugbot, Copilot, and Graphite - on those PRs and scored each tool on whether it caught the planted bug.

It was a credible benchmark. Real bugs from real production codebases, identical diffs across all tools, a clear pass/fail scoring criterion. We wanted to use the same methodology to answer a different question: how does MergeMitra compare to Greptile and CodeRabbit?

So we did exactly that. We took the same two codebases from Greptile's benchmark - Cal.com and Keycloak - used the same PRs carrying the same real production bugs, set up our own mirror repositories, and ran the test with three tools: MergeMitra, CodeRabbit, and Greptile.

This report walks through the full process and results.

1. Codebases Under Test

We chose Cal.com and Keycloak from Greptile's original five-repo benchmark because they cover two very different technology stacks and domain pressures:

CodebaseLanguage / StackWhy it's a good test
Cal.comTypeScript / React / Next.js / PrismaModern web application - scheduling, calendar OAuth, payments, UI flows
KeycloakJava / Enterprise / Identity ProviderHeavyweight enterprise - authentication, authorization, cryptography

Cal.com gives us a TypeScript-heavy modern web stack. Keycloak gives us a long-standing Java enterprise system with deep concerns around identity, authorization and cryptography. If a tool performs well on both, the result is meaningful. Together they give us coverage across two languages, two ecosystems, and two very different kinds of domain complexity.

2. How We Set Up the Test

Step 1 - Same bugs, same PRs as Greptile's benchmark

Greptile's benchmark traced back real bug-fix commits from each project's Git history. For each bug, they identified the commit that originally introduced the flawed code and the commit that later fixed it. They then created two branches - one before the bug was introduced and one after - and opened a fresh PR that reintroduced the original buggy change. This meant every PR in the benchmark carried a real, historical production bug: something that was introduced during normal development, ran in production, was eventually reported, diagnosed, and patched by the real maintainers.

We used the same PRs and the same bugs. We did not pick new bugs or create synthetic ones. The bugs span the full range you see in real codebases:

  • Authentication and authorization flaws
  • Race conditions and concurrency issues
  • Performance regressions (N+1 queries, runaway memory)
  • Privacy and data leaks
  • Lifecycle and migration bugs
  • Test-quality regressions
  • Maintainability and architecture issues
  • Localization and content errors

Step 2 - Three mirror repositories per codebase

For each of the two codebases, we created three clean mirror repositories on GitHub - one per AI review tool - and installed exactly one bot in each:

That gives us 6 repositories in total, with one tool per repository. Same code in all three mirrors of each project. Different reviewer in each.

Step 3 - Open the same 10 PRs in each mirror

We opened the same 10 pull requests in each of the three mirror repositories for each codebase - 60 PRs in total (10 PRs × 3 tools × 2 codebases). Each PR carries one real production bug from that project's history.

Critically, the same PR has a byte-identical diff and the same commit hash in all three mirror repositories. We verified this with the GitHub API (gh api repos/.../pulls/N --jq '.head.sha'). So if one tool found the bug and another did not, the difference comes from the tool, not the diff.

Step 4 - Let the tools run, then collect everything

All three tools ran on their default settings with no custom rules - the same constraint Greptile used in their benchmark. Each tool had full repository access including the PR diff and base branch. Once the reviews were in, we collected every review comment, every inline comment, and every issue comment from all 60 PRs via the GitHub CLI.

3. Validation Methodology

Greptile's original benchmark was scored by Greptile's own team. We wanted an independent evaluator with no relationship to any of the three tools. So we used Claude Code (powered by Claude Opus 4.6, Anthropic's most capable model with one-million-token context) as the validation layer.

For each of the 60 PR reviews, Claude Code:

  1. Pulled every review comment from the PR via the GitHub API - inline review comments, review summaries, and issue comments.
  2. Read the actual source code at the exact commit the PR introduced, including surrounding context (20–60 lines above and below the cited line).
  3. Checked whether the planted bug was caught - a clean yes or no. A bug counted as "caught" only when the tool explicitly identified the faulty code and explained its impact, consistent with Greptile's original scoring criterion. Summary-level mentions or vague warnings without identifying the specific code did not count.
  4. Produced a verification table with a direct link to the review comment (for catches) or to the PR (for misses), so every result is independently verifiable by clicking the link.

Claude Code wrote one report per codebase. We then consolidated them into this single report with the combined verification tables.

In total, 60 PR reviews were validated against source - every tool's verdict on every PR cross-checked against the actual planted bug. Nothing was taken on the bot's word. Nothing was taken on the validator's word either - every and in the tables below links directly to the GitHub evidence.

4. A Note on Code Review and Bug Detection

Before we share the results, one piece of context: code review is not primarily a bug-catching activity. Every senior engineer knows this. The main jobs of a code review are to:

  • Reduce technical debt before it accumulates.
  • Enforce architectural and security best practices.
  • Improve long-term maintainability of the codebase.
  • Spread knowledge across the team.

Some bugs do get caught in code review - usually shallow ones, near the surface - but most functional bugs are caught by automated tests, QA, and production monitoring. So when we ask "can an AI reviewer catch the bug?" we are deliberately stress-testing these tools on a dimension where even experienced human reviewers often struggle.

We chose to measure bug-finding precisely because it is hard. A tool that can do this - on top of the maintainability, architecture and security hygiene that code review is normally for - is genuinely valuable. A tool that can only do the easy stuff is not.

The results below reflect that stiff test.

5. Results - Cal.com (TypeScript)

The 10 Cal.com PRs cover scheduling, OAuth integrations, calendar sync, workflows, two-factor authentication, and a major UI feature. means the tool successfully flagged the planted bug. means it missed it. Each links directly to the review comment where the tool flagged the bug. Each links to the PR so readers can verify nothing was found.

PR ContextSeverityMergeMitraCodeRabbitGreptile
Title: Async import of the appStore packages

Bug: Async callbacks in forEach create unhandled promise rejections
Low
Title: feat: 2fa backup codes

Bug: Backup codes are not invalidated after use, allowing reuse (authentication risk)
Critical
Title: fix: handle collective multiple host on destinationCalendar

Bug: Null reference error occurs when host array is empty
Medium
Title: feat: convert InsightsBookingService to use Prisma.sql raw queries

Bug: Raw SQL query construction introduces potential SQL injection risk
Critical
Title: Comprehensive workflow reminder management for booking lifecycle events

Bug: Missing database cleanup when immediateDelete is true leads to stale reminders
High
Title: Advanced date override handling and timezone compatibility improvements

Bug: Incorrect end time calculation using slotStartTime instead of slotEndTime
Medium
Title: OAuth credential sync and app integration enhancements

Bug: Timing attack vulnerability due to direct string comparison instead of constant-time comparison
Critical
Title: SMS workflow reminder retry count tracking

Bug: Incorrect OR condition deletes all workflow reminders instead of targeted ones
High
Title: Add guest management functionality to existing bookings

Bug: Case sensitivity issue allows bypass of email blacklist restrictions
High
Title: feat: add calendar cache status and actions (#22532)

Bug: Cache status tracking is inaccurate due to unreliable updatedAt field
Medium
TOTAL9 / 105 / 105 / 10

Bug #7 (timing attack) was missed by all three tools. MergeMitra caught every other bug - including the SQL injection risk (Bug #4) and the case-sensitivity blacklist bypass (Bug #9) that both other tools missed.

6. Results - Keycloak (Java)

The 10 Keycloak PRs cover Keycloak's authentication, authorization, cryptography, identity-provider caching, and update-management subsystems - all enterprise-critical surfaces. Each links directly to the review comment where the tool flagged the bug. Each links to the PR.

PR ContextSeverityMergeMitraCodeRabbitGreptile
Title: Fixing Re-authentication with passkeys

Bug: isConditionalPasskeysEnabled() called without the required UserModel parameter
Medium
Title: Add caching support for IdentityProviderStorageProvider

Bug: Recursive caching call uses session instead of delegate, causing potential infinite recursion
Critical
Title: Add AuthzClientCryptoProvider for authorization client crypto

Bug: Incorrect cryptography provider returned (default keystore instead of BouncyCastle)
High
Title: Add rolling-updates feature flag and compatibility framework

Bug: Incorrect method used for handling exit codes, leading to improper runtime behavior
Medium
Title: Add Client resource type and scopes to authorization schema

Bug: Inconsistent feature flag handling creates orphaned permissions
High
Title: Add Groups resource type and scopes to authorization schema

Bug: Incorrect permission check in canManage() allows VIEW to escalate to MANAGE
High
Title: Add HTML sanitizer for translated message resources

Bug: Translation files contain incorrect language mappings (Italian text in Lithuanian file)
Low
Title: Implement access token context encoding framework

Bug: Wrong parameter in null check - grantType validated twice instead of rawTokenId
Critical
Title: Implement recovery key support for user storage providers

Bug: Unsafe raw List deserialization without type safety
Medium
Title: Fix concurrent group access to prevent NullPointerException

Bug: Missing null check leads to NullPointerException during concurrent access
Medium
TOTAL8 / 108 / 107 / 10

Bug #2 (IdP cache recursive caching) was missed by all three tools. MergeMitra flagged the exit-code contract issue (Bug #4) that both other tools missed, and uniquely went deeper on Bug #10 - pointing out that three sibling getSubGroupsStream() methods still lacked the null check, not just confirming the bug exists.

7. Results - Effectiveness by Category

How to read this table

If we say "Security: CodeRabbit 60%", it means: out of all the security-relevant issues that actually exist on these 20 PRs (the theoretical maximum a perfect reviewer could find), CodeRabbit identified about 60% of them. If the total number of issues is 20, CodeRabbit found 12. If MergeMitra is at 90%, it found 18.

The total number of issues in each category - the maximum possible - is a fixed denominator estimated from the combined list of every real issue any of the three tools (or our human validation) found across the 20 PRs. Higher percentages are better, except for False-positive rate where lower is better.

Combined scorecard (across both codebases, 20 PRs)

CategoryGreptileCodeRabbitMergeMitra
Functional bug detection (does it find the planted bug?)60 %65 %85 %
Security (auth bypass, leaks, crypto, secrets)60 %60 %90 %
Performance (N+1, runaway memory, fire-and-forget)30 %40 %80 %
Test quality (does the test actually verify the change?)30 %60 %90 %
Long-term maintainability (duplication, growing methods, SPI shape)40 %60 %90 %
Architectural insight (does this PR change the contract?)40 %40 %90 %
Cross-file reasoning (does the bug live in another file?)50 %50 %90 %
Domain awareness (knows the codebase's idioms)40 %60 %90 %
Accessibility (UI-only changes)20 %40 %80 %
Signal-to-noise (low-noise = high score)95 %60 %70 %
False-positive rate (lower is better)0 %~ 5 %~ 3 %

What the scorecard says

  • MergeMitra leads on every dimension that matters for bug prevention. At 85% functional bug detection (17 of 20 planted bugs), it is dominant - and that gap widens on security, performance, test quality, maintainability, architecture and domain awareness.
  • CodeRabbit and Greptile are closer to each other than either is to MergeMitra. CodeRabbit caught 13 of 20 bugs (65%); Greptile caught 12 of 20 (60%). The difference between them is what they miss - CodeRabbit tends to miss subtle bugs requiring domain knowledge (SQL injection, blacklist bypass), while Greptile tends to miss bugs requiring cross-file reasoning (2FA backup-code reuse, translation mapping errors).
  • Greptile has one genuine super-power: signal-to-noise. Almost everything it says is correct. Almost nothing it says is fluff. But the price of that conservatism is breadth - it stays silent on a lot of real issues.
  • CodeRabbit is broad but volatile. It picks up issues across many dimensions, but it also produces noise (a "LGTM!" praise block posted as a finding; a likely-fabricated CVE identifier; one PR with zero inline findings at all). Reviewers have to triage.

8. Tool-by-Tool Observations

Each tool has a recognizable personality across the 60 reviews we collected.

MergeMitra - the senior reviewer

The only tool whose reviews consistently feel like they were written by an experienced engineer who already knows the codebase. It doesn't just point at the line that changed - it reasons about what the change implies for the rest of the system, asks whether the fix actually fixes the problem, and routinely flags issues that span three or four files.

The decisive behavior: MergeMitra is the only tool that goes beyond confirming a bug exists to asking whether the fix is complete. On the concurrent-group NPE fix (Keycloak Bug #10), all three tools flagged the null-check problem - but only MergeMitra pointed out that three sibling getSubGroupsStream() methods still lacked the same null check, meaning the fix was incomplete. That single question - "did this PR actually fix the problem?" - is the difference between a junior and a senior reviewer.

The numbers back this up. MergeMitra caught 17 of 20 planted bugs (85%) - versus CodeRabbit's 13 (65%) and Greptile's 12 (60%). Among those, four bugs were caught only by MergeMitra and by no other tool (Cal.com Bugs #1, #4, #9; Keycloak Bug #4).

The trade-off is volume. MergeMitra is happy to leave 8–13 comments on a busy PR. The signal density is high (almost no false positives in 171 findings), but reviewers have to be willing to read.

Greptile - the precise sniper

When Greptile speaks, it is almost always correct. Zero false positives across 63 findings. Cleanly labeled severities. Concise comments. No fluff.

The trade-off is narrowness. Greptile reviews the file in front of it; it rarely follows the code into other files, and it does not engage with test quality, architecture or long-term maintainability. On two of the highest-stakes PRs in our study - the 2FA backup-code feature on Cal.com and the rolling-updates feature on Keycloak - Greptile posted comments and still missed the planted bug. It also missed the SQL injection risk on the Insights raw-query refactor (Cal.com Bug #4, Critical) and the email blacklist bypass on the Add-Guests feature (Cal.com Bug #9, High) - both of which only MergeMitra caught.

If you want a quiet second opinion - "tell me only the things I really need to act on" - Greptile is a defensible pick. If you want a tool that catches the deep stuff, it isn't.

CodeRabbit - the broad generalist with a noise problem

CodeRabbit covers a lot of ground. It is genuinely strong on content-layer review (it caught Italian text mistakenly bundled into the Lithuanian translation file, and Traditional Chinese characters bundled into the Simplified Chinese file - exactly the tedious work humans do badly). It produces ready-to-apply diff suggestions for almost every finding, which is a real UX win.

But it has reliability problems. In our study, CodeRabbit:

  • Posted "LGTM! Comprehensive documentation…" as an inline finding (this is praise, not a finding).
  • Cited a specific CVE identifier (CVE-2025-66021) with implausible precision - a classic AI hallucination pattern, and a particularly dangerous one because security teams trust CVE-shaped claims by default.
  • Produced zero inline comments on one PR (Keycloak rolling-updates) where the other two tools found 3 and 5 substantive issues.
  • Buried Critical findings inside collapsed summary blocks instead of posting them as inline comments - meaning a reviewer scanning inline comments would miss them.

If your team is willing to treat CodeRabbit's output as a starting point and triage carefully, the autofix UX is genuinely helpful. If your team treats AI suggestions at face value, the noise becomes a liability.

9. Four Findings That Defined the Study

The verification tables tell you what each tool caught. These four findings show you why the gaps matter. Each one was confirmed against the actual source code, and in every case, the verification data shows a clear difference between the tools.

Finding 1 - SQL Injection in a Performance Refactor (Cal.com Bug #4, Critical)

Cal.com's InsightsBookingService was refactored from Prisma's type-safe query builder to raw Prisma.sql queries for performance. The new getBaseConditions() function constructed SQL fragments that callers composed with string interpolation - introducing a SQL injection surface on an analytics endpoint that handles user-supplied filter parameters.

ToolCaught?Link
MergeMitraFlagged the raw SQL injection risk
CodeRabbitMissed it
GreptileMissed it

This is the single highest-severity bug that only one tool caught. A Critical SQL injection, hidden inside a performance optimization, on a production analytics path. The other two tools reviewed the same diff and did not flag it.

Finding 2 - Email Blacklist Bypass via Case Sensitivity (Cal.com Bug #9, High)

The new Add-Guests handler checked incoming email addresses against a blacklist, but the comparison was case-sensitive. An attacker could bypass a restriction on blocked@example.com by submitting Blocked@Example.com. Since email local-parts are case-insensitive by convention (and case-insensitive in all major providers), this is a real bypass.

ToolCaught?Link
MergeMitraFlagged the case-sensitivity issue
CodeRabbitMissed it
GreptileMissed it

This kind of bug requires knowing that email blacklists must normalize before comparison - domain awareness of how the feature is used, not just what the code does syntactically.

Finding 3 - 2FA Backup Code Reuse (Cal.com Bug #2, Critical)

The new 2FA backup-code feature allowed backup codes to be used for authentication - but the codes were not invalidated after use. A stolen backup code could be reused indefinitely, defeating the purpose of the one-time-use security model.

ToolCaught?Link
MergeMitraFlagged the reuse risk
CodeRabbitAlso flagged it
GreptileMissed it

Both MergeMitra and CodeRabbit caught this one - which is to CodeRabbit's credit. But Greptile posted four comments on the same PR and missed the most Critical bug in it. On a PR that modifies the authentication path, missing the backup-code reuse vulnerability is a significant gap.

Finding 4 - Incomplete NPE Fix (Keycloak Bug #10, Medium)

The concurrent-group-access PR added a null check to getSubGroupsCount() to prevent a NullPointerException during concurrent group deletion. All three tools flagged the basic null-check issue. But MergeMitra went one step further: it pointed out that three sibling getSubGroupsStream() methods still executed the same modelSupplier.get().getSubGroupsStream(...) pattern without the null guard. The fix patched one of four vulnerable methods and left the other three exposed to the same race condition.

ToolCaught basic bug?Flagged incomplete fix?
MergeMitraYes - three sibling methods still vulnerable
CodeRabbitNo
GreptileNo

This is the clearest example of the difference between "catching a bug" and "reviewing like a senior engineer." All three tools saw the problem. Only one asked whether the fix was complete.

10. Recommendation

Pick MergeMitra.

Across 20 PRs, on two radically different stacks (Java enterprise and TypeScript modern web), judged independently by Claude Code, MergeMitra is the only tool that consistently produced senior-engineer-quality reviews. It found the planted bug in 17 of the 20 PRs (85%) - versus CodeRabbit's 13 (65%) and Greptile's 12 (60%). It was the only tool to catch the Critical SQL injection on Cal.com's raw-query refactor and the only tool to catch the email-blacklist bypass on the Add-Guests feature. And it did so with effectively zero false positives.

  • Pick Greptile if your only priority is maximum signal-to-noise on a small surface area, and you accept that you will miss real security bugs on big PRs.
  • Pick CodeRabbit if your priority is broad coverage with polished autofix UX, and you are willing to train reviewers to triage CodeRabbit's output (criticals can be hidden in collapsed summaries; some findings are noise).

For a single-tool enterprise pick, MergeMitra is the right answer.

11. Caveats

A few things this report is not:

  • 20 PRs is a strong sample, not a procurement guarantee. If your codebase is meaningfully different (e.g., heavy mobile, embedded, or data-engineering), re-run this methodology before deciding.
  • All three tools evolve continuously. This evaluation reflects their behavior in April 2026. A tool that loses today may improve, and one that wins today can regress.
  • No finding was executed at runtime. Findings that depend on runtime behavior (test flakiness, actual race-window timing) are marked as partial in the underlying reports.
  • Code review is not the only way to catch bugs. Tests, monitoring, fuzzing and human QA all matter. An AI reviewer is one layer in a defense-in-depth strategy, not a replacement for the others.

12. References

Everything in this report is reproducible. Below are the 60 PR review threads, the source codebases, the underlying validation reports, and the prompts we used to evaluate them.

Cal.com - 30 PR review threads

#BranchMergeMitraCodeRabbitGreptile
1sms-retry-enhancedPRPRPR
2insights-performance-optimizationPRPRPR
3guest-management-enhancedPRPRPR
4workflow-queue-enhancedPRPRPR
5fix/handle-collective-multiple-host-destinationsPRPRPR
6appstore-async-improvementsPRPRPR
7date-algorithm-enhancedPRPRPR
8oauth-security-enhancedPRPRPR
9introduce-cache-key-overflowPRPRPR
10improve-two-factor-authentication-featuresPRPRPR

Keycloak - 30 PR review threads

#BranchMergeMitraCodeRabbitGreptile
1enhance-passkey-authentication-flowPRPRPR
2feature-group-concurrency-implementationPRPRPR
3feature-rolling-updates-implementationPRPRPR
4feature-idp-cache-implementationPRPRPR
5feature-groups-authz-implementationPRPRPR
6feature-html-sanitizer-implementationPRPRPR
7feature-clients-authz-implementationPRPRPR
8feature-token-context-implementationPRPRPR
9feature-authz-crypto-implementationPRPRPR
10feature-recovery-keys-implementationPRPRPR

Source codebases

Claude Code prompts used for the analysis

The exact prompts given to Claude Code to produce the per-codebase reports were issued as a single instruction in each case - Claude Code was responsible for fetching the PR data via the GitHub CLI, validating each finding against the source, and producing the verdicts.