11 min read
Building a Centralized PR Reviewer Agent with Claude Code

Anthropic recently shipped Code Review for Claude Code: a native, multi-agent PR reviewer that dispatches parallel agents to hunt bugs, verify findings, and post inline comments. It’s impressive. It also costs $15-25 per review.

Months before that launched, I’d already built something similar for my org using claude-code-action. Different tradeoffs: no multi-agent orchestration, but full control over prompts, triggers, and auth, at a fraction of the cost. Here’s the architecture I landed on: 13 lines of YAML per repo, everything else centralized.


Table of contents


The landscape: native vs. DIY

There are three ways to get AI-powered PR reviews on GitHub today.

Anthropic’s native Code Review is a multi-agent system. Parallel review agents cross-check findings for false positives and post inline comments. Anthropic reports 84% of large PRs get actionable findings with less than 1% false positive rate. It requires a Team or Enterprise plan, and at org scale, $15-25 per review gets expensive quickly.

claude-code-action@v1 is the official GitHub Action. One Claude Code session per PR event. You control the prompt, the model, the auth. Much cheaper (you pay only for API tokens), but you build the orchestration yourself.

The DIY centralized approach (this post) wraps claude-code-action in reusable workflows. One config governs the whole org. Smart triggers route PRs to different review types. You get centralized prompt iteration and cost controls without touching individual repos.

AI PR reviews are essential guardrails at this point. The only question is how much control and budget you want to keep over them.


The three-tier architecture

The design separates concerns into three layers:

graph TB
    subgraph "Repository Level"
        PR[Pull Request Event]
        Comment[PR Comment]
        Trigger[Trigger Workflow]
    end

    subgraph "Organization Level"
        Central[Central Workflow]
        GHA[GitHub App]
        Auth[Auth Provider]
    end

    subgraph "AI Provider"
        API[Claude API]
        Claude[Claude Model]
    end

    subgraph "Response"
        Review[PR Review]
        Inline[Inline Comments]
        Status[Status Check]
    end

    PR --> Trigger
    Comment --> Trigger
    Trigger -->|"Calls"| Central
    Central -->|"Authenticates"| GHA
    Central -->|"Credentials"| Auth
    Auth -->|"API Call"| API
    API -->|"Invokes"| Claude
    Claude -->|"Returns Analysis"| API
    API -->|"Response"| Central
    Central -->|"Posts"| Review
    Central -->|"Adds"| Inline
    Central -->|"Updates"| Status

    style PR fill:#e1f5fe
    style Comment fill:#e1f5fe
    style Central fill:#fff3e0
    style Auth fill:#e8f5e9
    style Claude fill:#f3e5f5
    style Review fill:#e8f5e9

Tier 1 is a thin YAML stub in each repo. It defines which events trigger a review and delegates everything else.

Tier 2 is a reusable workflow in your org’s .github repo. It decides whether to review and what kind of review to run, based on event type, comment content, and PR size.

Tier 3 is the central review workflow: auth, prompt assembly, claude-code-action invocation, result posting.

If you’ve worked on a platform team, this pattern is familiar. A thin per-repo shim calling centralized pipeline logic. Want to change the review prompt? Update one file. Add a new review type? One file. The repos never know.


The workflows, from repo to review

Tier 1: The repo stub

This is the only file you add to each repository:

name: AI Code Review

on:
  pull_request:
    types: [opened, synchronize, ready_for_review]
  issue_comment:
    types: [created]

jobs:
  claude-review:
    uses: your-org/.github/.github/workflows/claude-review-trigger.yml@main
    secrets: inherit

Thirteen lines. That’s the whole thing. Two event triggers and a uses: call with secrets: inherit.

Tier 2: The trigger router

The router lives in your org’s .github repo. It receives the event from Tier 1, decides if a review should happen, and picks the review type.

The gate condition filters out noise early:

if: |
  (github.event_name == 'pull_request' &&
   github.event.pull_request.draft == false) ||
  (github.event_name == 'issue_comment' &&
   github.event.issue.pull_request &&
   contains(github.event.comment.body, '@claude'))

Draft PRs? Skipped. Comments on issues (not PRs)? Skipped. Only @claude mentions on pull requests pass through.

For automatic PR events, size determines the review type:

ADDITIONS="${{ github.event.pull_request.additions }}"
DELETIONS="${{ github.event.pull_request.deletions }}"
TOTAL_CHANGES=$((ADDITIONS + DELETIONS))

if [ $TOTAL_CHANGES -lt 50 ]; then
  REVIEW_TYPE="quick"
elif [ $TOTAL_CHANGES -gt 1000 ]; then
  REVIEW_TYPE="comprehensive"
fi

For comment-triggered reviews, the type is parsed from the comment:

COMMENT_LOWER=$(echo "$COMMENT_BODY" | tr '[:upper:]' '[:lower:]')

if [[ "$COMMENT_LOWER" == *"security"* ]]; then
  REVIEW_TYPE="security"
elif [[ "$COMMENT_LOWER" == *"performance"* ]]; then
  REVIEW_TYPE="performance"
elif [[ "$COMMENT_LOWER" == *"quick"* ]]; then
  REVIEW_TYPE="quick"
else
  REVIEW_TYPE="comprehensive"
fi

@claude security gets a security-focused review. @claude quick gets a fast pass. Just @claude defaults to comprehensive.

Once resolved, the router calls Tier 3 with structured inputs:

trigger-review:
  needs: check-trigger
  if: needs.check-trigger.outputs.should_review == 'true'
  uses: your-org/.github/.github/workflows/claude-review.yml@main
  with:
    pr_number: ${{ needs.check-trigger.outputs.pr_number }}
    repository: ${{ github.repository }}
    trigger_comment: ${{ needs.check-trigger.outputs.trigger_comment }}
    review_type: ${{ needs.check-trigger.outputs.review_type }}
  secrets:
    GCP_WORKLOAD_IDENTITY_PROVIDER: ${{ secrets.GCP_WORKLOAD_IDENTITY_PROVIDER }}
    GCP_SERVICE_ACCOUNT: ${{ secrets.GCP_SERVICE_ACCOUNT }}
    APP_ID: ${{ secrets.APP_ID }}
    APP_PRIVATE_KEY: ${{ secrets.APP_PRIVATE_KEY }}

Tier 3: The central review

This is where claude-code-action actually runs. The workflow checks out the PR branch, authenticates, and invokes the action with a dynamic prompt:

- name: Claude Code Review
  uses: anthropics/claude-code-action@v1
  with:
    github_token: ${{ steps.app-token.outputs.token }}
    prompt: |
      You are an expert code reviewer. Review PR #${{ inputs.pr_number }}.

      ## Review Context
      - **Review Type**: ${{ inputs.review_type }}
      - **Focus**: ${{ steps.review-context.outputs.FOCUS_AREAS }}

      ${{ steps.review-context.outputs.REVIEW_DEPTH }}

      Structure your review as:
      ### Summary
      ### Issues Found
      ### Suggestions
      ### Review Metrics
      - **Risk Level**: Low/Medium/High
      - **Code Quality**: 1-10
      - **Recommendation**: Approve / Request Changes / Needs Discussion

    claude_args: |
      --model claude-sonnet-4-5@20250929
      --allowedTools "Bash(gh pr comment:*),Bash(gh pr diff:*),Bash(gh pr view:*),Bash(git log:*),Bash(git diff:*),Read,Grep,Glob"

The allowedTools list is deliberate. Claude can read code, run git commands, and post comments, but it can’t modify files, run arbitrary commands, or hit the network. Read-only reviewer by design.

The workflow also posts a failure notice to the PR with common troubleshooting steps when something breaks, so developers aren’t left guessing.


The prompt and review types

The prompt is where you actually tune what the reviewer does. Everything else is plumbing.

A Prepare Review Context step maps the review type to focus areas and depth instructions before the action runs. The routing logic:

graph TD
    Start[PR Event / Comment] --> Check{Trigger Type?}

    Check -->|Comment| Parse[Parse Comment]
    Check -->|PR Event| Size[Check PR Size]

    Parse --> Security{Contains 'security'?}

    Security -->|Yes| SecReview[Security Review]
    Security -->|No| Perf{Contains 'performance'?}
    Perf -->|Yes| PerfReview[Performance Review]
    Perf -->|No| Quick{Contains 'quick'?}
    Quick -->|Yes| QuickReview[Quick Review]
    Quick -->|No| CompReview[Comprehensive Review]

    Size --> Small{< 50 lines?}
    Small -->|Yes| QuickReview
    Small -->|No| Large{> 1000 lines?}
    Large -->|Yes| CompReview
    Large -->|No| CompReview

    style Start fill:#e1f5fe
    style SecReview fill:#ffebee
    style PerfReview fill:#fff3e0
    style QuickReview fill:#e8f5e9
    style CompReview fill:#f3e5f5
Review typeTriggerFocus areas
QuickAuto (<50 lines) or @claude quickObvious bugs, basic code quality, critical problems
ComprehensiveAuto (default) or @claude reviewCode quality, security, performance, testing, docs, architecture
Security@claude securityVulnerabilities, auth issues, input validation, data exposure, dependencies
Performance@claude performanceBottlenecks, algorithm efficiency, query optimization, caching, memory

The mapping in code:

case "${{ inputs.review_type }}" in
  security)
    FOCUS_AREAS="security vulnerabilities, authentication issues, input validation, \
SQL injection, XSS, CSRF, sensitive data exposure, and dependency vulnerabilities"
    REVIEW_DEPTH="Perform a thorough security audit."
    ;;
  performance)
    FOCUS_AREAS="performance bottlenecks, algorithm efficiency, database query \
optimization, caching opportunities, memory usage, and scalability concerns"
    REVIEW_DEPTH="Focus on performance implications."
    ;;
  quick)
    FOCUS_AREAS="obvious bugs, basic code quality issues, and critical problems"
    REVIEW_DEPTH="Provide a quick, high-level review."
    ;;
  *)
    FOCUS_AREAS="code quality, best practices, security vulnerabilities, performance, \
testing, documentation, and architecture"
    REVIEW_DEPTH="Perform a comprehensive review."
    ;;
esac

The structured output format (summary, issues, suggestions, metrics) keeps reviews scannable. Risk level and recommendation fields make triage fast without reading the full review.


Authentication and provider choice

Two things need auth: the AI provider and GitHub.

Claude provider

claude-code-action supports three providers. Pick one:

  • Direct API key: set ANTHROPIC_API_KEY as a secret. Simplest way to get started.
  • Vertex AI via OIDC: keyless, GCP-native. Workload Identity Federation means your workflow never holds a long-lived key.
  • Bedrock: AWS-native. Use IAM roles.

For Vertex AI with keyless auth, the workflow uses Google’s OIDC integration:

- name: Authenticate to Google Cloud
  uses: google-github-actions/auth@v2
  with:
    workload_identity_provider: ${{ secrets.GCP_WORKLOAD_IDENTITY_PROVIDER }}
    service_account: ${{ secrets.GCP_SERVICE_ACCOUNT }}

- name: Claude Code Review
  uses: anthropics/claude-code-action@v1
  env:
    ANTHROPIC_VERTEX_PROJECT_ID: your-gcp-project
    CLOUD_ML_REGION: global
  with:
    use_vertex: "true"

No API keys stored anywhere. GitHub’s OIDC token gets exchanged for short-lived GCP credentials at runtime. The GitHub OIDC docs walk through the Workload Identity Federation setup.

GitHub App

Regardless of provider, you need a GitHub App for PR write permissions. The built-in GITHUB_TOKEN can’t reliably post comments on PRs from reusable workflows across repos. A GitHub App gives you a stable identity, fine-grained permissions, and works across the org.

- name: Generate GitHub App token
  uses: actions/create-github-app-token@v2
  with:
    app-id: ${{ secrets.APP_ID }}
    private-key: ${{ secrets.APP_PRIVATE_KEY }}
    repositories: ${{ github.event.repository.name }}

Keeping costs under control

This is the real reason to build it yourself.

A single Claude session per review costs roughly $1-3 depending on PR size and review type. Compare that to $15-25 for the native multi-agent system. At 50 PRs/week across an org, that’s $50-150/week vs $750-1,250/week. A 10x difference.

Here’s how to keep it predictable:

Skip conditions. The Tier 2 router already skips draft PRs. I also skip bot-authored PRs (github.actor checks) and WIP branches (title prefix matching). No point burning tokens on Dependabot bumps.

Size-based routing. Small PRs (<50 lines) get a quick review: shorter prompt, fewer tokens, faster turnaround. Only large or explicitly requested PRs get the full treatment.

Debounce. Reviews fire on ready_for_review, not on every synchronize during development. This alone cut my review volume in half.

Model selection. Sonnet for routine reviews. Opus for security reviews or anything over 1000 lines where deeper reasoning matters. One-line change in the central workflow.

You can also log token usage as workflow artifacts for budget tracking, but honestly, at $1-3 per review, I haven’t needed to watch it closely.


Lessons learned

  1. Start with quick reviews. Get auth, permissions, and routing working before tuning prompts. A quick review that posts “LGTM, no major issues” proves the whole pipeline end-to-end. I spent two days debugging OIDC before writing a single line of prompt.

  2. Make reviews advisory, never blocking. Don’t gate merges on AI reviews. Teams adopt advisory tools faster, and you avoid the politics of “the bot blocked my PR.”

  3. Iterate prompts centrally. The whole point of this architecture. When reviews miss a class of issues, you update the prompt in one place. Every repo benefits immediately.

  4. The 13-line stub is a feature. Repo owners don’t need to understand the review system. They drop in a file and get reviews. When the central team ships an improvement, every repo gets it for free.

  5. Watch for hallucinated line numbers. Claude sometimes references lines that don’t exist in the diff. The allowedTools constraint helps (Claude can gh pr diff to see actual changes), but you’ll still see occasional misses. Inline comments via claude-code-action’s built-in GitHub MCP tools are more reliable than asking the model to format line references manually.


References


I’m still iterating on prompt quality and thinking about adding model routing based on file types (Opus for infra changes, Sonnet for everything else). If you’ve built something similar, I’d like to hear what worked. GitHub.