🤖 AI‑Powered Code Reviews: How Gleez Leverages LLMs to Boost Code Quality

Posted on June 22, 2025 by Gleez Team
AI CodeReview LLM DevOps Software Quality Machine Learning

🤖 AI‑Powered Code Reviews: How Gleez Leverages LLMs to Boost Code Quality

In a world where AI in software engineering is moving from novelty to necessity, Gleez has taken a pragmatic step: embedding large language models (LLMs) directly into our continuous‑integration (CI) pipeline to act as an intelligent code reviewer. Below we walk through the architecture, the day‑to‑day workflow, the hard numbers we’ve seen, and the lessons we’ve learned along the way.

Gleez AI‑Powered Code Reviews

📚 The Problem We Set Out to Solve

Even with rigorous peer review, a non‑trivial amount of defects slip through—especially subtle issues such as:

  • Inconsistent naming conventions
  • Missed edge‑case handling
  • Boilerplate duplication
  • Security‑related anti‑patterns (e.g., unsafe deserialization)

Manual reviewers can miss these patterns due to cognitive overload, time pressure, or simply because they’re not experts in every domain of the codebase. We wanted a scalable, consistent, and fast safety net that could surface these problems before they reached production.


🏗️ Architecture Overview

git push ──► CI (GitHub Actions / GitLab CI)
                │
                ▼
          LLM‑Reviewer Service
                │
   ┌────────────┴─────────────┐
   │                          │
Prompt Engine            Human‑in‑the‑Loop
   │                          │
   ▼                          ▼
LLM (GPT‑4‑Turbo)        Slack / Email Alerts
   │
   ▼
Review Report (JSON) ──► PR Comment Bot

Key components

Component Role
Prompt Engine Generates a concise, context‑aware prompt for the LLM, injecting the diff, coding standards, and any relevant metadata (e.g., target runtime, security policy).
LLM‑Reviewer Service Calls the selected LLM (currently OpenAI GPT‑4‑Turbo) with the crafted prompt, receives structured feedback.
Human‑in‑the‑Loop Flags high‑severity findings for senior engineers via Slack; allows them to approve, reject, or augment the AI suggestions.
PR Comment Bot Posts the LLM’s suggestions as an inline comment on the pull request, clearly marking AI‑generated content.

✍️ Prompt Engineering – The Secret Sauce

A good prompt is the difference between a generic “looks fine” response and a pinpointed, actionable recommendation. Our prompt template includes:

  1. Diff snippet – Only the changed lines (max 300 tokens).
  2. Project style guide excerpt – Enforced naming, lint rules, and architectural constraints.
  3. Specific checks – e.g., “Detect any insecure deserialization,” “Identify duplicated logic across modules.”
  4. Output schema – JSON array with fields: line, issue_type, severity, suggestion, confidence.

Example excerpt:

You are an expert software engineer reviewing a Python pull request. 
Only analyze the following diff (max 300 tokens):
--- a/app/auth.py
+++ b/app/auth.py
@@ -12,7 +12,9 @@
 def login(user):
-    token = generate_token(user)
+    token = generate_token(user, expires_in=3600)
     return token
Check for:
- security anti‑patterns,
- violation of the project's naming convention (snake_case),
- duplicated logic.
Return findings as JSON:
[
  {"line": 15, "issue_type": "Security", "severity": "High", "suggestion": "...", "confidence": 0.94},
  ...
]

🔄 CI Integration

  1. Trigger – On every PR opened or updated, the CI job runs run-llm-reviewer.sh.
  2. Execution – The script extracts the diff, calls the Prompt Engine service, and sends the request to the LLM.
  3. Result handling – The JSON response is parsed; high‑severity items (severity = High) fire a Slack alert to the designated reviewer group.
  4. Comment posting – All findings are posted back to the PR using the GitHub/GitLab API, prefixed with an 🤖 AI Review: banner.

The entire process adds ≈ 45 seconds to the CI runtime—well within our acceptable latency budget.

📈 Measurable Outcomes

Metric Before LLM Reviewer After 3 Months
Average bugs per release 27 19 (≈ 30 % reduction)
Time to detect critical security issue 4.2 days (post‑merge) 1.1 days (pre‑merge)
Review turnaround time (per PR) 2.8 hrs (human only) 3.2 hrs (human + AI, with AI handling ~60 % of comments)
Engineer satisfaction (survey) 78 % happy 86 % happy (citing “instant feedback”)

Key takeaway: The LLM does not replace human reviewers; it augments them, allowing engineers to focus on architectural decisions while the AI catches repetitive, rule‑based issues.

🧩 Lessons Learned

  1. Prompt stability matters – Small wording changes caused the LLM to drift. We now lock prompts in version control and run regression tests on sample diffs.
  2. Confidence thresholds – Not all suggestions are equally reliable. We filter out anything below a 0.85 confidence score unless a senior engineer explicitly asks for a review.
  3. Human‑in‑the‑Loop is non‑negotiable – Fully automated merges led to a few false positives early on. Adding a manual approval step for high‑severity alerts restored trust.
  4. Cost monitoring – LLM API usage can balloon. We cap token usage per PR and batch low‑severity findings for nightly processing.
  5. Continuous feedback loop – Engineers can upvote/downvote AI suggestions directly in the PR comment. This feedback is fed back into prompt refinements and a small fine‑tuning dataset we maintain internally.

🚀 Getting Started at Your Organization

If you’re interested in replicating a similar setup:

  1. Pick an LLM provider – OpenAI, Anthropic, or an open‑source model hosted on your own GPU cluster.
  2. Define your style guide – Encode it as a machine‑readable JSON/YAML so the Prompt Engine can inject relevant sections.
  3. Build a lightweight Prompt Service – A Flask/FastAPI app that receives diffs, assembles prompts, and returns structured responses.
  4. Integrate with CI – Add a single job step that calls the service and posts results back to the PR.
  5. Iterate – Start with a narrow set of checks (e.g., lint violations) and expand to security, performance, and architectural patterns over time.

🎉 Closing Thoughts

AI‑powered code reviews are no longer a futuristic buzzword—they’re a practical tool that reduces defects, accelerates feedback, and frees engineers to solve higher‑impact problems. At Gleez, the journey taught us that success hinges on thoughtful prompt engineering, tight CI integration, and a respectful human‑in‑the‑loop.

Give it a try, measure your own outcomes, and join the growing community of teams building safer, faster software with LLMs. Happy coding!


Ready to see the AI reviewer in action? Check out our open‑source demo repo: https://github.com/gleez/ai-code-review-demo