Code Review
This guide will introduces you to generating code change suggestions, from an angle of resilience and reliability, using LLM.
The proposed changes are proposed as unified diff that help you visualize what fault suggests you may want to add or remove from your code.
Prerequisites
-
Install fault
If you haven’t installed fault yet, follow the installation instructions.
-
Get an OpenAI Key
For the purpose of the guide, we will be using OpenAI models. You need to create an API key. Then make sure the key is available for fault:
-
Install a local qdrant database
fault uses qdrant for its vector database. You can install a local, free, qdrant using docker:
Windows not supported
Unfortunately, the agent feature is not supported on Windows because the framework used by fault to interact with LLM does not support that platform.
Experimental feature
This feature is still experimental and is subject to change. Dealing with LLM requires accepting a level of fuzzyness and adjustments. Engineering is still very much a human endeavour!
Review a Python Web Application
In this scenario we take a very basic Python application, using the FastAPI and SQLAlchemy (sqlite) libraries. We want to learn what we can from this application.
-
Source code of the application
webapp/app.py#!/usr/bin/env -S uv run --script # /// script # dependencies = [ # "uvicorn", # "fastapi[standard]", # "sqlalchemy" # ] # /// ############################################################################### # # Very basic application that expose a couple of endpoints that you can # use to test fault. # Once you have installed `uv` https://docs.astral.sh/uv/, simply run the # application as follows: # # uv run --script app.py # ############################################################################### from typing import Annotated import uvicorn from fastapi import FastAPI, HTTPException, Depends, status, Body from sqlalchemy import create_engine, Column, Integer, String from sqlalchemy.orm import declarative_base, sessionmaker, Session from sqlalchemy.exc import SQLAlchemyError ############################################################################### # Database configuration ############################################################################### engine = create_engine("sqlite:///./test.db") SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine) Base = declarative_base() ############################################################################### # Data model ############################################################################### class User(Base): __tablename__ = "users" id = Column(Integer, primary_key=True, index=True) name = Column(String, index=True) password = Column(String) Base.metadata.create_all(bind=engine) ############################################################################### # Dependency injection ############################################################################### def get_db(): db = SessionLocal() try: yield db finally: db.close() ############################################################################### # Our application ############################################################################### app = FastAPI(servers=[{"url": "http://localhost:9090"}]) @app.get("/") async def index() -> dict[str, str]: return {"message": "Hello, World!"} @app.post("/users/") async def create_user( name: Annotated[str, Body()], password: Annotated[str, Body()], db: sessionmaker[Session] = Depends(get_db) ): db_user = User(name=name, password=password) db.add(db_user) db.commit() db.refresh(db_user) return db_user @app.get("/users/{user_id}") async def read_user( user_id: int, db: sessionmaker[Session] = Depends(get_db) ): try: user = db.query(User).filter(User.id == user_id).first() if user is None: raise HTTPException(status_code=status.HTTP_404_NOT_FOUND) return user except SQLAlchemyError as e: raise HTTPException(status_code=status.HTTP_500_INTERNAL_SERVER_ERROR) if __name__ == "__main__": uvicorn.run("app:app", port=9090)
You may now install the dependencies to run it:
Finally, run the application as follows:
This application has only a couple of endpoints is purposefully not optimised.
-
Generate a scenario for this application
We must first generate and run a scenario so we get a mapping of the application.
-
Run the scenario against this application
-
Review the code and make suggestions
fault reviews the code by chunking it, indexing it and then asking your favourite LLM for its opinion.
fault agent code-review \ --results results.json \ # (1)! --source-dir webapp/ \ # (2)! --source-lang python # (3)! ⠸ [00:00:34] Reviewing! This could take a while... > index__get [GET http://localhost:9090/] create_user_users__post [POST http://localhost:9090/users/] read_user_users__user_id__get [GET http://localhost:9090/users/{user_id}] [↑↓ to move, enter to select, type to filter]
- The results from the previous scenario execution
- The top-level directory where the application's code lives
- The application's language (someday a heuristic will guess this)
- Use the arrow keys to select the endpoint you want to review and patch
Assuming we select the first endpoint, fault suggests the following changes:
Generated code review changes report
Below is a set of targeted, small‐scope changes you can drop almost verbatim into your
app.py
to dramatically improve resilience, prevent cascading slow‐downs, and give you early warning via metrics and tracing. Each change is explained, and I’ve ordered them by immediate impact.
1) Increase and tune your DB-connection pool
Right now you are using SQLite’s default pool (QueuePool size=5). Under even moderate latency or I/O hiccups you’ll saturate your pool and stall new requests forever.
--- a/app.py +++ b/app.py @@ ### Database configuration -engine = create_engine("sqlite:///./test.db") +engine = create_engine( + "sqlite:///./test.db", + # allow up to 10 concurrent connections + pool_size=10, + # and burst to 20 before failing + max_overflow=10, + # wait up to 5s for a free connection + pool_timeout=5, +) SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)
Why: prevents “head of line” blocking when one connection is slow (disk I/O hiccup, GC pause, etc.).
2) Bound every DB call with a per-request timeout
Even with a bigger pool, a hung query will hold its slot indefinitely. Wrapping your sync calls in
asyncio.to_thread
+asyncio.wait_for
guarantees a hard cap.--- a/app.py +++ b/app.py import asyncio from fastapi import HTTPException, status from sqlalchemy.exc import SQLAlchemyError +# helper that runs sync code in a thread +def _sync_read_user(db, user_id: int): + user = db.query(User).filter(User.id == user_id).first() + if user is None: + raise HTTPException(status_code=status.HTTP_404_NOT_FOUND, detail="user not found") + return user @app.get("/users/{user_id}") async def read_user( user_id: int, db: sessionmaker[Session] = Depends(get_db) ): - try: - user = db.query(User).filter(User.id == user_id).first() - if user is None: - raise HTTPException(status_code=status.HTTP_404_NOT_FOUND) - return user - except SQLAlchemyError as e: - raise HTTPException(status_code=status.HTTP_500_INTERNAL_SERVER_ERROR) + try: + # cap the entire DB roundtrip at 2s + return await asyncio.wait_for( + asyncio.to_thread(_sync_read_user, db, user_id), + timeout=2.0, + ) + except asyncio.TimeoutError: + # fast‐fail slow queries + raise HTTPException(status_code=status.HTTP_504_GATEWAY_TIMEOUT, detail="DB request timed out") + except SQLAlchemyError: + # catch transient DB errors + raise HTTPException(status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, detail="DB error")
Why: prevents a black-hole or very slow query from chewing your entire worker pool and pushing p95 latency off the charts.
3) Add idempotent retries with exponential back-off to your GET
read_user
is idempotent. A retry on a transient DB or I/O error will hide packet drops, brief database locks, file-system stalls, etc.- Install
tenacity
:
- Wrap the in-thread helper:
--- a/app.py +++ b/app.py from tenacity import ( retry, retry_if_exception_type, wait_exponential, stop_after_attempt, ) +# retry only on SQLALchemy transient errors @retry( retry=retry_if_exception_type(SQLAlchemyError), wait=wait_exponential(multiplier=0.1, max=1.0), stop=stop_after_attempt(3), reraise=True, ) def _sync_read_user(db, user_id: int): @@ user = db.query(User).filter(User.id == user_id).first() if user is None: raise HTTPException(status_code=status.HTTP_404_NOT_FOUND, detail="user not found") return user
Why: this will re-try on deadlocks, brief locks, dropped packets, without touching your route signature or calling code.
4) Instrument basic Prometheus metrics + tracing
Early warning is critical: track latencies, error rates and retry counts so you can alert before your users notice.
-
Install:
-
Add to the top of
app.py
:
from starlette_exporter import PrometheusMiddleware, handle_metrics from prometheus_client import Counter, Histogram from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter # -- tracing ----------------------------------------------------------------- trace.set_tracer_provider(TracerProvider()) trace.get_tracer_provider().add_span_processor( BatchSpanProcessor(ConsoleSpanExporter()) ) tracer = trace.get_tracer(__name__) # -- metrics ----------------------------------------------------------------- app.add_middleware(PrometheusMiddleware) app.add_route("/metrics", handle_metrics) REQUEST_LATENCY = Histogram( "http_request_latency_seconds", "Latencies by method and path", ["method", "endpoint"], ) REQUEST_COUNT = Counter( "http_requests_total", "Total requests by method, path and status", ["method", "endpoint", "http_status"], ) @app.middleware("http") async def metrics_middleware(request, call_next): labels = {"method": request.method, "endpoint": request.url.path} with REQUEST_LATENCY.labels(**labels).time(): response = await call_next(request) REQUEST_COUNT.labels( **labels, http_status=response.status_code ).inc() return response
Why: you’ll immediately see tail-latency spikes (p95/p99), error surges, retry storms or pool-timeouts in your dashboards.
5) (Optional) Circuit-breaker for sustained failures
When your DB goes down for more than a few seconds, retries only amplify pressure. A circuit-breaker will short-circuit and fast-fail until the DB recovers.
--- a/app.py +++ b/app.py from pybreaker import CircuitBreaker # trip breaker after 5 errors, reset after 30s db_breaker = CircuitBreaker(fail_max=5, reset_timeout=30) -@retry(...) -def _sync_read_user(...): +@db_breaker +@retry(...) +def _sync_read_user(...): ...
Why: prevents retry avalanches against an already-failing downstream.
Summary of Impact
- Pool tuning + per-call timeouts → no more “stair-step” tail-latency under slow queries.
- Idempotent retries → hide small network/DB blips.
- Metrics & tracing → proactive alerting.
- Circuit-breaker → quick fail-fast during true outages.
With these four “drop-in” changes you’ll eliminate head-of-line blocking, dramatically reduce error SLO violations on
GET /users/{user_id}
, and gain visibility into when—and why—your service is struggling. -
Optionally Inject Advices from Scenario Executions Analysis
You may improve the generated review by injecting the analysis made from a scenario executed agains the source code under review. This should give much more context to reason about.
Run an analysis from a past scenario results:
This will generate a file called
scenario-review-report.md
. Now you can inject this file into the code review command line:fault agent code-review \ --results results.json \ --source-dir webapp/ \ --source-lang python \ --scenario-review-report scenario-review-report.md # (1)! ? Select the OpenAPI operationId to patch: > read_root__get [GET http://localhost:9090/] create_user_users__post [POST http://localhost:9090/users/] read_user_users__user_id__get [GET http://localhost:9090/users/{user_id}] [↑↓ to move, enter to select, type to filter]
- Pass the generated report. You can omit this, if the file exists in the current directory, it will be read.
Assuming again we select the first endpoint, fault suggests now the following changes:
Generated code review changes report after scenario analysis
Here are four focused, minimally-intrusive changes you can make today to dramatically improve resilience, reliability and observability in your FastAPI/SQLAlchemy app.
-
Wrap every transaction in an explicit context manager and rollback on failure
If
Right now you do:
commit()
fails you never roll back, leaving the session in an invalid state. Instead use:
from sqlalchemy.exc import SQLAlchemyError @app.post("/users/") async def create_user( name: str = Body(...), password: str = Body(...), db: Session = Depends(get_db), ): try: # begin() will automatically rollback on exception with db.begin(): user = User(name=name, password=password) db.add(user) # now safe to refresh db.refresh(user) return user except SQLAlchemyError as e: # session.rollback() already called by begin() # you can log e here raise HTTPException( status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, detail="could not create user", )
-
Add a simple retry with exponential back-off around commits
Transient “database is locked” errors in SQLite (and some cloud-SQL networks) can often be overcome by a retry. The tenacity library gives you a one-liner:
from tenacity import retry, wait_exponential, stop_after_attempt @retry(wait=wait_exponential(multiplier=0.2, max=2), stop=stop_after_attempt(3)) def safe_commit(db: Session): db.commit() @app.post("/users/") async def create_user(...): try: with db.begin(): user = User(...) db.add(user) # retry commit if it hits a transient lock safe_commit(db) db.refresh(user) return user except SQLAlchemyError: raise HTTPException(500, "db error")
-
Enforce a per-request timeout
A hung or extremely slow request ties up your worker. Adding a single middleware gives you a hard cap on processing time:
-
Add basic metrics and tracing hooks
Knowing “what just broke” is half the battle. Two minutes to add Prometheus metrics:
import time from prometheus_client import Counter, Histogram, make_asgi_app from starlette.middleware import Middleware from starlette.middleware.base import BaseHTTPMiddleware REQUEST_COUNT = Counter("http_requests_total", "Request count", ["method", "endpoint", "status"]) REQUEST_LATENCY = Histogram("http_request_latency_seconds", "Latency", ["method", "endpoint"]) class MetricsMiddleware(BaseHTTPMiddleware): async def dispatch(self, request, call_next): start = time.time() response = await call_next(request) elapsed = time.time() - start key = (request.method, request.url.path, response.status_code) REQUEST_COUNT.labels(*key).inc() REQUEST_LATENCY.labels(request.method, request.url.path).observe(elapsed) return response app.add_middleware(MetricsMiddleware) # mount /metrics for Prometheus to scrape app.mount("/metrics", make_asgi_app())
With these four changes in place you will have:
- safe transactions that always roll back on error
- automatic retries for common transient failures
- a hard deadline for every HTTP call
- real-time metrics you can hook into your alerting system
Tip
In a future release, fault will be able to apply and try the changes to verify they may be used safely.