Reflection | The Agentic Wiki

●●●○○ Complexity

Overview

Reflection is a pattern in which an LLM reviews its own output to identify errors, gaps, or opportunities for improvement. After generating an initial response, the model (or a separate evaluator model) is prompted to critically examine what it produced, point out specific weaknesses, and then generate an improved version. Reflection can be implemented as self-reflection, where the same model critiques its own output in a follow-up call, or as external reflection, where a second model or a separate prompt acts as the reviewer. The pattern leverages the observation that LLMs are often better at spotting flaws in existing text than at avoiding those flaws during initial generation.

How It Works

Generate the initial output. The LLM produces a first-pass response to the task.
Prompt for reflection. In a follow-up call, present the original task and the generated output, and ask the model to critically review it. The reflection prompt should direct the model to look for specific issues: factual errors, logical gaps, missing edge cases, unclear explanations, or violations of stated constraints.
Produce a critique. The model outputs a structured critique — a list of identified issues with specific references to what is wrong and why.
Revise based on the critique. Feed the original output and the critique back to the model (or a new call) and ask it to produce an improved version that addresses all identified issues.
Optionally repeat. For tasks where quality is paramount, run additional reflection-revision cycles. In practice, one or two reflection rounds capture most of the improvement; returns diminish rapidly after that.

When to Use

The initial generation is often “good enough” but consistently has minor issues that a second look would catch (e.g., code with subtle bugs, text with factual inaccuracies).
You want to improve output quality without building a separate evaluation pipeline — reflection is the simplest form of self-improvement.
The model can reliably identify the types of errors you care about when explicitly asked to look for them.
You have a small latency budget beyond what a single call provides but cannot afford a full evaluator-optimizer loop.
The task involves generation where correctness matters: code, technical writing, structured data extraction.

When Not to Use

The first-pass output is already consistently high quality — reflection adds cost without improvement.
The model cannot reliably identify its own errors for this task type (common with deeply specialized domains where the model lacks grounding).
You need formal, quantitative evaluation criteria — use Evaluator-Optimizer instead, which separates evaluation into a structured scoring step.
Latency constraints are very tight and even one additional round-trip is unacceptable.

Example

# Reflection: Code generation with self-review.

def generate_code(task: str) -> str:
    """Generate an initial code solution."""
    response = llm.call(
        system="You are an expert Python developer. Write clean, correct code.",
        prompt=f"Write a solution for: {task}"
    )
    return response.text

def reflect_on_code(task: str, code: str) -> str:
    """Self-reflect: critique the generated code."""
    response = llm.call(
        system=(
            "You are a meticulous code reviewer. Examine the code for bugs, edge cases, "
            "performance issues, and readability problems. List each issue with a brief explanation."
        ),
        prompt=f"Task: {task}\n\nCode to review:\n```python\n{code}\n```"
    )
    return response.text

def revise_code(task: str, code: str, critique: str) -> str:
    """Revise the code based on the self-reflection critique."""
    response = llm.call(
        system="You are an expert Python developer. Revise the code to fix all identified issues.",
        prompt=(
            f"Task: {task}\n\n"
            f"Original code:\n```python\n{code}\n```\n\n"
            f"Issues found:\n{critique}\n\n"
            "Produce a corrected version."
        )
    )
    return response.text

# Run the reflection cycle
task = "Implement a thread-safe LRU cache with a max size parameter."
initial_code = generate_code(task)
critique = reflect_on_code(task, initial_code)
improved_code = revise_code(task, initial_code, critique)

# Optionally run a second reflection round
second_critique = reflect_on_code(task, improved_code)
if "no issues" not in second_critique.lower():
    improved_code = revise_code(task, improved_code, second_critique)

Evaluator-Optimizer — A more structured version of reflection with a formal evaluation step, scoring, and explicit termination criteria. Use evaluator-optimizer when you need quantitative quality gates.
Prompt Chaining — Reflection can be modeled as a two- or three-step chain: generate, critique, revise.
Parallelization — You can run multiple independent reflection rounds in parallel (voting-style) and select the best revised output.

Overview

How It Works

When to Use

When Not to Use

Example

Related Patterns