one GEPA report of DSPy
- 3 minutes read - 576 wordsWhat is GEPA?
GEPA stands for Graph-based Evolutionary Program Adaptation — a DSPy optimizer that automatically improves the prompts/instructions of a multi-module LLM program through evolutionary search. It iteratively mutates module instructions, evaluates the changes, and keeps the best-performing candidates on a Pareto front.
What’s Happening in This Run
This file captures a GEPA optimization run on a financial news extraction system that classifies M&A (merger/acquisition) articles and extracts structured data from them.
The Program Being Optimized
The system (defined in gepa_optimize.py and extract.py) has two main modules:
-
classifier— Determines if an article is a merger, acquisition, or Other (rumored/potential/non-deal) -
merger_extractor/acquisition_extractor— Extracts structured fields like company names, tickers, deal amounts, and currencies
Key Configuration
| Setting | Value |
|---|---|
Task LM |
|
Reflection LM |
|
Optimizer mode |
|
Threads |
32 |
Training set |
14 articles (IDs: 2,5,6,1,7,4,10,9,11,3,12,13,8,14) |
Validation set |
10 articles (IDs: 33,29,24,34,30,25,35,31,26,36) |
The Optimization Loop
Each iteration follows this cycle:
-
Select — Pick the best program from the Pareto front (the set of non-dominated candidates)
-
Sample — Evaluate on a small 3-example subsample of the validation set
-
Reflect — Use the
reflection_lm(DeepSeek V4 Pro) to analyze errors and propose improved instructions for underperforming modules -
Propose — Generate a new candidate program with updated prompts
-
Validate — If the subsample score improves, run a full 10-example validation and add to the Pareto front
Progress Highlights from the Log
| Iteration | Module Updated | Subsample Score | Full Val Score | Pareto Score | Outcome |
|---|---|---|---|---|---|
0 (base) |
— |
— |
84.3% |
84.3% |
Starting baseline |
1 |
|
81.0% |
— |
— |
No improvement, skipped |
2 |
|
82.7% → 91.1% ✅ |
81.8% |
84.3% |
Added to candidate pool (#1) |
3 |
|
95.2% → 100% ✅ |
82.9% |
84.3% |
Added to pool (#2) |
4 |
|
61.9% |
— |
— |
Regressed, skipped |
5 |
— |
— |
— |
— |
Reflection failed ("No valid predictions") |
6 |
|
78.0% |
— |
— |
No improvement |
7 |
|
85.7% |
— |
— |
No improvement |
8 |
— |
100% on subsample |
— |
— |
Skipped (perfect subsample) |
9 |
|
61.9% |
— |
— |
No improvement |
10 |
|
81.0% → 100% ✅ |
88.6% |
88.6% 🏆 |
New best! Added to pool (#3) |
11 |
|
90.5% |
— |
— |
No improvement |
12+ |
— |
continuing… |
— |
— |
Still running |
The Pareto Front Mechanism
GEPA maintains a Pareto front: a set of program candidates where no single candidate outperforms another on all 10 validation examples simultaneously. By iteration 10, the Pareto front had expanded to 4 candidates (indices 0–3), and program #3 became the new best with 88.6% aggregate score — up from the baseline of 84.3%.
Notable Patterns
-
Failed to use structured output format, falling back to JSON mode.— This warning appears frequently because the BAMLAdapter couldn’t enforce structured output for the given LM, so it falls back to parsing JSON from free-form text. -
Prompt evolution — You can see the instructions getting progressively more detailed and precise across iterations (e.g., the
acquisition_extractorprompt goes from a few lines to multi-paragraph instructions with examples by iteration 10). -
"No valid predictions found" at iteration 5 — The reflection LM couldn’t generate useful error feedback, so no mutation was proposed.
-
The optimization was budgeted for 1,195 metric calls (≈50 full evals) but only ~10% complete when this log was captured.
Bottom Line
GEPA successfully improved the extraction system from 84.3% → 88.6% accuracy on the validation set by evolving the prompt instructions for the acquisition_extractor module, with the best update found at iteration 10. The optimization was still running when the log ended.