one GEPA report of DSPy

June 2, 2026 - 3 minutes read - 576 words

What is GEPA?

GEPA stands for Graph-based Evolutionary Program Adaptation — a DSPy optimizer that automatically improves the prompts/instructions of a multi-module LLM program through evolutionary search. It iteratively mutates module instructions, evaluates the changes, and keeps the best-performing candidates on a Pareto front.

What’s Happening in This Run

This file captures a GEPA optimization run on a financial news extraction system that classifies M&A (merger/acquisition) articles and extracts structured data from them.

The Program Being Optimized

The system (defined in gepa_optimize.py and extract.py) has two main modules:

classifier — Determines if an article is a merger, acquisition, or Other (rumored/potential/non-deal)
merger_extractor / acquisition_extractor — Extracts structured fields like company names, tickers, deal amounts, and currencies

Key Configuration

Setting Value

Setting	Value
Task LM	`deepseek/deepseek-v4-flash` (generates extractions)
Reflection LM	`deepseek/deepseek-v4-pro` (generates improved instructions)
Optimizer mode	`auto="light"`
Threads	32
Training set	14 articles (IDs: 2,5,6,1,7,4,10,9,11,3,12,13,8,14)
Validation set	10 articles (IDs: 33,29,24,34,30,25,35,31,26,36)

Task LM

deepseek/deepseek-v4-flash (generates extractions)

Reflection LM

deepseek/deepseek-v4-pro (generates improved instructions)

Optimizer mode

auto="light"

Threads

Training set

14 articles (IDs: 2,5,6,1,7,4,10,9,11,3,12,13,8,14)

Validation set

10 articles (IDs: 33,29,24,34,30,25,35,31,26,36)

The Optimization Loop

Each iteration follows this cycle:

Select — Pick the best program from the Pareto front (the set of non-dominated candidates)
Sample — Evaluate on a small 3-example subsample of the validation set
Reflect — Use the reflection_lm (DeepSeek V4 Pro) to analyze errors and propose improved instructions for underperforming modules
Propose — Generate a new candidate program with updated prompts
Validate — If the subsample score improves, run a full 10-example validation and add to the Pareto front

Progress Highlights from the Log

Iteration	Module Updated	Subsample Score	Full Val Score	Pareto Score	Outcome
0 (base)	—	—	84.3%	84.3%	Starting baseline
1	`classifier`	81.0%	—	—	No improvement, skipped
2	`merger_extractor`	82.7% → 91.1% ✅	81.8%	84.3%	Added to candidate pool (#1)
3	`acquisition_extractor`	95.2% → 100% ✅	82.9%	84.3%	Added to pool (#2)
4	`classifier`	61.9%	—	—	Regressed, skipped
5	—	—	—	—	Reflection failed ("No valid predictions")
6	`acquisition_extractor`	78.0%	—	—	No improvement
7	`classifier`	85.7%	—	—	No improvement
8	—	100% on subsample	—	—	Skipped (perfect subsample)
9	`merger_extractor`	61.9%	—	—	No improvement
10	`acquisition_extractor`	81.0% → 100% ✅	88.6%	88.6% 🏆	New best! Added to pool (#3)
11	`classifier`	90.5%	—	—	No improvement
12+	—	continuing…	—	—	Still running

The Pareto Front Mechanism

GEPA maintains a Pareto front: a set of program candidates where no single candidate outperforms another on all 10 validation examples simultaneously. By iteration 10, the Pareto front had expanded to 4 candidates (indices 0–3), and program #3 became the new best with 88.6% aggregate score — up from the baseline of 84.3%.

Notable Patterns

Failed to use structured output format, falling back to JSON mode. — This warning appears frequently because the BAMLAdapter couldn’t enforce structured output for the given LM, so it falls back to parsing JSON from free-form text.
Prompt evolution — You can see the instructions getting progressively more detailed and precise across iterations (e.g., the acquisition_extractor prompt goes from a few lines to multi-paragraph instructions with examples by iteration 10).
"No valid predictions found" at iteration 5 — The reflection LM couldn’t generate useful error feedback, so no mutation was proposed.
The optimization was budgeted for 1,195 metric calls (≈50 full evals) but only ~10% complete when this log was captured.

Bottom Line

GEPA successfully improved the extraction system from 84.3% → 88.6% accuracy on the validation set by evolving the prompt instructions for the acquisition_extractor module, with the best update found at iteration 10. The optimization was still running when the log ended.