● 08 · recipe

Column-aware values with --engine spec

intermediate 9 min LLMquality

Let an LLM write a per-column generator spec once, cache it forever, and get semantically correct values for unusual column names.

The default heuristic engine covers ~80% of columns correctly via pattern matching — it knows email means an email address and phone means a phone number. For domain-specific columns (risk_tier, account_lifecycle_state, coverage_band) the heuristic falls back to generic strings.

Problem: you need semantically correct values for columns that don’t match any built-in pattern, without writing custom generators by hand.

How —engine spec works

When you run with --engine spec, DBSprout sends column names and types (never row values) to an LLM once per schema. The LLM returns a DataSpec — a structured JSON document describing the generator, value distribution, and constraints for each column. That spec is cached under .dbsprout/cache/ keyed by the schema hash; subsequent runs read from cache without calling the LLM again.

Prerequisites

  • Python 3.10+ with dbsprout installed
  • An LLM provider configured — the embedded Qwen2.5-1.5B model (zero config, offline) or a cloud provider via OPENAI_API_KEY / ANTHROPIC_API_KEY
  • A schema initialised with dbsprout init

Steps

1. Initialise the schema

Point DBSprout at your database or schema file:

dbsprout init --db postgresql://localhost/myapp
# or a schema file
dbsprout init --file schema.sql

2. Generate with the spec engine

dbsprout generate --engine spec --rows 500

On the first run you will see a one-time message:

Generating DataSpec via LLM (embedded model)…
Cached → .dbsprout/cache/spec_a3f9c12.json

All subsequent runs skip the LLM call entirely.

3. Verify the cached spec

Inspect what the LLM produced:

cat .dbsprout/cache/spec_*.json | python3 -m json.tool | head -60

Each column entry looks like:

{
  "column": "coverage_band",
  "generator": "choice",
  "values": ["bronze", "silver", "gold", "platinum"],
  "distribution": "uniform"
}

4. Tune a column (optional)

If the spec chose wrong values for a column, edit the cached JSON directly and re-run — DBSprout will use your edited spec without calling the LLM again.

5. Force a spec refresh after a schema change

When you add or rename columns, invalidate the cache:

dbsprout generate --engine spec --refresh-spec

This regenerates the spec for the changed columns only.

Result

Unusual column names now produce domain-correct values — risk_tier generates ["low", "medium", "high", "critical"] instead of random strings, coverage_band generates insurance tier names, account_lifecycle_state generates ["trial", "active", "churned", "paused"]. The LLM call happens once; every subsequent run is as fast as the heuristic engine.

For more on provider configuration and privacy tiers, see LLM configuration.