Core Concepts
DBSprout is built around a handful of ideas that, once understood, make every other option and command obvious. This page walks through them all.
The five-stage pipeline
Every dbsprout generate run travels through five discrete stages. Each stage
is independently pluggable, and each one produces a typed artifact consumed by
the next.
SCHEMA INPUT → SPEC GENERATION → DATA GENERATION → QUALITY CHECK → OUTPUT
Schema Input reads your database or schema file and produces a
DatabaseSchema — a unified, dialect-neutral description of every table,
column, relationship, and constraint. Seven parsers cover live databases
(PostgreSQL, MySQL, SQLite, MSSQL via SQLAlchemy) and static files (SQL DDL,
DBML, Mermaid, PlantUML, Prisma, Django models). Regardless of source, the
output is the same frozen Pydantic model.
Spec Generation maps each column to a generation strategy. The heuristic
engine does this with regex and fuzzy column-name matching and is accurate for
roughly 80% of real-world schemas — with no network calls and in milliseconds.
The LLM engine hands the full DatabaseSchema to a language model once and
gets back a DataSpec: per-column generator configs, statistical distributions,
correlation rules, and derived-column formulas. That result is cached by schema
hash, so you pay the LLM cost exactly once per schema version.
Data Generation walks the dependency graph, generates values table-by-table in topological order (parents before children), enforces every constraint at generation time, and produces in-memory row batches. FK columns always sample from real parent primary keys — values are never invented.
Quality Validation runs four metric families across the generated data:
integrity (FK referential correctness, uniqueness, NOT NULL, CHECK violations),
fidelity (statistical distance to a reference sample when provided), detection
(can a classifier tell synthetic from real?), and ML utility (can a model
trained on synthetic data perform on real data?). Validation can run as a
post-generation check or as a standalone dbsprout validate call.
Output serialises the validated row batches to the format you need. The
writers cover SQL INSERT, CSV, JSON, JSONL, Parquet, PostgreSQL COPY
(sustaining 100 K+ rows/sec), MySQL LOAD DATA, and UPSERT variants. Direct
insertion into a live target database is also supported via --output-format direct.
The schema model
DBSprout represents every schema as a tree of immutable Pydantic v2 models
(all declared frozen=True). The key types are:
| Model | What it holds |
|---|---|
DatabaseSchema | Full schema: ordered table list, enum types, SQL dialect, a content hash used for caching |
TableSchema | Columns, primary key columns, foreign keys, indexes; helpers is_junction_table and fk_parent_tables |
ColumnSchema | Name, normalised type, nullable, PK flag, unique flag, CHECK expression, allowed enum values |
ForeignKeySchema | Source columns, referenced table and columns, ON DELETE action, deferrability |
When DBSprout introspects a live database it uses SQLAlchemy’s Inspector API,
which normalises type names across dialects. When it parses a static file, the
parser produces the same model tree. Downstream code only ever sees
DatabaseSchema — it is independent of how the schema arrived.
Because all models are frozen, every transformation returns a new copy
(model_copy(update=...)) rather than mutating in place. This makes the
pipeline safe for concurrent use and straightforward to debug.
Foreign-key resolution
Generating data with foreign keys requires that parent rows exist before child rows reference them. DBSprout handles this automatically:
- Build a directed dependency graph — each table is a node, each FK is a directed edge pointing from child to parent.
- Separate self-referencing FKs (a table that references itself) for special handling.
- Topologically sort the graph via Python’s stdlib
graphlib.TopologicalSorter. The result is a generation order where every parent appears before its children. - If cycles exist (table A → B → A), run Tarjan’s strongly-connected-component algorithm (via NetworkX) to identify the cycle, find a nullable FK within it, and defer that edge.
- Execute a two-pass insert: the first pass generates all rows with the
deferred FK set to
NULL; the second pass updates those rows with the correct parent PK values once the parents have been inserted.
FK columns are always populated by sampling from the real primary key pool of the parent table. DBSprout never asks a language model to invent FK values; doing so would silently break referential integrity.
Self-referencing FKs (e.g. an employees.manager_id that points to
employees.id) are handled as a post-generation update step, where leaf nodes
in the self-reference tree are assigned before parents.
The four generation engines
All engines produce rows that satisfy the schema’s constraints. The difference is in how they decide what value to put in each cell.
Heuristic (default)
The heuristic engine matches column names and types against a curated library of
Mimesis generators using regex and fuzzy matching. A column named email gets
an email address; first_name gets a realistic first name; created_at gets a
timestamp. It runs at 100 K+ rows/sec, needs no model, no API key, and no
internet connection. Use it for fast fixture generation, CI pipelines, and any
schema where standard naming conventions are followed.
Spec-driven
The spec engine asks a language model to read the full DatabaseSchema and
produce a DataSpec — a structured document describing, for each column, which
Mimesis generator to use, any distribution parameters, correlations with other
columns, and derivation formulas for computed columns. The model runs once
per schema version; the result is cached. After that, row generation is
deterministic and model-free. Use the spec engine when the heuristic engine
produces values that are semantically wrong for your domain, or when you need
cross-column correlations.
Statistical
The statistical engine fits distributions to a real data sample you supply with
--reference-data. It learns the marginal distribution of each numeric column,
the empirical frequencies of categorical columns, and simple correlations. Use
it when you have a small real dataset and want synthetic data with the same
statistical shape.
Finetuned
The finetuned engine loads a QLoRA adapter (trained with dbsprout train) on
top of the embedded GGUF base model. It is the most domain-aware engine but also
the most expensive to set up. Use it when spec-driven generation still produces
semantically implausible values for a highly specialised domain.
The hybrid “LLM-as-Architect” pattern
DBSprout’s LLM usage is deliberately narrow. The model acts as an architect —
it reads the schema once, thinks about what data makes sense, and writes a
DataSpec. After that, the model is not involved. Every row is generated by
Mimesis and NumPy following the spec’s instructions.
This has two important consequences. First, the LLM cost is constant regardless
of how many rows you generate — generating 10 rows or 10 million rows costs the
same number of LLM tokens. Second, generation is deterministic and auditable:
the spec is stored in .dbsprout/cache/ as a human-readable JSON file you can
inspect, edit, and version-control.
The spec is keyed by DatabaseSchema.schema_hash(), a SHA-256 of the schema
structure. If you add a column and re-run dbsprout init, the hash changes and
the spec is regenerated. If nothing changed, the cached spec is reused.
Determinism
DBSprout defaults to --seed 42. Given the same schema snapshot and the same
seed, every run produces byte-for-byte identical output across machines and
across time. This matters for:
- CI pipelines — fixture files can be committed and diffed; unexpected changes fail the build.
- Debugging — reproduce a failure by re-running with the same seed.
- Collaboration — teammates generate the same fixtures without coordinating.
Seeding works at the per-cell level via a hash of the table name, column name, row index, and the global seed. This means adding a new table does not shift the values generated for existing tables.
Constraint enforcement
DBSprout enforces schema constraints during generation, not as a post-hoc check.
- UNIQUE — a deduplication set is maintained per column (or per composite unique constraint). If a generated value collides, a new one is generated.
- NOT NULL — non-nullable columns with no other generator instruction always
produce a value; nullable columns produce
NULLat the configured rate. - CHECK — constraints expressed as SQL expressions are modelled as AC-3 arc-consistency problems and solved during generation. Generated values satisfy the CHECK before they enter the row batch.
- Enums — columns with an enum type sample uniformly from the allowed values
list recorded in
ColumnSchema.enum_values.
After generation, dbsprout validate re-checks all of the above against the
final output and reports any violations. A clean validate means 100% referential
and structural integrity.
Privacy gradient
DBSprout has four privacy tiers, controlled by --privacy or [privacy] tier
in dbsprout.toml. Every tier is logged to the audit trail visible via
dbsprout audit.
| Tier | What happens |
|---|---|
local | All generation runs on your machine; no data leaves; default |
redacted | If you supply reference data, PII columns are detected and stripped before any processing |
cloud | The schema and DataSpec are sent to a cloud LLM provider; row data stays local |
training | Reference data is serialised for fine-tuning; full audit trail mandatory |
The default local tier means DBSprout works with no internet connection, no
API key, and no account. You can move up the gradient incrementally by
installing the relevant extras ([privacy] for redaction, [cloud] for cloud
LLM providers).
Plugin system
Every major component in the pipeline is replaceable via Python entry points. Install a package that declares the right entry-point group and DBSprout will discover it automatically at startup.
| Entry-point group | What it replaces |
|---|---|
dbsprout.parsers | Schema parser (add a new file format) |
dbsprout.generators | Row generator (add a new generation engine) |
dbsprout.outputs | Output format writer |
dbsprout.llm_providers | LLM backend |
dbsprout.migration_frameworks | Migration framework integration |
Run dbsprout plugins list to see discovered plugins and their status. Run
dbsprout plugins check <group>:<name> to validate that a specific plugin
loads and conforms to its interface.
Where next
- Schema Input guide — all seven parsers, with examples
- Data Generation guide — engine comparison and configuration
- CLI Reference — every flag explained
- Configuration Reference —
dbsprout.tomlkeys