● Getting started

Core Concepts

10 min

DBSprout is built around a handful of ideas that, once understood, make every other option and command obvious. This page walks through them all.

The five-stage pipeline

Every dbsprout generate run travels through five discrete stages. Each stage is independently pluggable, and each one produces a typed artifact consumed by the next.

SCHEMA INPUT → SPEC GENERATION → DATA GENERATION → QUALITY CHECK → OUTPUT

Schema Input reads your database or schema file and produces a DatabaseSchema — a unified, dialect-neutral description of every table, column, relationship, and constraint. Seven parsers cover live databases (PostgreSQL, MySQL, SQLite, MSSQL via SQLAlchemy) and static files (SQL DDL, DBML, Mermaid, PlantUML, Prisma, Django models). Regardless of source, the output is the same frozen Pydantic model.

Spec Generation maps each column to a generation strategy. The heuristic engine does this with regex and fuzzy column-name matching and is accurate for roughly 80% of real-world schemas — with no network calls and in milliseconds. The LLM engine hands the full DatabaseSchema to a language model once and gets back a DataSpec: per-column generator configs, statistical distributions, correlation rules, and derived-column formulas. That result is cached by schema hash, so you pay the LLM cost exactly once per schema version.

Data Generation walks the dependency graph, generates values table-by-table in topological order (parents before children), enforces every constraint at generation time, and produces in-memory row batches. FK columns always sample from real parent primary keys — values are never invented.

Quality Validation runs four metric families across the generated data: integrity (FK referential correctness, uniqueness, NOT NULL, CHECK violations), fidelity (statistical distance to a reference sample when provided), detection (can a classifier tell synthetic from real?), and ML utility (can a model trained on synthetic data perform on real data?). Validation can run as a post-generation check or as a standalone dbsprout validate call.

Output serialises the validated row batches to the format you need. The writers cover SQL INSERT, CSV, JSON, JSONL, Parquet, PostgreSQL COPY (sustaining 100 K+ rows/sec), MySQL LOAD DATA, and UPSERT variants. Direct insertion into a live target database is also supported via --output-format direct.

The schema model

DBSprout represents every schema as a tree of immutable Pydantic v2 models (all declared frozen=True). The key types are:

ModelWhat it holds
DatabaseSchemaFull schema: ordered table list, enum types, SQL dialect, a content hash used for caching
TableSchemaColumns, primary key columns, foreign keys, indexes; helpers is_junction_table and fk_parent_tables
ColumnSchemaName, normalised type, nullable, PK flag, unique flag, CHECK expression, allowed enum values
ForeignKeySchemaSource columns, referenced table and columns, ON DELETE action, deferrability

When DBSprout introspects a live database it uses SQLAlchemy’s Inspector API, which normalises type names across dialects. When it parses a static file, the parser produces the same model tree. Downstream code only ever sees DatabaseSchema — it is independent of how the schema arrived.

Because all models are frozen, every transformation returns a new copy (model_copy(update=...)) rather than mutating in place. This makes the pipeline safe for concurrent use and straightforward to debug.

Foreign-key resolution

Generating data with foreign keys requires that parent rows exist before child rows reference them. DBSprout handles this automatically:

  1. Build a directed dependency graph — each table is a node, each FK is a directed edge pointing from child to parent.
  2. Separate self-referencing FKs (a table that references itself) for special handling.
  3. Topologically sort the graph via Python’s stdlib graphlib.TopologicalSorter. The result is a generation order where every parent appears before its children.
  4. If cycles exist (table A → B → A), run Tarjan’s strongly-connected-component algorithm (via NetworkX) to identify the cycle, find a nullable FK within it, and defer that edge.
  5. Execute a two-pass insert: the first pass generates all rows with the deferred FK set to NULL; the second pass updates those rows with the correct parent PK values once the parents have been inserted.

FK columns are always populated by sampling from the real primary key pool of the parent table. DBSprout never asks a language model to invent FK values; doing so would silently break referential integrity.

Self-referencing FKs (e.g. an employees.manager_id that points to employees.id) are handled as a post-generation update step, where leaf nodes in the self-reference tree are assigned before parents.

The four generation engines

All engines produce rows that satisfy the schema’s constraints. The difference is in how they decide what value to put in each cell.

Heuristic (default)

The heuristic engine matches column names and types against a curated library of Mimesis generators using regex and fuzzy matching. A column named email gets an email address; first_name gets a realistic first name; created_at gets a timestamp. It runs at 100 K+ rows/sec, needs no model, no API key, and no internet connection. Use it for fast fixture generation, CI pipelines, and any schema where standard naming conventions are followed.

Spec-driven

The spec engine asks a language model to read the full DatabaseSchema and produce a DataSpec — a structured document describing, for each column, which Mimesis generator to use, any distribution parameters, correlations with other columns, and derivation formulas for computed columns. The model runs once per schema version; the result is cached. After that, row generation is deterministic and model-free. Use the spec engine when the heuristic engine produces values that are semantically wrong for your domain, or when you need cross-column correlations.

Statistical

The statistical engine fits distributions to a real data sample you supply with --reference-data. It learns the marginal distribution of each numeric column, the empirical frequencies of categorical columns, and simple correlations. Use it when you have a small real dataset and want synthetic data with the same statistical shape.

Finetuned

The finetuned engine loads a QLoRA adapter (trained with dbsprout train) on top of the embedded GGUF base model. It is the most domain-aware engine but also the most expensive to set up. Use it when spec-driven generation still produces semantically implausible values for a highly specialised domain.

The hybrid “LLM-as-Architect” pattern

DBSprout’s LLM usage is deliberately narrow. The model acts as an architect — it reads the schema once, thinks about what data makes sense, and writes a DataSpec. After that, the model is not involved. Every row is generated by Mimesis and NumPy following the spec’s instructions.

This has two important consequences. First, the LLM cost is constant regardless of how many rows you generate — generating 10 rows or 10 million rows costs the same number of LLM tokens. Second, generation is deterministic and auditable: the spec is stored in .dbsprout/cache/ as a human-readable JSON file you can inspect, edit, and version-control.

The spec is keyed by DatabaseSchema.schema_hash(), a SHA-256 of the schema structure. If you add a column and re-run dbsprout init, the hash changes and the spec is regenerated. If nothing changed, the cached spec is reused.

Determinism

DBSprout defaults to --seed 42. Given the same schema snapshot and the same seed, every run produces byte-for-byte identical output across machines and across time. This matters for:

  • CI pipelines — fixture files can be committed and diffed; unexpected changes fail the build.
  • Debugging — reproduce a failure by re-running with the same seed.
  • Collaboration — teammates generate the same fixtures without coordinating.

Seeding works at the per-cell level via a hash of the table name, column name, row index, and the global seed. This means adding a new table does not shift the values generated for existing tables.

Constraint enforcement

DBSprout enforces schema constraints during generation, not as a post-hoc check.

  • UNIQUE — a deduplication set is maintained per column (or per composite unique constraint). If a generated value collides, a new one is generated.
  • NOT NULL — non-nullable columns with no other generator instruction always produce a value; nullable columns produce NULL at the configured rate.
  • CHECK — constraints expressed as SQL expressions are modelled as AC-3 arc-consistency problems and solved during generation. Generated values satisfy the CHECK before they enter the row batch.
  • Enums — columns with an enum type sample uniformly from the allowed values list recorded in ColumnSchema.enum_values.

After generation, dbsprout validate re-checks all of the above against the final output and reports any violations. A clean validate means 100% referential and structural integrity.

Privacy gradient

DBSprout has four privacy tiers, controlled by --privacy or [privacy] tier in dbsprout.toml. Every tier is logged to the audit trail visible via dbsprout audit.

TierWhat happens
localAll generation runs on your machine; no data leaves; default
redactedIf you supply reference data, PII columns are detected and stripped before any processing
cloudThe schema and DataSpec are sent to a cloud LLM provider; row data stays local
trainingReference data is serialised for fine-tuning; full audit trail mandatory

The default local tier means DBSprout works with no internet connection, no API key, and no account. You can move up the gradient incrementally by installing the relevant extras ([privacy] for redaction, [cloud] for cloud LLM providers).

Plugin system

Every major component in the pipeline is replaceable via Python entry points. Install a package that declares the right entry-point group and DBSprout will discover it automatically at startup.

Entry-point groupWhat it replaces
dbsprout.parsersSchema parser (add a new file format)
dbsprout.generatorsRow generator (add a new generation engine)
dbsprout.outputsOutput format writer
dbsprout.llm_providersLLM backend
dbsprout.migration_frameworksMigration framework integration

Run dbsprout plugins list to see discovered plugins and their status. Run dbsprout plugins check <group>:<name> to validate that a specific plugin loads and conforms to its interface.

Where next