Generate synthetic data from a database

Generating synthetic data from an existing database means using that database's schema and data patterns as a blueprint to produce net-new records that preserve its structure and relationships — without copying any real values. A tool connects to the source, reads the tables, columns, types, and foreign keys — and, in value-aware approaches, the actual value distributions — then generates statistically faithful data at whatever volume you need. Done well, referential integrity stays intact across tables, so the synthetic database behaves like the original while containing none of its sensitive records.

What "seeding from an existing database" actually means

Seeding from an existing database means pointing a generator at a real database and using it as a reference for producing new data, rather than copying or masking the rows already there. The word seed is close to literal: the source database is what the generator grows new data from — its structure, and optionally its statistical patterns, guide what gets produced, but the output is a fresh set of records that never existed. That distinction separates seeding from the two things people most often confuse it with. Masking transforms the real rows in place; copying duplicates them outright. Seeding does neither — it reads the source and writes something new.

Two flavors of seeding sit underneath that definition. Schema-only seeding reads just the structure of the source — the tables, columns, types, and the relationships between them — and generates values from rules or generic patterns, learning nothing about the real data inside. Value-aware modeling goes a step further: it profiles the actual data, capturing how values are distributed and how columns correlate, then samples new records that reproduce those patterns. Schema-only is faster and carries no information about real individuals at all; value-aware is more faithful to how the source behaves, at the cost of reading more of it.

The goal in both cases is the same: a dataset that behaves like the source while containing none of its records. A well-seeded database has the same tables and relationships, values that fall in the same ranges, and enough volume to stand in for production — yet no row traces back to a real person or transaction. If you're new to the concept, it's worth grounding in what synthetic data is first. And while this page is about one specific source — a database you already have — synthetic data spans a much wider range of software and AI workflows.

Reading the schema: tables, types, and the foreign-key problem

The hard part of seeding a database isn't generating values that look right in a single column — it's keeping the relationships between tables intact. Before a generator produces anything, it reads the source's structure to learn what a valid record looks like and how records connect. That structure includes:

Tables and columns — the entities the database tracks and the attributes of each.
Data types and formats — whether a field is an integer, a timestamp, a fixed-length code, or free text.
Nullability and constraints — which fields are required, which must be unique, which carry check constraints or defaults.
Primary keys — the column or columns that uniquely identify each row.
Foreign keys — the references that tie a row in one table to a row in another, the backbone of a relational database.

Those foreign keys are where seeding succeeds or fails. Referential integrity is the guarantee that every reference points to something real: an order row that names customer 4471 is only valid if customer 4471 actually exists in the customers table. A relational database is, in effect, a set of cross-referenced spreadsheets — an order points to a customer by ID, a line item points to an order, a payment points to an invoice. The whole structure depends on those pointers resolving to real rows.

This is exactly what naive generation breaks. Generate each table independently — the approach most general-purpose faker libraries take — and you get orphaned keys: child rows that reference parents that were never created. Regenerate the orders table and the customers table on their own, and you end up with orders addressed to customers who don't exist and line items for orders that were never placed. The data survives a glance at any single table and falls apart the moment a query joins two of them. Avoiding that is the central problem any serious database-seeding tool has to solve, and it's one dimension of the broader set of generation approaches that produce synthetic data.

Modeling real values vs. generating from structure alone: fidelity and its limits

How faithful the result is depends on which flavor of seeding you use. Schema-only seeding is exactly as realistic as the rules you give it: tell the generator a column holds U.S. ZIP codes and it produces valid-looking ZIPs, but it won't know that most of your customers cluster in three states unless you encode that yourself. Value-aware modeling closes that gap by learning the real distributions and correlations from the source — the actual frequencies, the way one column moves with another — so the output reflects how the data behaves, not just what shape it takes.

That higher fidelity has a real ceiling, and it's worth understanding before expecting perfection from a single pass. Modeling an entire multi-table database through one statistical engine runs into the curse of dimensionality: as you add columns and tables, the number of possible combinations of values explodes, and no realistic sample contains enough examples to model every joint relationship accurately. An engine that tries to capture a whole database at once tends to get the common cases right and the interactions among rare values wrong. This is why modern tools don't promise whole-database fidelity in one shot — they profile each table, plan the generation, and validate the output table by table, rather than treating the database as one enormous distribution to learn all at once.

Knowing the ceiling exists tells you to check your results rather than assume them. Fidelity is something you measure, not something a tool can promise in the abstract — which is the substance of how you measure synthetic data fidelity: comparing distributions between the synthetic and real sets, confirming that relationships hold across tables, and the train-on-synthetic, test-on-real method of checking that a model learns the same things from generated data as from the original.

How Tonic Fabricate seeds from a live database

Tonic Fabricate, an agentic synthetic data generation platform, handles database seeding by connecting directly to a live source and modeling it. Using Live Connect, Fabricate connects to a running database, and its Data Agent profiles both the schema and the actual value distributions inside it, then generates relationally intact data across the source's tables — preserving the foreign-key relationships that naive per-table generation breaks. The result is net-new records modeled on the real patterns, not masked or copied production rows.

On a complex schema, Fabricate's Data Agent drafts a generation plan you can review before anything runs, so you can see how it intends to handle each table and its relationships rather than trusting a black box. An optional Validation Agent then reviews and refines the generated output, which keeps quality reasonable even when the initial prompt is imprecise. Once the data is right, you can operationalize it through automated workflows and mock APIs, so the same step that fills a database can also stand up the services a system expects to call.

Fabricate connects to the databases most teams actually run on — including PostgreSQL, MySQL, SQL Server, Oracle, Snowflake, and BigQuery — so the source can be a dev instance, a staging copy, or a locked-down production database you only read from. Because the output is generated rather than copied, volume is decoupled from the source: profile a 1,000-row sample and generate 100,000 rows shaped like it, scaling a dataset well past what production safely holds while keeping the relationships intact.

The Tonic Advantage: model the source, keep the relationships. Connect Fabricate to a live database and its Data Agent profiles both the schema and the real value distributions, then generates relationally intact data across your tables. Model a 1,000-row sample and generate 100,000 rows shaped like it, with the Validation Agent refining the output as it goes — and none of the source records are copied into the result.

Does seeding from real data leak real data?

Synthetic data isn't automatically private, and treating it as private by default is the most common mistake teams make. The risk depends entirely on which flavor of seeding produced it. Schema-only and from-scratch output has no real individual behind any record — there's nothing to leak, because no real values were ever read. Value-aware output is different: a model that learned from real data can, if trained without controls, memorize a rare real record and reproduce it almost verbatim in the output. The exposure is concentrated in the outliers — the unusual record the model effectively copies because it saw too few like it.

Managing that risk is a matter of controls and measurement rather than faith. On the generation side, you limit how much any single source record can shape the model, so no one row drives a recognizable output. On the verification side, you measure the result instead of assuming it: near-copy detection scans the synthetic set for records that sit too close to a real one, and membership-inference testing — checking whether an attacker could tell that a specific real record was part of the training data — quantifies how much the output reveals about its source. Synthetic data earns the label private by passing those checks, not by being generated.

There's an honest boundary worth naming. When the blocker is access or volume — you can't get production data, or you need far more of it than exists — synthesis is the right tool. But when you specifically need the exact tangle of one real record's history, with its real edge cases intact, modeling it from a distribution is the wrong instrument, and de-identifying the production data in place tends to fit better. The two aren't really in competition; they suit different problems, and which is safer for a given dataset turns on the re-identification risk and the safeguards each approach brings.

Seeding vs. generating from scratch vs. de-identifying production: choosing the approach

There are three routes to non-production data, and seeding from an existing database is only one of them. Each fits a different starting point. Seeding from an existing database models a real source to produce high-fidelity, net-new records — the right choice when you have a database and need more safe data shaped like it. Generating from scratch needs no source at all: you describe what you want and the tool produces it, which fits greenfield features and scenarios your real data has never recorded. De-identifying production takes the real records and transforms them in place — masking or replacing sensitive values while keeping everything else exactly as it was — the approach a tool like Tonic Structural takes, and the one to reach for when the real-world shape of specific records is the whole point.

The deciding question is what's actually blocking you. An access or volume problem points to synthesis, in either form. A need for the precise, irreplaceable detail of real records points to de-identification. And the two are often used together rather than chosen between: de-identify a production dataset so it's safe to handle, then seed a generator from the de-identified version to scale it up well beyond the original volume, without reintroducing the sensitive records you just removed.

	Seed from an existing database	Generate from scratch	De-identify production
Source data required	Yes — an existing schema and data to model	No — works from a specification alone	Yes — the production records themselves
Preserves exact real-world edge cases	Approximates them statistically; rare specifics can smooth out	No — only the scenarios you specify	Yes — keeps the exact real shape, transformed in place
Net-new records, no real individual behind them	Yes	Yes	No — the same records, with sensitive values replaced
Scales beyond production volume	Yes — generate far more than the source holds	Yes — volume is whatever you specify	Limited — roughly mirrors the source volume
Best-fit scenario	You have a real database and need more safe data shaped like it	No usable or safe source data exists yet	You need the precise tangle of real records, made safe

None of the three is universally best; they map to different gaps. For a fuller breakdown of when to use synthetic versus real production data, the trade-offs run deeper than seeding alone — but for the specific job of turning a database you already have into safe data that behaves like it, seeding is the most direct route, and it's where a generator that respects relationships earns its keep.

Generating synthetic data from an existing database

What "seeding from an existing database" actually means

Reading the schema: tables, types, and the foreign-key problem

Modeling real values vs. generating from structure alone: fidelity and its limits

How Tonic Fabricate seeds from a live database

Does seeding from real data leak real data?

Seeding vs. generating from scratch vs. de-identifying production: choosing the approach

See how Tonic Fabricate generates synthetic data

More in How it works

How to measure synthetic data quality and fidelity