How is synthetic data generated? Methods explained

Synthetic data is generated from one of two starting points — from scratch against a schema or a prompt, or modeled on an existing dataset — using one of a few category-level approaches: rule-based generation, model-based generation (statistical and machine-learning), simulation, and agentic generation that orchestrates these techniques automatically. Which approach fits depends on how much real data you have to learn from and how much realism the job demands. Modern tools increasingly put an AI agent in charge of the workflow — profiling your data or reading your prompt, drafting a generation plan, producing the data, and validating it — turning what used to be a scripting project into a conversation.

Two starting points: from scratch or modeled on existing data

Every method of synthetic data generation begins from one of two places, and the choice between them is settled before you pick a technique. Either you generate from scratch — defining the data with a schema or describing it in a prompt, with no source dataset behind it — or you generate by modeling an existing dataset, learning its distributions, correlations, and structure, then sampling new records that share those properties. The deciding question is plain: do you already hold usable real data to learn from, or not?

From-scratch generation is the cold-start path. It's what you reach for on a greenfield feature with no history yet, for a domain you've never logged, or when the real data is too sensitive to touch. Because nothing constrains the output except your specification, you set the schema, the coverage, and the edge cases directly — the trade is that the data only knows what you tell it.

Modeling existing data is the opposite starting point: you have a real dataset and need a faithful, higher-volume stand-in for it — more records than production can safely supply, or a version that's safe to share. Here the source data carries the realism, and the generator's job is to reproduce its statistical shape without copying its rows. The split between these two starting points — building data from a specification versus deriving it from data you already have — is the prior question of what synthetic data is and where it comes from, and it's what makes the later methods differ, mainly in how much they learn from real data versus how much they invent.

Rule-based generation

Rule-based generation is the original approach, and it does what the name says: you write the rules, and the generator fills them in. You specify constraints field by field — a format, a numeric range, a set of allowed values, a simple distribution to sample from — and the generator produces records that satisfy them, deterministically. A status column might be restricted to a fixed set of valid codes; a transaction amount might draw uniformly from a known range; an account identifier might follow a fixed pattern. Nothing is learned; everything is declared.

That makes rule-based generation precise and completely controllable, which is the source of both its strength and its ceiling. Where the fields are bounded and well understood, and the output has to obey exact rules every time, it's hard to beat — you get exactly the shape you asked for, with no dependency on a source dataset and no surprises. It's transparent, too: every value traces back to a rule you can read.

The limits show up as the data grows more realistic. A rule-based generator can't reproduce a correlation you didn't encode by hand — if income and credit limit move together in the real world, you have to know that and write it in, and the same goes for every other relationship between fields. Across a large or evolving schema, that hand-encoding becomes a substantial maintenance project, and the realism never exceeds what you thought to specify. Rule-based generation stays genuinely useful for bounded, well-defined needs; it simply stops scaling once "realistic" means capturing structure you can't easily enumerate.

Statistical and machine-learning (model-based) generation

Model-based generation turns the rule-based approach around: instead of declaring the structure, you let a model learn it. A statistical or machine-learning model studies a source dataset — its distributions, the correlations between columns, the overall shape — and then samples new records from what it learned. You don't tell it that income and credit limit move together; it picks that relationship up from the data and carries it into the synthetic output.

The techniques fall into two broad families. Statistical models fit explicit mathematical structure to the data — Gaussian copulas to capture how variables move together, or fitted parametric distributions for individual columns — and suit tabular data with relationships you can model compactly. Deep generative models go further: GANs (generative adversarial networks, where one network generates candidates and another learns to tell them from real data, the two improving in competition), VAEs (variational autoencoders, which compress data into a learned representation and sample new points from it), and diffusion models learn far more complex, high-dimensional distributions than a statistical model can. The deeper the model, the more subtle the patterns it can reproduce — and the more source data it needs to learn them.

The honest trade-off is that model-based generation is only as good as the data behind it. It needs a quality source dataset, and the correlations in the output are only as faithful as those in the training set — bias and gaps included. A carelessly trained model can also memorize fragments of its source and reproduce them, which turns a privacy-safe exercise into a leak. That is why fidelity (how closely the synthetic data reproduces the patterns of the real data) and privacy in model-based generation are measured rather than assumed — the subject of how to measure synthetic data quality and fidelity.

Simulation and agent-based generation

Simulation takes a different route to synthetic data: rather than learn from a dataset, you model a process or a system and let it run, and the data falls out as a byproduct of what happens. Agent-based modeling (ABM) is the most common form — you define a population of autonomous agents, give each one simple rules for how it behaves and interacts, and let the system play forward. Cars following local driving rules produce traffic-flow data; simulated traders reacting to prices produce market activity; modeled individuals making contact produce epidemiological spread. The data is whatever the simulated world generates as it evolves.

Despite the shared word, agent-based modeling is a different idea from agentic generation: the agents here are the simulated entities inside the model — cars, traders, households — not an AI system orchestrating a generation toolchain.

Simulation earns its place where the other approaches can't reach. When the system doesn't exist yet, there's no dataset to model and no production behavior to encode rules from, but you can still simulate it. When behavior is dynamic and unfolds over time, a simulation captures the temporal and causal structure that a static statistical model flattens. It's also the standard way to produce complete simulated worlds, including the environments used to train and evaluate reinforcement-learning agents, where you need long, coherent stretches of realistic activity rather than independent rows. The cost is that you have to be able to describe the system's rules, and the realism of the output rises and falls with how well your model captures the real dynamics — and increasingly, standing up that kind of simulated environment is something an agentic tool can do directly from a prompt.

Agentic generation: orchestrating the approaches

Agentic generation is the newest approach, and it changes who does the configuring. Instead of you hand-building rules or training a model, an AI agent takes the wheel: it profiles your source data or reads your prompt and schema, drafts a plan for generating the data, produces it, and checks its own work — orchestrating rule-based and model-based techniques underneath so you don't configure them yourself. What used to be a scripting project becomes a conversation about what you need.

Tonic Fabricate is a current example of how this works in practice. It generates from any of three starting points:

a natural-language prompt describing the data you want,
a schema you upload, or
a live database it connects to, profiles, and models — the path for generating synthetic data from an existing database.

Those starting points aren't mutually exclusive: within a single conversation you can model from a live source and generate additional records from scratch, combining the two in one dataset. For complex schemas, Fabricate drafts a generation plan you can review and adjust before any data is produced, so you keep control over the structure without hand-specifying every field. Two agents divide the work: a Data Agent generates the data, and a Validation Agent reviews and refines it, which keeps quality reasonable even when the initial prompt is imprecise. The output isn't limited to relational tables — it spans relational data, free text, and files, including nested and semi-structured shapes, across multiple databases at once, with referential integrity (the property that keys and references stay consistent across related tables and files) maintained throughout. The results can be operationalized through automated workflows and mock APIs that slot into an existing pipeline. Agentic generation is the modern, state-of-the-art approach rather than the prevailing default — rule-based and model-based methods remain the established techniques — but it's the direction the tooling is moving.

The Tonic Advantage: orchestration instead of hand-configuration. The older methods make you choose and configure a technique; an agentic layer chooses and combines them for you. In Fabricate, the Data Agent generates relational data, free text, and mock APIs across multiple databases and files with referential integrity maintained throughout, while the Validation Agent reviews what it produces and flags data that looks unrealistic or incorrect. For complex schemas, Fabricate drafts a generation plan you control step by step. Rule-based precision and model-based fidelity are still doing the work underneath — you describe and review instead of hand-building.

How to choose an approach

Choosing an approach comes down to two questions, and they map onto the two axes running through every method here. The first is how much usable real data you have: with none, you're generating from scratch, which points toward rule-based or simulation; with a representative dataset in hand, modeling it becomes the higher-fidelity option. The second is how much realism and control the job demands: tightly bounded data with exact rules favors rule-based generation, realism learned from real records favors model-based, and dynamic systems that have to unfold over time favor simulation.

The practical wrinkle is that modern agentic tools fold rule-based and model-based generation together under one workflow, so the real decision is shifting — less often "which single technique do I implement," and more "how much do I want to hand-configure versus describe and review." Another input is how synthetic data compares to real production data for your specific job — whether you need to generate at all, or to de-identify real data instead. The technique you reach for also tends to follow the type of data you're generating, from bounded tabular records to free text and files. Whichever you choose, the starting-point question still frames it: generate from scratch when you have nothing to learn from, model what you have when you do, and lean on an agentic layer when you'd rather not wire the techniques together yourself.

Approach	How it works	Best when	Trade-offs
Rule-based	Fills hand-written field constraints — formats, ranges, allowed values, simple distributions — deterministically	Fields are bounded and well understood, and output must obey exact rules every time	Can't infer correlations you don't encode; labor-intensive across large or changing schemas
Model-based (statistical, deep generative)	A statistical or deep model learns a source dataset's distributions and correlations, then samples new records	You hold a representative real dataset and need a higher-fidelity, higher-volume stand-in	Needs a quality source; fidelity and privacy must be measured, not assumed
Simulation / agent-based	Models a system or autonomous agents and lets their interactions emit data over time	The system doesn't exist yet or is too complex for a statistical model; behavior is dynamic or temporal	You must be able to specify the system's rules; realism depends on how well the model captures them
Agentic	An AI agent profiles your source or prompt, plans the generation, orchestrates rule-based and model-based techniques underneath, then validates the output	You want high-fidelity, mixed-format data without hand-configuring the underlying techniques	A newer approach, not yet the established default; you review the agent's plan rather than set every parameter by hand

How is synthetic data generated?

Two starting points: from scratch or modeled on existing data

Rule-based generation

Statistical and machine-learning (model-based) generation

Simulation and agent-based generation

Agentic generation: orchestrating the approaches

How to choose an approach

See how Tonic Fabricate generates synthetic data

More in Fundamentals

Synthetic data vs. real (production) data

Types of synthetic data: tabular, time-series, and unstructured

What is synthetic data?