Synthetic data for software testing and QA

Synthetic data lets engineering and QA teams test applications against realistic, privacy-safe data without exposing production records or waiting on data-provisioning tickets. There are two ways to get safe test data: generate synthetic data from scratch — or seed it from a real schema — to cover edge cases and scale on demand, or de-identify production data to preserve real-world complexity. The right choice, often a combination, depends on whether the data you need already exists and whether it can be used safely.

Why teams use synthetic data for testing

Synthetic test data removes the two blockers that slow testing down the most: the privacy risk of putting production records into lower environments, and the wait for someone to provision a usable dataset. Production data carries real names, account numbers, and health details, so every copy that lands in a dev, staging, or QA environment widens your exposure and pulls compliance into every release. Real datasets also tend to arrive through a ticket — a request to a data team, a masking job, an export that takes days — and testing stalls until it clears. This is the gap engineering and QA teams turn to synthetic data to fill: data built for the test rather than borrowed from production.

The payoff is not only about safety. Production data only contains what already happened, which means the rare and awkward cases you most need to test — the ones that actually break things — are often missing or buried under ordinary records. Generating data lets you create those cases on purpose and produce as many of them as a test demands.

In practice, the reasons teams adopt synthetic test data cluster into a few clear wins:

No PII in lower environments. Generated records describe no real person, so dev, staging, and QA never hold sensitive data in the first place.
No waiting on data-as-a-ticket. Engineers generate what they need on demand instead of queuing behind a provisioning request.
Volume on demand. Scale a dataset from a handful of rows to millions to see how the system behaves under load.
Deliberate coverage. Produce the edge cases, negative inputs, and rare combinations that ordinary production traffic underrepresents.

Two ways to get safe test data: generate it or de-identify production

Safe test data comes from two distinct sources, and they produce genuinely different categories of data — a distinction worth getting right before you pick a tool. You can generate synthetic data, or you can de-identify production data. The two solve the same problem from opposite directions, and plenty of teams use both.

Synthetic data is net-new data created by a model or algorithm rather than collected from real-world events. For testing, you generate it either from scratch — from a schema, a set of rules, or a prompt — or by seeding it from a real schema so the output mirrors the shape of a system you already run. Because no record traces back to a real person, synthetic data sidesteps privacy exposure by construction, and because you define it, you control its coverage and its volume. With Tonic Fabricate, you describe what you need and it generates a relationally intact dataset, from scratch or modeled on an existing source.

De-identified data is a different thing. It starts from real production records and transforms them in place — masking, tokenizing, or generalizing the sensitive fields — so the dataset keeps production's real-world structure while the personal information is removed or replaced. Tonic Structural takes this approach: it transforms production data into safe test data and maintains referential integrity across the de-identified schema. The output is based on real production data, but made safe — not generated records.

This is the distinction teams most often get wrong: de-identified data is not synthetic data. One is generated; the other is transformed real data. Each has a natural fit, and the right answer is frequently a combination — de-identify the data you already hold and are allowed to use, and generate the data you don't have. Seeing how synthetic data compares to using real production data is the clearest way to understand which job each one does.

	Generate synthetic data — Tonic Fabricate	De-identify production data — Tonic Structural
Data origin	Net-new records, generated from scratch or seeded from a real schema	Existing production records, transformed in place
Data category	Synthetic data	De-identified data (masked/transformed real data)
Best when	The data doesn't exist yet, can't be touched, or you need specific edge cases and volume	You need production's real-world complexity and have data you're allowed to transform
Edge-case coverage	Deliberately generate rare/negative scenarios	Limited to what already exists in production
Referential integrity	Maintained across generated tables, files, and APIs	Preserved across the de-identified schema
Privacy basis	No real records involved	Sensitive fields removed/replaced before leaving prod

What makes test data good enough to trust

Test data is only useful if it behaves like the system under test, and three properties decide whether it does: referential integrity across tables and services, production-like complexity and distributions, and coverage of the specific scenarios you need to exercise. Miss any one and the tests pass against data that doesn't resemble production, which is worse than no test at all because it builds false confidence.

This is where a sharp, well-known critique of synthetic data lands: a naïve synthetic replica of production inherits production's gaps. If you generate data that simply mimics the distribution of what you already have, it won't contain the edge cases that were never in production to begin with — the malformed address, the account in three currencies, the order that was refunded twice. A replica reproduces the ordinary and the missing alike.

Good generation answers that critique directly by treating coverage as something you design rather than inherit. Instead of only mirroring production, you deliberately inject the rare, negative, and boundary scenarios a test needs, and you keep control over distributions so the dataset isn't lopsided toward the common case. Referential integrity is the other half of trustworthy test data: when a generated order references a generated customer that references a generated account, those keys have to line up across every table, file, and service, or a multi-table query or a cross-service call breaks for reasons that have nothing to do with the code you meant to test. Deciding how you measure whether synthetic data is good enough — against real distributions and against the scenarios you actually care about — is what separates data you can trust from data that merely looks plausible.

The Tonic Advantage: quality that holds up under test. Generating realistic test data usually means hand-tuning a generator until the output stops breaking your tests. Tonic Fabricate splits that work between two agents: a Data Agent generates the dataset from your description, and a Validation Agent reviews and refines what it produces, so quality holds even when the initial prompt is imprecise. Fabricate maintains referential integrity not just within one table but across multiple databases, files, and APIs at once — the structure multi-table and multi-service tests depend on, so the data holds together instead of falling apart at the first join.

The testing scenarios synthetic data unlocks

Synthetic data supports the full range of testing, not just a single use case, and the value shows up differently in each. Because you can generate exactly the data a test needs — in the volume it needs, with the edge cases built in — it slots into the scenarios that real data tends to block:

Functional and feature testing before real data exists. When you're building a feature, there's often no production data for it yet. Generate a realistic dataset to develop and test against from day one instead of waiting for real usage to accumulate.
Regression testing with consistent, reusable datasets. Regenerate the same dataset on demand so every regression run tests against identical, known data — no drift between runs, no flaky failures from a shifting fixture.
Integration testing across services. Generate referentially consistent data that spans the services a workflow touches, so the IDs and relationships line up end to end rather than breaking at the first service boundary.
Performance and load testing. Generate the volume production can't safely hand you — millions of rows, realistic distributions — to find where the system slows or fails under load.
Edge-case and negative testing. Produce the malformed inputs, boundary values, and "chaos" cases that rarely appear in production but cause the worst failures when they do.
CI/CD test data refresh. Regenerate or re-provision data on every commit so test environments are never stale and never depend on a manual data pull to stay current.

These scenarios map straight onto the pains that make testing painful — shipping blind for lack of data, hand-scripting mocks, failures at scale and at the edge — which is why teams increasingly back their QA and test environments with generated data.

Generating test data from an existing database

When you want production-like structure without using production records, you generate synthetic data modeled on a real database rather than copied from it. You connect to the live source, the generator reads its schema, relationships, and value distributions, and it produces new records that mirror those patterns — the column relationships, value frequencies, and cross-table structure that make the data realistic — at whatever scale you need. This is the seeding path, and it stays distinct from de-identifying the real records themselves: you are creating net-new data shaped by real patterns, not transforming the production rows you started with.

Tonic Fabricate handles this through Live Connect, which connects to a live data source and generates data that reflects its real schemas and distributions. A common use is expanding a small sample into a large, referentially intact dataset: point Fabricate at a modest slice of a real database and have it generate a far larger set that keeps the same shape, so you can load-test or populate a full environment without ever moving production rows. Because the output is generated rather than masked, it carries no real records — the realism comes from the model of the data, not the data itself. The implementation detail of connecting and configuring a source lives in the Live Connect documentation. This is the practical answer to generating synthetic data from an existing database when production-like structure matters but production access doesn't.

Mock data, synthetic APIs, and keeping PII out of lower environments

The same generation that fills a database can also stand up the services around it, which matters when the thing you need to test against isn't a table but an API. Frontend and backend teams routinely block on data that doesn't exist yet or services that aren't ready, and mock data and synthetic APIs clear both at once. Tonic Fabricate generates data and the mock APIs that serve it, so a team can test against realistic responses without waiting on a live service to come online.

The practical uses cluster into a few patterns:

Mock data for development. Realistic generated datasets let teams build and test before any real data is available.
Synthetic APIs for integration and edge-case testing. Functional mock endpoints return realistic responses — including the error and boundary cases a live service won't reliably produce on demand — so integration tests can cover paths that are hard to trigger otherwise.

Underneath all of this sits the reason to do any of it: keeping PII and PHI out of dev, staging, and QA entirely. Generated data is privacy-safe at the root because it uses no real records, so there is no sensitive data to leak and no re-identification risk to manage. De-identification can also make production data safe to test with, but it has to be done carefully — transform too little and residual re-identification risk in de-identified data remains, which is the failure mode generation avoids entirely by never touching a real record.