Synthetic data and privacy: re-identification risk

Synthetic data is not automatically private. Because generative models learn from real records, a synthetic dataset can still expose information about real people — through memorized outliers, membership inference, or attribute linkage — unless it is generated and evaluated with privacy in mind. What makes synthetic data private is how it is produced (from scratch versus modeled on real data), whether sensitive source data is de-identified first, and whether re-identification risk is measured before release — not the label alone.

Is synthetic data automatically private?

"Synthetic" is not a synonym for "private" or "anonymous," though the two are constantly treated as the same thing. The assumption behind the confusion is intuitive: if a dataset copies no records verbatim from production — every value was produced by a model or an algorithm — then there is no real person left in the data to expose. That reasoning is incomplete. The risk in a synthetic data set is rarely that it reproduces a row word for word; it's that a model trained on real records can carry information about those records into what it generates, in ways a quick look at the output won't reveal. Re-identification — determining that a supposedly anonymous record actually corresponds to a specific real individual — is the risk that survives, and it tends to survive quietly.

What decides whether a synthetic dataset protects the people behind its source is the process, not the label. Three things do most of the work: how the data was generated, whether the sensitive source was de-identified before a model ever saw it, and whether anyone measured re-identification risk before release. A dataset built from a schema and a prompt with no real values in the loop sits at one end of that range; a dataset modeled closely on a small set of real records sits at the other, and the two aren't equally safe just because both are called synthetic. Knowing what synthetic data is is the starting point — knowing how it can still leak is what tells you whether a given dataset belongs in your pipeline.

How synthetic data can still expose real people

Information about real individuals survives into synthetic data through a few well-studied mechanisms that share one root cause: a model that learns a distribution well enough to reproduce it can also reproduce the specific people inside it. Three account for most of the real-world risk.

Memorization (overfitting). A generator that has overfit its training data can emit near-copies of the records it learned from — reproducing an actual individual closely enough that the "synthetic" record is effectively a lightly disguised real one.
Membership inference. In a membership inference attack, an adversary determines whether a specific person's record was part of the source data — not by reading the record, but by probing how the model behaves. Confirming that someone was in, say, a dataset of addiction-treatment patients can itself be the sensitive disclosure.
Attribute and linkage inference. An adversary recovers a sensitive attribute that wasn't meant to be inferable, or matches a synthetic record back to a real person by joining it against an auxiliary dataset. A record that looks anonymous in isolation can become identifying once lined up against outside information.

The common thread is the outlier. Rare or unique records — the patient with an unusual combination of diagnoses, the single very-high-value account — are the hardest to synthesize convincingly without effectively copying them, and they are exactly the records whose exposure causes the most harm. Regulators have reached the same conclusion: reviewing the research, the Office of the Privacy Commissioner of Canada described synthetic data as not a silver bullet, noting that a dataset faithful enough to stay useful can, at the same time, let an adversary extract sensitive information about the individuals in the original. Fidelity and privacy pull against each other, and the tension concentrates in exactly the records that are hardest to fake.

Why the generation method sets the privacy ceiling

The single biggest determinant of how private a synthetic dataset can be is where its values come from. Data generated from scratch — from a schema, a set of rules, or a natural-language prompt, with no real records in the loop — carries essentially no re-identification risk, because there is no real individual behind any value to re-identify. Data modeled on real records inherits risk in proportion to how faithfully it reproduces them: the more closely the model hews to its source, the more of that source's individual detail can survive into the output. The method sets a ceiling on privacy before any downstream safeguard is applied.

Tonic Fabricate illustrates both ends of that range in one tool. You can generate a fully relational dataset from scratch — describe the schema, rules, or outcome you need in plain language, and Fabricate produces net-new records with referential integrity across tables, none of which trace to a real person. Or, when you need data that behaves like a system you already run, Fabricate can seed from an existing source and model its statistical patterns. What matters for privacy is that even when seeding, it generates net-new records that reproduce the shape of the data rather than transforming or copying the real rows — so generating synthetic data from an existing database is modeled on the source, not lifted from it.

Modeling on real data is not risk-free, though, and it's worth being precise about that. Seeding from production narrows the gap between synthetic and real, and a model that fits its source too tightly can reproduce that source's outliers. The residual risk doesn't disappear because the output is labeled synthetic; it has to be measured before release. From-scratch generation is the stronger privacy posture wherever the use case allows it, because it removes the question rather than managing it.

The Tonic Advantage: When you generate from scratch with Tonic Fabricate, no real values enter the pipeline — the records are built from your specification, not derived from anyone's data. There is no source individual to re-identify, because the data never described a real person. For greenfield features, sensitive domains you can't touch, or any case that doesn't need real-record fidelity, from-scratch generation turns a privacy risk you'd otherwise have to measure and mitigate into one that simply isn't there.

Safeguards that make synthetic data privacy-preserving

When you do model on real data, privacy becomes something you engineer in rather than assume. Four safeguards, running roughly from upstream to downstream, do most of the work.

De-identify the sensitive source before or during generation. If a model never sees raw identifiers, it can't learn or reproduce them. For unstructured sources — clinical notes, transcripts, tickets — that means detecting sensitive values with named-entity recognition (NER), the task of locating and classifying spans of text such as names, dates, and account numbers, then redacting or replacing them before the text is used to generate anything. Tonic Textual is built for this step: it uses proprietary NER models to detect PII across free text and documents, then redacts those values or swaps in realistic synthetic substitutes, so the surrounding context stays usable while the real identifiers are gone.
Apply differential privacy during model training. Differential privacy is a mathematical guarantee that bounds how much any single record can influence a model's output — informally, the result is nearly the same whether or not any one individual was in the training data. Applied to a generator, it limits how much any one person's data can shape what comes out, constraining both memorization and membership inference. The trade-off is fidelity: tighter privacy budgets blur fine detail, so the setting has to be tuned against the utility the dataset needs.
Prefer from-scratch generation where the use case allows. Data produced with no real records in the loop removes re-identification risk at the source rather than mitigating it afterward — the strongest posture for any use case that doesn't strictly need real-record fidelity.
Don't faithfully reproduce outliers. Because rare, unique records are both the hardest to synthesize safely and the most damaging to expose, handle them deliberately — suppress, aggregate, or generalize them rather than letting a model reproduce them one for one.

The detection step in the first safeguard is where much of real-world privacy is won or lost, because anything the detector misses passes straight into the "de-identified" output. Tonic.ai's PrivacyBench benchmark measures exactly this on generated email and Slack data: using Textual for detection held recall at 95% of the sensitive spans, where a general-purpose LLM doing the same job caught closer to 89% — and Textual did it faster and at over 60% lower cost. Any undetected span is a real identifier left behind, so that recall gap is a privacy gap: the more a detector catches, the less leakage survives into whatever you generate from the result.

The Tonic Advantage: For scarce, sensitive unstructured data, the strongest pattern is to de-identify first, then generate. Use Tonic Textual to detect and synthesize the sensitive entities in the source text, then point Fabricate at that de-identified set as the model for generating more. Because the identifiers are removed before generation begins, they never reach the synthetic output — you expand a small, safe foundation into a larger set without carrying the original sensitive content forward.

Measuring re-identification risk before you release

Privacy in synthetic data is measured, not asserted — which means the claim "this dataset is safe" should rest on a check you ran, not on a generator having produced it. The practical checks run after generation and before release, and they look at the output that will actually leave the building rather than the process that made it. A few metric types do most of the work:

Distance-to-closest-record (nearest-neighbor distance). For each synthetic record, how close is its nearest real record? Synthetic points that sit almost on top of a real individual are the memorized copies you most need to catch.
Membership-inference risk scores. Empirical tests of whether an attacker could determine that a given real record was in the source data, quantifying the mechanism described earlier.
Singling-out, linkability, and inference tests. Structured probes of whether an individual can be isolated in the dataset, whether records can be linked across datasets, and whether sensitive attributes can be inferred.

Running these isn't a separate privacy project bolted on at the end. Privacy is one axis of overall synthetic data quality, sitting alongside fidelity (how well the data reproduces real patterns) and utility (how well a model trained on it performs) — and the three trade against one another, so they belong in one review. Treating a privacy check as part of evaluating overall synthetic data quality is what turns "it's synthetic, so it's fine" into a defensible release gate.

What synthetic data does and doesn't do for compliance

Well-generated, properly evaluated synthetic data can genuinely reduce your regulatory exposure. Under regimes like GDPR and HIPAA, obligations attach to personal data, so a dataset that provably describes no real individuals can fall largely outside that scope — which is much of why teams turn to generation when privacy constraints block them from using real data directly. Removing the personal-data footprint is a real, defensible benefit when the generation and evaluation behind it hold up.

What synthetic data does not do is make the word "synthetic" mean "anonymous" in the legal sense on its own. Whether a dataset is treated as anonymized, de-identified, or still personal depends on residual re-identification risk, how the process is documented, and the context in which the data is used — not on the label. The distinction between anonymized and merely de-identified data is itself contested in law and evolving through regulation, and it can turn on exactly the residual risk that the metrics above are meant to quantify. The sound approach is to generate carefully, measure re-identification risk explicitly, document what you did, and consult authoritative regulatory guidance for your jurisdiction and sector rather than relying on the synthetic label to settle the question. This page is educational, not legal advice; treat a genuine anonymization determination as a decision to make with qualified counsel, informed by the evidence your evaluation produces.

Synthetic data and privacy: re-identification risk and safeguards

Is synthetic data automatically private?

How synthetic data can still expose real people

Why the generation method sets the privacy ceiling

Safeguards that make synthetic data privacy-preserving

Measuring re-identification risk before you release

What synthetic data does and doesn't do for compliance

See how Tonic Fabricate generates synthetic data

More in How it works

Generating synthetic data from an existing database

How to measure synthetic data quality and fidelity