Synthetic data quality is measured across three dimensions: fidelity (how closely the synthetic data reproduces the statistical patterns and relationships of the real data), utility (how well it performs in the downstream task it was generated for), and privacy (how well it resists attempts to recover the original records). No single score captures all three, and the three trade off against one another. The practical way to evaluate a synthetic dataset is to measure each dimension against the requirements of your specific use case, not to chase one universal benchmark.

The three dimensions of synthetic data quality: fidelity, utility, and privacy

Quality is not one property of a synthetic dataset — it is three, and each answers a different question. Measuring it assumes you already have a working definition of what synthetic data is and a generated dataset in hand; the open question is whether that dataset is good enough to rely on. A dataset that excels on one dimension can fail badly on another, so "good enough" only means something once you name the dimension you have in mind.

The three dimensions are fidelity, utility, and privacy:

  • Fidelity — does the synthetic data look like the real data? It captures statistical resemblance: whether the distributions, correlations, and structure of the generated data match the source it was modeled on.
  • Utility — does the synthetic data work like the real data? It captures whether a model or analysis built on the synthetic data performs as well as one built on the real thing.
  • Privacy — does the synthetic data protect the real data? It captures how well the dataset resists attempts to recover or re-identify the original records it was derived from.

There is no single "quality score" because these three pull against one another. Pushing fidelity as high as it will go — reproducing every value and edge case in the source — tends to drag privacy down, because a synthetic record that mirrors a real one too closely is, in effect, a near-copy of it. Loosening fidelity to protect privacy can cost utility, if the patterns a model needed lived in the detail you smoothed away. Any synthetic data tuned hard for one dimension is making a quiet concession on another, whether or not the team measured it. In practice, teams reach for fidelity first — it is the most intuitive to picture and the easiest to plot — and treat utility and privacy as afterthoughts. That ordering is backwards for most use cases: the dimension that should lead is the one the task can least afford to lose.

How to measure fidelity

Fidelity is statistical resemblance to the source data — the degree to which the synthetic dataset reproduces the patterns, relationships, and shape of the real data it was modeled on. You measure it by comparing the synthetic data against a real reference set, and the right comparison depends on the data type, moving from single-column checks up to whole-structure and relational ones.

Start with univariate distribution checks, which compare one column at a time. The Kolmogorov–Smirnov test measures the largest gap between the cumulative distributions of a real and a synthetic column, which suits continuous values like age or transaction amount. The Chi-square test does the analogous job for categorical columns, comparing observed against expected category frequencies. Each one tells you whether a single field's distribution has drifted.

Distribution checks alone miss how far apart two distributions are and how columns relate, which is where divergence and distance measures come in. KL divergence quantifies how much one distribution diverges from another; total variation distance captures the largest difference in probability the two assign to any outcome; and Wasserstein distance measures the "work" needed to reshape one distribution into the other, so it is sensitive to how far off a mismatch is, not just that one exists. Beyond single columns, correlation and joint-distribution checks confirm that relationships between fields — income and credit limit, say — survive generation, since a model often learns from those relationships more than from any column in isolation.

For relational or time-series data, fidelity has a structural dimension too: whether foreign-key relationships, cardinalities, and event ordering hold across tables and over time. A dataset can match every column distribution and still be useless if its referential integrity is broken. This matters most when you generate synthetic data from an existing database, where fidelity is judged against the source schema and its relationships, not only its values.

A complementary, model-based approach is detection-based evaluation: train a classifier to tell real records from synthetic ones. If a well-tuned classifier can do no better than a coin flip, the two are statistically hard to separate — strong evidence of high fidelity. A classifier that easily picks out the synthetic records is pointing you straight to the features that gave them away. These checks usually run after generation, as a separate validation pass — but fidelity checking does not have to wait until the data exists. It can be built into the generation step itself.

The Tonic Advantage: build fidelity checking into generation. The synthetic data generation platform Tonic Fabricate pairs a Data Agent, which generates the data from your prompt or schema, with a Validation Agent that reviews the output and prompts refinements until it matches the request — the two running in a loop. Because the review happens during generation, quality holds even when the initial prompt is imprecise, and the dataset you hand to your own distribution and detection tests has already been corrected against the spec that produced it.

How to measure utility

Utility asks a different question than fidelity: not whether the synthetic data looks like the real data, but whether it does the same job — most often, whether you can train a model on synthetic data and have it hold up on real-world inputs. A dataset can pass every distribution check and still train a worse model, and a dataset that looks statistically rough can still be perfectly serviceable for the task at hand. Fidelity and utility are correlated, but they are not the same measurement, and a high score on one does not guarantee a high score on the other.

The canonical utility test is train-synthetic-test-real (TSTR). You train a model on the synthetic data, evaluate it on a held-out set of real data, and compare its score against a model trained on real data and tested on the same real holdout. The metric is whatever the task uses — accuracy, precision, recall, F1, AUC. If the synthetic-trained model lands close to the real-trained one, the synthetic data carries the signal the task needs; a wide gap tells you it does not, however good its distributions looked. TSTR is the most trustworthy utility check precisely because it measures the thing you actually care about — downstream performance — rather than a proxy for it.

Lighter-weight checks are useful when a full training run is expensive. Analytics or query parity runs the same aggregate queries — group-bys, segment counts, summary statistics — against the real and synthetic sets and compares the answers, catching gross utility failures fast. Feature-importance agreement trains a model on each set and checks whether they rank the same features as important; when the rankings diverge, the synthetic data is teaching a model to lean on the wrong signals.

The evidence that high-utility synthetic data is achievable is concrete, and it shows up as exactly this kind of TSTR result. In a Tonic.ai benchmark, an open-source model (Qwen3.5-35B-A3B) fine-tuned only on Fabricate-generated synthetic email data improved on the real-world Enron email benchmark from 80.5% to 86% — outperforming o3 and gpt-4.1-mini, which scored about 85% — without training on a single real email. Trained on synthetic, tested on real, measured against stronger baselines: that is utility demonstrated on the dimension that counts. Deciding whether generation is worth it at all comes down to this same comparison — how synthetic data compares to real production data for a given task, not in the abstract.

How to measure privacy

Synthetic data is not automatically private. Because a generative model learns from real records, it can — especially when overfit — reproduce or closely approximate some of them, which means privacy is a property you measure, not one you assume holds because the data is labeled "synthetic." The same generation process that creates the risk can be evaluated for it directly, with a handful of established tests:

  1. Exact-match and duplicate detection. The most basic check: does any synthetic record copy a real one verbatim, or near-verbatim on its identifying fields? Exact copies are an outright leak and the first thing to screen for.
  2. Nearest-neighbor distance. For each synthetic record, measure the distance to the closest real record. If synthetic points sit systematically too close to real ones, the generator is memorizing rather than generalizing — a distribution of distances that hugs zero is the warning sign.
  3. Membership inference. Frame privacy as an attack: given a record and access to the synthetic data, can an adversary determine whether that record was in the original training set? A model that has memorized its inputs leaks membership; one that has generalized does not.
  4. Attribute inference. A subtler attack: given a few known fields of a real individual, can the synthetic data be used to predict their unknown sensitive fields? This tests whether the dataset leaks correlations precise enough to fill in private values.

The technique that ties these together is the holdout comparison. You set aside a holdout of real records the generator never saw, then compare two sets of scores: original-versus-holdout (real training data against other real data it did not train on) and original-versus-synthetic. If the synthetic data resembles its training records noticeably more than the holdout does, that gap is the signature of memorization — the model is reproducing what it saw rather than learning the underlying distribution. When the two score distributions look alike, the synthetic data is generalizing, which is what you want. Read this way, re-identification risk is something you measure and bound on every dataset, not a hazard you wave away because the output is synthetic.

Choosing the right thresholds for your use case

There is no global pass/fail threshold for synthetic data quality, because the right thresholds are set by the use case, not by the metric. The same fidelity score that is excellent for one application is unacceptable for another, and the only way to choose well is to decide which of the three dimensions the task cannot afford to compromise, then measure all three against that priority.

Two contrasting cases make the point. A fraud-detection dataset lives or dies on its rare events: fraudulent transactions are a tiny fraction of the data, and they sit in the tails of the distribution. For that task you need high fidelity precisely where it is hardest to achieve — on the outliers — because smoothing them away destroys the signal the model exists to learn. A healthcare dataset built from patient records often inverts the priority. There, an outlier can be a single identifiable patient, so faithfully reproducing the tails becomes a privacy liability rather than a feature, and teams routinely accept lower fidelity on rare cases in exchange for stronger privacy guarantees. Same three dimensions, opposite thresholds — driven entirely by what each dataset is for.

The practical discipline follows from that. Name the non-negotiable dimension for your task, set explicit thresholds on all three, measure against them, and write down the tradeoff you accepted — a documented decision is what lets a teammate or an auditor understand why the data was judged good enough. And whatever you measure, measure it against a holdout the generator never saw: fidelity and privacy scores computed against the training data alone will flatter a model that has simply memorized it. National data authorities reach the same conclusion — the UK regulators' joint work on synthetic data validation likewise finds there is no one-size-fits-all standard and that privacy, utility, and fidelity have to be balanced against the specific use case (FCA, ICO, and Alan Turing Institute).

Dimension What it measures Example metrics When it matters most
Fidelity Statistical resemblance to the source data Kolmogorov–Smirnov, Chi-square, KL divergence, Wasserstein distance, correlation preservation, detection-based AUC When the task depends on faithfully reproduced patterns and rare cases — e.g., fraud detection
Utility Whether it performs in the downstream task Train-synthetic-test-real (TSTR) score gap, analytics/query parity, feature-importance agreement When the data is generated to train a model or run an analysis
Privacy Resistance to recovering the original records Exact-match/duplicate rate, nearest-neighbor distance, membership-inference and attribute-inference success When the source data is sensitive or regulated — e.g., healthcare, finance