Types of synthetic data: a practical taxonomy

Synthetic data is most often classified by the structure of the real data it imitates: tabular (rows-and-columns and relational records), time-series (sequential, time-stamped data), and unstructured (free text, documents, images, and audio). A second axis describes how it's generated — built from scratch from a specification, or modeled on existing data you already hold. Both the method you use to generate each type and the way you judge its quality change with the type, which is why the category you're working in is the first decision to make.

The three principal types of synthetic data

Synthetic data is sorted first by the structure of the real data it imitates. That structure drives almost everything downstream — the technique that generates the data well, the failure modes to watch for, and the checks that tell you whether the result is good enough to use. Three categories cover the large majority of real workloads, and most teams can place a problem into one of them on sight.

Tabular: rows and columns, often spread across related tables — the shape of a relational database or a spreadsheet.
Time-series: sequential, time-stamped records where the order carries meaning — sensor readings, financial ticks, server logs.
Unstructured: data with no fixed schema — free text, documents, images, and audio, with text the dominant form in software and AI work.

A second axis cuts across all three. It describes not the shape of the data but how it's generated. Some synthetic data is built from scratch — from a specification rather than any real source, the path you take when you have no usable data, or when the real data is too sensitive to work with directly. Other synthetic data is modeled on existing data: you start from records you already hold and generate new data that mirrors their statistical shape, which needs a representative sample to learn from. Because synthetic data is, at its core, artificially generated data that stands in for real records, either approach can apply to any type — tabular, time-series, and unstructured data can each be built from scratch or modeled on real data. A churn model short on examples might generate new cases modeled on the few it has; a brand-new product with no history at all has to start from scratch.

Tabular synthetic data

Tabular synthetic data is data shaped as rows and columns — the form most enterprise data takes, and the most common kind of synthetic data teams generate. In its simplest form it's a single flat table, but most real tabular data is relational: spread across multiple tables connected by keys, where a row in one table points to a row in another. Generating it well means satisfying two requirements at once, and they pull in different directions.

The first is statistical fidelity within each column. The generated values have to follow the same distribution as the real data — the same spread of ages, the same frequency of product categories, the same correlation between income and credit limit — so that a model or a test exercises the data the way production would. The second is referential integrity: the rule that a reference from one table to another must always resolve to a record that actually exists. A generated order has to point to a customer who is present in the customer table; an order line has to reference a real order. Get the per-column statistics right but break the links between tables, and the dataset falls apart the moment anything tries to join across it.

Tonic Fabricate is built around this problem, and it generates the data agentically rather than through manual configuration. It produces relationally intact structured data — multiple related tables, with keys that resolve and referential integrity maintained throughout — either from scratch or modeled on an existing schema. From scratch, you describe the tables, columns, and relationships you need in plain language; a Data Agent builds the data to match, and a Validation Agent reviews and refines the result, so quality holds even when the prompt is imprecise. The other approach models an existing source: connect Fabricate to a live database — PostgreSQL, MySQL, Oracle, SQL Server, Snowflake, or BigQuery — and it learns the schemas, distributions, and cross-table relationships already there to generate tabular data by seeding from that database, producing new records that behave like the originals without copying them. Either way the output is a working relational dataset rather than a pile of disconnected tables.

Time-series synthetic data

Time-series synthetic data is a sequence of time-stamped records where the order of the points is part of the information. Sensor telemetry from industrial equipment, financial price ticks, application server logs, and user event timelines are all time-series: each reading means something only in relation to the ones before and after it. The defining property is temporal dependency — the value at any moment is shaped by recent history and by where it sits in a larger cycle.

That dependency shows up in a few specific ways the synthetic data has to reproduce. There's trend, the long-run direction a series drifts; seasonality, the regular cycles that repeat over a day, a week, or a year; and autocorrelation, the statistical relationship between a value and its own recent past — a server that's busy this minute is likely to be busy the next. A temperature sensor in a cold-chain warehouse, for instance, should drift slowly and cycle with the building's HVAC, not jump at random from one reading to the next; synthetic data that ignores that pattern gives itself away immediately.

This is what makes time-series harder to generate than plain tabular data. With tabular data you can often treat rows as independent draws from a distribution, but the moment you do that to a time-series — sampling each point on its own, or shuffling the order — you destroy the very signal that makes the data useful. A forecasting model trained on shuffled data learns nothing about what comes next; an anomaly detector trained on points stripped of their sequence has no notion of what normal behavior over time looks like. Generating time-series well means producing whole sequences whose temporal structure holds together, which is why it leans on methods built to model order rather than treat each record in isolation.

Unstructured synthetic data

Unstructured synthetic data is generated content that has no fixed schema — free text, documents, images, or audio. Text is the form that dominates software and AI workflows: emails, support tickets, clinical notes, chat transcripts, and contracts are where much of an organization's most useful and most sensitive information lives. Because the data has no columns to constrain it, the fidelity bar is different from the other two types. What matters is semantic coherence — the text has to read naturally and carry the right meaning, intent, and signal, not just match a surface statistic. A synthetic support ticket that's grammatically clean but describes an impossible product problem teaches a model the wrong thing.

There are two distinct paths to useful unstructured synthetic data, and they sit on either side of the genuine fully-versus-partially-synthetic distinction: how much real data survives in the result. The first generates net-new content from a specification — fully synthetic text, no real source behind it. Tonic Fabricate produces unstructured outputs this way: free text and document files such as PDFs, Word documents, and email messages, generated alongside structured data and kept referentially consistent with it, so a name in a generated document matches the customer record it belongs to. This is the route when you need realistic text that never existed, often as training material for LLM and RAG systems where real examples are scarce.

The second path starts from real unstructured data that's too sensitive to use directly, and it produces partially synthetic data — the real content kept, only a sensitive subset replaced, so real and synthetic values sit side by side. Tonic Textual is the exemplar here. It extracts free text from wherever it's stored and detects sensitive values using proprietary NER models — named entity recognition, the task of locating and classifying spans of text such as names, dates, account numbers, and medical terms — then either redacts those values or synthesizes them, replacing each with a realistic fake of the same type. With synthesis, a real clinical note keeps its original language, structure, and statistical properties, and only the names, dates, and identifiers become plausible invented ones.

How generation and fidelity differ by type

Because each type has its own structure, both the way you generate it and the way you judge it shift with the category. The main approaches to generating synthetic data line up with the three types in a fairly consistent pattern:

Tabular → rule-based and statistical methods, which sample from specified or learned distributions, are the workhorses: fast, controllable, and well-suited to schema-bound data.
Time-series → sequence-aware models that explicitly represent temporal dependency, so the generated points carry trend, seasonality, and autocorrelation rather than being drawn independently.
Unstructured → LLM-based generation, which produces fluent free text and documents from a prompt and is the natural fit for data whose fidelity bar is semantic.

The fit isn't arbitrary. Rule-based and statistical methods do well where the structure is explicit and the constraints are known, which describes tabular data exactly — you can encode the schema and the value ranges directly. Sequence-aware models earn their place on time-series because they carry state from one step to the next, the only way to reproduce dependencies that span time. And LLMs suit unstructured text because language is what they model natively, generating prose that holds together semantically in a way a rule-based template never could. A project that mixes types ends up mixing methods too, and the combined dataset is only as trustworthy as its weakest part.

The checks that tell you whether the data is good enough move the same way. For tabular data, how you measure synthetic data quality and fidelity comes down to comparing column distributions and confirming that referential integrity holds across tables. For time-series, the test is temporal fidelity — whether the synthetic sequences reproduce the trends, cycles, and autocorrelation of the real ones, not just the right average value. For unstructured text, the bar is semantic: whether the generated content reads naturally and carries the right meaning, which is harder to score with a single number and often needs task-based evaluation or human review. The practical consequence is that the type you're working in is not a label you attach after the fact — it decides your generation method and your evaluation plan from the start.

Choosing the right type for your use case

The fastest way to choose a type is to start from the workload, because most map cleanly onto one. Application and integration testing need relational data that behaves like the production database, which calls for tabular synthetic data with referential integrity intact. Forecasting, anomaly detection, and IoT monitoring depend on how values move over time, so they need time-series data that preserves trend, seasonality, and autocorrelation. LLM and RAG training, document understanding, and other text-heavy AI tasks need unstructured data with natural-language fidelity.

The mapping is rarely exclusive. A single application often needs all three at once — a relational database, the time-stamped event stream it emits, and the documents it produces — and the realistic case is a project that spans more than one type rather than living neatly inside a single category. When that happens, naming the dominant type tells you where to start, but you'll usually need the others to follow, and the data across them has to stay consistent to be worth anything. Whether synthetic data should stand in for real data at all is a separate question that applies within every type — but once you've decided to generate, the type you're in is what sets both your method and your fidelity bar.

Workload	Type you need	Why
App and integration testing	Tabular / relational	Production-like records with keys that resolve across tables
Forecasting, anomaly detection, IoT monitoring	Time-series	Depends on order, trend, seasonality, and autocorrelation
LLM / RAG training, document AI	Unstructured	Natural-language text and documents with the right semantics
Multi-component applications	More than one	A real system spans relational data, event streams, and documents at once

The Tonic Advantage: one workflow across the whole taxonomy. Most teams reach for a different tool per type — one for relational data, another for time-aware events, a third for text. Tonic Fabricate generates across your entire data ecosystem in a single agentic workflow: relationally intact tabular data, time-aware event and timeline data, and unstructured free text, all kept referentially consistent with one another. The same entities line up across a database row, an event log, and a document, so a project that spans more than one type doesn't turn into a stitching-together exercise.

Types of synthetic data: tabular, time-series, and unstructured

The three principal types of synthetic data

Tabular synthetic data

Time-series synthetic data

Unstructured synthetic data

How generation and fidelity differ by type

Choosing the right type for your use case

See how Tonic Fabricate generates synthetic data

More in Fundamentals

How is synthetic data generated?

Synthetic data vs. real (production) data

What is synthetic data?