Building the Best Synthetic Data Generator in Python for 2026: Why I Am Building Misata and How to…

Building the Best Synthetic Data Generator in Python for 2026: Why I Am Building Misata and How to Use It

An honest attempt to break into Synthetic Data Generation in the LLM Era

If you’ve ever needed realistic fake data for a dev database, or a product demo, or for testing your analytics dashboard for a niche you still doesn’t have data for; you might already know how painful this is.

You either spend days hand-crafting a script, wrestle with libraries that need real data to generate fake data, or ask an LLM to produce a CSV and watch it hallucinate 50 rows before giving up.

I know this pain personally. My first project at my first job was writing a Python script to generate synthetic data for a real use case. I worked on that script for two months. Every single day. Hand-coding for loops, managing foreign key consistency, attending stand-up calls that would change the needs and requirements almost every week. Two months. One script. I was just into a job and that experience turned into a trauma.

The problem wasn’t hard. It was just unsolved.

That’s why I am building Misata an open-source Python library for generating realistic, multi-table synthetic datasets from a plain English description. No real data required. No ML training. One sentence in, thousands of rows out.

This post covers what Misata does, how to use it, why it’s different from everything else out there, and where I’m taking it.

The state of synthetic data generation in 2026

Before getting into how Misata works, it’s worth being honest about the current landscape: because the tools that exist are genuinely good at some things.

Faker is the go-to for generating individual fake values. Names, emails, phone numbers, addresses; it handles all of that well. But it has no concept of a dataset. It generates one row’s worth of values at a time, with no awareness of relationships between tables. If you want 20,000 transactions that reference valid customer IDs with a 2% fraud rate and a log-normal amount distribution, you’re writing all of that yourself.

SDV (Synthetic Data Vault) is impressive engineering. It learns the statistical structure of your real data and generates synthetic copies that preserve it. The problem: you need real data to train it. When the whole point is that you don’t have real data; which is a very common situation when you’re prototyping, building a dev environment, or training a model on GDPR-restricted data; SDV can’t help you.

LLMs feel like the obvious answer until you try them. Ask any major model to generate a 10,000-row dataset and you’ll hit context limits, truncation, or hallucinated CSV structure. Even when they work, you’re burning tokens at a rate that makes large-scale generation economically absurd. And critically, LLMs have no mechanism to guarantee that your monthly revenue totals hit $320k in June; they’ll get close-ish at best.

The gap: there is no tool that takes your intent as input and produces statistically calibrated, relationally consistent data as output. That’s what Misata is built to fill.

Installing Misata

pip install misata pandas numpy

That’s it. No ML dependencies, no database required to get started, no API key for the core functionality.

The basics: generating your first dataset

The simplest way to use Misata is the generate() function. Pass it a plain English description and a random seed:

import misata
tables = misata.generate(
    "A SaaS company with 2000 users.",
    seed=42,
)
print(tables.keys())
# dict_keys(['users', 'subscriptions'])
print(tables['subscriptions'].head())

You get back a dictionary of Pandas DataFrames, one per table, with realistic column names, types, and values. But don’t use these simple prompts though as it invalidates our whole point.

The seed parameter makes generation fully deterministic. Same description, same seed, same dataset every time. This matters for reproducible research, stable CI test fixtures, and sharing datasets with collaborators.

Pinning business targets

Here’s where Misata does something no other library does: it lets you describe specific business dynamics, and the generated data will satisfy them exactly.

tables = misata.generate(
    "A SaaS company with 2000 users. "
    "MRR rises from 80k in January to 320k in June, "
    "drops to 180k in August due to churn, "
    "then recovers to 400k in December.",
    seed=42,
)

When you sum the monthly MRR from the generated rows:

Jan    $80,000   ✓
Feb   $128,000   ✓
Mar   $176,000   ✓
Apr   $224,000   ✓
May   $272,000   ✓
Jun   $320,000   ✓
Jul   $250,000   ✓
Aug   $180,000   ✓   <- churn dip, as described
Sep   $235,000   ✓
Oct   $290,000   ✓
Nov   $345,000   ✓
Dec   $400,000   ✓

Not approximate. Exact. The individual rows still follow a log-normal distribution (median MRR $126, mean $150, p90 $291); because that’s what real SaaS revenue looks like at the row level. But the monthly totals are treated as hard constraints. Misata pins the aggregates first, then samples rows that satisfy them.

This is the key architectural difference from tools that just try to produce data that looks reasonable. GROUP BY on Misata data gives you what you asked for.

Calibrated domain distributions

The part that cost me two months to research and hand-code in my first job is what Misata ships pre-calibrated for. Seven domains, each with distribution priors tuned to real-world statistics.

Fintech

tables = misata.generate(
    "A fintech company with 2000 customers and banking transactions.",
    seed=42,
)
transactions = tables["transactions"]
print(f"Fraud rate: {transactions['is_fraud'].mean() * 100:.2f}%")
# Fraud rate: 2.00%

The real-world baseline for card fraud is ~2%. That’s what you get; not a random number, a calibrated one. 400 fraudulent transactions out of 20,000, distributed across accounts in a way that reflects how fraud actually clusters.

Credit scores follow the actual US distribution:

mean:   679   (real US average: 680–720)
std:     80   (real range: 70–90)
min:    328
max:    850

Account balances are log-normal because real bank balances are:

median     $1,976
mean       $6,128
p90       $14,260
p99       $62,565

Why does this matter? If you train a fraud detection model on uniformly distributed balances, you’re teaching it that everyone has the same balance. Real fraud disproportionately targets high-value accounts. A model trained on flat data will miss exactly the cases you most need it to catch.

Healthcare

tables = misata.generate(
    "A hospital with 500 patients and doctors.",
    seed=42,
)

Blood type frequencies match actual ABO/Rh epidemiology:

Blood type    Generated    Real-world
O+               37.9%        38.0%   ✓
A+               33.9%        34.0%   ✓
B+                9.6%         9.0%   ✓
AB+               3.0%         3.0%   ✓
O-                6.5%         7.0%   ✓
A-                6.1%         6.0%   ✓
B-                2.0%         2.0%   ✓
AB-               0.9%         1.0%   ✓

All eight types within 0.6% of real frequencies. Patient ages center on 45 with standard deviation 18, matching a chronic-care population. Nobody configured this; it’s what the healthcare domain prior knows.

Ecommerce

schema = misata.parse(
    "An ecommerce store with 5000 customers and orders. "
    "Revenue grows from 100k in January to 300k in November, "
    "then 350k in December.",
    rows=5000,
)
tables = misata.generate_from_schema(schema)

Product categories follow Zipf’s law, because real shopping behavior does:

electronics      47.1%
clothing         20.0%
home & garden    12.3%
sports            8.7%
books             6.5%
beauty            5.5%

Uniform would give ~17% per category. Real shopping isn’t uniform. Neither is Misata.

Order statuses at realistic rates:

completed    71.5%
shipped      12.4%
pending       8.2%
returned      5.0%
cancelled     3.0%

Real e-commerce return rates run 8–10%. Recommendation systems, inventory models, and return prediction pipelines all need this to be right to produce meaningful results.

Multi-table generation with guaranteed referential integrity

Misata generates entire relational schemas, not just single tables. And it guarantees that foreign keys are always valid; not by checking after the fact, but by generating in the correct dependency order.

tables = misata.generate(
    "A fintech company with 2000 customers and banking transactions.",
    seed=42,
)
customers    = tables["customers"]     # 2,000 rows
accounts     = tables["accounts"]      # 2,600 rows
transactions = tables["transactions"]  # 20,000 rows

# Verify FK integrity
orphan_accounts = (~accounts["customer_id"].isin(customers["customer_id"])).sum()
orphan_txns     = (~transactions["account_id"].isin(accounts["account_id"])).sum()

print(orphan_accounts)  # 0
print(orphan_txns)      # 0

Tables generate in topological dependency order. Parents first, children sample from the completed parent pool. Orphan keys are architecturally impossible.

This matters because referential integrity errors in test data produce silent false negatives. Your pipeline looks like it handles JOINs correctly, until it meets real data.

The two-step flow: inspect before you generate

For more control, separate the parsing step from the generation step:

schema = misata.parse("A hospital with 500 patients and doctors.")
print(schema.summary())

Schema: Healthcare Dataset
Domain: healthcare
Tables (3)
  doctors         25 rows    [doctor_id, first_name, last_name, specialty, years_experience]
  patients       500 rows    [patient_id, name, age, gender, blood_type, registered_at]
  appointments  1500 rows    [appointment_id, patient_id, doctor_id, type, duration_minutes]

Relationships (2)
  patients.patient_id  -> appointments.patient_id
  doctors.doctor_id    -> appointments.doctor_id

You can inspect the schema, adjust row counts, add columns, or change relationship cardinality before committing to generation. The schema object is also serializable; useful for teams where a data engineer defines the schema and developers generate against it, versioned in git.

Seeding databases directly

Generated DataFrames go directly into any database via SQLAlchemy:

from misata import seed_database

tables = misata.generate("A SaaS company with 1000 users.", seed=42)
report = seed_database(tables, "postgresql://user:pass@localhost/mydb", create=True)
print(report.total_rows)  # 12,400

Or from the CLI: useful in Makefile targets and docker-compose setups:

# PostgreSQL
misata generate \
  --story "A SaaS company with 1000 users" \
  --db-url postgresql://user:pass@localhost/mydb \
  --db-create --db-truncate

# SQLite (local dev)
misata generate \
  --story "A SaaS company with 1000 users" \
  --db-url sqlite:///./dev.db \
  --db-create --db-truncate

A new developer runs make seed-db and has a working, realistic dataset in their environment in under 10 seconds.

LLM-powered generation for custom domains

The rule-based parser covers seven domains. For anything outside that; B2B marketplace vendor tiers, clinical trial schemas, commodity trading systems — the LLM backend handles schema inference:

from misata import LLMSchemaGenerator

gen = LLMSchemaGenerator(provider="groq")  # or openai, ollama
schema = gen.generate_from_story(
    "A B2B marketplace with vendor tiers, SLA contracts, and quarterly invoices"
)
tables = misata.generate_from_schema(schema)

Requires GROQ_API_KEY or OPENAI_API_KEY. The LLM infers schema, column types, and row count ratios from your description; you get back the same DataFrames as the rule-based path.

How Misata compares to Faker and SDV

The core distinction: SDV is a synthetic data replication tool, it mirrors real data. Misata is a synthetic data generation tool, it generates from intent. They solve different problems. If you have real data and want a privacy-safe copy, use SDV. If you’re starting from scratch, use Misata. There are various other tools getting traction in the industry too, but when you truely know the aspiration behind what Misata should do, then this would be a novel one.

Common use cases

Seeding development and staging environments. Every developer joining a team hits the same wall: the dev database is empty. Misata seeds it with realistic, relationally consistent data in seconds.

Product demos. Sales teams demoing analytics dashboards need data that tells a plausible story and not random numbers. Misata generates internally consistent business narratives on demand.

ETL and pipeline testing. Calibrated distributions naturally produce the p99 outlier values that expose pipeline brittleness, the cases hand-crafted test data systematically misses.

Reproducible research benchmarks. Share a description and a seed. Anyone can reproduce your exact dataset with a single Python call.

What’s coming

Misata is under active development. The current version is the foundation, the goal is to be the most sophisticated intent-driven synthetic data tool that exists.

The roadmap includes broader domain coverage, time-series generation with configurable seasonality and anomalies, natural language column constraints (“30% of enterprise accounts have balances above $50k”), a schema DSL for version-controlled dataset definitions, and better LLM schema inference for unusual entity types.

If you’ve ever spent two months on a synthetic data script, you know exactly what this is trying to solve. And if you haven’t, now you don’t have to.

Get started

pip install misata pandas numpy

The fastest path is the Colab notebook, no install required, runs in under a minute:

Open the quickstart notebook →

Or run the included examples locally:

python examples/saas_revenue_curve.py
python examples/fintech_fraud_detection.py
python examples/healthcare_multi_table.py
python examples/ecommerce_seasonal.py

All produce full verified output in under 3 seconds.

GitHub: github.com/rasinmuhammed/misata
PyPI: pypi.org/project/misata
Docs: QUICKSTART.md

MIT licensed. Contributions, issues, and feedback are very welcome, this is early days and the direction will be shaped by what people actually need.

Building the Best Synthetic Data Generator in Python for 2026: Why I Am Building Misata and How to… was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.