TabGAN: Generate Synthetic Tabular Data
with GANs, Diffusion & LLMs — in 3 Lines of Python

High-quality synthetic tabular data using GANs, Forest Diffusion, or LLMs — with built-in quality reports, privacy metrics, AutoSynth, and one-click synthesis for any HuggingFace dataset.

synthetic-data GAN diffusion privacy open-source
by InsafQ · March 29, 2026

The Problem

You have tabular data that's too sensitive to share, too small to train on, or too imbalanced to model well. You need synthetic data that:

The Solution: TabGAN

pip install tabgan

3 Lines to Synthetic Data

from tabgan import GANGenerator
import pandas as pd

df = pd.read_csv("your_data.csv")
gen = GANGenerator(gen_x_times=1.1, cat_cols=["gender", "city"])
synthetic, _ = gen.generate_data_pipe(df, None, df, only_generated_data=True)

That's it. synthetic is a DataFrame with realistic rows that never existed in the original data.

One API, Multiple Generators

Switch between state-of-the-art methods with a single parameter change:

GeneratorBest ForSpeed
CTGAN (GAN)General purpose, mixed typesFast
Forest DiffusionTree-friendly structured dataMedium
LLM (GReaT)Text-rich, semantic dependenciesSlow
Random BaselineQuick benchmarkingInstant
from tabgan import GANGenerator, ForestDiffusionGenerator, LLMGenerator

# Just swap the class — same API!
gen = ForestDiffusionGenerator(gen_x_times=1.0, cat_cols=["category"])
synthetic, _ = gen.generate_data_pipe(df, target, df, only_generated_data=True)

NEW: AutoSynth — Let the Library Choose

Don't know which generator works best for your data? AutoSynth runs all of them and picks the winner:

from tabgan import AutoSynth

result = AutoSynth(df, target_col="label").run()

print(result.report)
#   Generator          Status  Score  Quality  Privacy  Rows  Time (s)
# 0 GAN (CTGAN)        OK      0.847  0.891    0.743    165   12.3
# 1 Forest Diffusion   OK      0.812  0.834    0.761    165   45.1
# 2 Random Baseline    OK      0.654  0.621    0.732    165   0.1

best_synthetic = result.best_data  # Best generator's output
print(f"Winner: {result.best_name}")  # "GAN (CTGAN)"

AutoSynth scores each generator on a weighted combination of quality (distribution fidelity, ML utility) and privacy (distance to closest record, membership inference risk).

NEW: One-Click Synthesis for Any HuggingFace Dataset

from tabgan import synthesize_hf_dataset

# Load → Generate → Evaluate in one call
result = synthesize_hf_dataset(
    "scikit-learn/iris",
    target_col="target",
)

# Push synthetic version to your HF account
result = synthesize_hf_dataset(
    "scikit-learn/iris",
    target_col="target",
    push_to_hub=True,
    hub_repo_id="your-username/iris-synthetic",
)

Key Features

Quality Reports

PSI distribution divergence, correlation comparison, ML utility (train-on-synthetic, test-on-real).

Privacy Metrics

Distance to Closest Record, Nearest Neighbor Distance Ratio, Membership Inference Risk.

Business Constraints

Enforce domain rules: RangeConstraint, FormulaConstraint on generated data.

sklearn Integration

Drop TabGANTransformer into any sklearn pipeline for synthetic augmentation.

Quality & Privacy Reports

from tabgan import QualityReport

report = QualityReport(original_df, synthetic_df, cat_cols=["gender"], target_col="label")
report.compute()
report.to_html("quality_report.html")  # Self-contained HTML with plots
from tabgan import PrivacyMetrics

pm = PrivacyMetrics(original_df, synthetic_df, cat_cols=["gender"])
summary = pm.summary()
print(f"Privacy score: {summary['overall_privacy_score']}")  # 0 = leaked, 1 = private

Business Constraints

from tabgan import GANGenerator, RangeConstraint, FormulaConstraint

gen = GANGenerator(
    gen_x_times=1.5,
    cat_cols=["department"],
    constraints=[
        RangeConstraint("age", min_val=18, max_val=65),
        RangeConstraint("salary", min_val=0),
        FormulaConstraint("end_date > start_date"),
    ],
)

sklearn Pipeline Integration

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from tabgan import TabGANTransformer

pipe = Pipeline([
    ("augment", TabGANTransformer(gen_x_times=2.0, cat_cols=["gender"])),
    ("model", RandomForestClassifier()),
])
pipe.fit(X_train, y_train)

Benchmarks

Quality (Normalized ROC AUC)

DatasetCTGANForest DiffusionRandom
Credit0.7520.7810.501
Adult Census0.6890.7120.523
Telecom0.8140.7990.548

Higher is better.

Speed (generation time, 1000 rows, 8 features)

GeneratorTimeNotes
Random Baseline~0.1sInstant — just resampling
CTGAN (GAN)~1–10sFast, depends on epochs
Forest Diffusion~30–120sHigh quality, but slower
LLM (GReaT)~5–30minBest for text columns, GPU recommended

Execution Timing

gen = GANGenerator(gen_x_times=1.1)
synthetic, _ = gen.generate_data_pipe(train, target, test)
print(gen.last_timing_)
# {'preprocess': 0.001, 'generation': 2.3, 'postprocess': 0.01,
#  'adversarial_filtering': 0.15, 'total': 2.46}

What's Next

pip install tabgan GitHub Interactive Demo