DataGen API Documentation - Complete Guide

Getting Started

DataGen provides a powerful REST API for generating synthetic datasets using advanced AI. Simply describe your data requirements in natural language, and our AI will create realistic, structured datasets with proper relationships and constraints.

Quick Start

To get started, you'll need:

An OpenAI API key (get one at platform.openai.com)
A clear description of the data you need

💡 Pro Tip: Be specific in your dataset descriptions. The more detail you provide about entities, relationships, and business domain, the better the AI can generate realistic data.

Authentication

DataGen uses your OpenAI API key for authentication and AI processing. The key is passed in the request payload and is encrypted in transit. We never store your API keys.

API Key Requirements

Must be a valid OpenAI API key starting with sk-
Should have sufficient credits for processing requests

⚠️ Security Notice: Your API Key will be encrypted and decrypted in the server but never stored.

API Reference

Base URL

https://syntheticdatagen.xyz/api

Generate Custom Datasets

POST /generate

Generates synthetic datasets based on your natural language description. Use this when you need fully custom datasets.

Features: Custom prompts, flexible schema generation

Limitations: Does not support deterministic generation (no seed parameter)

Generate from Template

POST /generate/template

Generates synthetic datasets using predefined templates. Use this with template_id from /templates endpoint.

Features: Predefined schemas, deterministic generation with seed parameter, faster processing

Best for: Reproducible datasets, testing, standardized data structures

Content-Type

Content-Type: application/json

Templates

DataGen provides pre-built templates for common data scenarios. Templates include predefined schemas, relationships, and realistic data patterns for various business domains.

List All Templates

GET /templates

Returns all available templates with their descriptions and metadata.

GET /api/templates

Optional Parameters

category (string): Filter templates by category (e.g., "E-commerce", "Healthcare")

Example Response

[
  {
    "id": "sales_crm",
    "name": "Sales CRM",
    "description": "Sales CRM system with leads, companies, and activities...",
    "category": "Sales & Marketing",
    "datasets": 3,
    "suggested_size": 1000,
    "suggested_reality": 4,
    "preview_schema": {
      "companies": ["id", "name", "industry", "employee_count"],
      "leads": ["id", "name", "email", "company_id", "stage"],
      "activities": ["id", "type", "lead_id", "date", "outcome"]
    }
  },
  {
    "id": "custom",
    "name": "Custom Dataset", 
    "description": "Create your own custom dataset by describing your data requirements...",
    "category": "Custom",
    "datasets": 1,
    "suggested_size": 1000,
    "suggested_reality": 4,
    "preview_schema": {
      "example": ["Depends on your description - AI will generate appropriate schema"]
    }
  }
]

Get Template Details

GET /templates/{template_id}

Returns detailed information about a specific template.

GET /api/templates/sales_crm

Get Template Categories

GET /templates/categories

Returns all available template categories.

Example Response

{
  "categories": [
    "API Testing",
    "Custom", 
    "E-commerce",
    "Events",
    "Finance",
    "Healthcare",
    "Human Resources",
    "IoT & Sensors",
    "Logistics",
    "Marketing",
    "SaaS",
    "Sales & Marketing",
    "Social Media",
    "User Management"
  ]
}

Custom Datasets

When existing templates don't fit your needs, use the custom dataset feature. Simply describe your data requirements in natural language, and AI will parse your description to generate appropriate schemas and relationships.

How Custom Datasets Work

Description Analysis: AI parses your natural language description
Schema Generation: Creates technical prompts and preview schemas
Data Creation: Generates realistic data matching your requirements
Relationship Building: Establishes proper foreign key relationships

Custom Dataset Guidelines

Example Description:
"E-commerce platform with customers, orders, and products. Include customer profiles, order history with line items, product catalog with inventory tracking, and customer reviews. Track order status, payment methods, and shipping addresses."

Writing Effective Custom Prompts

Describe your business domain: E-commerce, healthcare, finance, etc.
List main entities: customers, orders, products, etc.
Mention key relationships: how entities connect to each other
Include important attributes: specific fields you need

⚠️ Custom Template Requirements: When using template_id: "custom", the prompt parameter becomes required and must be at least 10 characters long.

Request Parameters

Parameter	Type	Required	Description
api_key	string	Required	Your OpenAI API key (sk-...)
template_id	string	Optional	Template ID to use (e.g., "sales_crm", "custom"). Use "custom" for custom descriptions.
prompt	string	Optional*	Natural language description. Required only when template_id is "custom".
datasets	integer	Optional	Number of datasets to generate (1-5, default: 2)
dataset_size	integer	Optional	Records per dataset (1-10,000, default: 1000)
reality	integer	Optional	Reality level 0-10 (0=clean, 10=messy, default: 3)
output_format	string	Optional	Format: "json" (default). Returns direct JSON response.
seed	integer	Optional	Seed for deterministic generation (≥1). Only available for `/generate/template` endpoint. Ensures identical input produces identical output.

Response Format

Success Response (200 OK)

Response Structure

{
  "success": true,
  "message": "Successfully generated 2 dataset(s)",
  "datasets": {
    "customers": {
      "format": "json",
      "size": 1000,
      "columns": ["id", "name", "email", "signup_date"],
      "content": [...] // Array of 1000 customer records
    },
    "orders": {
      "format": "json", 
      "size": 1000,
      "columns": ["order_id", "customer_id", "total", "order_date"],
      "content": [...] // Array of 1000 order records
    }
  },
  "metadata": {
    "total_datasets": 2,
    "total_records": 2000,
    "reality_level": 3,
    "output_format": "json",
    "prompt": "E-commerce platform...",
    "generation_timestamp": "2024-01-15T10:30:00Z",
    "validation": {
      "entity_types_generated": ["customers", "orders"],
      "validation_status": "completed",
      "validation_score": 9
    }
  },
  "download_links": null // Only used by web UI
}

Error Response (4xx/5xx)

Error Structure

{
  "error": "Invalid API key format",
  "detail": "API key must start with 'sk-'"
}

Code Examples

Complete Template Workflow (Python)

This example shows the complete workflow: listing templates, selecting one, and generating data.

import requests
import json

API_BASE = "https://syntheticdatagen.xyz/api"
API_KEY = "sk-your-openai-api-key"
headers = {"Content-Type": "application/json"}

# Step 1: List all available templates
print("📋 Fetching available templates...")
templates_response = requests.get(f"{API_BASE}/templates")

if templates_response.status_code == 200:
    templates = templates_response.json()
    
    # Display templates
    print(f"Found {len(templates)} templates:")
    for template in templates[:5]:  # Show first 5
        print(f"  • {template['id']}: {template['name']} ({template['category']})")
        print(f"    {template['description'][:80]}...")
        print()
else:
    print("Failed to fetch templates")
    exit()

# Step 2: Generate data using a predefined template
print("🎯 Using predefined template...")
template_data = {
    "api_key": API_KEY,
    "template_id": "sales_crm",  # No prompt needed
    "dataset_size": 1000,
    "reality": 4,
    "output_format": "json",
    "seed": 42  # Optional: For deterministic/reproducible results
}

response = requests.post(f"{API_BASE}/generate/template", headers=headers, json=template_data)

if response.status_code == 200:
    result = response.json()
    print(f"✅ Generated {result['metadata']['total_datasets']} datasets")
    
    # Process datasets
    for name, dataset in result["datasets"].items():
        print(f"  📊 {name}: {dataset['size']} records")
        
        # Save to file
        with open(f"{name}.json", "w") as f:
            json.dump(dataset["content"], f, indent=2)
else:
    print(f"❌ Error: {response.status_code}")
    print(response.json())

# Step 3: Generate custom dataset
print("\n🎨 Creating custom dataset...")
custom_data = {
    "api_key": API_KEY,
    "template_id": "custom",  # Triggers custom mode
    "prompt": "Restaurant management system with customers, reservations, menu items, and orders. Include table management, wait staff assignments, and kitchen order tracking.",
    "datasets": 4,
    "dataset_size": 800,
    "reality": 3,
    "output_format": "json"
}

custom_response = requests.post(f"{API_BASE}/generate", headers=headers, json=custom_data)

if custom_response.status_code == 200:
    custom_result = custom_response.json()
    print(f"✅ Custom dataset created with {custom_result['metadata']['total_datasets']} tables")
    
    for name, dataset in custom_result["datasets"].items():
        print(f"  📊 {name}: {dataset['size']} records")
else:
    print(f"❌ Custom generation failed: {custom_response.status_code}")
    print(custom_response.json())

Using Predefined Templates (cURL)

# First, list available templates
curl -X GET https://syntheticdatagen.xyz/api/templates | jq '.[] | {id: .id, name: .name, category: .category}'

# Then use a specific template (no prompt needed)
curl -X POST https://syntheticdatagen.xyz/api/generate/template \
  -H "Content-Type: application/json" \
  -d '{
    "api_key": "sk-your-openai-api-key",
    "template_id": "ecommerce_platform",
    "dataset_size": 1000,
    "reality": 4,
    "output_format": "json",
    "seed": 12345
  }' \
  | jq '.'

Custom Dataset Creation (cURL)

# Create custom dataset with detailed description
curl -X POST https://syntheticdatagen.xyz/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "api_key": "sk-your-openai-api-key",
    "template_id": "custom",
    "prompt": "University management system with students, courses, professors, and enrollments. Include GPA tracking, course prerequisites, and semester scheduling.",
    "datasets": 4,
    "dataset_size": 2000,
    "reality": 3,
    "output_format": "json"
  }' \
  | jq '.'

Template Categories Filtering

# Get templates by category
import requests

# List all categories
categories = requests.get("https://syntheticdatagen.xyz/api/templates/categories").json()
print("Available categories:", categories['categories'])

# Get templates for a specific category
healthcare_templates = requests.get(
    "https://syntheticdatagen.xyz/api/templates?category=Healthcare"
).json()

print(f"Healthcare templates: {len(healthcare_templates)}")
for template in healthcare_templates:
    print(f"  • {template['id']}: {template['name']}")
    print(f"    Datasets: {template['datasets']}, Size: {template['suggested_size']}")
    print(f"    Schema: {list(template['preview_schema'].keys())}")

API Usage Example

Simple, direct JSON response - no encoding or decoding required:

# Direct JSON API access
import requests
import json

template_data = {
    "api_key": "sk-your-openai-api-key",
    "template_id": "digital_marketing",
    "dataset_size": 1000,
    "reality": 4,
    "output_format": "json"  # Returns direct JSON
}

response = requests.post("https://syntheticdatagen.xyz/api/generate/template", 
                        headers={"Content-Type": "application/json"}, 
                        json=template_data)

if response.status_code == 200:
    result = response.json()
    
    # Direct access to datasets
    for name, dataset in result["datasets"].items():
        print(f"📊 {name}: {dataset['size']} records")
        
        # dataset['content'] is already a list of dictionaries
        for record in dataset['content'][:3]:  # Show first 3 records
            print(f"  {record}")
        
        # Save individual JSON files
        with open(f"{name}.json", "w") as f:
            json.dump(dataset["content"], f, indent=2)

Deterministic Generation

Use the seed parameter to generate identical datasets for testing and reproducibility. Only available with the /generate/template endpoint.

# Generate identical datasets every time with same seed
import requests

API_BASE = "https://syntheticdatagen.xyz/api"

# This will always produce identical results
deterministic_request = {
    "api_key": "sk-your-openai-api-key",
    "template_id": "sales_crm",
    "dataset_size": 500,
    "reality": 3,
    "seed": 2024  # Fixed seed for reproducible results
}

# First generation
response1 = requests.post(f"{API_BASE}/generate/template", json=deterministic_request)
result1 = response1.json()

# Second generation with same seed - will be identical
response2 = requests.post(f"{API_BASE}/generate/template", json=deterministic_request)
result2 = response2.json()

print(f"Results are identical: {result1['datasets'] == result2['datasets']}")

# Different seed produces different data
deterministic_request["seed"] = 2025
response3 = requests.post(f"{API_BASE}/generate/template", json=deterministic_request)
result3 = response3.json()

print(f"Different seed produces different data: {result1['datasets'] != result3['datasets']}")

💡 Deterministic Generation Use Cases:
• Consistent test datasets across environments
• Reproducible machine learning experiments
• Debugging data processing pipelines
• Demo environments with predictable data

Data Formats

JSON Format: Returns clean, direct JSON response with individual datasets. Each dataset contains an array of objects with consistent schema - ready for immediate use in your applications.

Reality Levels

Control the "messiness" of your synthetic data to simulate real-world conditions:

Level	Description	Use Cases
0-2	Perfect data with no inconsistencies	Testing ideal scenarios, demos
3-5	Minor inconsistencies (formatting variations, occasional nulls)	General testing, development
6-8	Moderate issues (duplicates, typos, missing values)	Data cleaning workflows, ETL testing
9-10	Highly messy (inconsistent formats, many nulls, outliers)	Stress testing, data quality tools

Best Practices

Choosing Templates vs Custom

Use Predefined Templates When:

Your use case matches a common business scenario
You want standardized, tested schemas
You need faster generation times
You're building prototypes or demos

Use Custom Datasets When:

Your domain is highly specialized
You need specific entity relationships
Existing templates don't fit your requirements
You want full control over the schema

Use Deterministic Generation (seed parameter) When:

Building reproducible test suites or demos
Running machine learning experiments that need consistent training data
Debugging data processing pipelines with predictable inputs
Sharing datasets that others need to recreate exactly
Creating consistent environments across dev/staging/prod

Template Workflow Best Practices

Explore first: Always check /templates before creating custom datasets
Filter by category: Use category filtering to find relevant templates faster
Check preview schemas: Review template schemas before generation
Override settings: Adjust size and reality levels for your needs
Test small first: Generate small samples before creating large datasets

Writing Effective Custom Prompts

Business context first: Start with domain (e.g., "Healthcare system", "E-commerce platform")
List main entities: Enumerate key data types (customers, orders, products)
Describe relationships: Explain how entities connect
Mention key attributes: Include important fields and constraints
Be specific: More detail leads to better schemas

Excellent custom prompt example:
"Healthcare clinic management system with patients, doctors, appointments, and medical records. Include patient demographics and insurance information, doctor specializations and schedules, appointment booking with time slots, and medical records with diagnosis codes, prescribed medications, and treatment notes. Track appointment status (scheduled, completed, cancelled) and insurance claim processing."

⚠️ Avoid vague prompts: "Generate some business data" or "Create a database" are too general and will result in poor schemas.

Performance Optimization

Templates are faster: Predefined templates generate data more quickly than custom
Start small: Test with 100-1000 records before scaling up
Reality levels impact speed: Higher reality levels take more processing time
Batch requests: Generate multiple datasets in one request when possible
Cache generated data: Store and reuse datasets for testing

Deterministic Generation Best Practices

Use consistent seeds: Document seed values for important datasets
Seed management: Use meaningful seeds (dates, version numbers) for traceability
Version control: Store seed values alongside your test code
Environment consistency: Use same seeds across dev/test/prod for comparable datasets
Regression testing: Change seeds when you want to test with different data patterns

💡 Seed Strategy Example:
Use dates as seeds: seed: 20240315 for datasets created on March 15, 2024. This makes it easy to track when specific test data was generated and recreate it later.

Data Quality

Validate schemas: Check that generated structure matches expectations
Verify relationships: Ensure foreign keys reference correct entities
Test reality levels: Find the optimal messiness for your use case
Inspect samples: Review generated data before using in production tests
Check data types: Verify field types match your database schema

API Integration Patterns

# Pattern: Template discovery and selection
def select_best_template(domain_keywords):
    templates = requests.get(f"{API_BASE}/templates").json()
    
    # Score templates by keyword matches in description
    scored = []
    for template in templates:
        score = sum(1 for keyword in domain_keywords 
                   if keyword.lower() in template['description'].lower())
        if score > 0:
            scored.append((score, template))
    
    # Return best match or suggest custom
    if scored:
        return max(scored, key=lambda x: x[0])[1]
    else:
        return {"id": "custom", "message": "No matching template, use custom"}

# Usage
best_template = select_best_template(["healthcare", "patient", "appointment"])
print(f"Recommended: {best_template['name']}")

Error Handling

Common Error Types

Error Code	Description	Common Causes
400	Bad Request	Invalid payload, inconsistent parameters, constraint violations
401	Unauthorized	Invalid or expired OpenAI API key
429	Rate Limited	Exceeded request limits or OpenAI quota
500	Internal Server Error	Service unavailable, AI model issues

Payload Validation Errors

The API performs intelligent validation of your request payload:

Example: Entity Count Mismatch

{
  "error": "Invalid request payload",
  "issues": [
    "Prompt mentions 4 entities (customers, products, orders, reviews) but datasets=3"
  ],
  "suggestions": [
    "Set datasets=4 to match entities in prompt",
    "Or modify prompt to only mention 3 entities"
  ]
}

Example: Dataset Size Exceeded

{
  "error": "Invalid request payload",
  "issues": [
    "Dataset size 2,000,000 exceeds limit. With 3 datasets, max size per dataset is 1,666,666"
  ],
  "suggestions": []
}

Custom Template Errors

When using custom datasets, you may encounter specific validation errors:

Example: Missing Prompt for Custom Template

{
  "error": "Validation error",
  "detail": [
    {
      "type": "value_error", 
      "msg": "Prompt is required when template_id is 'custom'"
    }
  ]
}

Example: Custom Template Generation Failed

{
  "error": "Please provide a clearer description. Include: business domain, key entities/tables, and relationships between data",
  "type": "custom_template_error",
  "suggestions": [
    "Be more specific about your business domain",
    "Include the main entities/tables you need", 
    "Describe relationships between different data types",
    "Example: 'E-commerce platform with customers, orders, and products...'"
  ]
}

Example: Template Not Found

{
  "error": "Template 'invalid_template_id' not found",
  "suggestion": "Use GET /templates to see available templates"
}

💡 Auto-Correction: When possible, the API will automatically adjust your dataset count to match the entities mentioned in your prompt, with a warning message.

Rate Limits

Datasets per request: Maximum 5
Records per dataset: Maximum 10,000
Total records per request: Maximum 100,000
Request timeout: 5 minutes

⚠️ Note: Large datasets may take several minutes to generate. Consider using webhooks or polling for very large requests.

The system automatically calculates maximum dataset size based on the number of datasets:

                    max_size_per_dataset = min(10,000, 100,000 / number_of_datasets)

# Examples:
# 1 dataset: max 10,000 records
# 2 datasets: max 10,000 records each (total: 20,000)
# 5 datasets: max 10,000 records each (total: 50,000)
# 10+ datasets: max 100,000 ÷ datasets records each