DataGen API Documentation

Complete guide to generating synthetic datasets with AI-powered intelligence

Getting Started

DataGen provides a powerful REST API for generating synthetic datasets using advanced AI. Simply describe your data requirements in natural language, and our AI will create realistic, structured datasets with proper relationships and constraints.

Quick Start

To get started, you'll need:

๐Ÿ’ก Pro Tip: Be specific in your dataset descriptions. The more detail you provide about entities, relationships, and business domain, the better the AI can generate realistic data.

Authentication

DataGen uses your OpenAI API key for authentication and AI processing. The key is passed in the request payload and is encrypted in transit. We never store your API keys.

API Key Requirements

  • Must be a valid OpenAI API key starting with sk-
  • Should have sufficient credits for processing requests

โš ๏ธ Security Notice: Your API Key will be encrypted and decrypted in the server but never stored.

API Reference

Base URL

https://syntheticdatagen.xyz/api

Generate Custom Datasets

POST /generate

Generates synthetic datasets based on your natural language description. Use this when you need fully custom datasets.

Features: Custom prompts, flexible schema generation

Limitations: Does not support deterministic generation (no seed parameter)

Generate from Template

POST /generate/template

Generates synthetic datasets using predefined templates. Use this with template_id from /templates endpoint.

Features: Predefined schemas, deterministic generation with seed parameter, faster processing

Best for: Reproducible datasets, testing, standardized data structures

Content-Type

Content-Type: application/json

Templates

DataGen provides pre-built templates for common data scenarios. Templates include predefined schemas, relationships, and realistic data patterns for various business domains.

List All Templates

GET /templates

Returns all available templates with their descriptions and metadata.

GET /api/templates

Optional Parameters

  • category (string): Filter templates by category (e.g., "E-commerce", "Healthcare")
Example Response
[
  {
    "id": "sales_crm",
    "name": "Sales CRM",
    "description": "Sales CRM system with leads, companies, and activities...",
    "category": "Sales & Marketing",
    "datasets": 3,
    "suggested_size": 1000,
    "suggested_reality": 4,
    "preview_schema": {
      "companies": ["id", "name", "industry", "employee_count"],
      "leads": ["id", "name", "email", "company_id", "stage"],
      "activities": ["id", "type", "lead_id", "date", "outcome"]
    }
  },
  {
    "id": "custom",
    "name": "Custom Dataset", 
    "description": "Create your own custom dataset by describing your data requirements...",
    "category": "Custom",
    "datasets": 1,
    "suggested_size": 1000,
    "suggested_reality": 4,
    "preview_schema": {
      "example": ["Depends on your description - AI will generate appropriate schema"]
    }
  }
]

Get Template Details

GET /templates/{template_id}

Returns detailed information about a specific template.

GET /api/templates/sales_crm

Get Template Categories

GET /templates/categories

Returns all available template categories.

Example Response
{
  "categories": [
    "API Testing",
    "Custom", 
    "E-commerce",
    "Events",
    "Finance",
    "Healthcare",
    "Human Resources",
    "IoT & Sensors",
    "Logistics",
    "Marketing",
    "SaaS",
    "Sales & Marketing",
    "Social Media",
    "User Management"
  ]
}

Custom Datasets

When existing templates don't fit your needs, use the custom dataset feature. Simply describe your data requirements in natural language, and AI will parse your description to generate appropriate schemas and relationships.

How Custom Datasets Work

  1. Description Analysis: AI parses your natural language description
  2. Schema Generation: Creates technical prompts and preview schemas
  3. Data Creation: Generates realistic data matching your requirements
  4. Relationship Building: Establishes proper foreign key relationships

Custom Dataset Guidelines

Example Description:
"E-commerce platform with customers, orders, and products. Include customer profiles, order history with line items, product catalog with inventory tracking, and customer reviews. Track order status, payment methods, and shipping addresses."

Writing Effective Custom Prompts

  • Describe your business domain: E-commerce, healthcare, finance, etc.
  • List main entities: customers, orders, products, etc.
  • Mention key relationships: how entities connect to each other
  • Include important attributes: specific fields you need

โš ๏ธ Custom Template Requirements: When using template_id: "custom", the prompt parameter becomes required and must be at least 10 characters long.

Request Parameters

Parameter Type Required Description
api_key string Required Your OpenAI API key (sk-...)
template_id string Optional Template ID to use (e.g., "sales_crm", "custom"). Use "custom" for custom descriptions.
prompt string Optional* Natural language description. Required only when template_id is "custom".
datasets integer Optional Number of datasets to generate (1-5, default: 2)
dataset_size integer Optional Records per dataset (1-10,000, default: 1000)
reality integer Optional Reality level 0-10 (0=clean, 10=messy, default: 3)
output_format string Optional Format: "json" (default). Returns direct JSON response.
seed integer Optional Seed for deterministic generation (โ‰ฅ1). Only available for /generate/template endpoint. Ensures identical input produces identical output.

Response Format

Success Response (200 OK)

Response Structure
{
  "success": true,
  "message": "Successfully generated 2 dataset(s)",
  "datasets": {
    "customers": {
      "format": "json",
      "size": 1000,
      "columns": ["id", "name", "email", "signup_date"],
      "content": [...] // Array of 1000 customer records
    },
    "orders": {
      "format": "json", 
      "size": 1000,
      "columns": ["order_id", "customer_id", "total", "order_date"],
      "content": [...] // Array of 1000 order records
    }
  },
  "metadata": {
    "total_datasets": 2,
    "total_records": 2000,
    "reality_level": 3,
    "output_format": "json",
    "prompt": "E-commerce platform...",
    "generation_timestamp": "2024-01-15T10:30:00Z",
    "validation": {
      "entity_types_generated": ["customers", "orders"],
      "validation_status": "completed",
      "validation_score": 9
    }
  },
  "download_links": null // Only used by web UI
}

Error Response (4xx/5xx)

Error Structure
{
  "error": "Invalid API key format",
  "detail": "API key must start with 'sk-'"
}

Code Examples

Complete Template Workflow (Python)

This example shows the complete workflow: listing templates, selecting one, and generating data.

import requests
import json

API_BASE = "https://syntheticdatagen.xyz/api"
API_KEY = "sk-your-openai-api-key"
headers = {"Content-Type": "application/json"}

# Step 1: List all available templates
print("๐Ÿ“‹ Fetching available templates...")
templates_response = requests.get(f"{API_BASE}/templates")

if templates_response.status_code == 200:
    templates = templates_response.json()
    
    # Display templates
    print(f"Found {len(templates)} templates:")
    for template in templates[:5]:  # Show first 5
        print(f"  โ€ข {template['id']}: {template['name']} ({template['category']})")
        print(f"    {template['description'][:80]}...")
        print()
else:
    print("Failed to fetch templates")
    exit()

# Step 2: Generate data using a predefined template
print("๐ŸŽฏ Using predefined template...")
template_data = {
    "api_key": API_KEY,
    "template_id": "sales_crm",  # No prompt needed
    "dataset_size": 1000,
    "reality": 4,
    "output_format": "json",
    "seed": 42  # Optional: For deterministic/reproducible results
}

response = requests.post(f"{API_BASE}/generate/template", headers=headers, json=template_data)

if response.status_code == 200:
    result = response.json()
    print(f"โœ… Generated {result['metadata']['total_datasets']} datasets")
    
    # Process datasets
    for name, dataset in result["datasets"].items():
        print(f"  ๐Ÿ“Š {name}: {dataset['size']} records")
        
        # Save to file
        with open(f"{name}.json", "w") as f:
            json.dump(dataset["content"], f, indent=2)
else:
    print(f"โŒ Error: {response.status_code}")
    print(response.json())

# Step 3: Generate custom dataset
print("\n๐ŸŽจ Creating custom dataset...")
custom_data = {
    "api_key": API_KEY,
    "template_id": "custom",  # Triggers custom mode
    "prompt": "Restaurant management system with customers, reservations, menu items, and orders. Include table management, wait staff assignments, and kitchen order tracking.",
    "datasets": 4,
    "dataset_size": 800,
    "reality": 3,
    "output_format": "json"
}

custom_response = requests.post(f"{API_BASE}/generate", headers=headers, json=custom_data)

if custom_response.status_code == 200:
    custom_result = custom_response.json()
    print(f"โœ… Custom dataset created with {custom_result['metadata']['total_datasets']} tables")
    
    for name, dataset in custom_result["datasets"].items():
        print(f"  ๐Ÿ“Š {name}: {dataset['size']} records")
else:
    print(f"โŒ Custom generation failed: {custom_response.status_code}")
    print(custom_response.json())

Using Predefined Templates (cURL)

# First, list available templates
curl -X GET https://syntheticdatagen.xyz/api/templates | jq '.[] | {id: .id, name: .name, category: .category}'

# Then use a specific template (no prompt needed)
curl -X POST https://syntheticdatagen.xyz/api/generate/template \
  -H "Content-Type: application/json" \
  -d '{
    "api_key": "sk-your-openai-api-key",
    "template_id": "ecommerce_platform",
    "dataset_size": 1000,
    "reality": 4,
    "output_format": "json",
    "seed": 12345
  }' \
  | jq '.'

Custom Dataset Creation (cURL)

# Create custom dataset with detailed description
curl -X POST https://syntheticdatagen.xyz/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "api_key": "sk-your-openai-api-key",
    "template_id": "custom",
    "prompt": "University management system with students, courses, professors, and enrollments. Include GPA tracking, course prerequisites, and semester scheduling.",
    "datasets": 4,
    "dataset_size": 2000,
    "reality": 3,
    "output_format": "json"
  }' \
  | jq '.'

Template Categories Filtering

# Get templates by category
import requests

# List all categories
categories = requests.get("https://syntheticdatagen.xyz/api/templates/categories").json()
print("Available categories:", categories['categories'])

# Get templates for a specific category
healthcare_templates = requests.get(
    "https://syntheticdatagen.xyz/api/templates?category=Healthcare"
).json()

print(f"Healthcare templates: {len(healthcare_templates)}")
for template in healthcare_templates:
    print(f"  โ€ข {template['id']}: {template['name']}")
    print(f"    Datasets: {template['datasets']}, Size: {template['suggested_size']}")
    print(f"    Schema: {list(template['preview_schema'].keys())}")

API Usage Example

Simple, direct JSON response - no encoding or decoding required:

# Direct JSON API access
import requests
import json

template_data = {
    "api_key": "sk-your-openai-api-key",
    "template_id": "digital_marketing",
    "dataset_size": 1000,
    "reality": 4,
    "output_format": "json"  # Returns direct JSON
}

response = requests.post("https://syntheticdatagen.xyz/api/generate/template", 
                        headers={"Content-Type": "application/json"}, 
                        json=template_data)

if response.status_code == 200:
    result = response.json()
    
    # Direct access to datasets
    for name, dataset in result["datasets"].items():
        print(f"๐Ÿ“Š {name}: {dataset['size']} records")
        
        # dataset['content'] is already a list of dictionaries
        for record in dataset['content'][:3]:  # Show first 3 records
            print(f"  {record}")
        
        # Save individual JSON files
        with open(f"{name}.json", "w") as f:
            json.dump(dataset["content"], f, indent=2)

Deterministic Generation

Use the seed parameter to generate identical datasets for testing and reproducibility. Only available with the /generate/template endpoint.

# Generate identical datasets every time with same seed
import requests

API_BASE = "https://syntheticdatagen.xyz/api"

# This will always produce identical results
deterministic_request = {
    "api_key": "sk-your-openai-api-key",
    "template_id": "sales_crm",
    "dataset_size": 500,
    "reality": 3,
    "seed": 2024  # Fixed seed for reproducible results
}

# First generation
response1 = requests.post(f"{API_BASE}/generate/template", json=deterministic_request)
result1 = response1.json()

# Second generation with same seed - will be identical
response2 = requests.post(f"{API_BASE}/generate/template", json=deterministic_request)
result2 = response2.json()

print(f"Results are identical: {result1['datasets'] == result2['datasets']}")

# Different seed produces different data
deterministic_request["seed"] = 2025
response3 = requests.post(f"{API_BASE}/generate/template", json=deterministic_request)
result3 = response3.json()

print(f"Different seed produces different data: {result1['datasets'] != result3['datasets']}")

๐Ÿ’ก Deterministic Generation Use Cases:
โ€ข Consistent test datasets across environments
โ€ข Reproducible machine learning experiments
โ€ข Debugging data processing pipelines
โ€ข Demo environments with predictable data

Data Formats

JSON Format: Returns clean, direct JSON response with individual datasets. Each dataset contains an array of objects with consistent schema - ready for immediate use in your applications.

Reality Levels

Control the "messiness" of your synthetic data to simulate real-world conditions:

Level Description Use Cases
0-2 Perfect data with no inconsistencies Testing ideal scenarios, demos
3-5 Minor inconsistencies (formatting variations, occasional nulls) General testing, development
6-8 Moderate issues (duplicates, typos, missing values) Data cleaning workflows, ETL testing
9-10 Highly messy (inconsistent formats, many nulls, outliers) Stress testing, data quality tools

Best Practices

Choosing Templates vs Custom

Use Predefined Templates When:

  • Your use case matches a common business scenario
  • You want standardized, tested schemas
  • You need faster generation times
  • You're building prototypes or demos

Use Custom Datasets When:

  • Your domain is highly specialized
  • You need specific entity relationships
  • Existing templates don't fit your requirements
  • You want full control over the schema

Use Deterministic Generation (seed parameter) When:

  • Building reproducible test suites or demos
  • Running machine learning experiments that need consistent training data
  • Debugging data processing pipelines with predictable inputs
  • Sharing datasets that others need to recreate exactly
  • Creating consistent environments across dev/staging/prod

Template Workflow Best Practices

  1. Explore first: Always check /templates before creating custom datasets
  2. Filter by category: Use category filtering to find relevant templates faster
  3. Check preview schemas: Review template schemas before generation
  4. Override settings: Adjust size and reality levels for your needs
  5. Test small first: Generate small samples before creating large datasets

Writing Effective Custom Prompts

  • Business context first: Start with domain (e.g., "Healthcare system", "E-commerce platform")
  • List main entities: Enumerate key data types (customers, orders, products)
  • Describe relationships: Explain how entities connect
  • Mention key attributes: Include important fields and constraints
  • Be specific: More detail leads to better schemas

Excellent custom prompt example:
"Healthcare clinic management system with patients, doctors, appointments, and medical records. Include patient demographics and insurance information, doctor specializations and schedules, appointment booking with time slots, and medical records with diagnosis codes, prescribed medications, and treatment notes. Track appointment status (scheduled, completed, cancelled) and insurance claim processing."

โš ๏ธ Avoid vague prompts: "Generate some business data" or "Create a database" are too general and will result in poor schemas.

Performance Optimization

  • Templates are faster: Predefined templates generate data more quickly than custom
  • Start small: Test with 100-1000 records before scaling up
  • Reality levels impact speed: Higher reality levels take more processing time
  • Batch requests: Generate multiple datasets in one request when possible
  • Cache generated data: Store and reuse datasets for testing

Deterministic Generation Best Practices

  • Use consistent seeds: Document seed values for important datasets
  • Seed management: Use meaningful seeds (dates, version numbers) for traceability
  • Version control: Store seed values alongside your test code
  • Environment consistency: Use same seeds across dev/test/prod for comparable datasets
  • Regression testing: Change seeds when you want to test with different data patterns

๐Ÿ’ก Seed Strategy Example:
Use dates as seeds: seed: 20240315 for datasets created on March 15, 2024. This makes it easy to track when specific test data was generated and recreate it later.

Data Quality

  • Validate schemas: Check that generated structure matches expectations
  • Verify relationships: Ensure foreign keys reference correct entities
  • Test reality levels: Find the optimal messiness for your use case
  • Inspect samples: Review generated data before using in production tests
  • Check data types: Verify field types match your database schema

API Integration Patterns

# Pattern: Template discovery and selection
def select_best_template(domain_keywords):
    templates = requests.get(f"{API_BASE}/templates").json()
    
    # Score templates by keyword matches in description
    scored = []
    for template in templates:
        score = sum(1 for keyword in domain_keywords 
                   if keyword.lower() in template['description'].lower())
        if score > 0:
            scored.append((score, template))
    
    # Return best match or suggest custom
    if scored:
        return max(scored, key=lambda x: x[0])[1]
    else:
        return {"id": "custom", "message": "No matching template, use custom"}

# Usage
best_template = select_best_template(["healthcare", "patient", "appointment"])
print(f"Recommended: {best_template['name']}")

Error Handling

Common Error Types

Error Code Description Common Causes
400 Bad Request Invalid payload, inconsistent parameters, constraint violations
401 Unauthorized Invalid or expired OpenAI API key
429 Rate Limited Exceeded request limits or OpenAI quota
500 Internal Server Error Service unavailable, AI model issues

Payload Validation Errors

The API performs intelligent validation of your request payload:

Example: Entity Count Mismatch
{
  "error": "Invalid request payload",
  "issues": [
    "Prompt mentions 4 entities (customers, products, orders, reviews) but datasets=3"
  ],
  "suggestions": [
    "Set datasets=4 to match entities in prompt",
    "Or modify prompt to only mention 3 entities"
  ]
}
Example: Dataset Size Exceeded
{
  "error": "Invalid request payload",
  "issues": [
    "Dataset size 2,000,000 exceeds limit. With 3 datasets, max size per dataset is 1,666,666"
  ],
  "suggestions": []
}

Custom Template Errors

When using custom datasets, you may encounter specific validation errors:

Example: Missing Prompt for Custom Template
{
  "error": "Validation error",
  "detail": [
    {
      "type": "value_error", 
      "msg": "Prompt is required when template_id is 'custom'"
    }
  ]
}
Example: Custom Template Generation Failed
{
  "error": "Please provide a clearer description. Include: business domain, key entities/tables, and relationships between data",
  "type": "custom_template_error",
  "suggestions": [
    "Be more specific about your business domain",
    "Include the main entities/tables you need", 
    "Describe relationships between different data types",
    "Example: 'E-commerce platform with customers, orders, and products...'"
  ]
}
Example: Template Not Found
{
  "error": "Template 'invalid_template_id' not found",
  "suggestion": "Use GET /templates to see available templates"
}

๐Ÿ’ก Auto-Correction: When possible, the API will automatically adjust your dataset count to match the entities mentioned in your prompt, with a warning message.

Rate Limits

  • Datasets per request: Maximum 5
  • Records per dataset: Maximum 10,000
  • Total records per request: Maximum 100,000
  • Request timeout: 5 minutes

โš ๏ธ Note: Large datasets may take several minutes to generate. Consider using webhooks or polling for very large requests.

The system automatically calculates maximum dataset size based on the number of datasets:

max_size_per_dataset = min(10,000, 100,000 / number_of_datasets) # Examples: # 1 dataset: max 10,000 records # 2 datasets: max 10,000 records each (total: 20,000) # 5 datasets: max 10,000 records each (total: 50,000) # 10+ datasets: max 100,000 รท datasets records each