Complete guide to generating synthetic datasets with AI-powered intelligence
DataGen provides a powerful REST API for generating synthetic datasets using advanced AI. Simply describe your data requirements in natural language, and our AI will create realistic, structured datasets with proper relationships and constraints.
To get started, you'll need:
๐ก Pro Tip: Be specific in your dataset descriptions. The more detail you provide about entities, relationships, and business domain, the better the AI can generate realistic data.
DataGen uses your OpenAI API key for authentication and AI processing. The key is passed in the request payload and is encrypted in transit. We never store your API keys.
sk-
โ ๏ธ Security Notice: Your API Key will be encrypted and decrypted in the server but never stored.
https://syntheticdatagen.xyz/api
POST /generate
Generates synthetic datasets based on your natural language description. Use this when you need fully custom datasets.
Features: Custom prompts, flexible schema generation
Limitations: Does not support deterministic generation (no seed parameter)
POST /generate/template
Generates synthetic datasets using predefined templates. Use this with template_id from /templates endpoint.
Features: Predefined schemas, deterministic generation with seed parameter, faster processing
Best for: Reproducible datasets, testing, standardized data structures
Content-Type: application/json
DataGen provides pre-built templates for common data scenarios. Templates include predefined schemas, relationships, and realistic data patterns for various business domains.
GET /templates
Returns all available templates with their descriptions and metadata.
GET /api/templates
category
(string): Filter templates by category (e.g., "E-commerce", "Healthcare")[
{
"id": "sales_crm",
"name": "Sales CRM",
"description": "Sales CRM system with leads, companies, and activities...",
"category": "Sales & Marketing",
"datasets": 3,
"suggested_size": 1000,
"suggested_reality": 4,
"preview_schema": {
"companies": ["id", "name", "industry", "employee_count"],
"leads": ["id", "name", "email", "company_id", "stage"],
"activities": ["id", "type", "lead_id", "date", "outcome"]
}
},
{
"id": "custom",
"name": "Custom Dataset",
"description": "Create your own custom dataset by describing your data requirements...",
"category": "Custom",
"datasets": 1,
"suggested_size": 1000,
"suggested_reality": 4,
"preview_schema": {
"example": ["Depends on your description - AI will generate appropriate schema"]
}
}
]
GET /templates/{template_id}
Returns detailed information about a specific template.
GET /api/templates/sales_crm
GET /templates/categories
Returns all available template categories.
{
"categories": [
"API Testing",
"Custom",
"E-commerce",
"Events",
"Finance",
"Healthcare",
"Human Resources",
"IoT & Sensors",
"Logistics",
"Marketing",
"SaaS",
"Sales & Marketing",
"Social Media",
"User Management"
]
}
When existing templates don't fit your needs, use the custom dataset feature. Simply describe your data requirements in natural language, and AI will parse your description to generate appropriate schemas and relationships.
Example Description:
"E-commerce platform with customers, orders, and products. Include customer profiles, order history with line items, product catalog with inventory tracking, and customer reviews. Track order status, payment methods, and shipping addresses."
โ ๏ธ Custom Template Requirements: When using template_id: "custom", the prompt parameter becomes required and must be at least 10 characters long.
Parameter | Type | Required | Description |
---|---|---|---|
api_key | string | Required | Your OpenAI API key (sk-...) |
template_id | string | Optional | Template ID to use (e.g., "sales_crm", "custom"). Use "custom" for custom descriptions. |
prompt | string | Optional* | Natural language description. Required only when template_id is "custom". |
datasets | integer | Optional | Number of datasets to generate (1-5, default: 2) |
dataset_size | integer | Optional | Records per dataset (1-10,000, default: 1000) |
reality | integer | Optional | Reality level 0-10 (0=clean, 10=messy, default: 3) |
output_format | string | Optional | Format: "json" (default). Returns direct JSON response. |
seed | integer | Optional | Seed for deterministic generation (โฅ1). Only available for /generate/template endpoint. Ensures identical input produces identical output. |
{
"success": true,
"message": "Successfully generated 2 dataset(s)",
"datasets": {
"customers": {
"format": "json",
"size": 1000,
"columns": ["id", "name", "email", "signup_date"],
"content": [...] // Array of 1000 customer records
},
"orders": {
"format": "json",
"size": 1000,
"columns": ["order_id", "customer_id", "total", "order_date"],
"content": [...] // Array of 1000 order records
}
},
"metadata": {
"total_datasets": 2,
"total_records": 2000,
"reality_level": 3,
"output_format": "json",
"prompt": "E-commerce platform...",
"generation_timestamp": "2024-01-15T10:30:00Z",
"validation": {
"entity_types_generated": ["customers", "orders"],
"validation_status": "completed",
"validation_score": 9
}
},
"download_links": null // Only used by web UI
}
{
"error": "Invalid API key format",
"detail": "API key must start with 'sk-'"
}
This example shows the complete workflow: listing templates, selecting one, and generating data.
import requests
import json
API_BASE = "https://syntheticdatagen.xyz/api"
API_KEY = "sk-your-openai-api-key"
headers = {"Content-Type": "application/json"}
# Step 1: List all available templates
print("๐ Fetching available templates...")
templates_response = requests.get(f"{API_BASE}/templates")
if templates_response.status_code == 200:
templates = templates_response.json()
# Display templates
print(f"Found {len(templates)} templates:")
for template in templates[:5]: # Show first 5
print(f" โข {template['id']}: {template['name']} ({template['category']})")
print(f" {template['description'][:80]}...")
print()
else:
print("Failed to fetch templates")
exit()
# Step 2: Generate data using a predefined template
print("๐ฏ Using predefined template...")
template_data = {
"api_key": API_KEY,
"template_id": "sales_crm", # No prompt needed
"dataset_size": 1000,
"reality": 4,
"output_format": "json",
"seed": 42 # Optional: For deterministic/reproducible results
}
response = requests.post(f"{API_BASE}/generate/template", headers=headers, json=template_data)
if response.status_code == 200:
result = response.json()
print(f"โ
Generated {result['metadata']['total_datasets']} datasets")
# Process datasets
for name, dataset in result["datasets"].items():
print(f" ๐ {name}: {dataset['size']} records")
# Save to file
with open(f"{name}.json", "w") as f:
json.dump(dataset["content"], f, indent=2)
else:
print(f"โ Error: {response.status_code}")
print(response.json())
# Step 3: Generate custom dataset
print("\n๐จ Creating custom dataset...")
custom_data = {
"api_key": API_KEY,
"template_id": "custom", # Triggers custom mode
"prompt": "Restaurant management system with customers, reservations, menu items, and orders. Include table management, wait staff assignments, and kitchen order tracking.",
"datasets": 4,
"dataset_size": 800,
"reality": 3,
"output_format": "json"
}
custom_response = requests.post(f"{API_BASE}/generate", headers=headers, json=custom_data)
if custom_response.status_code == 200:
custom_result = custom_response.json()
print(f"โ
Custom dataset created with {custom_result['metadata']['total_datasets']} tables")
for name, dataset in custom_result["datasets"].items():
print(f" ๐ {name}: {dataset['size']} records")
else:
print(f"โ Custom generation failed: {custom_response.status_code}")
print(custom_response.json())
# First, list available templates
curl -X GET https://syntheticdatagen.xyz/api/templates | jq '.[] | {id: .id, name: .name, category: .category}'
# Then use a specific template (no prompt needed)
curl -X POST https://syntheticdatagen.xyz/api/generate/template \
-H "Content-Type: application/json" \
-d '{
"api_key": "sk-your-openai-api-key",
"template_id": "ecommerce_platform",
"dataset_size": 1000,
"reality": 4,
"output_format": "json",
"seed": 12345
}' \
| jq '.'
# Create custom dataset with detailed description
curl -X POST https://syntheticdatagen.xyz/api/generate \
-H "Content-Type: application/json" \
-d '{
"api_key": "sk-your-openai-api-key",
"template_id": "custom",
"prompt": "University management system with students, courses, professors, and enrollments. Include GPA tracking, course prerequisites, and semester scheduling.",
"datasets": 4,
"dataset_size": 2000,
"reality": 3,
"output_format": "json"
}' \
| jq '.'
# Get templates by category
import requests
# List all categories
categories = requests.get("https://syntheticdatagen.xyz/api/templates/categories").json()
print("Available categories:", categories['categories'])
# Get templates for a specific category
healthcare_templates = requests.get(
"https://syntheticdatagen.xyz/api/templates?category=Healthcare"
).json()
print(f"Healthcare templates: {len(healthcare_templates)}")
for template in healthcare_templates:
print(f" โข {template['id']}: {template['name']}")
print(f" Datasets: {template['datasets']}, Size: {template['suggested_size']}")
print(f" Schema: {list(template['preview_schema'].keys())}")
Simple, direct JSON response - no encoding or decoding required:
# Direct JSON API access
import requests
import json
template_data = {
"api_key": "sk-your-openai-api-key",
"template_id": "digital_marketing",
"dataset_size": 1000,
"reality": 4,
"output_format": "json" # Returns direct JSON
}
response = requests.post("https://syntheticdatagen.xyz/api/generate/template",
headers={"Content-Type": "application/json"},
json=template_data)
if response.status_code == 200:
result = response.json()
# Direct access to datasets
for name, dataset in result["datasets"].items():
print(f"๐ {name}: {dataset['size']} records")
# dataset['content'] is already a list of dictionaries
for record in dataset['content'][:3]: # Show first 3 records
print(f" {record}")
# Save individual JSON files
with open(f"{name}.json", "w") as f:
json.dump(dataset["content"], f, indent=2)
Use the seed
parameter to generate identical datasets for testing and reproducibility. Only available with the /generate/template
endpoint.
# Generate identical datasets every time with same seed
import requests
API_BASE = "https://syntheticdatagen.xyz/api"
# This will always produce identical results
deterministic_request = {
"api_key": "sk-your-openai-api-key",
"template_id": "sales_crm",
"dataset_size": 500,
"reality": 3,
"seed": 2024 # Fixed seed for reproducible results
}
# First generation
response1 = requests.post(f"{API_BASE}/generate/template", json=deterministic_request)
result1 = response1.json()
# Second generation with same seed - will be identical
response2 = requests.post(f"{API_BASE}/generate/template", json=deterministic_request)
result2 = response2.json()
print(f"Results are identical: {result1['datasets'] == result2['datasets']}")
# Different seed produces different data
deterministic_request["seed"] = 2025
response3 = requests.post(f"{API_BASE}/generate/template", json=deterministic_request)
result3 = response3.json()
print(f"Different seed produces different data: {result1['datasets'] != result3['datasets']}")
๐ก Deterministic Generation Use Cases:
โข Consistent test datasets across environments
โข Reproducible machine learning experiments
โข Debugging data processing pipelines
โข Demo environments with predictable data
JSON Format: Returns clean, direct JSON response with individual datasets. Each dataset contains an array of objects with consistent schema - ready for immediate use in your applications.
Control the "messiness" of your synthetic data to simulate real-world conditions:
Level | Description | Use Cases |
---|---|---|
0-2 | Perfect data with no inconsistencies | Testing ideal scenarios, demos |
3-5 | Minor inconsistencies (formatting variations, occasional nulls) | General testing, development |
6-8 | Moderate issues (duplicates, typos, missing values) | Data cleaning workflows, ETL testing |
9-10 | Highly messy (inconsistent formats, many nulls, outliers) | Stress testing, data quality tools |
/templates
before creating custom datasetsExcellent custom prompt example:
"Healthcare clinic management system with patients, doctors, appointments, and medical records. Include patient demographics and insurance information, doctor specializations and schedules, appointment booking with time slots, and medical records with diagnosis codes, prescribed medications, and treatment notes. Track appointment status (scheduled, completed, cancelled) and insurance claim processing."
โ ๏ธ Avoid vague prompts: "Generate some business data" or "Create a database" are too general and will result in poor schemas.
๐ก Seed Strategy Example:
Use dates as seeds: seed: 20240315
for datasets created on March 15, 2024. This makes it easy to track when specific test data was generated and recreate it later.
# Pattern: Template discovery and selection
def select_best_template(domain_keywords):
templates = requests.get(f"{API_BASE}/templates").json()
# Score templates by keyword matches in description
scored = []
for template in templates:
score = sum(1 for keyword in domain_keywords
if keyword.lower() in template['description'].lower())
if score > 0:
scored.append((score, template))
# Return best match or suggest custom
if scored:
return max(scored, key=lambda x: x[0])[1]
else:
return {"id": "custom", "message": "No matching template, use custom"}
# Usage
best_template = select_best_template(["healthcare", "patient", "appointment"])
print(f"Recommended: {best_template['name']}")
Error Code | Description | Common Causes |
---|---|---|
400 | Bad Request | Invalid payload, inconsistent parameters, constraint violations |
401 | Unauthorized | Invalid or expired OpenAI API key |
429 | Rate Limited | Exceeded request limits or OpenAI quota |
500 | Internal Server Error | Service unavailable, AI model issues |
The API performs intelligent validation of your request payload:
{
"error": "Invalid request payload",
"issues": [
"Prompt mentions 4 entities (customers, products, orders, reviews) but datasets=3"
],
"suggestions": [
"Set datasets=4 to match entities in prompt",
"Or modify prompt to only mention 3 entities"
]
}
{
"error": "Invalid request payload",
"issues": [
"Dataset size 2,000,000 exceeds limit. With 3 datasets, max size per dataset is 1,666,666"
],
"suggestions": []
}
When using custom datasets, you may encounter specific validation errors:
{
"error": "Validation error",
"detail": [
{
"type": "value_error",
"msg": "Prompt is required when template_id is 'custom'"
}
]
}
{
"error": "Please provide a clearer description. Include: business domain, key entities/tables, and relationships between data",
"type": "custom_template_error",
"suggestions": [
"Be more specific about your business domain",
"Include the main entities/tables you need",
"Describe relationships between different data types",
"Example: 'E-commerce platform with customers, orders, and products...'"
]
}
{
"error": "Template 'invalid_template_id' not found",
"suggestion": "Use GET /templates to see available templates"
}
๐ก Auto-Correction: When possible, the API will automatically adjust your dataset count to match the entities mentioned in your prompt, with a warning message.
โ ๏ธ Note: Large datasets may take several minutes to generate. Consider using webhooks or polling for very large requests.
The system automatically calculates maximum dataset size based on the number of datasets:
max_size_per_dataset = min(10,000, 100,000 / number_of_datasets)
# Examples:
# 1 dataset: max 10,000 records
# 2 datasets: max 10,000 records each (total: 20,000)
# 5 datasets: max 10,000 records each (total: 50,000)
# 10+ datasets: max 100,000 รท datasets records each