Complete guide to generating synthetic datasets with AI-powered intelligence
DataGen provides a powerful REST API for generating synthetic datasets using advanced AI. Simply describe your data requirements in natural language, and our AI will create realistic, structured datasets with proper relationships and constraints.
To get started, you'll need:
💡 Pro Tip: Be specific in your dataset descriptions. The more detail you provide about entities, relationships, and business domain, the better the AI can generate realistic data.
DataGen uses your OpenAI API key for authentication and AI processing. The key is passed in the request payload and is encrypted in transit. We never store your API keys.
sk-
⚠️ Security Notice: Your API Key will be encrypted and decrypted in the server but never stored.
https://syntheticdatagen.xyz/api
POST /generate
Generates synthetic datasets based on your natural language description.
Content-Type: application/json
Parameter | Type | Required | Description |
---|---|---|---|
api_key | string | Required | Your OpenAI API key (sk-...) |
prompt | string | Required | Natural language description of your dataset requirements |
datasets | integer | Optional | Number of datasets to generate (1-5, default: 2) |
dataset_size | integer | Optional | Records per dataset (1-5,000,000, default: 1000) |
reality | integer | Optional | Reality level 0-10 (0=clean, 10=messy, default: 3) |
output_format | string | Optional | Format: "json", "csv", "sql" (default: "json") |
{
"success": true,
"message": "Successfully generated 2 dataset(s)",
"datasets": {
"customers": {
"format": "json",
"size": 1000,
"columns": ["id", "name", "email", "signup_date"],
"content": [...] // Array of 1000 customer records
},
"orders": {
"format": "json",
"size": 1000,
"columns": ["order_id", "customer_id", "total", "order_date"],
"content": [...] // Array of 1000 order records
}
},
"metadata": {
"total_datasets": 2,
"total_records": 2000,
"reality_level": 3,
"output_format": "json",
"prompt": "E-commerce platform...",
"generation_timestamp": "2024-01-15T10:30:00Z",
"validation": {
"entity_types_generated": ["customers", "orders"],
"validation_status": "completed",
"validation_score": 9
}
},
"download_links": null // Only used by web UI
}
{
"error": "Invalid API key format",
"detail": "API key must start with 'sk-'"
}
import requests
import json
# API configuration
url = "https://syntheticdatagen.xyz/api/generate"
headers = {"Content-Type": "application/json"}
# Request payload
data = {
"api_key": "sk-your-openai-api-key",
"prompt": "E-commerce platform with customers, products, orders, and reviews",
"datasets": 3,
"dataset_size": 1500,
"reality": 4,
"output_format": "json"
}
# Make request
response = requests.post(url, headers=headers, json=data)
if response.status_code == 200:
result = response.json()
# Process each dataset
for name, dataset in result["datasets"].items():
print(f"Dataset: {name}")
print(f"Records: {dataset['size']}")
print(f"Format: {dataset['format']}")
# Save to file
with open(f"{name}.json", "w") as f:
json.dump(dataset["content"], f, indent=2)
print(f"Total records generated: {result['metadata']['total_records']}")
else:
print(f"Error: {response.status_code}")
print(response.json())
curl -X POST https://syntheticdatagen.xyz/api/generate \
-H "Content-Type: application/json" \
-d '{
"api_key": "sk-your-openai-api-key",
"prompt": "Healthcare system with patients and appointments",
"datasets": 2,
"dataset_size": 800,
"reality": 2,
"output_format": "json"
}' \
| jq '.'
Format | Description |
---|---|
JSON | Default format. Returns an array of objects with consistent schema. |
CSV | Comma-separated values with headers. Ready for import into spreadsheets or databases. |
SQL | Complete SQL script with CREATE TABLE and INSERT statements. |
Control the "messiness" of your synthetic data to simulate real-world conditions:
Level | Description | Use Cases |
---|---|---|
0-2 | Perfect data with no inconsistencies | Testing ideal scenarios, demos |
3-5 | Minor inconsistencies (formatting variations, occasional nulls) | General testing, development |
6-8 | Moderate issues (duplicates, typos, missing values) | Data cleaning workflows, ETL testing |
9-10 | Highly messy (inconsistent formats, many nulls, outliers) | Stress testing, data quality tools |
Good prompt example: "E-commerce platform with customers (id, name, email, signup_date), products (id, name, price, category_id), and orders (id, customer_id, product_id, quantity, order_date, status). Include 5 product categories and realistic order patterns."
Error Code | Description | Common Causes |
---|---|---|
400 | Bad Request | Invalid payload, inconsistent parameters, constraint violations |
401 | Unauthorized | Invalid or expired OpenAI API key |
429 | Rate Limited | Exceeded request limits or OpenAI quota |
500 | Internal Server Error | Service unavailable, AI model issues |
The API performs intelligent validation of your request payload:
{
"error": "Invalid request payload",
"issues": [
"Prompt mentions 4 entities (customers, products, orders, reviews) but datasets=3"
],
"suggestions": [
"Set datasets=4 to match entities in prompt",
"Or modify prompt to only mention 3 entities"
]
}
{
"error": "Invalid request payload",
"issues": [
"Dataset size 2,000,000 exceeds limit. With 3 datasets, max size per dataset is 1,666,666"
],
"suggestions": []
}
💡 Auto-Correction: When possible, the API will automatically adjust your dataset count to match the entities mentioned in your prompt, with a warning message.
⚠️ Note: Large datasets may take several minutes to generate. Consider using webhooks or polling for very large requests.
The system automatically calculates maximum dataset size based on the number of datasets:
max_size_per_dataset = 5,000,000 / number_of_datasets
# Examples:
# 1 dataset: max 5,000,000 records
# 2 datasets: max 2,500,000 records each
# 5 datasets: max 1,000,000 records each