DataGen API Documentation

Complete guide to generating synthetic datasets with AI-powered intelligence

Getting Started

DataGen provides a powerful REST API for generating synthetic datasets using advanced AI. Simply describe your data requirements in natural language, and our AI will create realistic, structured datasets with proper relationships and constraints.

Quick Start

To get started, you'll need:

  • An OpenAI API key (get one at platform.openai.com)
  • A clear description of the data you need
  • Your preferred output format (JSON, CSV, or SQL)

💡 Pro Tip: Be specific in your dataset descriptions. The more detail you provide about entities, relationships, and business domain, the better the AI can generate realistic data.

Authentication

DataGen uses your OpenAI API key for authentication and AI processing. The key is passed in the request payload and is encrypted in transit. We never store your API keys.

API Key Requirements

  • Must be a valid OpenAI API key starting with sk-
  • Should have sufficient credits for processing requests

⚠️ Security Notice: Your API Key will be encrypted and decrypted in the server but never stored.

API Reference

Base URL

https://syntheticdatagen.xyz/api

Generate Datasets

POST /generate

Generates synthetic datasets based on your natural language description.

Content-Type

Content-Type: application/json

Request Parameters

Parameter Type Required Description
api_key string Required Your OpenAI API key (sk-...)
prompt string Required Natural language description of your dataset requirements
datasets integer Optional Number of datasets to generate (1-5, default: 2)
dataset_size integer Optional Records per dataset (1-5,000,000, default: 1000)
reality integer Optional Reality level 0-10 (0=clean, 10=messy, default: 3)
output_format string Optional Format: "json", "csv", "sql" (default: "json")

Response Format

Success Response (200 OK)

Response Structure
{
  "success": true,
  "message": "Successfully generated 2 dataset(s)",
  "datasets": {
    "customers": {
      "format": "json",
      "size": 1000,
      "columns": ["id", "name", "email", "signup_date"],
      "content": [...] // Array of 1000 customer records
    },
    "orders": {
      "format": "json", 
      "size": 1000,
      "columns": ["order_id", "customer_id", "total", "order_date"],
      "content": [...] // Array of 1000 order records
    }
  },
  "metadata": {
    "total_datasets": 2,
    "total_records": 2000,
    "reality_level": 3,
    "output_format": "json",
    "prompt": "E-commerce platform...",
    "generation_timestamp": "2024-01-15T10:30:00Z",
    "validation": {
      "entity_types_generated": ["customers", "orders"],
      "validation_status": "completed",
      "validation_score": 9
    }
  },
  "download_links": null // Only used by web UI
}

Error Response (4xx/5xx)

Error Structure
{
  "error": "Invalid API key format",
  "detail": "API key must start with 'sk-'"
}

Code Examples

Python with requests

import requests
import json

# API configuration
url = "https://syntheticdatagen.xyz/api/generate"
headers = {"Content-Type": "application/json"}

# Request payload
data = {
    "api_key": "sk-your-openai-api-key",
    "prompt": "E-commerce platform with customers, products, orders, and reviews",
    "datasets": 3,
    "dataset_size": 1500,
    "reality": 4,
    "output_format": "json"
}

# Make request
response = requests.post(url, headers=headers, json=data)

if response.status_code == 200:
    result = response.json()
    
    # Process each dataset
    for name, dataset in result["datasets"].items():
        print(f"Dataset: {name}")
        print(f"Records: {dataset['size']}")
        print(f"Format: {dataset['format']}")
        
        # Save to file
        with open(f"{name}.json", "w") as f:
            json.dump(dataset["content"], f, indent=2)
            
    print(f"Total records generated: {result['metadata']['total_records']}")
else:
    print(f"Error: {response.status_code}")
    print(response.json())

cURL

curl -X POST https://syntheticdatagen.xyz/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "api_key": "sk-your-openai-api-key",
    "prompt": "Healthcare system with patients and appointments",
    "datasets": 2,
    "dataset_size": 800,
    "reality": 2,
    "output_format": "json"
  }' \
  | jq '.'

Data Formats

Format Description
JSON Default format. Returns an array of objects with consistent schema.
CSV Comma-separated values with headers. Ready for import into spreadsheets or databases.
SQL Complete SQL script with CREATE TABLE and INSERT statements.

Reality Levels

Control the "messiness" of your synthetic data to simulate real-world conditions:

Level Description Use Cases
0-2 Perfect data with no inconsistencies Testing ideal scenarios, demos
3-5 Minor inconsistencies (formatting variations, occasional nulls) General testing, development
6-8 Moderate issues (duplicates, typos, missing values) Data cleaning workflows, ETL testing
9-10 Highly messy (inconsistent formats, many nulls, outliers) Stress testing, data quality tools

Best Practices

Writing Effective Prompts

  • Be specific: Include entity names, relationships, and business context
  • Mention constraints: Specify data types, ranges, and validation rules
  • Include examples: Provide sample values for complex fields
  • Define relationships: Explain how entities connect (foreign keys, hierarchies)

Good prompt example: "E-commerce platform with customers (id, name, email, signup_date), products (id, name, price, category_id), and orders (id, customer_id, product_id, quantity, order_date, status). Include 5 product categories and realistic order patterns."

Performance Optimization

  • Start with smaller datasets (1,000-10,000 records) for testing
  • Use appropriate reality levels - higher levels take more processing time
  • Consider generating multiple smaller datasets vs. one large dataset
  • Cache generated data for repeated use in testing

Data Quality

  • Validate generated data structure matches your expectations
  • Check for proper relationships between entities
  • Verify data types and constraints
  • Test with different reality levels to find optimal messiness

Error Handling

Common Error Types

Error Code Description Common Causes
400 Bad Request Invalid payload, inconsistent parameters, constraint violations
401 Unauthorized Invalid or expired OpenAI API key
429 Rate Limited Exceeded request limits or OpenAI quota
500 Internal Server Error Service unavailable, AI model issues

Payload Validation Errors

The API performs intelligent validation of your request payload:

Example: Entity Count Mismatch
{
  "error": "Invalid request payload",
  "issues": [
    "Prompt mentions 4 entities (customers, products, orders, reviews) but datasets=3"
  ],
  "suggestions": [
    "Set datasets=4 to match entities in prompt",
    "Or modify prompt to only mention 3 entities"
  ]
}
Example: Dataset Size Exceeded
{
  "error": "Invalid request payload",
  "issues": [
    "Dataset size 2,000,000 exceeds limit. With 3 datasets, max size per dataset is 1,666,666"
  ],
  "suggestions": []
}

💡 Auto-Correction: When possible, the API will automatically adjust your dataset count to match the entities mentioned in your prompt, with a warning message.

Rate Limits

  • Datasets per request: Maximum 5
  • Records per dataset: Maximum 5,000,000
  • Total records per request: Maximum 5,000,000
  • Request timeout: 5 minutes

⚠️ Note: Large datasets may take several minutes to generate. Consider using webhooks or polling for very large requests.

The system automatically calculates maximum dataset size based on the number of datasets:

max_size_per_dataset = 5,000,000 / number_of_datasets # Examples: # 1 dataset: max 5,000,000 records # 2 datasets: max 2,500,000 records each # 5 datasets: max 1,000,000 records each