Skip to main content

Fair Usage & System Stability

Neosantara AI implements throughput limits to ensure platform stability and fair resource allocation across all users.

Overview

Rate limits control the speed at which you can make API requests and process tokens. These are technical constraints designed to prevent system overload and ensure consistent performance for all users.
Looking for pricing and billing information? Check the Token Credits & Pricing documentation.

How Rate Limits Work

RPM

Requests Per MinuteMaximum API calls per minute

ITPM

Input Tokens Per MinuteMaximum tokens you can send

OTPM

Output Tokens Per MinuteMaximum tokens generated
All three limits apply simultaneously. You must stay within all three to avoid rate limit errors:Example Scenario:
  • You’re on Basic tier (50 RPM, 20K ITPM, 5K OTPM)
  • You make 45 requests in one minute (within RPM ✅)
  • But each request uses 1,000 input tokens = 45,000 total (exceeds ITPM ❌)
  • Result: Rate limit error even though RPM wasn’t exceeded
Best Practice: Monitor all three metrics, not just request count!

Throughput Tiers

Your rate limits are determined by your account tier, which automatically upgrades based on your total lifetime deposits.
TierMin. DepositRPMITPMOTPMBest For
FreeRp 035,0002,000Testing & Learning
BasicRp 85,0005020,0005,000Small Projects
StandardRp 670,0001,000100,00025,000Production Apps
ProRp 3,350,0002,000200,00050,000High-Volume Apps
EnterpriseRp 6,700,0004,000500,000125,000Large Scale Operations
Automatic Upgrades: Your tier upgrades automatically when you reach the deposit threshold. See Token Credits & Pricing for details.
Need custom limits? Contact our sales team for enterprise plans with dedicated infrastructure and higher throughput.

Monitoring Rate Limits

Response Headers

Every API response includes headers showing your current rate limit status. Use these to proactively avoid hitting limits.
x-neosantara-ratelimit-requests-limit: 50
x-neosantara-ratelimit-requests-remaining: 35
x-neosantara-ratelimit-requests-reset: 2025-12-19T10:30:00.000Z
HeaderDescription
requests-limitYour maximum RPM
requests-remainingRequests left in current window
requests-resetWhen the limit resets (ISO 8601)

Complete Header Reference

HeaderTypeDescription
x-neosantara-ratelimit-requests-limitintegerMaximum requests per minute
x-neosantara-ratelimit-requests-remainingintegerRequests remaining in current window
x-neosantara-ratelimit-requests-resetstringISO 8601 timestamp when request limit resets
x-neosantara-ratelimit-input-tokens-limitintegerMaximum input tokens per minute
x-neosantara-ratelimit-input-tokens-remainingintegerInput tokens remaining in current window
x-neosantara-ratelimit-input-tokens-resetstringISO 8601 timestamp when input limit resets
x-neosantara-ratelimit-output-tokens-limitintegerMaximum output tokens per minute
x-neosantara-ratelimit-output-tokens-remainingintegerOutput tokens remaining in current window
x-neosantara-ratelimit-output-tokens-resetstringISO 8601 timestamp when output limit resets
x-neosantara-tierstringCurrent account tier (Free, Basic, Standard, Pro, Enterprise)

Error Handling

429 Too Many Requests

This error occurs when you exceed any of the three throughput limits (RPM, ITPM, or OTPM).
{
  "error": {
    "message": "Too many requests. Please try again later.",
    "type": "rate_limit_exceeded",
    "code": "rpm_exceeded",
    "details": {
      "retry_after": 5,
      "limit": 50,
      "remaining": 0,
      "reset": "2025-12-19T10:30:05.000Z"
    }
  }
}

Error Response Fields

FieldTypeDescription
error.codestringThe specific limit exceeded: rpm_exceeded, itpm_exceeded, or otpm_exceeded
error.details.retry_afterintegerSeconds to wait before retrying
error.details.limitintegerThe limit value that was exceeded
error.details.remainingintegerAlways 0 when rate limited
error.details.resetstringISO 8601 timestamp when limit resets

HTTP Headers

HeaderDescription
Retry-AfterSeconds to wait before making a new request
Rate limit headersShow which limit was exceeded (see Monitoring section)

Best Practices

When you receive a 429 error, implement exponential backoff with jitter:
import time
import random

def make_request_with_retry(max_retries=5):
    for attempt in range(max_retries):
        try:
            response = make_api_request()
            return response
        except RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            
            # Exponential backoff with jitter
            wait_time = (2 ** attempt) + random.uniform(0, 1)
            time.sleep(wait_time)
Why this works: Spreading out retries prevents thundering herd problems.
Check rate limit headers before you hit the limit:
const response = await fetch('https://api.neosantara.xyz/v1/chat/completions', {
  // ... request config
});

const remaining = parseInt(response.headers.get('x-neosantara-ratelimit-requests-remaining'));
const resetTime = response.headers.get('x-neosantara-ratelimit-requests-reset');

if (remaining < 5) {
  console.warn(`Only ${remaining} requests remaining until ${resetTime}`);
  // Slow down or queue requests
}
For large workloads, use the Batch API which:
  • Has separate, higher rate limits
  • Costs 50% less
  • Better for non-urgent processing
When to use Batch:
  • Processing 100+ requests
  • Non-time-sensitive tasks
  • Overnight or background jobs
Instead of sending all requests at once:
import asyncio

async def rate_limited_requests(requests, rpm_limit):
    delay = 60 / rpm_limit  # Seconds between requests
    
    for request in requests:
        await make_request(request)
        await asyncio.sleep(delay)
This keeps you well within limits and provides smoother processing.
Build a queue system to manage high-volume scenarios:
from queue import Queue
import threading
import time

class RateLimitedQueue:
    def __init__(self, rpm_limit):
        self.queue = Queue()
        self.rpm_limit = rpm_limit
        self.delay = 60 / rpm_limit
        
    def add_request(self, request):
        self.queue.put(request)
        
    def process_queue(self):
        while True:
            request = self.queue.get()
            make_api_request(request)
            time.sleep(self.delay)
            self.queue.task_done()
Reduce API calls by caching responses for identical requests:
from functools import lru_cache
import hashlib
import json

@lru_cache(maxsize=1000)
def cached_api_call(prompt_hash):
    return make_api_request(prompt_hash)

def make_cached_request(prompt):
    # Create hash of prompt
    prompt_hash = hashlib.md5(json.dumps(prompt).encode()).hexdigest()
    return cached_api_call(prompt_hash)
If you consistently hit rate limits:
  1. Check your current tier in the dashboard
  2. Calculate needed tier based on your usage patterns
  3. Top up balance to reach the next threshold
  4. Tier upgrades are automatic and immediate
ROI Example: Upgrading from Free → Basic costs Rp 85,000 but gives you:
  • 16× more requests (3 → 50 RPM)
  • 4× more input tokens
  • 2.5× more output tokens
  • Access to Batch API (50% savings)

Rate Limit Calculation Examples

Scenario: Real-time chat with nusantara-baseConfiguration:
  • Tier: Basic (50 RPM, 20K ITPM, 5K OTPM)
  • Average input: 100 tokens/request
  • Average output: 50 tokens/request
Calculation:
RPM bottleneck: 50 requests/minute
ITPM bottleneck: 20,000 / 100 = 200 requests/minute
OTPM bottleneck: 5,000 / 50 = 100 requests/minute

Actual limit: min(50, 200, 100) = 50 RPM ✅
Result: RPM is the limiting factor. You can safely make 50 requests/minute.

Frequently Asked Questions

You’ll receive a 429 Too Many Requests error with a Retry-After header. Your request is not processed, and no tokens are charged. Wait for the specified time and retry.
Yes, rate limits operate on a sliding window basis. Each minute is independent, and limits reset at the beginning of each minute.
No. Batch API has separate rate limiting mechanisms and doesn’t count against your standard RPM/ITPM/OTPM limits.
Temporary increases are not available. However, tier upgrades are instant once you reach the deposit threshold. For permanent custom limits, contact our enterprise team.
Rate limiting considers all three metrics simultaneously. You may have exceeded ITPM or OTPM limits even if RPM is fine. Check the error.code field to see which limit was hit.
No. Streaming requests count the same as non-streaming requests. They consume 1 RPM slot and count all input/output tokens against your limits.
All requests within a 1-minute window count toward your limits, regardless of whether they’re sequential or concurrent. Be mindful when making parallel requests.

Troubleshooting

1

Identify Which Limit Was Hit

Check the error.code field in the 429 response:
  • rpm_exceeded - Too many requests
  • itpm_exceeded - Too many input tokens
  • otpm_exceeded - Too many output tokens
2

Review Your Usage Pattern

Analyze your request patterns:
  • Are requests evenly distributed?
  • Are you sending large batches at once?
  • What’s the average token count per request?
3

Implement Rate Limiting Logic

Add client-side rate limiting:
  • Track your own request count
  • Monitor response headers
  • Implement queuing or throttling
4

Consider Architecture Changes

If you consistently hit limits:
  • Use Batch API for bulk operations
  • Implement caching for repeated queries
  • Upgrade to a higher tier
  • Split workload across multiple API keys (if appropriate)
Using multiple API keys to circumvent rate limits violates our terms of service and may result in account suspension. If you need higher limits, please upgrade your tier or contact support.

Need Higher Limits?

If standard tiers don’t meet your needs, contact our enterprise team for custom rate limits, dedicated infrastructure, and priority support.
Last modified on December 18, 2025