To handle rate limits or throughput limits in AWS Bedrock and avoid throttling in a production system, you need to design your application to respect service quotas and implement strategies to manage request volume effectively. Here’s a structured approach:
1. Understand and Monitor Service Quotas
First, review Bedrock’s rate limits (TPS, transactions per second) and throughput limits (input/output tokens per minute) for your specific model and region via the AWS Service Quotas console. Limits vary by model (e.g., Claude, Titan) and operation (e.g., InvokeModel
vs. InvokeModelWithResponseStream
). Use Amazon CloudWatch to track metrics like ThrottledRequests
and SuccessfulRequests
to identify patterns. For example, if Bedrock allows 100 TPS for a model, ensure your application stays below this threshold. If you’re close to the limit, request a quota increase via AWS Support, but design your system to work within existing limits as the primary strategy.
2. Implement Retry Logic with Backoff
When a request is throttled (HTTP 429 error), retry it with exponential backoff and jitter. For example, use the AWS SDK’s built-in retry mechanisms (configured via RetryPolicy
), which wait increasingly longer between retries (e.g., 1s, 2s, 4s). Add jitter (random delay) to prevent synchronized retries across multiple clients. Avoid tight loops—use asynchronous processing or queues to decouple retries from the main application flow. For instance, if a batch job processes 1,000 requests, use a queue (Amazon SQS) to retry failed tasks with backoff instead of blocking the entire batch.
3. Control Request Concurrency and Distribution
Use token bucket or leaky bucket algorithms to limit request rates. For example, if Bedrock allows 100 TPS, use a token bucket that refills 100 tokens per second and blocks requests when tokens are exhausted. Distribute load across multiple Bedrock model IDs or regions if your use case allows—for instance, route 50% of traffic to us-east-1
and 50% to us-west-2
if quotas are region-specific. For high-volume workloads, partition requests across multiple AWS accounts (each with its own quotas). Additionally, cache repeated inputs (e.g., common prompts) using Amazon ElastiCache to reduce redundant calls.
Example Code Snippet (Exponential Backoff):
import boto3
from botocore.config import Config
retry_config = Config(
retries={
'max_attempts': 5,
'mode': 'adaptive' # Includes backoff and jitter
}
)
bedrock = boto3.client('bedrock-runtime', config=retry_config)
try:
response = bedrock.invoke_model(...)
except bedrock.exceptions.ThrottlingException:
# Handle retries or fallback logic here
pass
By combining quota awareness, retries, and concurrency control, you can minimize throttling while maintaining responsiveness in production.