Amazon Bedrock enforces specific quotas and limitations on model usage, request rates, and payload sizes to ensure service stability and fair resource allocation. These constraints vary by model, region, and API type (synchronous vs. asynchronous), and they typically fall into three categories: throughput, request frequency, and input/output size.
First, throughput limits are defined in tokens per minute (TPM) and vary across models. For example, Anthropic’s Claude v2 has a default quota of 100,000 TPM, while Amazon Titan Text Lite might start at 10,000 TPM. These limits cap how much data a model can process in a given time. Exceeding TPM quotas triggers throttling, returning a 429 Too Many Requests
error. Similarly, request rate limits restrict the number of API calls per second. Synchronous APIs (like InvokeModel
) often have lower default limits (e.g., 10 requests per second), while asynchronous APIs (like StartModelInvocationJob
) may allow higher throughput but impose stricter payload size constraints. These limits are applied per AWS account and region, so multi-region deployments require separate quota management.
Second, payload size limits are tied to the maximum input and output tokens supported by each model. For instance, Claude 3 supports up to 200,000 input tokens, while Cohere Command limits inputs to 4,096 tokens. Outputs are also capped—Claude 3 allows 4,096 tokens per response by default. Exceeding these limits results in ValidationException
errors. Additionally, some models restrict specific parameters, like temperature or top-p ranges. For example, AI21 Jurassic-2 enforces a temperature range of 0–1, while others might default to narrower bands. These constraints require developers to validate inputs and truncate or chunk data before sending requests.
To manage these limits, AWS allows users to view quotas in the Service Quotas console and request increases via the AWS Support Center. For example, a team needing higher TPM for Claude could submit a request detailing their use case and required throughput. Best practices include implementing retries with exponential backoff for throttling errors and monitoring usage via CloudWatch metrics like CallCount
and TokenCount
. Proactive testing (e.g., load testing within quota bounds) and fallback strategies (e.g., switching models during spikes) help avoid disruptions. Always check the latest model-specific documentation, as these limits evolve with service updates.