Typical Throughput in AWS Bedrock AWS Bedrock provides managed access to various foundation models, but throughput (requests or tokens per second) varies significantly based on the model, input/output complexity, and workload patterns. For example, smaller models like Jurassic-2 Light might handle 50-100 requests per second (RPS) with short prompts, while larger models like Claude or Titan may process fewer requests (e.g., 10-30 RPS) due to their computational demands. Token generation speed also depends on context length—models processing 512-token inputs might output 100-500 tokens per second. AWS does not publish fixed benchmarks, as performance is influenced by factors like region, concurrent users, and default rate limits applied to accounts.
Increasing Throughput via Configuration Bedrock allows users to scale throughput primarily through Provisioned Throughput, a paid tier that reserves dedicated capacity for a model. This guarantees consistent performance, bypassing default on-demand limits. For example, provisioning 50 RPS for Claude ensures predictable handling of high-volume workloads. Additionally, optimizing requests (e.g., shorter prompts, batched inputs) reduces processing time per call. Asynchronous APIs can also improve throughput by decoupling submission from result retrieval, allowing parallel task queuing. Developers should enable HTTP keep-alive and reuse connections to minimize latency overhead, which indirectly boosts effective throughput.
Practical Considerations and Examples
Throughput gains depend on the use case. For instance, a text summarization task using Titan with 100-token inputs might achieve 80 RPS with Provisioned Throughput, while a chatbot using Claude with 1k-token conversations might see 15 RPS. AWS recommends starting with on-demand testing to establish baselines, then purchasing Provisioned Throughput aligned with peak needs. Monitoring via CloudWatch metrics (e.g., ModelLatency, Invocations) helps identify bottlenecks. For token-heavy workloads, selecting models with faster token generation (e.g., Jurassic-2 over Claude) or truncating unnecessary output can improve efficiency. While Bedrock auto-scales infrastructure, combining these strategies ensures optimal performance for critical applications.
