When evaluating generative models on AWS Bedrock, consider three main categories beyond speed: output quality, cost efficiency, and task-specific alignment. Each category includes measurable metrics to help you assess performance holistically.
Output Quality Metrics Focus on how well the model generates coherent, accurate, and relevant outputs. For text models:
- Perplexity (how confidently the model predicts a sample) can indicate fluency, but it’s less reliable for creative tasks.
- BLEU/ROUGE scores compare generated text to reference outputs, useful for summarization or translation.
- BERTScore uses embeddings to measure semantic similarity between outputs and ground-truth text.
- Factual accuracy checks if claims are correct (e.g., using retrieval-augmented validation).
- Diversity measures variation in outputs (e.g., unique n-grams or entropy scores) to avoid repetitive or generic responses. For example, a customer support chatbot might prioritize factual accuracy and coherence, while a creative writing tool might emphasize diversity and fluency.
Cost and Operational Efficiency Bedrock charges based on input/output tokens and model type. Key metrics include:
- Cost per request: Calculate using (input tokens + output tokens) × cost per token. For example, Claude 3 Opus costs $15 per million output tokens, while Haiku is $0.25.
- Error rates: Track API failure rates or retries needed due to throttling (Bedrock’s default throughput is 1-5 requests per second).
- Token efficiency: Optimize prompts to reduce unnecessary tokens (e.g., concise instructions vs. verbose examples). A cost-sensitive application might choose smaller models like Claude Haiku, accepting slightly lower quality for 98% cost savings versus Opus.
Task-Specific and Ethical Metrics Tailor metrics to your use case and ethical requirements:
- Instruction adherence: For chatbots, measure if responses follow rules (e.g., “refused to answer” rates for unsafe queries).
- Bias/toxicity: Use tools like AWS Titan safety filters or Perspective API to score outputs for harmful content.
- User feedback: Track thumbs-up/down rates or A/B test engagement metrics (e.g., time spent reading generated content).
- Privacy: Audit outputs for accidental data leaks (e.g., PII detection in summarization tasks). For instance, a medical advice tool would prioritize accuracy and safety, while a marketing copy generator might focus on click-through rates and brand alignment.
By combining these metrics, you can balance quality, cost, and reliability for your specific application on Bedrock.