How do I prepare and format my training data for fine-tuning a foundation model on Bedrock (for example, using JSONL files with prompt-completion pairs)?

To prepare and format training data for fine-tuning a foundation model on AWS Bedrock using JSONL files, you need to structure your data into prompt-completion pairs while adhering to Bedrock’s requirements. Each line in the JSONL file must be a valid JSON object containing two keys: prompt (the input text) and completion (the desired model output). For example, a line might look like {"prompt": "Translate to French: Hello", "completion": "Bonjour"}. Bedrock expects these pairs to be properly formatted, with no trailing commas or syntax errors. Ensure your file uses UTF-8 encoding and avoids non-printable characters, as these can cause processing issues during training.

When preparing the data, focus on relevance, consistency, and quality. Start by collecting or curating a dataset that aligns with your use case—for instance, customer service dialogues for a chatbot model. Clean the data by removing duplicates, correcting typos, and redacting sensitive information. Split the dataset into training and validation subsets (e.g., 80% training, 20% validation) to evaluate model performance. For fine-tuning, aim for a minimum of 100–200 high-quality examples, though larger datasets (1,000+ examples) often yield better results. Ensure each completion directly addresses the prompt and reflects the style or tone you want the model to learn. For example, if training a code-generation model, include precise code snippets paired with clear natural language prompts.

Formatting best practices include using consistent delimiters and avoiding overly long examples. For text-based models, add separators like \n\n###\n\n at the end of prompts to signal the start of the completion. Trim prompts and completions to stay within the model’s token limit (e.g., 2,048 tokens for many Bedrock models). Validate your JSONL file using tools like jq or Python’s json module to catch formatting errors. Finally, store the file in an Amazon S3 bucket, as Bedrock requires training data to be sourced from S3. Test a small subset of your data with Bedrock’s validation tools to confirm compatibility before starting a full fine-tuning job.

Your AI Reference Guide
How do I prepare and format my training data for fine-tuning a foundation model on Bedrock (for example, using JSONL files with prompt-completion pairs)?

How do I prepare and format my training data for fine-tuning a foundation model on Bedrock (for example, using JSONL files with prompt-completion pairs)?

Recommended AI Learn Series

VectorDB for GenAI Apps

Share this article

Keep Reading

AI Assistant

Your AI Reference GuideHow do I prepare and format my training data for fine-tuning a foundation model on Bedrock (for example, using JSONL files with prompt-completion pairs)?

How do I prepare and format my training data for fine-tuning a foundation model on Bedrock (for example, using JSONL files with prompt-completion pairs)?

Recommended AI Learn Series

VectorDB for GenAI Apps

Share this article

Keep Reading

AI Assistant

Your AI Reference Guide
How do I prepare and format my training data for fine-tuning a foundation model on Bedrock (for example, using JSONL files with prompt-completion pairs)?