Nano Banana 2 accepts up to four reference images in a single generation request. These reference images are passed alongside the text prompt and are used by the model to guide style, composition, color palette, or character appearance depending on how you describe their role in the prompt. The references can be provided as base64-encoded image strings or as URLs pointing to publicly accessible images. Each reference image is counted against the request's input token budget, so providing four large reference images in a single request will consume more tokens than providing one, which affects both cost and the model's attention bandwidth for processing the prompt text.
The effect of reference images on the output depends heavily on how you frame their role in the prompt. If you want the references to influence overall style—lighting, color treatment, artistic medium—describing them as style guides in the prompt produces better results than simply attaching them without explanation. If you want a specific character's appearance to carry into the generated image, describing the reference as a character reference and providing a detailed textual description alongside it gives the model stronger signal than the image alone. Using all four reference slots for different purposes simultaneously—one for style, one for character, one for environment, one for composition—can produce inconsistent results, as the model may have difficulty balancing all four influences at once.
In practice, most production use cases work well with one or two reference images per request. Starting with a single style reference and a detailed prompt is the recommended baseline; add more references incrementally while evaluating whether each addition improves or complicates the output. For workflows where you are selecting reference images dynamically based on user input, storing a curated set of reference images and their associated embeddings in a vector database such as Zilliz Cloud allows you to retrieve the most relevant references for a given prompt using similarity search before assembling the generation request.
