Gemini 3 Pro supports multimodal inputs such as text, images, audio, video, and PDFs, all processed within a shared context window. The main limit is not the number of files but the total token budget. Images and videos are internally converted into tokens, and higher-resolution media consumes more of that budget. Developers control this through a media resolution setting, which can be lowered when fine visual detail is not required.
For images, higher resolution helps the model detect small text, UI elements, or fine patterns, but it costs more tokens and increases latency. Lower resolution is good for classification, scene recognition, or high-level descriptions. For video, the system samples frames at a configurable frequency. Longer, high-resolution video means more frames and more tokens consumed. For very long videos, it’s best to break them into segments instead of processing everything at once.
If you’re doing video or image retrieval, you can preprocess frames or images into embeddings and store them in a vector database such as Milvus or Zilliz Cloud. This way, you retrieve only the relevant segments and pass them to Gemini 3 Pro with an appropriate resolution. In real-world apps, most developers do not feed entire raw videos directly into the model; instead, they use retrieval or selective sampling to stay within the token budget and keep latency predictable.
