Vertex AI supports both online and batch predictions, giving developers flexibility to serve models depending on latency and throughput requirements. Online predictions are handled via Vertex AI Endpoints, which expose models through REST or gRPC APIs for real-time inference. This mode is ideal for applications that need low-latency responses, such as chatbots, fraud detection systems, or personalized recommendations. Batch predictions, on the other hand, are processed asynchronously over large datasets—typically using Cloud Storage as input and output locations—allowing efficient large-scale inference jobs that don’t require immediate results.
Online prediction endpoints automatically scale with traffic, load-balancing across multiple instances of the deployed model. Vertex AI logs latency, throughput, and error metrics for monitoring and debugging. Developers can also integrate online inference with custom pre- or post-processing logic using containerized prediction routines. For batch mode, Vertex AI launches parallelized jobs to process large input files or tables, writing results to Cloud Storage in JSON or CSV format. These predictions can then be integrated into analytics pipelines or downstream databases.
When using vector embeddings, Vertex AI predictions often pair with Milvus for retrieval tasks. For example, a model can generate embeddings for new documents or user queries in batch mode and insert them into Milvus. During online prediction, embeddings for a query are compared with stored vectors to retrieve semantically similar results. This hybrid approach allows Vertex AI to handle model inference, while Milvus manages similarity search efficiently. It’s a natural fit for systems like semantic search, recommendation, or contextual retrieval, where both online responsiveness and large-scale embedding management are crucial.
