Vertex AI endpoints are managed HTTP/gRPC services that host one or more models for online prediction. You create an endpoint, deploy a model (or multiple versions) to it, and Vertex AI handles autoscaling, load balancing, health checks, and rolling updates. Each endpoint exposes a stable URL and IAM-controlled access, so clients can send inference requests without caring about underlying infrastructure. Endpoints let you specify machine types, accelerators, minimum/maximum replicas, and per-replica concurrency. You can also define traffic splits to run canary tests or A/B experiments across model versions without redeploying clients.
Using endpoints is straightforward. After training or importing a model, you register it in the Model Registry and deploy it to an endpoint. Clients send JSON (or protobuf for gRPC) with inputs, and the endpoint returns predictions along with optional custom metadata. Logs and metrics flow into Cloud Logging and Monitoring, where you track latency, throughput, and error rates. For complex preprocessing or custom business rules, you package a custom prediction container that wraps your model and implements request/response handling. Batch workloads are kept out of endpoints and run via Batch Prediction to avoid starving real-time traffic.
In retrieval-augmented or search-heavy systems, endpoints often work alongside Milvus. One endpoint generates query embeddings (fast, lightweight), Milvus performs ANN search with metadata filters, and another endpoint handles generation or classification using the retrieved context. This pattern keeps endpoints focused on model inference, while Milvus delivers vector search at low latency. You can cache frequent embeddings, use short timeouts with retries, and add backpressure controls to maintain SLOs. By combining traffic splitting, detailed monitoring, and a clear separation of concerns, Vertex AI endpoints become predictable building blocks for scaling real-time AI services.
