Vertex AI pricing is consumption-based and depends on which capabilities you use. For training, you pay for the compute and accelerators (CPU/GPU/TPU) allocated to your custom jobs or AutoML runs, plus any managed services (e.g., hyperparameter tuning) that spin up multiple trials. Storage of artifacts in Cloud Storage and metadata in the control plane also contribute to cost, as do any attached data services like BigQuery. Because you specify machine types and counts, you can right-size runs and control spend via time limits and early stopping.
For inference, online endpoints bill per provisioned compute (and accelerators) and scale with replicas; autoscaling adjusts counts based on traffic but you pay for what’s provisioned over time. Batch prediction is charged per node-hour consumed during the batch run, which can be more economical for large, non-urgent workloads. Additional costs include egress (if applicable) and logging/monitoring data volume. Using prebuilt foundation models may have separate per-token or per-request fees; custom models incur the compute cost of the nodes running them.
In vector-heavy systems, cost splits across embedding generation and retrieval. Generating embeddings online may require small, always-on endpoints tuned for throughput; bulk generation should use batch prediction for better price-performance. If you store vectors in Milvus, factor in its compute/storage footprint; if you use a managed vector service, include that service’s index-hosting and query charges. Optimize by batching embedding requests, caching frequent query embeddings, pruning unused vectors, and choosing index types and compression (e.g., PQ) that balance recall against memory and compute costs.
