Vertex AI supports feature engineering and storage by integrating data services (BigQuery, Cloud Storage) with orchestration (Pipelines) and production-grade feature serving patterns. For offline engineering, teams often use BigQuery SQL or Dataflow to build aggregates, joins, text tokenization outputs, and transformation tables. These artifacts are versioned by time (e.g., partitioned tables) and referenced in training jobs so that the exact snapshot used for model fitting is recorded. For online paths, you can containerize lightweight transformation code (e.g., standardization, categorical encodings) and deploy it alongside your prediction container to guarantee that training-time and serving-time transforms match.
In practice, you define a repeatable pipeline: extract raw data into BigQuery, compute features with SQL or Beam (Dataflow), export the training set to Cloud Storage, and trigger a Vertex AI training job. Store schemas (feature names, dtypes, value ranges) and transformation parameters (e.g., scalers, vocabularies) as artifacts, and validate them in CI before training. For text and image workloads, feature engineering often means producing embeddings; you can either compute them inside the training job or run a preprocessing step that generates embeddings and writes them back to Cloud Storage or BigQuery for faster model iterations.
When your application uses semantic retrieval, embeddings become “features at inference time.” Vertex AI can generate these online or in batch, while Milvus provides the storage and search layer for vectors. A common pattern: maintain a table that maps document IDs to metadata (title, URI, permissions) and a Milvus collection that stores the corresponding vectors. Dataflow updates keep Milvus in sync as content changes, and BigQuery remains the analytical source for monitoring retrieval quality (e.g., click-through rates, recall@k, latency distributions). This structure keeps feature pipelines traceable and enables consistent, low-latency access to both scalar features and vectors in production.
