The Importance of Data Engineering for Successful AI with Airbyte and Zilliz

This article was originally published on DBTA on Oct 17 2024 and reposted with permission.
Enabling the collection and utilization of data is crucial to successfully supporting AI projects at enterprise scale. From data integration to data pipelines, AI performance, data governance, compliance, and more, adhering to data engineering best practices has never been more prudent for enabling an AI-powered future.
In DBTA’s latest webinar, Data Engineering Best Practices for AI, Brian Leonard, director of engineering at Airbyte, and Tim Spann, principal developer advocate, Milvus, Zilliz, offered their expertise regarding how data engineering can resolve common challenges associated with deploying and scaling effective AI usage.
As the open source data movement company, Airbyte makes data actionable anywhere, enabling over 20,000 data and AI professionals to manage diverse data across multi-cloud environments, according to Leonard. Regarding Airbyte’s AI use case, many enterprises are leveraging the Airbyte platform to load first-party data into AI apps by extracting records from unstructured sources—such as Google Drive or Salesforce—and moving that data into lakehouses, where users can enable retrieval-augmented generation (RAG) and fine-tuning.
Leonard then took a closer look at the AI data pipeline, examining the journey from extraction to normalization, processing, and usage. Each phase of the pipeline incorporates the following processes:
- Extraction: Data encryption, PII masking, pushdown filters, file transfer, permissions
- Normalization: Schema normalization, data cleaning, deduplication
- Processing: Enrichment, summarization, use cases optimization, document chunking, embeddings calculation
- Usage: Place embeddings into a queryable data store, such as Milvus, a vector database
Spann expanded on the advantages of Zilliz’s Milvus, a high-performance, open source vector database built for scale. Vector search, Spann noted, is the new paradigm for AI, as “now, images, text, video, documents—everything is data, and vector search makes it searchable.” In fact, IDC predicts that 90% of newly generated data in 2025 will be unstructured, reflecting a crucial need for vector search.
Vector databases are responsible for powering search across a variety of use cases—from RAG to molecular similarity search, fraud and anomaly detection, multimodal similarity search, and more. At its core, unstructured data—and the ability to extract knowledge from it—is fundamental toward enabling AI success.
Since 2017, Zilliz has been helping organizations make sense of unstructured data. Having been built by a top-tier team of algorithm and database engineers with a strong pedigree in developing high performance, scalable, and highly available distributed systems, uniquely tailored for vector search, Zilliz was built from the ground-up to address the various data engineering challenges associated with AI, noted Spann.
As a result, Milvus is an easy to set up, feature-rich vector database that offers elastic scaling, reusable code, and expansive integrations, underpinned by a robust, supportive community. Spann then walked webinar viewers through the way Milvus operates, detailing its structure, features, and more.
For the full, in-depth webinar discussing data engineering for the age of AI, you can view an archived version of the webinar here.
Start Free, Scale Easily
Try the fully-managed vector database built for your GenAI applications.
Try Zilliz Cloud for FreeKeep Reading

Building RAG Pipelines for Real-Time Data with Cloudera and Milvus
explore how Cloudera can be integrated with Milvus to effectively implement some of the key functionalities of RAG pipelines.

Proactive Monitoring for Vector Database: Zilliz Cloud Integrates with Datadog
we're excited to announce Zilliz Cloud's integration with Datadog, enabling comprehensive monitoring and observability for your vector database deployments with your favorite monitoring tool.

How to Calculate the Total Cost of Your RAG-Based Solutions
In this guide, we’ll break down the main components of RAG costs, show you how to calculate these expenses using the Zilliz RAG Cost Calculator, and explore strategies to manage spending efficiently.
