Blog
The Importance of Data Engineering for Successful AI with Airbyte and Zilliz

The Importance of Data Engineering for Successful AI with Airbyte and Zilliz

Oct 18, 20242 min read

This article was originally published on DBTA on Oct 17 2024 and reposted with permission.

Enabling the collection and utilization of data is crucial to successfully supporting AI projects at enterprise scale. From data integration to data pipelines, AI performance, data governance, compliance, and more, adhering to data engineering best practices has never been more prudent for enabling an AI-powered future.

In DBTA’s latest webinar, Data Engineering Best Practices for AI, Brian Leonard, director of engineering at Airbyte, and Tim Spann, principal developer advocate, Milvus, Zilliz, offered their expertise regarding how data engineering can resolve common challenges associated with deploying and scaling effective AI usage.

As the open source data movement company, Airbyte makes data actionable anywhere, enabling over 20,000 data and AI professionals to manage diverse data across multi-cloud environments, according to Leonard. Regarding Airbyte’s AI use case, many enterprises are leveraging the Airbyte platform to load first-party data into AI apps by extracting records from unstructured sources—such as Google Drive or Salesforce—and moving that data into lakehouses, where users can enable retrieval-augmented generation (RAG) and fine-tuning.

Leonard then took a closer look at the AI data pipeline, examining the journey from extraction to normalization, processing, and usage. Each phase of the pipeline incorporates the following processes:

Extraction: Data encryption, PII masking, pushdown filters, file transfer, permissions
Normalization: Schema normalization, data cleaning, deduplication
Processing: Enrichment, summarization, use cases optimization, document chunking, embeddings calculation
Usage: Place embeddings into a queryable data store, such as Milvus, a vector database

Spann expanded on the advantages of Zilliz’s Milvus, a high-performance, open source vector database built for scale. Vector search, Spann noted, is the new paradigm for AI, as “now, images, text, video, documents—everything is data, and vector search makes it searchable.” In fact, IDC predicts that 90% of newly generated data in 2025 will be unstructured, reflecting a crucial need for vector search.

Vector databases are responsible for powering search across a variety of use cases—from RAG to molecular similarity search, fraud and anomaly detection, multimodal similarity search, and more. At its core, unstructured data—and the ability to extract knowledge from it—is fundamental toward enabling AI success.

Since 2017, Zilliz has been helping organizations make sense of unstructured data. Having been built by a top-tier team of algorithm and database engineers with a strong pedigree in developing high performance, scalable, and highly available distributed systems, uniquely tailored for vector search, Zilliz was built from the ground-up to address the various data engineering challenges associated with AI, noted Spann.

As a result, Milvus is an easy to set up, feature-rich vector database that offers elastic scaling, reusable code, and expansive integrations, underpinned by a robust, supportive community. Spann then walked webinar viewers through the way Milvus operates, detailing its structure, features, and more.

For the full, in-depth webinar discussing data engineering for the age of AI, you can view an archived version of the webinar here.

Updated on Jul 27, 2026

Sydney Blanchard
Sydney Blanchard is the Editorial Assistant at Database Trends and Applications, a division of Information Today, Inc

Keep Reading

3 Easiest Ways to Use Claude Code on Your Mobile Phone

Run Claude Code from your phone with Remote Control, Happy Coder, or SSH + Tailscale. Comparison table, setup steps, and tools for typing, memory, and parallel tasks.

Introducing Zilliz Cloud Global Cluster: Region-Level Resilience for Mission-Critical AI

Zilliz Cloud Global Cluster delivers multi-region resilience, automatic failover, and fast global AI search with built-in security and compliance.

Bringing AI to Legal Tech: The Role of Vector Databases in Enhancing LLM Guardrails

Discover how vector databases enhance AI reliability in legal tech, ensuring accurate, compliant, and trustworthy AI-powered legal solutions.