Blog
Text as Data, From Anywhere to Anywhere

Text as Data, From Anywhere to Anywhere

Jun 07, 20249 min read

Introduction

In March 2024, we heard from AJ Steers at the SF Unstructured Data Meetup about how we can utilize Airbyte and PyAirbyte to use text as data, from anywhere to anywhere. This means we can integrate and utilize both structured and unstructured data from various sources across different platforms.

Link to the YouTube replay of AJ Steers's talk: Watch on YouTube.

Who is AJ Steers and Why Should You Care About Unified Data Integration?

AJ Steers is an experienced architect, data engineer, software developer, and data ops expert. He has designed end-to-end solutions at Amazon and created a vision for quantified self-data models. As of now, he is a staff software engineer at Airbyte.

Unified data integration merges diverse data types and sources into a single, cohesive system for efficient analysis and processing. This capability is crucial in harnessing the full potential of your data, ensuring seamless access and utilization across various platforms and applications. As AJ puts it, “More than ever before, we are surrounded by usable data” and now it's all about opportunity cost. How much time do we have to get data that we think can be valuable from the source?

With that, let's dive into AJ's talk on “Text as Data, From Anywhere to Anywhere” using Airbyte.

Airbyte’s Expansion of supported sources and destinations

Airbyte’s focus so far has been on offering reliability, flexible deployment options, and a robust library of connectors to ensure seamless data integration for traditional tabular data. But what about unstructured data? AJ talks about what the Airbyte team has been up to for the past six months.

In the past six months, from March 2024 backward, Airbyte has expanded its capabilities to cover unstructured data sources in addition to traditional tabular data sources. This is due to the growing importance and prevalence of unstructured data in today’s data ecosystem. This expansion has seen Airbyte go beyond SQL-type and file-type destinations to cover vector database destinations like Milvus, ensuring we can effectively utilize data across various applications. Let's have a look at some of the connectors supported by Airbyte for both structured and unstructured data sources.

Airbyte supports more than 350 connectors hence we cannot list all of them. But the above slide shared by AJ shows sample connectors for each source and destination category. So far, we have been talking about unstructured data and how Airbyte is expanding to support them. But what are real-world examples of this data?

Think of this blog you are reading, is it arranged in tabular form? it's a prime example of unstructured data, composed of text that doesn't fit neatly into rows and columns. Unstructured data refers to information that doesn’t reside in a traditional row-column database. It includes data like text, images, videos, and social media posts. Let's take a look at some of the unstructured data shared by AJ in the slide below.

With this kind of data and Airbyte's assistance in consuming it, different communities have different expectations from Airbyte.

GenAl developers: Want streamlined integrations with other ecosystem tools and paradigms.
Data scientists: Want libraries that can run in a Jupyter notebook, decoupled from a central data warehouse.
Data engineers: Want a pipelines-as-code approach, the ability to run CI test pipelines, and friction-free development.
Developers: Want to build and iterate locally, then deploy later.

This shows we all want great libraries, good interoperability, and more control. But it all comes down to what you want. Some of us want something we can code for and control while others like to be less technical and use a user interface. Traditionally, this has led us to completely different paths. Airbyte is trying to break these barriers to anybody who can run and write Python code or anybody who can just copy-paste a few lines of code using PyAirbyte. Let's see what PyAibyte is and its pros.

Introducing PyAirbyte

PyAirbyte is a Python library that provides an interface to interact with Airbyte. It allows us to control and manage our Airbyte instances using Python. Think of it as bringing the power of Airbyte to every Python developer.

Advantages of PyAirbyte

Run anywhere: We can run PyAirbyte in a Jupyter notebook, decoupled from any data warehouse. This means we can use it in many environments, not just those tied to a specific data warehouse, giving us the flexibility to work in the environment that suits our needs best.

Reduce time to value: With PyAirbyte, we can build and iterate on data pipelines locally before deploying them. This allows us to test and refine our pipelines to ensure they're working as expected, which helps reduce the time it takes to get valuable insights from the data.

Fast prototyping: PyAirbyte doesn't require a user interface or signups, so we can start defining our data pipelines in code right away. This speeds up the prototyping process.

Flexible: We can use PyAirbyte with the tools and frameworks in our ecosystem. This means we can integrate it with the other tools we're already using, making it easier to incorporate into our existing workflows and processes.

Now, let’s dive into how you can get started with Airbyte and PyAirbyte in just a few steps.

UI and Code Demos

AJ digs into a demo session showing how we can use the hosted version of Airbyte (no code approach) and PyAirbyte (minimal code approach) to integrate our data sources with our data destinations.

UI Approach Demo

For those who love clicking buttons more than typing code, the UI approach is a dream come true. It’s like the ‘easy mode’ in your favorite video game. We will keep this brief as we have already covered how to use Airbyte’s hosted version to connect a data source and destination in detail. The linked article goes beyond the integration part and shows us how to build a real-world app using the established connection.

There are four main steps involved in using the no-code approach to connect a data source with a data destination.

Sign up or log in to your Airbyte account. When you sign up you are given a 14 day trial.
Configure the data source.
Configure the data destination.
Create a connection between the source and destination.

When the connection is established, you are ready to use your data. If you are a code geek like me and prefer the code approach, PyAirbyte has you covered.

Code Approach Demo

Fetching Data Using PyAirbyte in Three Steps

The process of fetching data using PyAirbyte is quite simple.

Step 1:

Create a source using the get_source() function.


import airbyte as ab

# Check for supported sources

print(ab.get_available_connectors())

#  Create the data source we want

source = ab.get_source("source-github")

To have a visual look of all the connectors Airbyte supports, have a look at the supported connectors page. But what if your source’s connector is not supported? Then you need to build your own custom connector. Don’t worry it's not as hard as it sounds. You just have to follow this guide as Airbyte offers a no-code connector builder which lets us build connectors in less than 10 minutes.

Step 2:

Configure our source with set_config().


source.set_config(

    {

        # Replace with your desired repository

    "repositories": ["FINCH285/engineering-education"],

    "credentials": {

        "personal_access_token": ab.get_secret(

        "GITHUB_PERSONAL_ACCESS_TOKEN"

            )

        }

    }

)

#  Validate the config and credentials

source.check()

This is just passing where in the source we want to get the data from and the required authorization credentials.

Step 3:

Read the data using the read() function.


# See what streams are avilable from the source

print(source.get_available_streams())

#Pull data rom the steams we choose

read_result = source.read(

    streams=['branches', 'collaborators']

)

# Use the interop option 

branches_dataframe = read_result["branches"].to_pandas()

branches_sql_table = read_result["branches"].to_sql_table()

branches_lln_docs = read_result["branches"].to_documents()

The interop option allows us to convert the fetched data into different data structures that best fit our use case. This completes the three-step process. However, it's not mandatory to follow the three steps independently to fetch data. Airbyte supports a one-step process to achieve the same.

Fetching Data Using PyAirbyte in a Single Step

This is the speedrun version. It involves combining all the three steps into a single step.


import airbyte as ab

source = ab.get_source(

    "source-github",

    config={

        # Replace with your desired repository

        "repositories": ["FINCH285/engineering-education"],

        "credentials": {

            "personal_access_token": ab.get_secret(

                "GITHUB_PERSONAL_ACCESS_TOKEN"

            ),

        },

    },

    streams=['branches', 'collaborators']

).read()

The code does all three operations in a single step. It creates the source, configures it, and finally reads the data. You can now use the interop option to convert the fetched data into different data structures as we did earlier. But if you want to generate LLM documents from tabular data, the next section shows you how to.

Generating LLM Documents From Tabular Data

If you want your data to be compatible with LangChain you need to convert it to LLM Documents. You can then use these documents for various natural language processing tasks, such as question-answering, summarization, or text generation.


import rich

from itertools import islice

dataset = source_github.get_documents(

   "issues",

   title_property="title",

   content_properties=["body"],

)

# Grab a few documents (skipping first 2)

documents = list(islice(dataset, 2, 5))

# Print as markdown

for document in documents:

   display(rich.markdown.Markdown(document.content))

The code will fetch the GitHub issues and render them as documents. It will then set the title of the document to the title of the issue and the document contents will be set to the body of the issue. It will then use rich to print the documents in markdown format to the console.

With two choices available, let's see which approach best suits you.

Choosing Between Airbyte and PyAirbyte

If you want the zero code to vector stores approach go for the hosted version. If you want full control with minimal code go for PyAirbyte. AJ asserts that in theory anywhere Python runs supported by a virtual environment PyAirbyte can run. I have put together a table to help you decide which approach suits you.


Criteria	Airbyte (UI/Hosted)	PyAirbyte
Approach	No-code	Code/Python
Setup	Simple sign-up process	Requires Python coding skills
User Interface	Provides a user-friendly interface	No user interface, code-based
Data Integration	Suitable for both traditional and modern data integration tasks (unstructured data, vector databases, etc.)	Suitable for both traditional and modern data integration tasks (unstructured data, vector databases, etc.)
Flexibility	Limited to the options provided in the UI	Highly flexible, can be integrated with other Python tools and frameworks
Environment	Hosted environment	Can run in any Python environment (e.g., Jupyter Notebook, local machine)
Deployment	Deploys data pipelines to Airbyte's hosted environment	Code-based deployment can be integrated with CI/CD pipelines
Prototyping	Suitable for non-technical users or quick prototyping	Ideal for technical users and fast prototyping
Learning Curve	Lower learning curve	Requires Python coding skills or basic coding skills

After the comparison, you can continue to choose your own adventure.

PyAirbyte’s Upcoming Features

PyAirbyte is still in the beta stage. Hence as of now, it does not support full destinations meaning it doesn’t have vector stores as destinations but we can pass them along to LangChain. Aj shares a couple of expectations we should have from the PyAirbyte team this year: What's Next for PyAirbyte?

Q2 2024 Plans:

Improve support for Python 3.9 and 3.11
Add Windows support
Improve API for configuring sources
Enable interoperability with Airbyte Cloud, OSS, and Enterprise versions

H2 2024 Plans:

Add support for Reverse ETL
Add support for Vector Store destinations
Add support for Publish-type destinations

If you want to offer feedback and report bugs, consider opening an issue on PyAirbyte’s GitHub page.

Conclusion

Whether you prefer a no-code or minimal-code approach, Airbyte and PyAirbyte offer robust solutions for integrating both structured and unstructured data. AJ Steers' painted a good picture of the potential of these tools in revolutionizing data workflows. For more information, check out the resources below and start your journey in leveraging text as data, from anywhere to anywhere.

Further Resources

https://zilliz.com/blog/use-milvus-and-airbyte-for-similarity-search-on-all-your-data
https://zilliz.com/blog/zilliz-introduces-upsert-kafka-connector-and-airbyte-integration
https://github.com/airbytehq/quickstarts/tree/main/pyairbyte_notebooks

Link to the YouTube replay of AJ Steer’s talk: Watch the talk on YouTube.

Updated on Jul 01, 2025

Denis Kuria
Denis is a machine learning engineer who enjoys writing guides to help other developers. He has a bachelor's in computer science and loves hiking and exploring the world.

Content

Start Free, Scale Easily

Try the fully-managed vector database built for your GenAI applications.

Try Zilliz Cloud for Free

Share this article

Keep Reading

Zilliz Named "Highest Performer" and "Easiest to Use" in G2's Summer 2025 Grid® Report for Vector Databases

This dual recognition shows that Zilliz solved a challenge that has long defined the database industry—delivering enterprise-grade performance without the complexity typically associated with it.

DeepSeek vs. OpenAI: A Battle of Innovation in Modern AI

Compare OpenAI's o1 and o3-mini with DeepSeek R1's open-source alternative. Discover which AI model offers the best balance of reasoning capabilities and cost efficiency.

Introducing Milvus 2.5: Built-in Full-Text Search, Advanced Query Optimization, and More 🚀

We're thrilled to announce the release of Milvus 2.5, a significant step in our journey to build the world's most complete solution for all search workloads.