1
0
mirror of synced 2025-12-21 11:01:41 -05:00
Files
airbyte/docs/integrations/destinations/weaviate.md
Airbyte a6b4dc6be6 🐙 destination-weaviate: run up-to-date pipeline [2025-09-16] (#61103)
# Update destination-weaviate

This PR was autogenerated by running `airbyte-ci connectors
--name=destination-weaviate up_to_date --pull`

We've set the `auto-merge` label on it, so it will be automatically
merged if the CI pipelines pass.
If you don't want to merge it automatically, please remove the
`auto-merge` label.
Please reach out to the Airbyte Connector Tooling team if you have any
questions or concerns.


## Operations

- Upgrade the base image to the latest version in metadata.yaml:
Successful

- Update versions of libraries in poetry: Successful

- PATCH bump destination-weaviate version to 0.2.60: Successful

- Build destination-weaviate docker image for platform(s) linux/amd64,
linux/arm64: Successful

- Get dependency updates: Successful

- Create or update pull request on Airbyte repository: Successful

- Add changelog entry: Successful




## Dependency updates

We use [`syft`](https://github.com/anchore/syft) to generate a SBOM for
the latest connector version and the one from the PR.
It allows us to spot the dependencies that have been updated at all
levels and for all types of dependencies (system, python, java etc.).
Here are the dependencies that have been updated compared to
`airbyte/destination-weaviate:latest`.
Keep in mind that `:latest` does not always match the connector code on
the main branch.
It is the latest released connector image when the head commit of this
branch was created.

| Type | Name | State | Previous Version | New Version |
|------|------|-------|-------------|------------------|
| python | Authlib | updated | 1.5.2 | **1.6.3** |
| python | CacheControl | updated | 0.14.2 | **0.14.3** |
| python | RapidFuzz | updated | 3.12.1 | **3.13.0** |
| python | SQLAlchemy | updated | 2.0.41 | **2.0.43** |
| python | aiohttp | updated | 3.11.18 | **3.12.15** |
| python | aiosignal | updated | 1.3.2 | **1.4.0** |
| python | airbyte_protocol_models | updated | 0.15.0 | **0.18.0** |
| python | anyio | updated | 4.9.0 | **4.10.0** |
| deb | base-files | updated | 12.4+deb12u9 | **12.4+deb12u11** |
| deb | bash | updated | 5.2.15-2+b7 | **5.2.15-2+b8** |
| python | bracex | updated | 2.5.post1 | **2.6** |
| deb | ca-certificates | updated | 20230311 | **20230311+deb12u1** |
| python | cachetools | updated | 5.5.2 | **6.2.0** |
| python | cattrs | updated | 24.1.3 | **25.2.0** |
| python | certifi | updated | 2025.4.26 | **2025.8.3** |
| python | cffi | updated | 1.17.1 | **2.0.0** |
| python | charset-normalizer | updated | 3.4.2 | **3.4.3** |
| python | cryptography | updated | 44.0.0 | **45.0.5** |
| deb | debian-archive-keyring | updated | 2023.3+deb12u1 |
**2023.3+deb12u2** |
| python | filelock | updated | 3.17.0 | **3.18.0** |
| python | fonttools | updated | 4.58.0 | **4.59.2** |
| python | frozenlist | updated | 1.6.0 | **1.7.0** |
| deb | gcc-12-base | updated | 12.2.0-14 | **12.2.0-14+deb12u1** |
| python | greenlet | updated | 3.2.2 | **3.2.4** |
| python | importlib_metadata | updated | 8.6.1 | **8.7.0** |
| python | jeepney | updated | 0.8.0 | **0.9.0** |
| python | joblib | updated | 1.5.0 | **1.5.2** |
| deb | libc-bin | updated | 2.36-9+deb12u9 | **2.36-9+deb12u10** |
| deb | libc6 | updated | 2.36-9+deb12u9 | **2.36-9+deb12u10** |
| deb | libcap2 | updated | 1:2.66-4 | **1:2.66-4+deb12u1** |
| deb | libcurl4 | updated | 7.88.1-10+deb12u8 | **7.88.1-10+deb12u12**
|
| deb | libfreetype6 | updated | 2.12.1+dfsg-5+deb12u3 |
**2.12.1+dfsg-5+deb12u4** |
| deb | libgcc-s1 | updated | 12.2.0-14 | **12.2.0-14+deb12u1** |
| deb | libglib2.0-0 | updated | 2.74.6-2+deb12u5 | **2.74.6-2+deb12u6**
|
| deb | libglib2.0-data | updated | 2.74.6-2+deb12u5 |
**2.74.6-2+deb12u6** |
| deb | libgnutls30 | updated | 3.7.9-2+deb12u3 | **3.7.9-2+deb12u4** |
| deb | libgomp1 | updated | 12.2.0-14 | **12.2.0-14+deb12u1** |
| deb | libgssapi-krb5-2 | updated | 1.20.1-2+deb12u2 |
**1.20.1-2+deb12u3** |
| deb | libicu72 | updated | 72.1-3 | **72.1-3+deb12u1** |
| deb | libk5crypto3 | updated | 1.20.1-2+deb12u2 | **1.20.1-2+deb12u3**
|
| deb | libkrb5-3 | updated | 1.20.1-2+deb12u2 | **1.20.1-2+deb12u3** |
| deb | libkrb5support0 | updated | 1.20.1-2+deb12u2 |
**1.20.1-2+deb12u3** |
| deb | liblzma5 | updated | 5.4.1-0.2 | **5.4.1-1** |
| deb | libpoppler126 | updated | 22.12.0-2+b1 | **22.12.0-2+deb12u1** |
| deb | libssl3 | updated | 3.0.15-1~deb12u1 | **3.0.16-1~deb12u1** |
| deb | libstdc++6 | updated | 12.2.0-14 | **12.2.0-14+deb12u1** |
| deb | libsystemd0 | updated | 252.33-1~deb12u1 | **252.38-1~deb12u1**
|
| deb | libtasn1-6 | updated | 4.19.0-2 | **4.19.0-2+deb12u1** |
| deb | libudev1 | updated | 252.33-1~deb12u1 | **252.38-1~deb12u1** |
| deb | libxml2 | updated | 2.9.14+dfsg-1.3~deb12u1 |
**2.9.14+dfsg-1.3~deb12u2** |
| deb | login | updated | 1:4.13+dfsg1-1+b1 | **1:4.13+dfsg1-1+deb12u1**
|
| python | more-itertools | updated | 10.6.0 | **10.7.0** |
| python | msgpack | updated | 1.1.0 | **1.1.1** |
| python | multidict | updated | 6.4.3 | **6.6.4** |
| python | narwhals | updated | 1.39.1 | **2.5.0** |
| deb | openssl | updated | 3.0.15-1~deb12u1 | **3.0.16-1~deb12u1** |
| python | orjson | updated | 3.10.18 | **3.11.3** |
| python | packaging | updated | 24.2 | **25.0** |
| python | pandas | updated | 2.2.3 | **2.3.2** |
| deb | passwd | updated | 1:4.13+dfsg1-1+b1 |
**1:4.13+dfsg1-1+deb12u1** |
| deb | perl-base | updated | 5.36.0-7+deb12u1 | **5.36.0-7+deb12u2** |
| python | pillow | updated | 11.2.1 | **11.3.0** |
| python | pkginfo | updated | 1.12.0 | **1.12.1.2** |
| python | platformdirs | updated | 4.3.8 | **4.4.0** |
| python | plotly | updated | 6.1.0 | **6.3.0** |
| deb | poppler-utils | updated | 22.12.0-2+b1 | **22.12.0-2+deb12u1** |
| python | propcache | updated | 0.3.1 | **0.3.2** |
| python | pycparser | updated | 2.22 | **2.23** |
| python | pydantic | updated | 1.10.22 | **1.10.23** |
| python | pyparsing | updated | 3.2.3 | **3.2.4** |
| binary | python | updated | 3.11.11 | **3.11.13** |
| python | regex | updated | 2024.11.6 | **2025.9.1** |
| python | requests | updated | 2.32.3 | **2.32.5** |
| python | setuptools | updated | 80.7.1 | **80.9.0** |
| python | tomlkit | updated | 0.13.2 | **0.13.3** |
| python | trove-classifiers | updated | 2025.1.15.22 | **2025.5.9.12**
|
| python | types-pytz | updated | 2025.2.0.20250516 |
**2025.2.0.20250809** |
| python | typing_extensions | updated | 4.13.2 | **4.15.0** |
| deb | tzdata | updated | 2024b-0+deb12u1 | **2025b-0+deb12u1** |
| python | urllib3 | updated | 2.4.0 | **2.5.0** |
| python | virtualenv | updated | 20.29.1 | **20.31.2** |
| python | wrapt | updated | 1.17.2 | **1.17.3** |
| python | yarl | updated | 1.20.0 | **1.20.1** |
| python | zipp | updated | 3.21.0 | **3.23.0** |

---------

Co-authored-by: octavia-bot-hoard[bot] <230633153+octavia-bot-hoard[bot]@users.noreply.github.com>
Co-authored-by: David Gold <32782137+dbgold17@users.noreply.github.com>
2025-10-16 12:40:29 -07:00

16 KiB

Weaviate

Overview

This page guides you through the process of setting up the Weaviate destination connector.

There are three parts to this:

  • Processing - split up individual records in chunks so they will fit the context window and decide which fields to use as context and which are supplementary metadata.
  • Embedding - convert the text into a vector representation using a pre-trained model (Currently, OpenAI's text-embedding-ada-002 and Cohere's embed-english-light-v2.0 are supported.)
  • Indexing - store the vectors in a vector database for similarity search

Prerequisites

To use the Weaviate destination, you'll need:

  • Access to a running Weaviate instance (either self-hosted or via Weaviate Cloud Services), minimum version 1.21.2
  • Either
    • An account with API access for OpenAI or Cohere (depending on which embedding method you want to use)
    • Pre-calculated embeddings stored in a field in your source database

You'll need the following information to configure the destination:

  • Embedding service API Key - The API key for your OpenAI or Cohere account
  • Weaviate cluster URL - The URL of the Weaviate cluster to load data into. Airbyte Cloud only supports connecting to your Weaviate Instance instance with TLS encryption.
  • Weaviate credentials - The credentials for your Weaviate instance (either API token or username/password)

Features

Feature Supported?(Yes/No) Notes
Full Refresh Sync Yes
Incremental - Append Sync Yes
Incremental - Append + Deduped Yes
Namespaces No
Provide vector Yes Either from field are calculated during the load process

Data type mapping

All fields specified as metadata fields will be stored as properties in the object can be used for filtering. The following data types are allowed for metadata fields:

  • String
  • Number (integer or floating point, gets converted to a 64 bit floating point)
  • Booleans (true, false)
  • List of String

All other fields are serialized into their JSON representation.

Configuration

Processing

Each record will be split into text fields and metadata fields as configured in the "Processing" section. All text fields are concatenated into a single string and then split into chunks of configured length. If specified, the metadata fields are stored as-is along with the embedded text chunks. Options around configuring the chunking process use the Langchain Python library.

When specifying text fields, you can access nested fields in the record by using dot notation, e.g. user.name will access the name field in the user object. It's also possible to use wildcards to access all fields in an object, e.g. users.*.name will access all names fields in all entries of the users array.

The chunk length is measured in tokens produced by the tiktoken library. The maximum is 8191 tokens, which is the maximum length supported by the text-embedding-ada-002 model.

The stream name gets added as a metadata field _ab_stream to each document. If available, the primary key of the record is used to identify the document to avoid duplications when updated versions of records are indexed. It is added as the _ab_record_id metadata field.

Embedding

The connector can use one of the following embedding methods:

  1. OpenAI - using OpenAI API , the connector will produce embeddings using the text-embedding-ada-002 model with 1536 dimensions. This integration will be constrained by the speed of the OpenAI embedding API.

  2. Cohere - using the Cohere API, the connector will produce embeddings using the embed-english-light-v2.0 model with 1024 dimensions.

  3. From field - if you have pre-calculated embeddings stored in a field in your source database, you can use the From field integration to load them into Weaviate. The field must be a JSON array of numbers, e.g. [0.1, 0.2, 0.3].

  4. No embedding - if you don't want to use embeddings or have configured a vectorizer for your class, you can use the No embedding integration.

For testing purposes, it's also possible to use the Fake embeddings integration. It will generate random embeddings and is suitable to test a data pipeline without incurring embedding costs.

Indexing

All streams will be indexed into separate classes derived from the stream name. If a class doesn't exist in the schema of the cluster, it will be created using the configure vectorizer configuration. In this case, dynamic schema has to be enabled on the server.

You can also create the class in Weaviate in advance if you need more control over the schema in Weaviate. In this case, the text properies _ab_stream and _ab_record_id need to be created for bookkeeping reasons. In case a sync is run in Overwrite mode, the class will be deleted and recreated.

As properties have to start will a lowercase letter in Weaviate and can't contain spaces or special characters. Field names might be updated during the loading process. The field names id, _id and _additional are reserved keywords in Weaviate, so they will be renamed to raw_id, raw__id and raw_additional respectively.

When using multi-tenancy, the tenant id can be configured in the connector configuration. If not specified, multi-tenancy will be disabled. In case you want to index into an already created class, you need to make sure the class is created with multi-tenancy enabled. In case the class doesn't exist, it will be created with multi-tenancy properly configured. If the class already exists but the tenant id is not associated with the class, the connector will automatically add the tenant id to the class. This allows you to configure multiple connections for different tenants on the same schema.

Changelog

Expand to review
Version Date Pull Request Subject
0.2.60 2025-09-16 61103 Update dependencies
0.2.59 2025-05-17 57180 Update dependencies
0.2.58 2025-03-29 56089 Update dependencies
0.2.57 2025-03-08 55424 Update dependencies
0.2.56 2025-03-01 54880 Update dependencies
0.2.55 2025-02-22 54278 Update dependencies
0.2.54 2025-02-15 53894 Update dependencies
0.2.53 2025-02-08 53424 Update dependencies
0.2.52 2025-02-01 52944 Update dependencies
0.2.51 2025-01-25 52211 Update dependencies
0.2.50 2025-01-18 51759 Update dependencies
0.2.49 2025-01-11 51259 Update dependencies
0.2.48 2025-01-04 50908 Update dependencies
0.2.47 2024-12-28 50444 Update dependencies
0.2.46 2024-12-21 50182 Update dependencies
0.2.45 2024-12-14 49317 Update dependencies
0.2.44 2024-11-25 48640 Update dependencies
0.2.43 2024-11-04 48244 Update dependencies
0.2.42 2024-10-29 47063 Update dependencies
0.2.41 2024-10-12 46848 Update dependencies
0.2.40 2024-10-05 46465 Update dependencies
0.2.39 2024-09-28 46189 Update dependencies
0.2.38 2024-09-21 45822 Update dependencies
0.2.37 2024-09-14 45560 Update dependencies
0.2.36 2024-09-07 45216 Update dependencies
0.2.35 2024-08-31 44964 Update dependencies
0.2.34 2024-08-24 44668 Update dependencies
0.2.33 2024-08-22 44530 Update test dependencies
0.2.32 2024-08-17 44216 Update dependencies
0.2.31 2024-08-12 43906 Update dependencies
0.2.30 2024-08-10 43599 Update dependencies
0.2.29 2024-08-03 43084 Update dependencies
0.2.28 2024-07-27 42629 Update dependencies
0.2.27 2024-07-20 42283 Update dependencies
0.2.26 2024-07-13 41935 Update dependencies
0.2.25 2024-07-10 41504 Update dependencies
0.2.24 2024-07-09 41222 Update dependencies
0.2.23 2024-07-06 40943 Update dependencies
0.2.22 2024-06-29 40633 Update dependencies
0.2.21 2024-06-25 40274 Update dependencies
0.2.20 2024-06-22 40109 Update dependencies
0.2.19 2024-06-06 39212 [autopull] Upgrade base image to v1.2.2
0.2.18 2024-05-15 38272 Replace AirbyteLogger with logging.Logger
0.2.17 2024-04-15 #37333 Update CDK & pytest version to fix security vulnerabilities.
0.2.16 2024-03-22 #35911 Fix tests and move to Poetry
0.2.15 2023-01-25 #34529 Fix tests
0.2.14 2023-01-15 #34229 Allow configuring tenant id
0.2.13 2023-12-11 #33303 Fix bug with embedding special tokens
0.2.12 2023-12-07 #33218 Normalize metadata field names
0.2.11 2023-12-01 #32697 Allow omitting raw text
0.2.10 2023-11-16 #32608 Support deleting records for CDC sources
0.2.9 2023-11-13 #32357 Improve spec schema
0.2.8 2023-11-03 #32134 Improve test coverage
0.2.7 2023-11-03 #32134 Upgrade weaviate client library
0.2.6 2023-11-01 #32038 Retry failed object loads
0.2.5 2023-10-24 #31953 Fix memory leak
0.2.4 2023-10-23 #31563 Add field mapping option, improve append+dedupe sync performance and remove unnecessary retry logic
0.2.3 2023-10-19 #31599 Base image migration: remove Dockerfile and use the python-connector-base image
0.2.2 2023-10-15 #31329 Add OpenAI-compatible embedder option
0.2.1 2023-10-04 #31075 Fix OpenAI embedder batch size and conflict field name handling
0.2.0 2023-09-22 #30151 Add embedding capabilities, overwrite and dedup support and API key auth mode, make certified. 🚨 Breaking changes - check migrations guide.
0.1.1 2022-02-08 #22527 Multiple bug fixes: Support String based IDs, arrays of uknown type and additionalProperties of type object and array of objects
0.1.0 2022-12-06 #20094 Add Weaviate destination