## What Migrating Pydantic V2 for Protocol Messages to speed up emitting records. This gives us 2.5x boost over V1. Close https://github.com/airbytehq/airbyte-internal-issues/issues/8333 ## How - Switch to using protocol models generated for pydantic_v2, in a new (temporary) package, `airbyte-protocol-models-pdv2` . - Update pydantic dependency of the CDK accordingly to v2. - For minimal impact, still use the compatibility code `pydantic.v1` in all of our pydantic code from airbyte-cdk that does not interact with the protocol models. ## Review guide 1. Checkout the code and clear your CDK virtual env (either `rm -rf .venv && python -m venv .venv` or `poetry env list; poetry env remove <env>`. This is necessary to fully clean out the `airbyte_protocol` library, for some reason. Then: `poetry lock --no-update && poetry install --all-extras`. This should install the CDK with new models. 2. Run unit tests on the CDK 3. Take your favorite connector and point it's `pyproject.toml` on local CDK (see example in `source-s3`) and try running it's tests and it's regression tests. ## User Impact > [!warning] > This is a major CDK change due to the pydantic dependency change - if connectors use pydantic 1.10, they will break and will need to do similar `from pydantic.v1` updates to get running again. Therefore, we should release this as a major CDK version bump. ## Can this PR be safely reverted and rolled back? - [x] YES 💚 - [ ] NO ❌ Even if sources migrate to this version, state format should not change, so a revert should be possible. ## Follow up work - Ella to move into issues <details> ### Source-s3 - turn this into an issue - [ ] Update source s3 CDK version and any required code changes - [ ] Fix source-s3 unit tests - [ ] Run source-s3 regression tests - [ ] Merge and release source-s3 by June 21st ### Docs - [ ] Update documentation on how to build with CDK ### CDK pieces - [ ] Update file-based CDK format validation to use Pydantic V2 - This is doable, and requires a breaking change to change `OneOfOptionConfig`. There are a few unhandled test cases that present issues we're unsure of how to handle so far. - [ ] Update low-code component generators to use Pydantic V2 - This is doable, there are a few issues around custom component generation that are unhandled. ### Further CDK performance work - create issues for these - [ ] Research if we can replace prints with buffered output (write to byte buffer and then flush to stdout) - [ ] Replace `json` with `orjson` ... </details>
4.9 KiB
CDK Migration Guide
Importing classes
Starting from 1.0.0, CDK classes and functions should be imported directly from airbyte_cdk (example: from airbyte_cdk import HttpStream). Lower-level __init__ files are not considered stable, and will be modified without introducing a major release.
Introducing breaking changes to a class or function exported from the top level __init__.py will require a major version bump and a migration note to help developer upgrade.
Note that the following packages are not part of the top level init because they require extras dependencies, but are still considered stable:
destination.vector_db_basedsource.file_based
The test package is not included in the top level init either. The test package is still evolving and isn't considered stable.
Upgrading to 2.0.0
Version 2.0.0 of the CDK updates the pydantic dependency to from Pydantic v1 to Pydantic v2. It also
updates the airbyte-protocol-models dependency to a version that uses Pydantic V2 models.
The changes to Airbyte CDK itself are backwards-compatible, but some changes are required if the connector:
- uses Pydantic directly, e.g. for its own custom models, or
- uses the
airbyte_protocolmodels directly, orairbyte_cdk.models, which points toairbyte_protocolmodels, or - customizes HashableStreamDescriptor, which inherits from a protocol model and has therefore been updated to use Pydantic V2 models.
Some test assertions may also need updating due to changes to default serialization of the protocol models.
Updating direct usage of Pydantic
If the connector uses pydantic, the code will need to be updated to reflect the change pydantic dependency version.
The Pydantic migration guide is a great resource for any questions that
might arise around upgrade behavior.
Using Pydantic V1 models with Pydantic V2
The easiest way to update the code to be compatible without major changes is to update the import statements from
from pydantic to from pydantic.v1, as Pydantic has kept the v1 module for backwards compatibility.
Some potential gotchas:
ValidationErrormust be imported frompydantic.v1.error_wrappersinstead ofpydantic.v1ModelMetaclassmust be imported frompydantic.v1.maininstead ofpydantic.v1resolve_annotationsmust be imported frompydantic.v1.typinginstead ofpydantic.v1
Upgrading to Pydantic V2
To upgrade all the way to V2 proper, Pydantic also offers a migration tool to automatically update the code to be compatible with Pydantic V2.
Updating assertions
It's possible that a connector might make assertions against protocol models without actually
importing them - for example when testing methods which return AirbyteStateBlob or AnyUrl.
To resolve this, either compare directly to a model, or dict() or str() your model accordingly, depending
on if you care most about the serialized output or the model (for a method which returns a model, option 1 is
preferred). For example:
# Before
assert stream_read.slices[1].state[0].stream.stream_state == {"a_timestamp": 123}
# After - Option 1
from airbyte_cdk.models import AirbyteStateBlob
assert stream_read.slices[1].state[0].stream.stream_state == AirbyteStateBlob(a_timestamp=123)
# After - Option 2
assert stream_read.slices[1].state[0].stream.stream_state.dict() == {"a_timestamp": 123}
Upgrading to 1.0.0
A few classes were deleted from the Airbyte CDK in version 1.0.0:
- AirbyteLogger
- AirbyteSpec
- Authenticators in the
sources.streams.http.authmodule
Migrating off AirbyteLogger
No connectors should still be using AirbyteLogger directly, but the class is still used in some interfaces. The only required change is to update the type annotation from AirbyteLogger to logging.Logger. For example:
def check_connection(self, logger: AirbyteLogger, config: Mapping[str, Any]) -> Tuple[bool, any]:
to
def check_connection(self, logger: logging.Logger, config: Mapping[str, Any]) -> Tuple[bool, any]:
Don't forget to also update the imports. You can delete from airbyte_cdk import AirbyteLogger and replace it with import logging.
Migrating off AirbyteSpec
AirbyteSpec isn't used by any connectors in the repository, and I don't expect any custom connectors to use the class either. This should be a no-op.
Migrating off Authenticators
Replace usage of authenticators in the airbyte_cdk.sources.streams.http.auth module with their sister classes in the airbyte_cdk.sources.streams.http.requests_native_auth module.
If any of your streams reference self.authenticator, you'll also need to update these references to self._session.auth as the authenticator is embedded in the session object.
Here is a pull request that can serve as an example.