## What
Migrating Pydantic V2 for Protocol Messages to speed up emitting records. This gives us 2.5x boost over V1.
Close https://github.com/airbytehq/airbyte-internal-issues/issues/8333
## How
- Switch to using protocol models generated for pydantic_v2, in a new (temporary) package, `airbyte-protocol-models-pdv2` .
- Update pydantic dependency of the CDK accordingly to v2.
- For minimal impact, still use the compatibility code `pydantic.v1` in all of our pydantic code from airbyte-cdk that does not interact with the protocol models.
## Review guide
1. Checkout the code and clear your CDK virtual env (either `rm -rf .venv && python -m venv .venv` or `poetry env list; poetry env remove <env>`. This is necessary to fully clean out the `airbyte_protocol` library, for some reason. Then: `poetry lock --no-update && poetry install --all-extras`. This should install the CDK with new models.
2. Run unit tests on the CDK
3. Take your favorite connector and point it's `pyproject.toml` on local CDK (see example in `source-s3`) and try running it's tests and it's regression tests.
## User Impact
> [!warning]
> This is a major CDK change due to the pydantic dependency change - if connectors use pydantic 1.10, they will break and will need to do similar `from pydantic.v1` updates to get running again. Therefore, we should release this as a major CDK version bump.
## Can this PR be safely reverted and rolled back?
- [x] YES 💚
- [ ] NO ❌
Even if sources migrate to this version, state format should not change, so a revert should be possible.
## Follow up work - Ella to move into issues
<details>
### Source-s3 - turn this into an issue
- [ ] Update source s3 CDK version and any required code changes
- [ ] Fix source-s3 unit tests
- [ ] Run source-s3 regression tests
- [ ] Merge and release source-s3 by June 21st
### Docs
- [ ] Update documentation on how to build with CDK
### CDK pieces
- [ ] Update file-based CDK format validation to use Pydantic V2
- This is doable, and requires a breaking change to change `OneOfOptionConfig`. There are a few unhandled test cases that present issues we're unsure of how to handle so far.
- [ ] Update low-code component generators to use Pydantic V2
- This is doable, there are a few issues around custom component generation that are unhandled.
### Further CDK performance work - create issues for these
- [ ] Research if we can replace prints with buffered output (write to byte buffer and then flush to stdout)
- [ ] Replace `json` with `orjson`
...
</details>
* remove version, make validation_policy enum, fix input_schema for s3 and abstract file based configs
* remove multiple file format options from stream config
* pr feedback
* fix tests after rebase
* additional spec changes to work with the UI
* fix tests post-rebase
* fix tests post-rebase and cleanup
* formatting
* remove invalid legacy option
* remove unused option
* the tests pass but this is quite messy
* very slight clean up
* Add skip options to csv format
* fix some of the typing issues
* fixme comment
* remove extra log message
* fix typing issues
* skip before header
* skip after header
* format
* add another test
* Automated Commit - Formatting Changes
* auto generate column names
* delete dead code
* update title and description
* true and false values
* Update the tests
* Add comment
* missing test
* rename
* update expected spec
* move to method
* Update comment
* fix typo
* remove unused import
* Add a comment
* None records do not pass the WaitForDiscoverPolicy
* format
* remove second branch to ensure we always go through the same processing
* Raise an exception if the record is None
* reset
* Update tests
* handle unquoted newlines
* Automated Commit - Formatting Changes
* Update test case so the quoting is explicit
* Update comment
* Automated Commit - Formatting Changes
* Fail validation if skipping rows before header and header is autogenerated
* always fail if a record cannot be parsed
* format
* set write line_no in error message
* remove none check
* Automated Commit - Formatting Changes
* enable autogenerate test
* remove duplicate test
* missing unit tests
* Update
* remove branching
* remove unused none check
* Update tests
* remove branching
* format
* extract to function
* comment
* missing type
* type annotation
* use set
* Document that the strings are case-sensitive
* public -> private
* add unit test
* newline
---------
Co-authored-by: girarda <girarda@users.noreply.github.com>
* add avro parser for inferring schema and reading records
* fix mypy check not caught locally
* pr feedback and some additional types
* add decimal_as_float for avro
* formatting + mypy
* tests pass
* everything except parquet config seems to work
* the file fortmat needs a literal
* Add a comment
* Update
* comment
* Ensure only one file type is specified
* Add a test
* add test
* update
* Automated Commit - Formatting Changes
* extract formats
* Automated Commit - Formatting Changes
* fix typo
* Update tests
* Also test jsonl
* Update airbyte-cdk/python/airbyte_cdk/sources/file_based/config/abstract_file_based_spec.py
Co-authored-by: Catherine Noll <clnoll@users.noreply.github.com>
* Update the spec
* update to new config format
* set decimal_as_float to True on legacy configs for backward compatibility
* comments
* Update airbyte-cdk/python/airbyte_cdk/sources/file_based/config/file_based_stream_config.py
Co-authored-by: Catherine Noll <clnoll@users.noreply.github.com>
* format
---------
Co-authored-by: girarda <girarda@users.noreply.github.com>
Co-authored-by: Catherine Noll <clnoll@users.noreply.github.com>
* Try running only on modified files
* make a change
* return something with the wrong type
* Revert "return something with the wrong type"
This reverts commit 23b828371e.
* fix typing in file-based
* format
* Mypy
* fix
* leave as Mapping
* Revert "leave as Mapping"
This reverts commit 908f063f70.
* Use Dict
* update
* move dict()
* Revert "move dict()"
This reverts commit fa347a8236.
* Revert "Revert "move dict()""
This reverts commit c9237df2e4.
* Revert "Revert "Revert "move dict()"""
This reverts commit 5ac1616414.
* use Mapping
* point to config file
* comment
* strict = False
* remove --
* Revert "comment"
This reverts commit 6000814a82.
* install types
* install types in same command as mypy runs
* non-interactive
* freeze version
* pydantic plugin
* plugins
* update
* ignore missing import
* Revert "ignore missing import"
This reverts commit 1da7930fb7.
* Install pydantic instead
* fix
* this passes locally
* strict = true
* format
* explicitly import models
* Update
* remove old mypy.ini config
* temporarily disable mypy
* format
* any
* format
* fix tests
* format
* Automated Commit - Formatting Changes
* Revert "temporarily disable mypy"
This reverts commit eb8470fa3f.
* implicit reexport
* update test
* fix mypy
* Automated Commit - Formatting Changes
* fix some errors in tests
* more type fixes
* more fixes
* more
* .
* done with tests
* fix last files
* format
* Update gradle
* change source-stripe
* only run mypy on cdk
* remove strict
* Add more rules
* update
* ignore missing imports
* cast to string
* Allow untyped decorator
* reset to master
* move to the cdk
* derp
* move explicit imports around
* Automated Commit - Formatting Changes
* Revert "move explicit imports around"
This reverts commit 56e306b72f.
* move explicit imports around
* Upgrade mypy version
* point to config file
* Update readme
* Ignore errors in the models module
* Automated Commit - Formatting Changes
* move check to gradle build
* Any
* try checking out master too
* Revert "try checking out master too"
This reverts commit 8a8f3e373c.
* fetch master
* install mypy
* try without origin
* fetch from the script
* checkout master
* ls the branches
* remotes/origin/master
* remove some cruft
* comment
* remove pydantic types
* unpin mypy
* fetch from the script
* Update connectors base too
* modify a non-cdk file to confirm it doesn't get checked by mypy
* run mypy after generateComponentManifestClassFiles
* run from the venv
* pass files as arguments
* update
* fix when running without args
* with subdir
* path
* try without /
* ./
* remove filter
* try resetting
* Revert "try resetting"
This reverts commit 3a54c424de.
* exclude autogen file
* do not use the github action
* works locally
* remove extra fetch
* run on connectors base
* try bad typing
* Revert "try bad typing"
This reverts commit 33b512a3e4.
* reset stripe
* Revert "reset stripe"
This reverts commit 28f23fc6dd.
* Revert "Revert "reset stripe""
This reverts commit 5bf5dee371.
* missing return type
* do not ignore the autogen file
* remove extra installs
* run from venv
* Only check files modified on current branch
* Revert "Only check files modified on current branch"
This reverts commit b4b728e654.
* use merge-base
* Revert "use merge-base"
This reverts commit 3136670cbf.
* try with updated mypy
* bump
* run other steps after mypy
* reset task ordering
* run mypy though
* looser config
* tests pass
* fix mypy issues
* type: ignore
* optional
* this is always a bool
* ignore
* fix typing issues
* remove ignore
* remove mapping
* Automated Commit - Formatting Changes
* Revert "remove ignore"
This reverts commit 9ffeeb6cb1.
* update config
---------
Co-authored-by: girarda <girarda@users.noreply.github.com>
Co-authored-by: Joe Bell <joseph.bell@airbyte.io>