1
0
mirror of synced 2025-12-26 05:05:18 -05:00
Commit Graph

101 Commits

Author SHA1 Message Date
Serhii Lazebnyi
dd5adef0d4 feat(concurrent-cdk): add to concurrent per slice tracking of the most recent cursor (#45180) 2024-10-07 23:26:48 +02:00
Alexandre Girard
f01a43cc4b bug(cdk) Always return a connection status even if an exception was raised (#45205) 2024-09-27 04:26:26 +00:00
Daryna Ishchenko
b3406937c6 feat(airbyt-cdk): add transform_record() to class DefaultFileBasedStream (#45698) 2024-09-24 17:02:21 +03:00
Artem Inzhyyants
df34893b63 feat(airbyte-cdk): replace pydantic BaseModel with dataclasses + serpyco-rs in protocol (#44444)
Signed-off-by: Artem Inzhyyants <artem.inzhyyants@gmail.com>
2024-09-02 17:48:17 +02:00
Erick Corona
fc8cd5a554 fix(python-cdk): add user friendly message for encoding errors (#44438)
Co-authored-by: Alexandre Girard <alexandre@airbyte.io>
2024-08-28 11:13:10 -06:00
Brian Lai
fca0460030 [airbyte-cdk] tech-debt Remove support for parsing legacy state message format (#43459) 2024-08-16 21:06:37 -04:00
Serhii Lazebnyi
aaaf12e055 [file-based cdk] add excel file type support (#43346) 2024-08-14 15:05:15 +02:00
Anton Karpets
6c439a8859 [file-based cdk]: add config option to limit number of files for schema discover (#39317)
Co-authored-by: askarpets <anton.karpets@globallogic.com>
Co-authored-by: Serhii Lazebnyi <serhii.lazebnyi@globallogic.com>
Co-authored-by: Serhii Lazebnyi <53845333+lazebnyi@users.noreply.github.com>
2024-07-11 15:16:09 +02:00
Ella Rohm-Ensing
fc12432305 airbyte-cdk: only update airbyte-protocol-models to pydantic v2 (#39524)
## What

Migrating Pydantic V2 for Protocol Messages to speed up emitting records. This gives us 2.5x boost over V1. 

Close https://github.com/airbytehq/airbyte-internal-issues/issues/8333

## How
- Switch to using protocol models generated for pydantic_v2, in a new (temporary) package, `airbyte-protocol-models-pdv2` .
- Update pydantic dependency of the CDK accordingly to v2.
- For minimal impact, still use the compatibility code `pydantic.v1` in all of our pydantic code from airbyte-cdk that does not interact with the protocol models.

## Review guide
1. Checkout the code and clear your CDK virtual env (either `rm -rf .venv && python -m venv .venv` or `poetry env list; poetry env remove <env>`. This is necessary to fully clean out the `airbyte_protocol` library, for some reason. Then: `poetry lock --no-update && poetry install --all-extras`. This should install the CDK with new models. 
2. Run unit tests on the CDK
3. Take your favorite connector and point it's `pyproject.toml` on local CDK (see example in `source-s3`) and try running it's tests and it's regression tests.

## User Impact

> [!warning]
> This is a major CDK change due to the pydantic dependency change - if connectors use pydantic 1.10, they will break and will need to do similar `from pydantic.v1` updates to get running again. Therefore, we should release this as a major CDK version bump.

## Can this PR be safely reverted and rolled back?
- [x] YES 💚
- [ ] NO 

Even if sources migrate to this version, state format should not change, so a revert should be possible.

## Follow up work - Ella to move into issues

<details>

### Source-s3 - turn this into an issue
- [ ] Update source s3 CDK version and any required code changes
- [ ] Fix source-s3 unit tests
- [ ] Run source-s3 regression tests
- [ ] Merge and release source-s3 by June 21st

### Docs
- [ ] Update documentation on how to build with CDK 

### CDK pieces
- [ ] Update file-based CDK format validation to use Pydantic V2
  - This is doable, and requires a breaking change to change `OneOfOptionConfig`. There are a few unhandled test cases that present issues we're unsure of how to handle so far.
- [ ] Update low-code component generators to use Pydantic V2
  - This is doable, there are a few issues around custom component generation that are unhandled.

### Further CDK performance work - create issues for these
- [ ] Research if we can replace prints with buffered output (write to byte buffer and then flush to stdout)
- [ ] Replace `json` with `orjson`
...

</details>
2024-06-21 01:53:44 +02:00
Gergely Imreh
d55995deb5 [cdk]: correctly raise unsupported logical type errors when parsing avro (#36888)
Co-authored-by: Natik Gadzhi <natik@respawn.io>
2024-06-06 04:17:02 +00:00
Bindi Pankhudi
700b1708d7 Fix: Vector-db-based CDK - Updated unstructured file type and removed experimental from file type (#38722) 2024-05-30 10:34:20 -07:00
Brian Lai
040f1415e5 [low-code CDK] Rsumable full refresh support for low-code streams (#38300) 2024-05-22 16:23:31 -04:00
Anton Karpets
50f4965324 File-based CDK: avoid error on empty stream when running discover (#38230) 2024-05-21 15:10:34 +03:00
Tobias Macey
18c9ebc64d [airbyte-cdk] Increase the maximum parseable field size for CSV files (#36320) 2024-05-07 20:08:01 -03:00
Anton Karpets
8ec438acf0 File-based CDK: allow to merge schemas with nullable object values (#37773) 2024-05-02 17:43:14 +03:00
Anton Karpets
2cfa6ea2c8 File-based CDK: fix schemas merge for nullable object types (#37619) 2024-05-02 10:40:20 +03:00
Ella Rohm-Ensing
b7819d9f6c python: assert actual == expected ordering (#36980) 2024-04-11 15:16:33 +00:00
Anatolii Yatsuk
157be91cb1 File-based CDK: Add skip_wrong_number_of_fields_error parameter for CSV parser (#36237)
Co-authored-by: Catherine Noll <clnoll@users.noreply.github.com>
2024-03-20 22:49:49 +02:00
Tobias Macey
f67938993e [airbyte-cdk] Fix tab delimiter configuration in CSV file type (#35901) 2024-03-13 13:46:32 -03:00
Ella Rohm-Ensing
2ac5248387 Emit record counts in state messages for concurrent streams (#35907)
Co-authored-by: brianjlai <brian.lai@airbyte.io>
Co-authored-by: Brian Lai <51336873+brianjlai@users.noreply.github.com>
2024-03-08 19:08:59 -05:00
Ella Rohm-Ensing
a4dca3b45b CDK: assert >0 state messages per read (fix tests) (#35906)
<!--
Thanks for your contribution! 
Before you submit the pull request, 
I'd like to kindly remind you to take a moment and read through our guidelines
to ensure that your contribution aligns with the type of contributions our project accepts.
All the information you need can be found here:
   https://docs.airbyte.com/contributing-to-airbyte/

We truly appreciate your interest in contributing to Airbyte,
and we're excited to see what you have to offer! 

If you have any questions or need any assistance, feel free to reach out in #contributions Slack channel.
-->

## What
* After https://github.com/airbytehq/airbyte/pull/35905, we should be emitting a state message with every successful sync. However there are a few tests that were too lenient and weren't actually _successful_ syncs. This PR fixes those cases and adds validation that we emit at least one state message per successful sync. 

## How
* Add an assertion that we get at least 1 state message for a successful sync 
* Fix some tests that previously "output 0 expected records" but actually errored silently - do not run them as read tests
* Fix a test that failed silently due to lack of support for multi-format
* Add a new test for syncs that output 0 records successfully

## 🚨 User Impact 🚨
None - test changes


## Pre-merge Actions
*Expand the relevant checklist and delete the others.*

<details><summary><strong>New Connector</strong></summary>

### Community member or Airbyter

- **Community member?** Grant edit access to maintainers ([instructions](https://docs.github.com/en/github/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork#enabling-repository-maintainer-permissions-on-existing-pull-requests))
- Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run `./gradlew :airbyte-integrations:connectors:<name>:integrationTest`.
- Connector version is set to `0.0.1`
    - `Dockerfile` has version `0.0.1`
- Documentation updated
    - Connector's `README.md`
    - Connector's `bootstrap.md`. See [description and examples](https://docs.google.com/document/d/1ypdgmwmEHWv-TrO4_YOQ7pAJGVrMp5BOkEVh831N260/edit?usp=sharing)
    - `docs/integrations/<source or destination>/<name>.md` including changelog with an entry for the initial version. See changelog [example](https://docs.airbyte.io/integrations/sources/stripe#changelog)
    - `docs/integrations/README.md`

### Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

- Create a non-forked branch based on this PR and test the below items on it
- Build is successful
- If new credentials are required for use in CI, add them to GSM. [Instructions](https://docs.airbyte.io/connector-development#using-credentials-in-ci).

</details>

<details><summary><strong>Updating a connector</strong></summary>

### Community member or Airbyter

- Grant edit access to maintainers ([instructions](https://docs.github.com/en/github/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork#enabling-repository-maintainer-permissions-on-existing-pull-requests))
- Unit & integration tests added


### Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

- Create a non-forked branch based on this PR and test the below items on it
- Build is successful
- If new credentials are required for use in CI, add them to GSM. [Instructions](https://docs.airbyte.io/connector-development#using-credentials-in-ci).

</details>

<details><summary><strong>Connector Generator</strong></summary>

- Issue acceptance criteria met
- PR name follows [PR naming conventions](https://docs.airbyte.com/contributing-to-airbyte/resources/pull-requests-handbook)
- If adding a new generator, add it to the [list of scaffold modules being tested](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connector-templates/generator/build.gradle#L41)
- The generator test modules (all connectors with `-scaffold` in their name) have been updated with the latest scaffold by running `./gradlew :airbyte-integrations:connector-templates:generator:generateScaffolds` then checking in your changes
- Documentation which references the generator is updated as needed

</details>

<details><summary><strong>Updating the Python CDK</strong></summary>

### Airbyter

Before merging:
- Pull Request description explains what problem it is solving
- Code change is unit tested
- Build and my-py check pass
- Smoke test the change on at least one affected connector
   - On Github: Run [this workflow](https://github.com/airbytehq/airbyte/actions/workflows/connectors_tests.yml), passing `--use-local-cdk --name=source-<connector>` as options
   - Locally: `airbyte-ci connectors --use-local-cdk --name=source-<connector> test`
- PR is reviewed and approved
      
After merging:
- [Publish the CDK](https://github.com/airbytehq/airbyte/actions/workflows/publish-cdk-command-manually.yml)
   - The CDK does not follow proper semantic versioning. Choose minor if this the change has significant user impact or is a breaking change. Choose patch otherwise.
   - Write a thoughtful changelog message so we know what was updated.
- Merge the platform PR that was auto-created for updating the Connector Builder's CDK version
   - This step is optional if the change does not affect the connector builder or declarative connectors.

</details>
2024-03-08 14:21:46 -08:00
Ella Rohm-Ensing
acbdc2d6e1 Introduce FinalStateCursor to emit state messages at the end of full refresh syncs (#35905)
Co-authored-by: brianjlai <brian.lai@airbyte.io>
2024-03-08 16:58:26 -05:00
Ella Rohm-Ensing
a090088594 file cdk: handle scalar values that resolve to None (#35688)
<!--
Thanks for your contribution! 
Before you submit the pull request, 
I'd like to kindly remind you to take a moment and read through our guidelines
to ensure that your contribution aligns with the type of contributions our project accepts.
All the information you need can be found here:
   https://docs.airbyte.com/contributing-to-airbyte/

We truly appreciate your interest in contributing to Airbyte,
and we're excited to see what you have to offer! 

If you have any questions or need any assistance, feel free to reach out in #contributions Slack channel.
-->

## What
* Closes https://github.com/airbytehq/airbyte/issues/34151
* Closes https://github.com/airbytehq/oncall/issues/4386

## How
Handle cases where the python value of a pyarrow scalar is None. This can be due to null values in data, as well as null-like values like `NaT` (similar to `NaN`). We previously handled this for `None` binary types, but now handle this for `None` of any type.

## 🚨 User Impact 🚨
No breaking changes. After this CDK version is released we should update the CDK dependency in S3 and any other file sources that parse parquet


## Pre-merge Actions
*Expand the relevant checklist and delete the others.*

<details><summary><strong>New Connector</strong></summary>

### Community member or Airbyter

- **Community member?** Grant edit access to maintainers ([instructions](https://docs.github.com/en/github/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork#enabling-repository-maintainer-permissions-on-existing-pull-requests))
- Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run `./gradlew :airbyte-integrations:connectors:<name>:integrationTest`.
- Connector version is set to `0.0.1`
    - `Dockerfile` has version `0.0.1`
- Documentation updated
    - Connector's `README.md`
    - Connector's `bootstrap.md`. See [description and examples](https://docs.google.com/document/d/1ypdgmwmEHWv-TrO4_YOQ7pAJGVrMp5BOkEVh831N260/edit?usp=sharing)
    - `docs/integrations/<source or destination>/<name>.md` including changelog with an entry for the initial version. See changelog [example](https://docs.airbyte.io/integrations/sources/stripe#changelog)
    - `docs/integrations/README.md`

### Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

- Create a non-forked branch based on this PR and test the below items on it
- Build is successful
- If new credentials are required for use in CI, add them to GSM. [Instructions](https://docs.airbyte.io/connector-development#using-credentials-in-ci).

</details>

<details><summary><strong>Updating a connector</strong></summary>

### Community member or Airbyter

- Grant edit access to maintainers ([instructions](https://docs.github.com/en/github/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork#enabling-repository-maintainer-permissions-on-existing-pull-requests))
- Unit & integration tests added


### Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

- Create a non-forked branch based on this PR and test the below items on it
- Build is successful
- If new credentials are required for use in CI, add them to GSM. [Instructions](https://docs.airbyte.io/connector-development#using-credentials-in-ci).

</details>

<details><summary><strong>Connector Generator</strong></summary>

- Issue acceptance criteria met
- PR name follows [PR naming conventions](https://docs.airbyte.com/contributing-to-airbyte/resources/pull-requests-handbook)
- If adding a new generator, add it to the [list of scaffold modules being tested](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connector-templates/generator/build.gradle#L41)
- The generator test modules (all connectors with `-scaffold` in their name) have been updated with the latest scaffold by running `./gradlew :airbyte-integrations:connector-templates:generator:generateScaffolds` then checking in your changes
- Documentation which references the generator is updated as needed

</details>

<details><summary><strong>Updating the Python CDK</strong></summary>

### Airbyter

Before merging:
- Pull Request description explains what problem it is solving
- Code change is unit tested
- Build and my-py check pass
- Smoke test the change on at least one affected connector
   - On Github: Run [this workflow](https://github.com/airbytehq/airbyte/actions/workflows/connectors_tests.yml), passing `--use-local-cdk --name=source-<connector>` as options
   - Locally: `airbyte-ci connectors --use-local-cdk --name=source-<connector> test`
- PR is reviewed and approved
      
After merging:
- [Publish the CDK](https://github.com/airbytehq/airbyte/actions/workflows/publish-cdk-command-manually.yml)
   - The CDK does not follow proper semantic versioning. Choose minor if this the change has significant user impact or is a breaking change. Choose patch otherwise.
   - Write a thoughtful changelog message so we know what was updated.
- Merge the platform PR that was auto-created for updating the Connector Builder's CDK version
   - This step is optional if the change does not affect the connector builder or declarative connectors.

</details>
2024-03-05 09:07:02 -08:00
Brian Lai
ef98194673 Emit final state message for full refresh syncs and consolidate read flows (#35622) 2024-03-05 01:05:06 -05:00
Danny Tiesling
e671aa320d 🐛 Source S3: fix exception when setting CSV stream delimiter to \t. (#35246)
Co-authored-by: Marcos Marx <marcosmarxm@users.noreply.github.com>
Co-authored-by: marcosmarxm <marcosmarxm@gmail.com>
2024-02-23 14:34:29 -03:00
Artem Inzhyyants
0954ad3d3a Airbyte CDK: add interpolation for request options (#35485)
Signed-off-by: Artem Inzhyyants <artem.inzhyyants@gmail.com>
Co-authored-by: Alexandre Girard <alexandre@airbyte.io>
2024-02-22 19:40:44 +01:00
Catherine Noll
e8910e427a File-based CDK: make incremental syncs concurrent (#34540) 2024-02-07 20:41:04 -05:00
Catherine Noll
7f97f245bc CDK: fix flaky scenario-based tests by sorting on k & v (#34912) 2024-02-06 18:55:39 -05:00
Maxime Carbonneau-Leclerc
ca8590e2b4 Have StateBuilder return our actual state object and not simply a dict (#34625) 2024-01-30 08:46:03 -05:00
Catherine Noll
eb31e4d2ba File-based CDK: make full refresh concurrent (#34411) 2024-01-29 19:33:50 -05:00
Baz
cf7f700bbb 🎉 Airbyte CDK (File-based CDK): Stop the sync if the record could not be parsed (#32589) 2024-01-11 21:26:23 +02:00
Joe Reuter
9065181e77 Unstructured parser: Support txt (#32929)
Co-authored-by: flash1293 <flash1293@users.noreply.github.com>
2023-12-15 11:31:45 +01:00
Joe Reuter
c1e428f35c File CDK: Handle 422 errors separately (#33300) 2023-12-13 11:03:36 +00:00
Maxime Carbonneau-Leclerc
0c2d43fdf9 Issue 32871/extract trace message creation (#33227) 2023-12-11 09:20:45 -05:00
Augustin
0b33caecda Revert "[skip ci] formatting: add missing license headers (#33250)" (#33289) 2023-12-11 11:38:37 +01:00
Augustin
60c1cc01ad [skip ci] formatting: add missing license headers (#33250) 2023-12-11 10:15:18 +01:00
Joe Reuter
aa220fc515 Stop sync on traced exception (#33246)
Co-authored-by: flash1293 <flash1293@users.noreply.github.com>
2023-12-08 18:07:25 +01:00
Joe Reuter
f5ac5cfd80 File CDK: Add file processing via API to document file type parser (#32781)
Co-authored-by: flash1293 <flash1293@users.noreply.github.com>
2023-12-08 15:48:37 +01:00
Joe Reuter
7fd92e2a03 File CDK: Parser defined primary key (#33009)
Co-authored-by: flash1293 <flash1293@users.noreply.github.com>
2023-12-08 15:15:33 +01:00
Joe Reuter
5b682ef74f Unstructured parser: Handle parsing errors better (#32700)
Co-authored-by: flash1293 <flash1293@users.noreply.github.com>
2023-12-08 11:47:05 +01:00
Catherine Noll
7ed47ee7d9 File-based CDK: hide the primary key field from config (#33172) 2023-12-06 11:12:50 -05:00
Maxime Carbonneau-Leclerc
ba83309bb1 [ISSUE #32870] Adding entrypoint wrapper and migrating file based and… (#33103) 2023-12-06 08:46:38 -05:00
Joe Reuter
f8b0b3e99e File CDK: Improve stream config appearance (#32420) 2023-11-14 11:49:19 +01:00
Joe Reuter
f1a11e1927 File CDK: Allow skipping unparseable file types (#32092)
Co-authored-by: flash1293 <flash1293@users.noreply.github.com>
2023-11-09 16:48:24 +01:00
Joe Reuter
e113ff66c5 CDK: Make consts required in Pydantic generated json schemas (#32251) 2023-11-09 16:12:11 +01:00
Joe Reuter
66dd29f764 File CDK unstructured parser: Improve file type detection (#31997) 2023-11-02 12:19:27 +01:00
Martin Hwasser
bc4b7198a9 Add pptx support in file based cdk (#31912)
Co-authored-by: Joe Reuter <joe@airbyte.io>
2023-10-30 14:42:39 +01:00
Joe Reuter
e3793c1491 Move over unstructured parser (#31390)
Co-authored-by: flash1293 <flash1293@users.noreply.github.com>
2023-10-26 17:50:57 +02:00
Anatolii Yatsuk
c719137df3 🐛 Airbyte CDK: Fix flake errors in file-based CDK (#31771) 2023-10-24 16:15:11 +03:00
Anatolii Yatsuk
ce2342dde8 🎉 Airbyte CDK: Add CustomFileBasedException for custom errors in file-based CDK (#31704) 2023-10-24 11:09:50 +00:00