1
0
mirror of synced 2026-01-29 13:02:00 -05:00
Commit Graph

67 Commits

Author SHA1 Message Date
Augustin
0b33caecda Revert "[skip ci] formatting: add missing license headers (#33250)" (#33289) 2023-12-11 11:38:37 +01:00
Augustin
60c1cc01ad [skip ci] formatting: add missing license headers (#33250) 2023-12-11 10:15:18 +01:00
Joe Reuter
aa220fc515 Stop sync on traced exception (#33246)
Co-authored-by: flash1293 <flash1293@users.noreply.github.com>
2023-12-08 18:07:25 +01:00
Joe Reuter
f5ac5cfd80 File CDK: Add file processing via API to document file type parser (#32781)
Co-authored-by: flash1293 <flash1293@users.noreply.github.com>
2023-12-08 15:48:37 +01:00
Joe Reuter
7fd92e2a03 File CDK: Parser defined primary key (#33009)
Co-authored-by: flash1293 <flash1293@users.noreply.github.com>
2023-12-08 15:15:33 +01:00
Joe Reuter
5b682ef74f Unstructured parser: Handle parsing errors better (#32700)
Co-authored-by: flash1293 <flash1293@users.noreply.github.com>
2023-12-08 11:47:05 +01:00
Catherine Noll
7ed47ee7d9 File-based CDK: hide the primary key field from config (#33172) 2023-12-06 11:12:50 -05:00
Maxime Carbonneau-Leclerc
ba83309bb1 [ISSUE #32870] Adding entrypoint wrapper and migrating file based and… (#33103) 2023-12-06 08:46:38 -05:00
Joe Reuter
f8b0b3e99e File CDK: Improve stream config appearance (#32420) 2023-11-14 11:49:19 +01:00
Joe Reuter
f1a11e1927 File CDK: Allow skipping unparseable file types (#32092)
Co-authored-by: flash1293 <flash1293@users.noreply.github.com>
2023-11-09 16:48:24 +01:00
Joe Reuter
e113ff66c5 CDK: Make consts required in Pydantic generated json schemas (#32251) 2023-11-09 16:12:11 +01:00
Joe Reuter
66dd29f764 File CDK unstructured parser: Improve file type detection (#31997) 2023-11-02 12:19:27 +01:00
Martin Hwasser
bc4b7198a9 Add pptx support in file based cdk (#31912)
Co-authored-by: Joe Reuter <joe@airbyte.io>
2023-10-30 14:42:39 +01:00
Joe Reuter
e3793c1491 Move over unstructured parser (#31390)
Co-authored-by: flash1293 <flash1293@users.noreply.github.com>
2023-10-26 17:50:57 +02:00
Anatolii Yatsuk
c719137df3 🐛 Airbyte CDK: Fix flake errors in file-based CDK (#31771) 2023-10-24 16:15:11 +03:00
Anatolii Yatsuk
ce2342dde8 🎉 Airbyte CDK: Add CustomFileBasedException for custom errors in file-based CDK (#31704) 2023-10-24 11:09:50 +00:00
Alexandre Girard
7da2822488 Concurrent CDK: catch exceptions from worker thread and add integration test scenarios (#31245)
Co-authored-by: girarda <girarda@users.noreply.github.com>
2023-10-23 08:39:58 -07:00
Joe Reuter
d474827068 File CDK: Don't fetch full file list for availability check (#31651)
Co-authored-by: flash1293 <flash1293@users.noreply.github.com>
2023-10-23 16:14:41 +02:00
Joe Reuter
bb07939646 File CDK: Add analytics messages for parser usage (#31498)
Co-authored-by: flash1293 <flash1293@users.noreply.github.com>
2023-10-19 15:42:51 +02:00
Alexandre Girard
ef9bd72a7e Parameterize ScenarioBuilder on Source type (#31244)
Co-authored-by: girarda <girarda@users.noreply.github.com>
Co-authored-by: Catherine Noll <clnoll@users.noreply.github.com>
Co-authored-by: Maxime Carbonneau-Leclerc <maxi297@users.noreply.github.com>
2023-10-16 17:12:18 -07:00
Joe Reuter
e35a1f2cd9 File CDK: Allow configuration of parsed records during check and discover from parser (#31281)
Co-authored-by: flash1293 <flash1293@users.noreply.github.com>
2023-10-13 09:50:22 +02:00
Roman Yermilov [GL]
e561d5d432 Airbyte CDK: fix none type binary error in parquet parser (#31073) 2023-10-05 15:56:02 +04:00
Anton Karpets
767800d2d7 🐛Airbyte CDK: fix parsing of UUID fields in avro files (#31096) 2023-10-05 10:53:18 +03:00
Marius Posta
7ae97175a6 gradle: fix repo wide behaviour (#30607) 2023-09-28 05:01:13 -07:00
Maxime Carbonneau-Leclerc
b6836ad950 [ISSUE #30353] remove file_type from stream config (#30453) 2023-09-18 08:50:00 -04:00
Maxime Carbonneau-Leclerc
48e8816b6b [oncall #2838] migrate parsing errors as config errors (#30209) 2023-09-06 13:38:48 -04:00
Maxime Carbonneau-Leclerc
5b653676aa Update spec and fix autogenerated headers with skip after (#30123) 2023-09-03 09:26:53 -04:00
Maxime Carbonneau-Leclerc
399b4d1fca File-based CDK: ensure no errors in Sentry given empty CSV (#29944) 2023-09-02 09:40:08 -04:00
Maxime Carbonneau-Leclerc
e2fb04f72d File-based CDK: allow user to provided column names (#29868) 2023-08-28 18:00:19 -04:00
Maxime Carbonneau-Leclerc
82a96e0c69 File-based CDK: allow for extension mismatch (#29835) 2023-08-25 11:44:49 -04:00
Maxime Carbonneau-Leclerc
40b76a7813 Source S3: v4 rollout/feature parity (#29753) 2023-08-23 11:30:08 -04:00
Maxime Carbonneau-Leclerc
b801a3d24f Do not stop processing file on parsing error (#29679) 2023-08-21 15:56:01 -04:00
Maxime Carbonneau-Leclerc
e9d99630ed Removing validation on skip rows and autogenerated headers (#29488) 2023-08-17 16:14:19 -04:00
Catherine Noll
7c1d6081de File-based CDK: handle legacy path_prefix + globs (#29389) 2023-08-15 12:18:25 -04:00
Brian Lai
5908b85e69 [file-based cdk] Remove CSV quoting_behavior config option (#29388)
* remove CSV quoting_behavior config option

* cleanup after getting latest master
2023-08-14 20:37:38 -04:00
Alexandre Girard
b512fa4628 file-based CDK: Configurable strings_can_be_null (#29298)
* [ISSUE #28893] infer csv schema

* [ISSUE #28893] align with pyarrow

* Automated Commit - Formatting Changes

* [ISSUE #28893] legacy inference and infer only when needed

* [ISSUE #28893] fix scenario tests

* [ISSUE #28893] using discovered schema as part of read

* [ISSUE #28893] self-review + cleanup

* [ISSUE #28893] fix test

* [ISSUE #28893] code review part #1

* [ISSUE #28893] code review part #2

* Fix test

* formatcdk

* first pass

* [ISSUE #28893] code review

* fix mypy issues

* comment

* rename for clarity

* Add a scenario test case

* this isn't optional anymore

* FIX test log level

* Re-adding failing tests

* [ISSUE #28893] improve inferrence to consider multiple types per value

* Automated Commit - Formatting Changes

* [ISSUE #28893] remove InferenceType.PRIMITIVE_AND_COMPLEX_TYPES

* Code review

* Automated Commit - Formatting Changes

* fix unit tests

---------

Co-authored-by: maxi297 <maxime@airbyte.io>
Co-authored-by: maxi297 <maxi297@users.noreply.github.com>
2023-08-14 12:51:27 -07:00
Maxime Carbonneau-Leclerc
12f1304a67 Issue 28893/infer schema csv (#29099) 2023-08-14 15:14:46 -04:00
Alexandre Girard
1a120ecd4b File-CDK (Avro) Set double_as_string to false by default (#29339)
* set double_as_string to false by default

* Use default config when irrelevant to the test

* Update description

* Update the description again
2023-08-10 14:31:52 -07:00
Maxime Carbonneau-Leclerc
cfbd0b8219 [ISSUE #26764] support brute force multiline json objects for JSONL (#29331)
* [ISSUE #26764] support brute force multiline json objects for JSONL

* [ISSUE #26764] infer_schema to support multiline json objects as well

* [ISSUE #26764] code review
2023-08-10 15:54:46 -04:00
Alexandre Girard
0aa86cf156 File-based CDK + Source S3 (v4): Pass configured file encoding to stream reader (#29110)
* Add encoding to open_file interface

* pass the encoding set in the config

* cleanup

* cleanup

* Automated Commit - Formatting Changes

* Add missing test

* Automated Commit - Formatting Changes

* Update infer_schema too

* Automated Commit - Formatting Changes

* Update unit test

* add a unit test

* fix

* format

* format

* remove newline

* use a mock

* fix

* format

---------

Co-authored-by: girarda <girarda@users.noreply.github.com>
2023-08-09 09:05:06 -05:00
Brian Lai
b8d5ca77db 🐛 [file based cdk] Fix S3 and abstract spec to be compatible with Airbyte UI and CAT (#29075)
* remove version, make validation_policy enum, fix input_schema for s3 and abstract file based configs

* remove multiple file format options from stream config

* pr feedback

* fix tests after rebase

* additional spec changes to work with the UI

* fix tests post-rebase

* fix tests post-rebase and cleanup

* formatting
2023-08-08 18:10:05 -04:00
Alexandre Girard
78b00e088b Parquet parser return Decimal fields as strings (#29191)
* Update the test so it fails if the type is different

* Update to convert values

* Add columns from file partitions

* update
2023-08-08 11:38:16 -07:00
Alexandre Girard
1b6428877d Avro parser: return Decimal fields as strings (#29182)
* update avro parsing

* rename field

* output as iso strings
2023-08-08 11:34:25 -07:00
Brian Lai
01045d674d Add start_date to all file-based configs (#28845)
* add start_date config to abstract spec and apply it in the cursor

* rollback start date cursor changes

* revert back to filtering in the reader and pr feedback

* fix tests post-rebase and pr feedback
2023-08-07 20:43:07 -04:00
Catherine Noll
53d8450ec2 File-based CDK: allow FileBasedSource to take a cursor_cls (#29027) 2023-08-04 09:49:03 -04:00
Alexandre Girard
641a65a1e3 Add CSV options to the CSV parser (#28491)
* remove invalid legacy option

* remove unused option

* the tests pass but this is quite messy

* very slight clean up

* Add skip options to csv format

* fix some of the typing issues

* fixme comment

* remove extra log message

* fix typing issues

* skip before header

* skip after header

* format

* add another test

* Automated Commit - Formatting Changes

* auto generate column names

* delete dead code

* update title and description

* true and false values

* Update the tests

* Add comment

* missing test

* rename

* update expected spec

* move to method

* Update comment

* fix typo

* remove unused import

* Add a comment

* None records do not pass the WaitForDiscoverPolicy

* format

* remove second branch to ensure we always go through the same processing

* Raise an exception if the record is None

* reset

* Update tests

* handle unquoted newlines

* Automated Commit - Formatting Changes

* Update test case so the quoting is explicit

* Update comment

* Automated Commit - Formatting Changes

* Fail validation if skipping rows before header and header is autogenerated

* always fail if a record cannot be parsed

* format

* set write line_no in error message

* remove none check

* Automated Commit - Formatting Changes

* enable autogenerate test

* remove duplicate test

* missing unit tests

* Update

* remove branching

* remove unused none check

* Update tests

* remove branching

* format

* extract to function

* comment

* missing type

* type annotation

* use set

* Document that the strings are case-sensitive

* public -> private

* add unit test

* newline

---------

Co-authored-by: girarda <girarda@users.noreply.github.com>
2023-08-03 08:59:55 -07:00
Catherine Noll
09ebb47b24 File cdk parser and cursor updates (#28900)
* File-based CDK: update parquet parser to handle partitions

* File-based CDK: make the record output & cursor date time format consistent
2023-08-01 21:47:58 -04:00
Catherine Noll
22ff7e0fae File-based CDK: reorganize FileReadMode to fix circular import (#28885) 2023-07-31 17:55:29 -04:00
Catherine Noll
642e7680b4 File-based CDK: add read mode to stream reader interface & parsers (#28862) 2023-07-31 16:55:00 -04:00
Catherine Noll
73395a187a File-based CDK: allow null values for all inferred columns (#28847) 2023-07-31 15:10:21 -04:00