Augustin
0b33caecda
Revert "[skip ci] formatting: add missing license headers ( #33250 )" ( #33289 )
2023-12-11 11:38:37 +01:00
Augustin
60c1cc01ad
[skip ci] formatting: add missing license headers ( #33250 )
2023-12-11 10:15:18 +01:00
Joe Reuter
aa220fc515
Stop sync on traced exception ( #33246 )
...
Co-authored-by: flash1293 <flash1293@users.noreply.github.com >
2023-12-08 18:07:25 +01:00
Joe Reuter
f5ac5cfd80
File CDK: Add file processing via API to document file type parser ( #32781 )
...
Co-authored-by: flash1293 <flash1293@users.noreply.github.com >
2023-12-08 15:48:37 +01:00
Joe Reuter
7fd92e2a03
File CDK: Parser defined primary key ( #33009 )
...
Co-authored-by: flash1293 <flash1293@users.noreply.github.com >
2023-12-08 15:15:33 +01:00
Joe Reuter
5b682ef74f
Unstructured parser: Handle parsing errors better ( #32700 )
...
Co-authored-by: flash1293 <flash1293@users.noreply.github.com >
2023-12-08 11:47:05 +01:00
Catherine Noll
7ed47ee7d9
File-based CDK: hide the primary key field from config ( #33172 )
2023-12-06 11:12:50 -05:00
Maxime Carbonneau-Leclerc
ba83309bb1
[ISSUE #32870 ] Adding entrypoint wrapper and migrating file based and… ( #33103 )
2023-12-06 08:46:38 -05:00
Joe Reuter
f8b0b3e99e
File CDK: Improve stream config appearance ( #32420 )
2023-11-14 11:49:19 +01:00
Joe Reuter
f1a11e1927
File CDK: Allow skipping unparseable file types ( #32092 )
...
Co-authored-by: flash1293 <flash1293@users.noreply.github.com >
2023-11-09 16:48:24 +01:00
Joe Reuter
e113ff66c5
CDK: Make consts required in Pydantic generated json schemas ( #32251 )
2023-11-09 16:12:11 +01:00
Joe Reuter
66dd29f764
File CDK unstructured parser: Improve file type detection ( #31997 )
2023-11-02 12:19:27 +01:00
Martin Hwasser
bc4b7198a9
✨ Add pptx support in file based cdk ( #31912 )
...
Co-authored-by: Joe Reuter <joe@airbyte.io >
2023-10-30 14:42:39 +01:00
Joe Reuter
e3793c1491
Move over unstructured parser ( #31390 )
...
Co-authored-by: flash1293 <flash1293@users.noreply.github.com >
2023-10-26 17:50:57 +02:00
Anatolii Yatsuk
c719137df3
🐛 Airbyte CDK: Fix flake errors in file-based CDK ( #31771 )
2023-10-24 16:15:11 +03:00
Anatolii Yatsuk
ce2342dde8
🎉 Airbyte CDK: Add CustomFileBasedException for custom errors in file-based CDK ( #31704 )
2023-10-24 11:09:50 +00:00
Alexandre Girard
7da2822488
Concurrent CDK: catch exceptions from worker thread and add integration test scenarios ( #31245 )
...
Co-authored-by: girarda <girarda@users.noreply.github.com >
2023-10-23 08:39:58 -07:00
Joe Reuter
d474827068
File CDK: Don't fetch full file list for availability check ( #31651 )
...
Co-authored-by: flash1293 <flash1293@users.noreply.github.com >
2023-10-23 16:14:41 +02:00
Joe Reuter
bb07939646
File CDK: Add analytics messages for parser usage ( #31498 )
...
Co-authored-by: flash1293 <flash1293@users.noreply.github.com >
2023-10-19 15:42:51 +02:00
Alexandre Girard
ef9bd72a7e
Parameterize ScenarioBuilder on Source type ( #31244 )
...
Co-authored-by: girarda <girarda@users.noreply.github.com >
Co-authored-by: Catherine Noll <clnoll@users.noreply.github.com >
Co-authored-by: Maxime Carbonneau-Leclerc <maxi297@users.noreply.github.com >
2023-10-16 17:12:18 -07:00
Joe Reuter
e35a1f2cd9
File CDK: Allow configuration of parsed records during check and discover from parser ( #31281 )
...
Co-authored-by: flash1293 <flash1293@users.noreply.github.com >
2023-10-13 09:50:22 +02:00
Roman Yermilov [GL]
e561d5d432
Airbyte CDK: fix none type binary error in parquet parser ( #31073 )
2023-10-05 15:56:02 +04:00
Anton Karpets
767800d2d7
🐛 Airbyte CDK: fix parsing of UUID fields in avro files ( #31096 )
2023-10-05 10:53:18 +03:00
Marius Posta
7ae97175a6
gradle: fix repo wide behaviour ( #30607 )
2023-09-28 05:01:13 -07:00
Maxime Carbonneau-Leclerc
b6836ad950
[ISSUE #30353 ] remove file_type from stream config ( #30453 )
2023-09-18 08:50:00 -04:00
Maxime Carbonneau-Leclerc
48e8816b6b
[oncall #2838 ] migrate parsing errors as config errors ( #30209 )
2023-09-06 13:38:48 -04:00
Maxime Carbonneau-Leclerc
5b653676aa
Update spec and fix autogenerated headers with skip after ( #30123 )
2023-09-03 09:26:53 -04:00
Maxime Carbonneau-Leclerc
399b4d1fca
File-based CDK: ensure no errors in Sentry given empty CSV ( #29944 )
2023-09-02 09:40:08 -04:00
Maxime Carbonneau-Leclerc
e2fb04f72d
File-based CDK: allow user to provided column names ( #29868 )
2023-08-28 18:00:19 -04:00
Maxime Carbonneau-Leclerc
82a96e0c69
File-based CDK: allow for extension mismatch ( #29835 )
2023-08-25 11:44:49 -04:00
Maxime Carbonneau-Leclerc
40b76a7813
✨ Source S3: v4 rollout/feature parity ( #29753 )
2023-08-23 11:30:08 -04:00
Maxime Carbonneau-Leclerc
b801a3d24f
Do not stop processing file on parsing error ( #29679 )
2023-08-21 15:56:01 -04:00
Maxime Carbonneau-Leclerc
e9d99630ed
Removing validation on skip rows and autogenerated headers ( #29488 )
2023-08-17 16:14:19 -04:00
Catherine Noll
7c1d6081de
File-based CDK: handle legacy path_prefix + globs ( #29389 )
2023-08-15 12:18:25 -04:00
Brian Lai
5908b85e69
[file-based cdk] Remove CSV quoting_behavior config option ( #29388 )
...
* remove CSV quoting_behavior config option
* cleanup after getting latest master
2023-08-14 20:37:38 -04:00
Alexandre Girard
b512fa4628
file-based CDK: Configurable strings_can_be_null ( #29298 )
...
* [ISSUE #28893 ] infer csv schema
* [ISSUE #28893 ] align with pyarrow
* Automated Commit - Formatting Changes
* [ISSUE #28893 ] legacy inference and infer only when needed
* [ISSUE #28893 ] fix scenario tests
* [ISSUE #28893 ] using discovered schema as part of read
* [ISSUE #28893 ] self-review + cleanup
* [ISSUE #28893 ] fix test
* [ISSUE #28893 ] code review part #1
* [ISSUE #28893 ] code review part #2
* Fix test
* formatcdk
* first pass
* [ISSUE #28893 ] code review
* fix mypy issues
* comment
* rename for clarity
* Add a scenario test case
* this isn't optional anymore
* FIX test log level
* Re-adding failing tests
* [ISSUE #28893 ] improve inferrence to consider multiple types per value
* Automated Commit - Formatting Changes
* [ISSUE #28893 ] remove InferenceType.PRIMITIVE_AND_COMPLEX_TYPES
* Code review
* Automated Commit - Formatting Changes
* fix unit tests
---------
Co-authored-by: maxi297 <maxime@airbyte.io >
Co-authored-by: maxi297 <maxi297@users.noreply.github.com >
2023-08-14 12:51:27 -07:00
Maxime Carbonneau-Leclerc
12f1304a67
Issue 28893/infer schema csv ( #29099 )
2023-08-14 15:14:46 -04:00
Alexandre Girard
1a120ecd4b
File-CDK (Avro) Set double_as_string to false by default ( #29339 )
...
* set double_as_string to false by default
* Use default config when irrelevant to the test
* Update description
* Update the description again
2023-08-10 14:31:52 -07:00
Maxime Carbonneau-Leclerc
cfbd0b8219
[ISSUE #26764 ] support brute force multiline json objects for JSONL ( #29331 )
...
* [ISSUE #26764 ] support brute force multiline json objects for JSONL
* [ISSUE #26764 ] infer_schema to support multiline json objects as well
* [ISSUE #26764 ] code review
2023-08-10 15:54:46 -04:00
Alexandre Girard
0aa86cf156
File-based CDK + Source S3 (v4): Pass configured file encoding to stream reader ( #29110 )
...
* Add encoding to open_file interface
* pass the encoding set in the config
* cleanup
* cleanup
* Automated Commit - Formatting Changes
* Add missing test
* Automated Commit - Formatting Changes
* Update infer_schema too
* Automated Commit - Formatting Changes
* Update unit test
* add a unit test
* fix
* format
* format
* remove newline
* use a mock
* fix
* format
---------
Co-authored-by: girarda <girarda@users.noreply.github.com >
2023-08-09 09:05:06 -05:00
Brian Lai
b8d5ca77db
🐛 [file based cdk] Fix S3 and abstract spec to be compatible with Airbyte UI and CAT ( #29075 )
...
* remove version, make validation_policy enum, fix input_schema for s3 and abstract file based configs
* remove multiple file format options from stream config
* pr feedback
* fix tests after rebase
* additional spec changes to work with the UI
* fix tests post-rebase
* fix tests post-rebase and cleanup
* formatting
2023-08-08 18:10:05 -04:00
Alexandre Girard
78b00e088b
Parquet parser return Decimal fields as strings ( #29191 )
...
* Update the test so it fails if the type is different
* Update to convert values
* Add columns from file partitions
* update
2023-08-08 11:38:16 -07:00
Alexandre Girard
1b6428877d
Avro parser: return Decimal fields as strings ( #29182 )
...
* update avro parsing
* rename field
* output as iso strings
2023-08-08 11:34:25 -07:00
Brian Lai
01045d674d
Add start_date to all file-based configs ( #28845 )
...
* add start_date config to abstract spec and apply it in the cursor
* rollback start date cursor changes
* revert back to filtering in the reader and pr feedback
* fix tests post-rebase and pr feedback
2023-08-07 20:43:07 -04:00
Catherine Noll
53d8450ec2
File-based CDK: allow FileBasedSource to take a cursor_cls ( #29027 )
2023-08-04 09:49:03 -04:00
Alexandre Girard
641a65a1e3
Add CSV options to the CSV parser ( #28491 )
...
* remove invalid legacy option
* remove unused option
* the tests pass but this is quite messy
* very slight clean up
* Add skip options to csv format
* fix some of the typing issues
* fixme comment
* remove extra log message
* fix typing issues
* skip before header
* skip after header
* format
* add another test
* Automated Commit - Formatting Changes
* auto generate column names
* delete dead code
* update title and description
* true and false values
* Update the tests
* Add comment
* missing test
* rename
* update expected spec
* move to method
* Update comment
* fix typo
* remove unused import
* Add a comment
* None records do not pass the WaitForDiscoverPolicy
* format
* remove second branch to ensure we always go through the same processing
* Raise an exception if the record is None
* reset
* Update tests
* handle unquoted newlines
* Automated Commit - Formatting Changes
* Update test case so the quoting is explicit
* Update comment
* Automated Commit - Formatting Changes
* Fail validation if skipping rows before header and header is autogenerated
* always fail if a record cannot be parsed
* format
* set write line_no in error message
* remove none check
* Automated Commit - Formatting Changes
* enable autogenerate test
* remove duplicate test
* missing unit tests
* Update
* remove branching
* remove unused none check
* Update tests
* remove branching
* format
* extract to function
* comment
* missing type
* type annotation
* use set
* Document that the strings are case-sensitive
* public -> private
* add unit test
* newline
---------
Co-authored-by: girarda <girarda@users.noreply.github.com >
2023-08-03 08:59:55 -07:00
Catherine Noll
09ebb47b24
File cdk parser and cursor updates ( #28900 )
...
* File-based CDK: update parquet parser to handle partitions
* File-based CDK: make the record output & cursor date time format consistent
2023-08-01 21:47:58 -04:00
Catherine Noll
22ff7e0fae
File-based CDK: reorganize FileReadMode to fix circular import ( #28885 )
2023-07-31 17:55:29 -04:00
Catherine Noll
642e7680b4
File-based CDK: add read mode to stream reader interface & parsers ( #28862 )
2023-07-31 16:55:00 -04:00
Catherine Noll
73395a187a
File-based CDK: allow null values for all inferred columns ( #28847 )
2023-07-31 15:10:21 -04:00