1
0
mirror of synced 2025-12-30 03:02:21 -05:00
Commit Graph

24 Commits

Author SHA1 Message Date
Cole Snodgrass
2e099acc52 update headers from 2022 -> 2023 (#22594)
* It's 2023!

* 2022 -> 2023

---------

Co-authored-by: evantahler <evan@airbyte.io>
2023-02-08 13:01:16 -08:00
Davin Chia
4bfaebdd33 Performance: Cache instead of recreating Json Validators. (#22060)
Today we recreate JSON validators each time we perform json validation.

Json validation is run twice for each record:

Validation that the record conforms to the general Airbyte Protocol schema.
Validation that the record conforms to that Stream's schema.
Looking at the code, we create at least 3 objects that are discarded for every single record that passes through the platform. This is both CPU and Garbage Collection inefficient. In particular, creating the validator object is expensive since it parses the entire json schema each time.

This CPU/GC inefficiency is true for all code that uses the current Json validator class, which can include Sources and Destinations.

- Instead of recreating the schema validators each time, initialise the validators and reuse them.
- The JsonSchemaValidator class should be rewritten to clean up it's methods and to cache validators by default. I'm skipping this for now to keep the PR small. I'll revisit in a follow up PR.
- Improved the ReplicationWorkerPerformanceTest so the source messages are emitted through the various stream factories for a higher fidelity test.
2023-01-30 15:37:54 -08:00
Jimmy Ma
6660b13ad2 Add Airbyte Protocol V1 support. (#20036)
* Add Airbyte Protocol V1 support.

* Fix VersionedAirbyteStreamFactoryTest

* Remove AirbyteMessageMigrationV0 example

* Add Protocol Version constants

* 🎉Updated normalization to handle new datatypes (#19721)

* Updated normalization simple stream processing to handle new datatypes

* Updated normalization nested stream processing to handle new datatypes

* Updated normalization nested stream processing to handle new datatypes

* Updated normalization drop_scd_catalog processing to handle new datatypes

* Updated normalization ephemeral test processing to handle new datatypes

* fixed more tests for normalization

* fixed more tests for normalization

* fixed more tests for normalization

* fixed more tests for normalization

* fixed more issues

* fixed more issues (clickhouse)

* fixed more issues

* fixed more issues

* fixed more issues

* added binary type processing for some DBs

* cleared commented code and moved some hardcodes to processing as macro

* fixed codestyle and cleared commented code

* minor refactor

* minor refactor

* minor refactor

* fixed bool cast error

* fixed dict->str cast error

* fixed is_combining_node cast py check

* removed commented code

* removed commented code

* committed autogenerated normalization_test_output files

* committed autogenerated normalization_test_output files (new files)

* refactored utils.py

* Updated utils.py to use Callable functions and get rid of property_type in is_number and is_bool functions

* committed autogenerated normalization_test_output files (new files)

* fixed typo in TIMESTAMP_WITH_TIMEZONE_TYPE

* updated stream_processor to handle string type first as a wider type

* fixed arrays normalization by updating is_simple_property method as per new approaches

* format

Co-authored-by: Edward Gao <edward.gao@airbyte.io>

* Update airbyte protocol migration (#20745)

* Extract MigrationContainer from AirbyteMessageMigrator

* Add ConfiguredAirbyteCatalogMigrations

* Add ConfiguredAirbyteCatalog to AirbyteMessageMigrations

* Enable ConfiguredAirbyteCatalog migration

* Fix tests

* Remove extra this.

* Add missing docs

* Typo

Co-authored-by: Edward Gao <edward.gao@airbyte.io>

* Data types update: Implement protocol message migrations (#19240)

* Extract MigrationContainer from AirbyteMessageMigrator

* Add ConfiguredAirbyteCatalogMigrations

* Add ConfiguredAirbyteCatalog to AirbyteMessageMigrations

* Enable ConfiguredAirbyteCatalog migration

* set up scaffolding

* [wip] more scaffolding, basic unit test

* minimal green code

* [wip] add failing test for other primitive types

* correct version number

* handle basic primitive type decls

* add implicit cases

* add recursive schema

* formatting

* comment

* support not

* fix indentation

* handle all nested schema cases

* handle boolean schemas

* verify empty schema handling

* cleanup

* extract map

* code organization

* extract method

* reformat

* [wip] more tests, minor fix type array handling

* corrected test

* cleanup

* reformat

* switch to v1

* add support for multityped fields

* missed test case

* nested test class

* basic record upgrade

* implement record upgrades

* slight refactor

* comments+clarificationso

* extract constants

* (partly) correct model classes

* add de/ser

* formatting

* extract constants

* fix json reference

* update docs

* switch to v1 models

* fix compile+test

* add base64 handling

* use vnull

* Data types update: Implement protocol message downgrade path (#19909)

* rough skeleton for passing catalog into migration

* basic test

* more scaffolding

* basic implementation

* add primitives test

* add in other tests (nested fields currently failing)

* add formats

* impleent oneOf handling

* formatting

* oneOf handling

* better tests

* comments + organization

* progress

* basic test case

* downgrade objects, ish

* basic array implementation

* handle numeric failure

* test for new type

* handle array items

* empty schema handling

* first pass at oneof handling

* add more tests+handling

* more tests

* comments

* add empty oneof test case

* format + reorganize

* more reorganize

* fix name

* also downgrade binary data

* only import vnull

* move migrations into v1 package

* extract schema mutation code

* comment

* extract schema migration to new class

* extract record downgrade logic for future use

* format

* fix build after rebase

* rename private method for consistency

* also implement configuredcatalog migrations >.>

* quick and dirty tests

* slight cleanup

* fix tests

* pmd

* pmd test

* null check on message objects

* maybe fix acceptance tests?

* fix name

* extract constants

* more fixes

* tmp

* meh

* fix cdc acc tests

* revert to master source-postgres

* remove log messages

* revert other misc hacks

* integers are valid cursors

* remove unrelated change

* fix build

* fix build more?

* [MUST REVERT] use dev normalization

* capture kube logs

* also here?

* no debug logs?

* delete dup from merging

* add final everywhere

* revert test changes

Co-authored-by: Jimmy Ma <jimmy@airbyte.io>

* On-the-fly migrations of persisted catalogs (#21757)

* On the fly catalog migration for normalization activity

* On the fly catalog migration for job persistence

* On the fly migration for standard sync persistence

* On the fly migration for airbyte catalogs

* Refactor code to share JsonSchema traversal

* Add V0 Data type search function

* PMD and Format

* Fix getOrInsertActorCatalog and ConfigRepositoryE2E tests

* Null-proofing CatalogMigrationV1Helper

* More null checks

* Fix test

* Format

* Add data type v1 support to the FE

* Changes AC test check to check exited ps (#21672)

some docker compose changes no longer show exited
processes.  this broke out test

this change should fix master

tested in a runner that failed

* Move wellknown types mapping to the utility function

* use protocolv1 normalization

---------

Co-authored-by: Topher Lubaway <asimplechris@gmail.com>
Co-authored-by: Edward Gao <edward.gao@airbyte.io>

* Update protocol support range (#21996)

* bump normalization version to 0.3.0

* Add version check on normalization (#22048)

* Add normalization min version check

* Add visible for testing

---------

Co-authored-by: Edward Gao <edward.gao@airbyte.io>
Co-authored-by: Eugene <etsybaev@gmail.com>
Co-authored-by: Topher Lubaway <asimplechris@gmail.com>
2023-01-30 10:17:49 -08:00
Davin Chia
d95c06d357 Remove unused imports. (#20938) 2022-12-30 14:39:51 -08:00
Davin Chia
18593d91b5 Remove sneaky throws. (#20931)
The Java 19 toolchain doesn't like sneaky throws. Not entirely sure why. However, I think it's better practice to not use sneaky throws as it makes it clearer what is throw and where.

Example error message when trying to compile the current codebase with Java 19:

error: Error during the transformation of 'io.airbyte.validation.json.JsonSchemaValidatorTest'; post-compiler 'lombok.bytecode.SneakyThrowsRemover' caused an exception: java.lang.IllegalArgumentException: Unsupported class file major version 63
        at org.objectweb.asm.ClassReader.<init>(ClassReader.java:199)
        at org.objectweb.asm.ClassReader.<init>(ClassReader.java:180)
        at org.objectweb.asm.ClassReader.<init>(ClassReader.java:166)
        at lombok.bytecode.AsmUtil.fixJSRInlining(AsmUtil.java:37)
        at lombok.bytecode.SneakyThrowsRemover.applyTransformations(SneakyThrowsRemover.java:46)
        at lombok.core.PostCompiler.applyTransformations(PostCompiler.java:44)
        at lombok.core.PostCompiler$1.close(PostCompiler.java:87)
        at jdk.compiler/com.sun.tools.javac.jvm.ClassWriter.writeClass(ClassWriter.java:1508)
        at jdk.compiler/com.sun.tools.javac.main.JavaCompiler.genCode(JavaCompiler.java:738)
2022-12-30 09:04:26 -08:00
Edward Gao
2392acb845 Enable record schema validation using v1 type system; CI uses MSG to start EC2 runners (#20439)
* Revert "Revert "RecordSchemaValidator can resolve $ref schemas (#19625)" (#20113)"

This reverts commit 86f61a53d3.

* just hardcode build?

* sshable instance

* pass arg for release oss only

* also skip octavia + create PR

* update ec2 runner

* revert CI test changes

* whoops

* whoopswhoops
2022-12-14 16:06:21 -08:00
Jimmy Ma
86f61a53d3 Revert "RecordSchemaValidator can resolve $ref schemas (#19625)" (#20113)
This reverts commit 56bfdaab6a.
2022-12-06 11:19:10 -08:00
Edward Gao
56bfdaab6a RecordSchemaValidator can resolve $ref schemas (#19625)
* put WellKnownTypes.json in worker

* misc random experimentation

* finalize

* add test

* copypasta error

* formatting

* pmd

* other pmd >.>

* generate in gradle

* better interface + comment

* formatting

* better dockerfile caching
2022-11-30 07:59:10 -08:00
Evan Tahler
c56ee8d6b3 bump com.networknt:json-schema-validator to latest version (#16619) 2022-09-14 15:41:53 -07:00
Anne
e9afa9bef3 Error Prone PMD rules (#15010)
* Implement ErrorProne PMD rules:
AssignmentInOperand
AvoidAccessibilityAlteration
AvoidBranchingStatementAsLastInLoop
AvoidCatchingNPE
AvoidCatchingThrowable
AvoidDuplicateLiterals rule
2022-08-09 15:30:48 -07:00
Davin Chia
7788594e22 Start publishing proper artifacts. (#13484)
## What
Finale of https://github.com/airbytehq/airbyte/pull/13122.

We've renamed all directories in previous PRs. Here we remove the fat jar configuration and add publishing to all subprojects.

Explanation for what is happening:

Identically named subprojects have the following issues:
* publishing as is leads to classpath confusion when the jars with the same names are placed in the Java distribution. This leads to NoClassDefFound errors on runtime.
* deconflicting the jar names without changing directory names leads to dependency errors as the OSS jar pom files are generated using project dependencies (suggesting a dependency a sibling subproject in the same repo) that use subprojects group and name as a reference. This means the generated jars look for Jars that do not exists (as their names have been changed) and cannot compile.
* the workaround to changing a subproject's name involves resetting the subproject's name in the settings.gradle and depending on the new name in each build.gradle. This increases configuration burden and decreases the ease of reading, since one will have to check the settings.gradle to know what the right subproject name is. See https://github.com/gradle/gradle/issues/847 for more info.
* given that Gradle itself doesn't have support for identically named subprojects (see the linked issue), the simplest solution is to not allow duplicated directories. I've only renamed conflicting directories here to keep things simple. I will create a follow up issues to enforce non-identical subproject names in our builds.

## How
* Remove fat jar configuration.
* Add publishing to all subprojects.
2022-06-06 17:15:25 +08:00
Anne
5f0f106cb6 Limit the number of record schema validations performed (#13351)
* Better error messages for record schema validations, and validate a maximum of 10 records per stream
2022-05-31 16:58:09 -07:00
Alexandre Girard
3894134d11 Bump year in license short to 2022 (#13191)
* Bump to 2022

* format
2022-05-25 17:56:49 -07:00
Charles
c1c8675366 Add readmes to all modules (#8893) 2022-03-13 14:45:36 -07:00
Jenny Brown
c77dd7ad66 Improved error handling (#7571)
* Improved error handling

* Comments
2021-11-03 12:17:52 -05:00
lmossman
b94ee00fd8 Revert "Generate seed connector specs on build (#7501)"
This reverts commit a534bb2a8f.
2021-11-03 08:46:43 -07:00
Lake Mossman
a534bb2a8f Generate seed connector specs on build (#7501)
* add specs module with logic to fetch specs on build

* format + build and add gradle dependency for new script

* check seed file for existing specs + refactor

* add tests + a bit more refactoring

* run gw format

* update yaml config persistence to merge specs into definitions

* add comment

* delete secrets migration to be consistent with master

* add dep

* add tests for GcsBucketSpecFetcher

* get rid of static block + format

* DRY up parse call

* add GCS details to comment

* formatting + fix test

* update comment

* do not format seed specs files

* change signature of run to allow cloud to reuse this script

* run gw format

* revert commits that change signature of run

* fix comment typo

Co-authored-by: Davin Chia <davinchia@gmail.com>

* rename enum to be distinct from the enum in cloud

* add missing dependencies between modules

* add readme for seed connector spec generator

* reword

* reference readme in comment

* ignore 'spec' field in newFields logic

Co-authored-by: Davin Chia <davinchia@gmail.com>
2021-11-02 22:03:50 -07:00
Charles
ba44f700b9 add final for params, local variables, and fields (#7084) 2021-10-15 16:41:04 -07:00
Michel Tricot
f25542a145 🎉 Update license for Core (#6479) 2021-09-27 11:17:17 -07:00
Michel Tricot
1773e41e47 Shorten our headers + adds contributors file (#6478) 2021-09-27 10:45:50 -07:00
Charles
23d27bcf34 split replication and normalization into separate temporal activities (#3136) 2021-04-30 15:55:39 -07:00
Yury Koleda
b1061e32d9 🎉 Add MongoDB Source
Signed-off-by: fut <fut.wrk@gmail.com>
2021-03-08 14:27:14 -08:00
Charles
e7edb2c858 Adding incremental to the catalog data model (#998)
* Add ConfiguredAirbyteCatalog and ConfiguredAirbyteStream
2020-11-18 14:15:59 -08:00
Christophe Duong
0fac6a99b0 Move JsonSchemaValidator into its own module airbyte-json-validation (#234) (#647) 2020-10-20 22:45:31 +02:00