Today we recreate JSON validators each time we perform json validation.
Json validation is run twice for each record:
Validation that the record conforms to the general Airbyte Protocol schema.
Validation that the record conforms to that Stream's schema.
Looking at the code, we create at least 3 objects that are discarded for every single record that passes through the platform. This is both CPU and Garbage Collection inefficient. In particular, creating the validator object is expensive since it parses the entire json schema each time.
This CPU/GC inefficiency is true for all code that uses the current Json validator class, which can include Sources and Destinations.
- Instead of recreating the schema validators each time, initialise the validators and reuse them.
- The JsonSchemaValidator class should be rewritten to clean up it's methods and to cache validators by default. I'm skipping this for now to keep the PR small. I'll revisit in a follow up PR.
- Improved the ReplicationWorkerPerformanceTest so the source messages are emitted through the various stream factories for a higher fidelity test.
* Add Airbyte Protocol V1 support.
* Fix VersionedAirbyteStreamFactoryTest
* Remove AirbyteMessageMigrationV0 example
* Add Protocol Version constants
* 🎉Updated normalization to handle new datatypes (#19721)
* Updated normalization simple stream processing to handle new datatypes
* Updated normalization nested stream processing to handle new datatypes
* Updated normalization nested stream processing to handle new datatypes
* Updated normalization drop_scd_catalog processing to handle new datatypes
* Updated normalization ephemeral test processing to handle new datatypes
* fixed more tests for normalization
* fixed more tests for normalization
* fixed more tests for normalization
* fixed more tests for normalization
* fixed more issues
* fixed more issues (clickhouse)
* fixed more issues
* fixed more issues
* fixed more issues
* added binary type processing for some DBs
* cleared commented code and moved some hardcodes to processing as macro
* fixed codestyle and cleared commented code
* minor refactor
* minor refactor
* minor refactor
* fixed bool cast error
* fixed dict->str cast error
* fixed is_combining_node cast py check
* removed commented code
* removed commented code
* committed autogenerated normalization_test_output files
* committed autogenerated normalization_test_output files (new files)
* refactored utils.py
* Updated utils.py to use Callable functions and get rid of property_type in is_number and is_bool functions
* committed autogenerated normalization_test_output files (new files)
* fixed typo in TIMESTAMP_WITH_TIMEZONE_TYPE
* updated stream_processor to handle string type first as a wider type
* fixed arrays normalization by updating is_simple_property method as per new approaches
* format
Co-authored-by: Edward Gao <edward.gao@airbyte.io>
* Update airbyte protocol migration (#20745)
* Extract MigrationContainer from AirbyteMessageMigrator
* Add ConfiguredAirbyteCatalogMigrations
* Add ConfiguredAirbyteCatalog to AirbyteMessageMigrations
* Enable ConfiguredAirbyteCatalog migration
* Fix tests
* Remove extra this.
* Add missing docs
* Typo
Co-authored-by: Edward Gao <edward.gao@airbyte.io>
* Data types update: Implement protocol message migrations (#19240)
* Extract MigrationContainer from AirbyteMessageMigrator
* Add ConfiguredAirbyteCatalogMigrations
* Add ConfiguredAirbyteCatalog to AirbyteMessageMigrations
* Enable ConfiguredAirbyteCatalog migration
* set up scaffolding
* [wip] more scaffolding, basic unit test
* minimal green code
* [wip] add failing test for other primitive types
* correct version number
* handle basic primitive type decls
* add implicit cases
* add recursive schema
* formatting
* comment
* support not
* fix indentation
* handle all nested schema cases
* handle boolean schemas
* verify empty schema handling
* cleanup
* extract map
* code organization
* extract method
* reformat
* [wip] more tests, minor fix type array handling
* corrected test
* cleanup
* reformat
* switch to v1
* add support for multityped fields
* missed test case
* nested test class
* basic record upgrade
* implement record upgrades
* slight refactor
* comments+clarificationso
* extract constants
* (partly) correct model classes
* add de/ser
* formatting
* extract constants
* fix json reference
* update docs
* switch to v1 models
* fix compile+test
* add base64 handling
* use vnull
* Data types update: Implement protocol message downgrade path (#19909)
* rough skeleton for passing catalog into migration
* basic test
* more scaffolding
* basic implementation
* add primitives test
* add in other tests (nested fields currently failing)
* add formats
* impleent oneOf handling
* formatting
* oneOf handling
* better tests
* comments + organization
* progress
* basic test case
* downgrade objects, ish
* basic array implementation
* handle numeric failure
* test for new type
* handle array items
* empty schema handling
* first pass at oneof handling
* add more tests+handling
* more tests
* comments
* add empty oneof test case
* format + reorganize
* more reorganize
* fix name
* also downgrade binary data
* only import vnull
* move migrations into v1 package
* extract schema mutation code
* comment
* extract schema migration to new class
* extract record downgrade logic for future use
* format
* fix build after rebase
* rename private method for consistency
* also implement configuredcatalog migrations >.>
* quick and dirty tests
* slight cleanup
* fix tests
* pmd
* pmd test
* null check on message objects
* maybe fix acceptance tests?
* fix name
* extract constants
* more fixes
* tmp
* meh
* fix cdc acc tests
* revert to master source-postgres
* remove log messages
* revert other misc hacks
* integers are valid cursors
* remove unrelated change
* fix build
* fix build more?
* [MUST REVERT] use dev normalization
* capture kube logs
* also here?
* no debug logs?
* delete dup from merging
* add final everywhere
* revert test changes
Co-authored-by: Jimmy Ma <jimmy@airbyte.io>
* On-the-fly migrations of persisted catalogs (#21757)
* On the fly catalog migration for normalization activity
* On the fly catalog migration for job persistence
* On the fly migration for standard sync persistence
* On the fly migration for airbyte catalogs
* Refactor code to share JsonSchema traversal
* Add V0 Data type search function
* PMD and Format
* Fix getOrInsertActorCatalog and ConfigRepositoryE2E tests
* Null-proofing CatalogMigrationV1Helper
* More null checks
* Fix test
* Format
* Add data type v1 support to the FE
* Changes AC test check to check exited ps (#21672)
some docker compose changes no longer show exited
processes. this broke out test
this change should fix master
tested in a runner that failed
* Move wellknown types mapping to the utility function
* use protocolv1 normalization
---------
Co-authored-by: Topher Lubaway <asimplechris@gmail.com>
Co-authored-by: Edward Gao <edward.gao@airbyte.io>
* Update protocol support range (#21996)
* bump normalization version to 0.3.0
* Add version check on normalization (#22048)
* Add normalization min version check
* Add visible for testing
---------
Co-authored-by: Edward Gao <edward.gao@airbyte.io>
Co-authored-by: Eugene <etsybaev@gmail.com>
Co-authored-by: Topher Lubaway <asimplechris@gmail.com>
The Java 19 toolchain doesn't like sneaky throws. Not entirely sure why. However, I think it's better practice to not use sneaky throws as it makes it clearer what is throw and where.
Example error message when trying to compile the current codebase with Java 19:
error: Error during the transformation of 'io.airbyte.validation.json.JsonSchemaValidatorTest'; post-compiler 'lombok.bytecode.SneakyThrowsRemover' caused an exception: java.lang.IllegalArgumentException: Unsupported class file major version 63
at org.objectweb.asm.ClassReader.<init>(ClassReader.java:199)
at org.objectweb.asm.ClassReader.<init>(ClassReader.java:180)
at org.objectweb.asm.ClassReader.<init>(ClassReader.java:166)
at lombok.bytecode.AsmUtil.fixJSRInlining(AsmUtil.java:37)
at lombok.bytecode.SneakyThrowsRemover.applyTransformations(SneakyThrowsRemover.java:46)
at lombok.core.PostCompiler.applyTransformations(PostCompiler.java:44)
at lombok.core.PostCompiler$1.close(PostCompiler.java:87)
at jdk.compiler/com.sun.tools.javac.jvm.ClassWriter.writeClass(ClassWriter.java:1508)
at jdk.compiler/com.sun.tools.javac.main.JavaCompiler.genCode(JavaCompiler.java:738)
## What
Finale of https://github.com/airbytehq/airbyte/pull/13122.
We've renamed all directories in previous PRs. Here we remove the fat jar configuration and add publishing to all subprojects.
Explanation for what is happening:
Identically named subprojects have the following issues:
* publishing as is leads to classpath confusion when the jars with the same names are placed in the Java distribution. This leads to NoClassDefFound errors on runtime.
* deconflicting the jar names without changing directory names leads to dependency errors as the OSS jar pom files are generated using project dependencies (suggesting a dependency a sibling subproject in the same repo) that use subprojects group and name as a reference. This means the generated jars look for Jars that do not exists (as their names have been changed) and cannot compile.
* the workaround to changing a subproject's name involves resetting the subproject's name in the settings.gradle and depending on the new name in each build.gradle. This increases configuration burden and decreases the ease of reading, since one will have to check the settings.gradle to know what the right subproject name is. See https://github.com/gradle/gradle/issues/847 for more info.
* given that Gradle itself doesn't have support for identically named subprojects (see the linked issue), the simplest solution is to not allow duplicated directories. I've only renamed conflicting directories here to keep things simple. I will create a follow up issues to enforce non-identical subproject names in our builds.
## How
* Remove fat jar configuration.
* Add publishing to all subprojects.
* add specs module with logic to fetch specs on build
* format + build and add gradle dependency for new script
* check seed file for existing specs + refactor
* add tests + a bit more refactoring
* run gw format
* update yaml config persistence to merge specs into definitions
* add comment
* delete secrets migration to be consistent with master
* add dep
* add tests for GcsBucketSpecFetcher
* get rid of static block + format
* DRY up parse call
* add GCS details to comment
* formatting + fix test
* update comment
* do not format seed specs files
* change signature of run to allow cloud to reuse this script
* run gw format
* revert commits that change signature of run
* fix comment typo
Co-authored-by: Davin Chia <davinchia@gmail.com>
* rename enum to be distinct from the enum in cloud
* add missing dependencies between modules
* add readme for seed connector spec generator
* reword
* reference readme in comment
* ignore 'spec' field in newFields logic
Co-authored-by: Davin Chia <davinchia@gmail.com>