* switch to temporal cloud client for now
* format
* use client cert/key env secret instead of path to secret
* add TODO comments
* format
* add logging to debug timeout issue
* add more logging
* change workflow task timeout
* PR feedback: consolidate as much as possible, add missing javadoc
* fix acceptance test, needs to specify localhost
* add internal-use only comments
* format
* refactor to clean up TemporalClient and prepare it for future dependency injection framework
* remove extraneous log statements
* PR feedback
* fix test
* return isInitialized true in test
Refactor the acceptance tests for readability & speed by splitting acceptance tests into a basic and advanced test class.
- Basic test class: Contains all tests around functionality that stays constant regardless of deployment. e.g. api changes. Only run for Docker.
- Advanced test class: Contains all tests around functionality that changes due to deployment. e.g. how we handle processes between Docker and Kubernetes. Runs for both Docker and Kubernetes.
The benefits are:
- Breaks up the huge monolith tests we have today for better readability. Preps us to run these tests on Cloud.
- Clarifies what tests run on what deployment.
- Gradle parallelises at the level of the test class, so we get some speed up. Anecdotally, this is faster by ~3 mins over the old Kubernetes acceptance tests.
There is some test fixture duplication, but I figured we can tackle that in a follow up PR.
* sweep all scheduler application code and new-scheduler conditional logic
* remove airbyte-scheduler from deployments and docs
* format
* remove 'v2' from github actions
* add back scheduler in delete deployment command
* remove scheduler parameters from helm chart values
* add back job cleaner + test and add comment
* remove now-unused env vars from code and docs
* format
* remove feature flags from web backend connection handler as it is no longer needed
* remove feature flags from config api as it is now longer needed
* remove feature flags input from config api test
* format + shorter url
* remove scheduler parameters from helm chart readme
## What
Part 2 of https://github.com/airbytehq/airbyte/pull/13122.
Follow up to #13476 .
Explanation for what is happening:
Identically named subprojects have the following issues:
* publishing as is leads to classpath confusion when the jars with the same names are placed in the Java distribution. This leads to NoClassDefFound errors on runtime.
* deconflicting the jar names without changing directory names leads to dependency errors as the OSS jar pom files are generated using project dependencies (suggesting a dependency a sibling subproject in the same repo) that use subprojects group and name as a reference. This means the generated jars look for Jars that do not exists (as their names have been changed) and cannot compile.
* the workaround to changing a subproject's name involves resetting the subproject's name in the settings.gradle and depending on the new name in each build.gradle. This increases configuration burden and decreases the ease of reading, since one will have to check the settings.gradle to know what the right subproject name is. See https://github.com/gradle/gradle/issues/847 for more info.
* given that Gradle itself doesn't have support for identically named subprojects (see the linked issue), the simplest solution is to not allow duplicated directories. I've only renamed conflicting directories here to keep things simple. I will create a follow up issues to enforce non-identical subproject names in our builds.
* Rename airbyte-config:models to airbyte-config:config-models.
* Rename airbyte-config:persistence to airbyte-config:config-persistence.
Part 1 of #13122.
Rename airbyte-db:lib to airbyte-db:db-lib.
Rename airbyte-metrics:lib to airbyte-metrics:metrics-lib
Rename airbyte-protocol:models to airbyte-protocol:protocol-models.
Explanation for what is happening:
Identically named subprojects have the following issues:
- publishing as is leads to classpath confusion when the jars with the same names are placed in the Java distribution. This leads to NoClassDefFound errors on runtime.
- deconflicting the jar names without changing directory names leads to dependency errors as the OSS jar pom files are generated using project dependencies (suggesting a dependency a sibling subproject in the same repo) that use subprojects group and name as a reference. This means the generated jars look for Jars that do not exists (as their names have been changed) and cannot compile.
- the workaround to changing a subproject's name involves resetting the subproject's name in the settings.gradle and depending on the new name in each build.gradle. This increases configuration burden and decreases the ease of reading, since one will have to check the settings.gradle to know what the right subproject name is. See Projects with same name lead to unintended conflict resolution gradle/gradle#847 for more info.
- given that Gradle itself doesn't have support for identically named subprojects (see the linked issue), the simplest solution is to not allow duplicated directories. I've only renamed conflicting directories here to keep things simple. I will create a follow up issues to enforce non-identical subproject names in our builds.
* Migrate OSS to temporal scheduler
* add comment about migration being performed in server
* add comments about removing migration logic
* formatting and add tests for migration logic
* rm duplicated test
* remove more duplicated build task
* remove retry
* disable acceptance tests that call temporal directly when on kube
* set NEW_SCHEDULER and CONTAINER_ORCHESTRATOR_ENABLED env vars to true to be consistent
* set default value of container orchestrator enabled to true
* Revert "set default value of container orchestrator enabled to true"
This reverts commit 21b36703a9.
* Revert "set NEW_SCHEDULER and CONTAINER_ORCHESTRATOR_ENABLED env vars to true to be consistent"
This reverts commit 6dd2ec04a2.
* Revert "Revert "set NEW_SCHEDULER and CONTAINER_ORCHESTRATOR_ENABLED env vars to true to be consistent""
This reverts commit 2f40f9da50.
* Revert "Revert "set default value of container orchestrator enabled to true""
This reverts commit 26068d5b31.
* fix sync workflow test
* remove defunct cancellation tests due to internal temporal error
* format - remove unused imports
* revert changes that set container orchestrator enabled to true everywhere
* remove NEW_SCHEDULER feature flag from .env files, and set CONTAINER_ORCHESTRATOR_ENABLED flag to true for kube .env files
Co-authored-by: Benoit Moriceau <benoit@airbyte.io>
* Repair temporal state when performing manual actions
* refactor temporal client and fix tests
* add unreachable workflow exception
* format
* test repeated deletion
* add acceptance tests for automatic workflow repair
* rename and DRY up manual operation methods in SchedulerHandler
* refactor temporal client to batch signal and start requests together in repair case
* add comment
* remove main method
* fix job id fetching
* only overwrite workflowState if reset flags are true on input
* fix test
* fix cancel endpoint
* Clean job state before creating new jobs in connection manager workflow (#12589)
* first working iteration of cleaning job state on first workflow run
* second iteration, with tests
* undo local testing changes
* move method
* add comment explaining placement of clean job state logic
* change connection_workflow failure origin value to platform
* remove cast from new query
* create static var for non terminal job statuses
* change failure origin value to airbyte_platform
* tweak external message wording
* remove unused variable
* reword external message
* fix merge conflict
* remove log lines
* move cleaning job state to beginning of workflow
* do not clean job state if there is already a job id for this workflow, and add test
* see if sleeping fixes test on CI
* add repeated test annotation to protect from flakiness
* fail jobs before creating new ones to protect from quarantined state
* update external message for cleaning job state error
* workspaceId should be part of spec request
* address review comment
* fix test
* format
* update octavia according to API changes
* create integration test for definition generation
* fix test
* fix test
Co-authored-by: alafanechere <augustin.lafanechere@gmail.com>
* acceptance tests with retries
* kube acceptance tests with retries
* javadoc to say we prefer not retrying
* no retries for tests that don't run on k8s
* start new workflow if not running during update
* add unit tests for update method
* remove comment
* just test update when temporal workflow does not exist
* put test inside usesNewScheduler conditional
* remove unused import
* add comment explaining acceptance test
* Add acceptance test for deleting connetion in bad temporal state
* disable new test on kube
* try using different temproalHost
* add log info line to give time for temporal client to spin up
* try avoiding missing row in temporal db
* check if temporal workflow is reachable
* check if temporal workflow is reachable
* check if temporal workflow is reachable
* try waiting for connection state
* try using airbyte-temporal hostname
* Revert "try using airbyte-temporal hostname"
This reverts commit 0e53a27622.
* Revert back to using localhost
* Add 5 second wait
* only enable test for new scheduler
* only enable test for new scheduler 2
* refactor test to cover normal and unexpected temporal state
* Make SchedulerHandler store schema after fetching it
* Add `disable_cache` parameter to discover_schema API
* Return cached catalog if it already exists
* Address code review comments
* Add tests for caching of catalog in SchedulerHandler
* Format fixes
* Fix Acceptance tests
* New code review fixes
- Use upper case for global variable
- Inline definition and assignment of variable
* Feat: first cut to allow naming for connections
* fix
* fix: migration
* fix: migration
* fix: formatting
* fix: formatting
* fix: tests
* fix: -> is bit outside of what we do generally
* fix: tests are failing
* fix: tests are failing
* fix: tests are failing
* fix: tests are failing
* fix: tests are failing
* auto-upgrade connectors there are in use with patch version only
* update check version docstring
* remove try/catch from hasNewPatchVersion
* refactor write std defs function
* run format
* add unit test and change exception
* update airbyte version function name to be more clear
* correct unit test in migration tests
* run format
* Revert "Rm flaky test (#9628)"
This reverts commit 16133cf5e7.
* Restore the acceptance test running with the new scheduler
* Add timeout
* Isolate the new acceptance test
* Update github action name
* Attemptp to fix checkpointing
* Check the state retrieval instead of the existance of the workflow
* fix build
* Add concurrent list for test
* Do not wait for the workflow to be potentially destroy
* Silencely ignore the cancel exception
* Format
* Trigger build
* format
* Remove unrelated changes
* Update acceptance
* Try to fix race condition
* Try to slow down the connection
* Disable test
* Move the sleep
* Rm useless sleep
* Fix missing return
* add repeated
* try using infinite feed source for cancellation test
* set limits on infinite feed source
* misunderstood waitForJob, now correctly waiting for job to be in RUNNING
* clean up PR, DRY create definition methods, clearer method name for waiting on job
* fix acceptance tests action name
* fix imports
* more cleanup
* revert temporalClient do-while change
* fix workflow step names
Co-authored-by: Benoit Moriceau <benoit@airbyte.io>
* upgrade temporal from mostly 1.6.0 to 1.8.1
* try bumping GSM to get newer grpc dep
* Revert "try bumping GSM to get newer grpc dep"
This reverts commit d837650284.
* upgrade temporal-testing as well
* don't change version for temporal-testing-junit5
* test time ranges for cancellations
* try with wait
* fix cancellation on worker restart
* revert for CI testing that the test fails without the retry policy
* revert testing change
* matrix test the different possible cases
* re-enable new retry policy
* switch to no_retry
* switch back to new retry
* paramaterize correctly
* revert to no-retry
* re-enable new retry policy
* speed up test + fixees
* significantly speed up test
* fix ordering
* use multiple task queues in connection manager test
* use versioning for task queue change
* remove sync workflow registration for the connection manager queue
* use more specific example
* respond to parker's comments
This is changing the check to see if a connection exist in order to make it more performant and more accurate. It makes sure that the workflow is reachable by trying to query it.
* fix normalization output processing in container orchestrator
* add full scheduler v2 acceptance tests
* speed up tests
* fixes
* clean up
* wip handle worker restarts
* only downtime during sync test not passing
* commit temp
* mostly cleaned up
* add attempt count check
* remove todo
* switch all pending checks to running checks
* use ++
* Update airbyte-container-orchestrator/src/main/java/io/airbyte/container_orchestrator/ContainerOrchestratorApp.java
Co-authored-by: Charles <giardina.charles@gmail.com>
* Update airbyte-workers/src/main/java/io/airbyte/workers/temporal/sync/LauncherWorker.java
Co-authored-by: Charles <giardina.charles@gmail.com>
* add more context
* remove unused arg
* test on CI that no_retry is insufficient
* revert back to orchestrator retry
* test for retry logic
* remove fialing test and switch back activity config to just no retry
Co-authored-by: Charles <giardina.charles@gmail.com>
* Add a job notification
The new scheduler was missing a notification step after the job is done.
This is needed in order to report the number of record of a sync.
* Acceptance test with the new scheduler
Add a new github action task to run the acceptances test with the new scheduler
* Retry if the failure
* PR comments
* fix migration test again
* disable acceptance tests
* re-enable acceptance tests
* bring snowflake version back
* fix
* fix how we compare versions for migration tests
* Implement destination null
* Update existing testing destinations
* Merge in logging consumer
* Remove old destination null
* Add documentation
* Add destination to build and summary
* Fix test
* Update acceptance test
* Log state message
* Remove unused variable
* Remove extra statement
* Remove old null doc
* Add dev null destination
* Update doc to include changelog for dev null
* Format code
* Fix doc
* Register e2e test destination in seed
* add test connectors and bump versions
* add failure timeout acceptance test
* run gw format
* include exception in runtime exception
* mark as disabled and add comment
- Add the CONFIGS_DATABASE_MINIMUM_FLYWAY_MIGRATION_VERSION and JOBS_DATABASE_MINIMUM_FLYWAY_MIGRATION_VERSION. These are env vars that will determine if the database is ready for an application to start.
- Add the CONFIGS_DATABASE_INITIALIZATION_TIMEOUT_MS and the JOBS_DATABASE_INITIALIZATION_TIMEOUT_MS env vars to determine how long an application should wait for the DB before giving up.
- Create the MinimumFlywayMigrationVersionCheck class. This class contains all the assertions to check if 1) a database is initialised. 2) a database meets the minimum migration version.
- Remove all set up operations from the ServerApp. Use MinimumFlywayMigrationVersionCheck operations instead.
- I also had to modify the Databases and BaseDatabaseInstance classes to support connecting to a database with timeouts. We would previously try forever.
- Add Bootloader to the relevant docker files and Kube files.
- Clean up the migration acceptance tests so it's clear what is happening.
* Rename GcsStorageBucket to GcsLogBucket.
* Update all references to GCP_STORAGE_BUCKET to GCS_LOG_BUCKET.
* Undo this for configuration files for older Airbyte versions.
* Clean up Job env vars. (#8462)
* Rename MAX_SYNC_JOB_ATTEMPTS to SYNC_JOB_MAX_ATTEMPTS.
* Rename MAX_SYNC_TIMEOUT_DAYS to SYNC_JOB_MAX_TIMEOUT_DAYS.
* Rename WORKER_POD_TOLERATIONS to JOB_POD_TOLERATIONS.
* Rename WORKER_POD_NODE_SELECTORS to JOB_POD_NODE_SELECTORS.
* Rename JOB_IMAGE_PULL_POLICY to JOB_POD_MAIN_CONTAINER_IMAGE_PULL_POLICY.
* Rename JOBS_IMAGE_PULL_SECRET to JOB_POD_MAIN_CONTAINER_IMAGE_PULL_SECRET.
* Rename JOB_SOCAT_IMAGE to JOB_POD_SOCAT_IMAGE.
* Rename JOB_BUSYBOX_IMAGE to JOB_POD_BUSYBOX_IMAGE.
* Rename JOB_CURL_IMAGE to JOB_POD_CURL_IMAGE.
* Rename KUBE_NAMESPACE to JOB_POD_KUBE_NAMESPACE.
* Rename RESOURCE_CPU_REQUEST to JOB_POD_MAIN_CONTAINER_CPU_REQUEST.
* Rename RESOURCE_CPU_LIMIT to JOB_POD_MAIN_CONTAINER_CPU_LIMIT.
* Rename RESOURCE_MEMORY_REQUEST to JOB_POD_MAIN_CONTAINER_MEMORY_REQUEST.
* Rename RESOURCE_MEMORY_LIMIT to JOB_POD_MAIN_CONTAINER_MEMORY_LIMIT.
* Remove worker suffix from created pods to reduce confusion with actual worker pods.
* Use sync instead of worker to name job pods.