1
0
mirror of synced 2026-01-04 00:04:25 -05:00
Files
airbyte/docs/understanding-airbyte/basic-normalization.md
Serhii Chvaliuk c262d20211 🎉 Destination Snowflake + Normalization Core: Added OAuth support (#11093)
* [10033] Destination-Snowflake: added basic part for support oauth login mode

* added basic logic for token refresh

* Fixed code to support pooled connections

* Hide DBT transformations in cloud (#10583)

* Bump Airbyte version from 0.35.35-alpha to 0.35.36-alpha (#10584)

Co-authored-by: timroes <timroes@users.noreply.github.com>

* 🐛 Source Shopify: fix wrong field type for tax_exemptions (#10419)

* fix(shopify): wrong type for tax_exemptions

abandoned_checkouts customer tax_exemptions had the wrong field type

* fix(shopify): wrong type for tax_exemptions

abandoned_checkouts customer tax_exemptions had the wrong field type

* bump connector version

Co-authored-by: marcosmarxm <marcosmarxm@gmail.com>

* Remove storybook-addon-styled-component-theme (#10574)

* Helm Chart: Secure chart for best practices (#10000)

* 🐛 Source FB Marketing: fix `execute_in_batch` when batch is bigger than 50 (#10588)

* fix execute_in_batch

* add tests

* fix pre-commit config

Co-authored-by: Sherif A. Nada <snadalive@gmail.com>
Co-authored-by: Eugene Kulak <kulak.eugene@gmail.com>
Co-authored-by: Sherif A. Nada <snadalive@gmail.com>

* Bmoric/move flag check to handler (#10469)

Move the feature flag checks to the handler instead of the configuration API. This could have avoid some bug related to the missing flag check in the cloud project.

* Documented product release stages (#10596)

* Set resource limits for connector definitions: api layer (#10482)

* Updated link to product release stages doc (#10599)

* Change the block logic and block after the job creation (#10597)

This is changing the check to see if a connection exist in order to make it more performant and more accurate. It makes sure that the workflow is reachable by trying to query it.

* Add timeout to connector pod init container command (#10592)

* add timeout to init container command

* add disk usage check into init command

* fix up disk usage checking and logs from init entrypoint

* run format

* fix orchestrator restart problem for cloud (#10565)

* test time ranges for cancellations

* try with wait

* fix cancellation on worker restart

* revert for CI testing that the test fails without the retry policy

* revert testing change

* matrix test the different possible cases

* re-enable new retry policy

* switch to no_retry

* switch back to new retry

* paramaterize correctly

* revert to no-retry

* re-enable new retry policy

* speed up test + fixees

* significantly speed up test

* fix ordering

* use multiple task queues in connection manager test

* use versioning for task queue change

* remove sync workflow registration for the connection manager queue

* use more specific example

* respond to parker's comments

* Fix the toggle design (#10612)

* Source Hubspot: cast timestamp to date/datetime (#10576)

* cast timestamp to date

* change test name

* fix corner cases

* fix corner cases 2

* format code

* changed method name

* add return typing

* bump version

* updated spec and def yaml

Co-authored-by: auganbay <auganenu@gmail.com>

* Update _helpers.tpl (#10617)

as helm templates integers as float64, when using %d, it renders the value of external airbyte.minio.endpoint to "S3_MINIO_ENDPOINT: "http://minio-service:%!d(float64=9000)", therefore needed to be changed to %g

* 🎉 Source Survey Monkey: add option to filter survey IDs (#8768)

* Add custom survey_ids

* bump version

* Update survey_question schema

* Add changelog

* Allow null objects

* merge master and format

* Make all types safe with NULL and add survey_ids to all streams

* Make additional types safe with NULL

* Make additional types safe with NULL

* One last safe NULL type

* small fixes

* solve conflic

* small fixes

* revert fb wrong commit

* small fb correction

* bump connector version

Co-authored-by: marcosmarxm <marcosmarxm@gmail.com>

* Fix doc links/loading (#10621)

* Allow frontmatter in rendered markdown (#10624)

* Adjust to new normalization name (#10626)

* sweep pods from end time not start time (#10614)

* Source Pinterest: fix typo in schema fields (#10223)

* 🎉 add associations companies to deals, ticket and contacts stream (from PR 9027) (#10631)

* Added associations to some CRM Object streams in Hubspot connector

* Added associations in the relevant schemas

* fix eof

* bump connector version

Co-authored-by: ksoenandar <kevin.soenandar@gmail.com>

* Source Chargebee: add transaction stream (#10312)

* added transactions model

* changes

* fix

* few changes

* fix

* added new stream in configured_catalog*.json

* changes

* removed new stream in configured_catalog*.json

* solve small schema issues

* add eof

* bump connector version

Co-authored-by: marcosmarxm <marcosmarxm@gmail.com>
Co-authored-by: Marcos Marx <marcosmarxm@users.noreply.github.com>

* Add missing continue as new (#10636)

* Bump Airbyte version from 0.35.36-alpha to 0.35.37-alpha (#10640)

Co-authored-by: benmoriceau <benmoriceau@users.noreply.github.com>

* exclude workers test from connectors builds on CI (#10615)

* 🎉 Source Google Workspace Admin Reports: add support for Google Meet Audit Activity Events (#10244)

* source(google-workspace-admin-reports): add support for Google Meet Audit activity events

Signed-off-by: Michele Zuccala <michele@zuccala.com>

* remove required fields

* bump connector version

* run format

Co-authored-by: marcosmarxm <marcosmarxm@gmail.com>

* stabilize connection manager tests (#10606)

* stabilize connection manager tests

* just call shutdown once

* another run just so we can see if it's passing

* another run just so we can see if it's passing

* re-disable test

* run another test

* run another test

* run another test

* run another test

* Log pod state if init pod wait condition times out (for debugging transient test issue) (#10639)

* log pod state if init pod search times out

* increase test timeout from 5 to 6 minutes to give kube pod process timeout time to trigger

* format

* upgrade gradle from 7.3.3 -> 7.4 (#10645)

* upgrade temporal sdk to 1.8.1 (#10648)

* upgrade temporal from mostly 1.6.0 to 1.8.1

* try bumping GSM to get newer grpc dep

* Revert "try bumping GSM to get newer grpc dep"

This reverts commit d837650284.

* upgrade temporal-testing as well

* don't change version for temporal-testing-junit5

* 🎉 Source Google Ads: add network fields to click view stream

* Google Ads #8331 - add network fields to click_view stream schema

* Google Ads #8331 - add segments.ad_network_type to click_view pk according to PR review

* Google Ads #8331 - bump version

* Google Ads #8331 - update definition

* Cloud Dashboard 1 (#10628)

Publish metrics for:
- created jobs tagged by release stage
- failed jobs tagged by release stage
- cancelled jobs tagged by release stage
- succeed jobs tagged by release stage

* Correct cancelled job metric name. (#10658)

* Add attempt status by release stage metrics. (#10659)

Add,

- attempt_created_by_release_stage
- attempt_failed_by_release_stage
- attempt_succeeded_by_release_stage

* 🐛 Source CockroachDB: fix connector replication failure due to multiple open portals error (#10235)

* fix cockroachdb connector replication failure due to multiple open portals error

* bump connector version

Co-authored-by: marcosmarxm <marcosmarxm@gmail.com>

* 🐙 octavia-cli: implement `generate` command (#10132)

* Add try catch to make sure all handlers are closed (#10627)

* Add try catch to make sure all handlers are closed

* Handle exceptions while initializing writers

* Bumpversion of connectors

* bumpversion in seed

* Fix bigquery denormalized tests

* bumpversion seed of destination bigquery denormalized

* Fix links in onboarding page (#10656)

* Fix missing key inside map

* Fix onboarding progress links

* Add use-case links to onboarding (#10657)

* Add use-case links to onboarding

* Add new onboarding links

* Set resource limits for connector definitions: expose in worker (#10483)

* pipe through to worker

* wip

* pass source and dest def resource reqs to job client

* fix test

* use resource requirements utils to get resource reqs for legacy and new impls

* undo changes to pass sync input to container launcher worker factory

* remove import

* fix hierarchy order of resource requirements

* add nullable annotations

* undo change to test

* format

* use destination resource reqs for normalization and make resource req utils more flexible

* format

* refactor resource requirements utils and add tests

* switch to storing source/dest resource requirements directly on job sync config

* fix tests and javadocs

* use sync input resource requirements for container orchestrator pod

* do not set connection resource reqs to worker reqs

* add overrident requirement utils method + test + comment

Co-authored-by: lmossman <lake@airbyte.io>

* add mocks to tests

* Bump Airbyte version from 0.35.37-alpha to 0.35.38-alpha (#10668)

Co-authored-by: lmossman <lmossman@users.noreply.github.com>

* 🎉 Source Salesforce: speed up discovery >20x by leveraging parallel API calls (#10516)

* 📖  improve salesforce docs & reorder properties in the spec (#10679)

* Bump Airbyte version from 0.35.38-alpha to 0.35.39-alpha (#10680)

Co-authored-by: sherifnada <sherifnada@users.noreply.github.com>

* Improve note in salesforce docs about creating a RO user

* Upgrade plop in connector generators (#10578)

* Upgrade plop

* Remove scaffolded code

* Build fixes

* Remove scaffolded code

* Revert "Remove scaffolded code"

This reverts commit 3911f527f8.

* Revert "Remove scaffolded code"

This reverts commit 549f790e3c.

* Remove .gitignore changes

* Remove .gitignore changes

* Update scaffold generated code

* Replace titleCase with capitalCase (#10654)

* Add capitalCase helper

* Replace titleCase with capitalCase

* Update generated scaffold files

Co-authored-by: LiRen Tu <tuliren.git@outlook.com>

* 🐛 Fix toggle styling (#10684)

* Fix error NPE in metrics emission. (#10675)

* Fix missing type=button (#10683)

* close ssh in case of exception during check in Postgres connector (#10620)

* close ssh in case of exception

* remove unwanted change

* remove comment

* format

* do not close scanner

* fix semi-colon

* format

* Refactor to enable support for optional JDBC parameters for all JDBC destinations (#10421)

* refactoring to allow testing

* MySQLDestination uses connection property map instead of url arguments

* Update jdbc destinations

* A little more generic

* reset to master

* reset to master

* move to jdbcutils

* Align when multiline

* Align when multiline

* Update postgres to use property map

* Move tests to AbstractJdbcDestinationTest

* clean

* Align when multiline

* return property map

* Add postgres tests

* update clickhouse

* reformat

* reset

* reformat

* fix test

* reformat

* fix bug

* Add mssql tests

* refactor test

* fix oracle destination test

* oracle tests

* fix redshift acceptance test

* Pass string

* Revert "Pass string"

This reverts commit 697821738c.

* Double deserialization

* Revert "Double deserialization"

This reverts commit ee8d75245b.

* try updating json_operations

* Revert "try updating json_operations"

This reverts commit c8022c2994.

* json parse

* Revert "json parse"

This reverts commit 11a6725eaa.

* Revert "Revert "Double deserialization""

This reverts commit 213f47acc4.

* Revert "Revert "Revert "Double deserialization"""

This reverts commit 66822454af.

* move to constant

* Add comment

* map can be constant

* Add comment

* move map

* hide in method

* no need to create new map

* no need to create new map

* no need to create new map

* enably mysql test

* Update changelogs

* Update changelog

* update changelog

* Bump versions

* bump version

* disable dbt support

* update spec

* update other oracle tests

* update doc

* bump seed

* fix source test

* update seed spec file

* fix expected spec

* Fix trial period time frame (#10714)

* Bmoric/restore update with temporal (#10713)

Restore the missing update call to temporal.

It was making the update of a schedule to not be effective immediately.

* Bump Airbyte version from 0.35.39-alpha to 0.35.40-alpha (#10716)

Co-authored-by: benmoriceau <benmoriceau@users.noreply.github.com>

* Fix CockroachDbSource compilation error (#10731)

* Fix CockroachDbSource compilation error

* fix test too

* 🎉 Source Zendesk: sync rate improvement (#9456)

* Update Source Zendesk request execution with future requests.

* Revert "Update Source Zendesk request execution with future requests."

This reverts commit 2a3c1f82b7.

* Add futures stream logics.

* Fix stream

* Fix full refresh streams.

* Update streams.py.
Fix all streams.
Updated schema.

* Add future request unit tests

* Post review fixes.

* Fix broken incremental streams.
Fix SAT.
Remove odd unit tests.

* Comment few unit tests

* Bump docker version

* CDK: Ensure AirbyteLogger is thread-safe using Lock (#9943)

* Ensure AirbyteLogger is thread-safe

- Introduce a global lock to ensure `AirbyteLogger` is thread-safe.
- The `logging` module is thread-safe, however `print` is not, and is currently used. This means that messages sent to stdout can clash if connectors use threading. This is obviously a huge problem when the IPC between the source/destination is stdout!
- A `multiprocessing.Lock` could have been introduced however given that `logging` module is not multiprocess-safe I thought that thread-safety should be first goal.
- IMO the `AirbyteLogger` should be a subclass of the `logging.Logger` so you have thread-safety automatically, however I didn't want to make a huge wholesale change here.

* Revert lock and add deprecation warning instead

* remove --cpu-shares flag (#10738)

* Bump Airbyte version from 0.35.40-alpha to 0.35.41-alpha (#10740)

Co-authored-by: jrhizor <jrhizor@users.noreply.github.com>

* Add Scylla destination to index (#10741)

* Add scylla to destination_definitions

* Add woocommerce source

* Update definition id

* Add icon

* update docker repository

* reset to master

* fix version

* generate spec

* Update builds.md

* run gradle format (#10746)

* Bump Airbyte version from 0.35.41-alpha to 0.35.42-alpha (#10747)

Co-authored-by: girarda <girarda@users.noreply.github.com>

* Change offer amount

* Fix back link on signup page (#10732)

* Fix back link on signup page

* Add and correct uiConfig links

* 🎉 Source redshift: implement privileges check (#9744)

* update postgres source version (#10696)

* update postgres source version

* update spec

* fix[api]: nullable connection schedule (#10107)

* fix[api] inconsistent casing on OperationID for Operations API  (#10464)

* #10307 Fixes inconsistent casing on OperationID for Operations API

* update generated doc

Co-authored-by: alafanechere <augustin.lafanechere@gmail.com>

* Display numbers in usage per connection table (#10757)

* Add connector stage to dropdown value (#10677)

* Add connector stage to dropdown value

* Remove line break from i18n message

* Update snowflake destination docs for correct host (#10673)

* Update snowflake destination docs for correct host

* Update snowflake.md

* Update README.md

* Update spec.json

* Update README.md

* Update spec.json

* Update README.md

* Update snowflake.md

* Update spec.json

* Update spec.json

* 📕 source salesforce: fix broken page anchor in spec.json & add guide for adding read only user (#10751)

* 🎉  Source Facebook Marketing: add activities stream (#10655)

* add facebook marketing activities stream

* update incremental test

* add overrides for activities specific logic

* formatting

* update readme docs

* remove test limitation

* update dockerfile airbyte version

* correct tests

* bump connector version in config module

Co-authored-by: marcosmarxm <marcosmarxm@gmail.com>

* Add a note about running only in dev mode on M1 (#10772)

Macs with M1 chip can run Airbyte only in dev mode right now, so to make it clear, I added a note about it and moved the hint about M1 chips to the top of the section.

* push failures to segment (#10715)

* test: new failures metadata for segment tracking

* new failures metadata for segment tracking

failure_reasons: array of all failures (as json objects) for a job
- for general analytics on failures
main_failure_reason: main failure reason (as json object) for this job
- for operational usage (for Intercom)
- currently this is just the first failure reason chronologically
    - we'll probably to change this when we have more data on how to
determine failure reasons more intelligently

- added an attempt_id to failures so we can group failures by attempt
- removed stacktrace from failures since it's not clear how we'd use
these in an analytics use case (and because segment has a 32kb size
limit for events)

* remove attempt_id

attempt info is already in failure metadata

* explicitly sort failures array chronologically

* replace "unknown" enums with null

note: ImmutableMaps don't allow nulls

* move sorting to the correct place

* Update temporal retention TTL from 7 to 30 days (#10635)

Increase the temporal retention to 30 days instead of 7. It will help with on call investigation.

* Add count connection functions (#10568)

* Add count connection functions

* Fix new configRepository queries

- Remove unnecessary joins
- Fix countConnection

* Use existing mock data for tests

* Adds default sidecar cpu request and limit and add resources to the init container (#10759)

* close ssh tunnel in case of exception in destination consumer (#10686)

* close ssh tunnel in case of exception

* format

* fix salesforce docs markdown formatting

* Fix typo in salesforce docs

* Extract event from the temporal worker run factory (#10739)

Extract of different events that can happen to a sync into a non temporal related interface.

* Bump Airbyte version from 0.35.42-alpha to 0.35.43-alpha (#10778)

Co-authored-by: sherifnada <sherifnada@users.noreply.github.com>

* Added a note about running in dev mode on M1 macs (#10776)

Currently, Macs with M1 chips can run Airbyte only in dev mode. I added a note about that.

* Destination Snowflake: add missing version in changelog (#10779)

* Hide shopify in Cloud (#10783)

* Metrics Reporter Queries Part 1 (#10663)

Add all the simpler queries from https://docs.google.com/document/d/11pEUsHyKUhh4CtV3aReau3SUG-ncEvy6ROJRVln6YB4/edit?usp=sharing.

- Num Pending Jobs
- Num Concurrent Jobs
- Oldest Pending Job
- Oldest Running Job

* Bump Airbyte version from 0.35.43-alpha to 0.35.44-alpha (#10789)

* Bump Airbyte version from 0.35.43-alpha to 0.35.44-alpha

* Commit.

* Add exception block.

* Why would having try catch work?

* Add logging to figure out.

* Undo all debugging changes.

* Better comments.

Co-authored-by: davinchia <davinchia@users.noreply.github.com>
Co-authored-by: Davin Chia <davinchia@gmail.com>

* Update api-documentation.md

* jdbc build fixes (#10799)

* Update api-documentation.md

* Exclude package.json from codeowners (#10805)

* 🎉 Source Chargebee: add credit note model (#10795)

* feat(chargebee) add credit note model

* fix(airbyte): update version Dockerfile

* fix(airbyte): update version Dockerfile v2

* Source Chargebee: run format and correct unit test (#10811)

* feat(chargebee) add credit note model

* fix(airbyte): update version Dockerfile

* fix(airbyte): update version Dockerfile v2

* correct unit test

Co-authored-by: Koen Sengers <k.sengers@gynzy.com>

* 🎉 Source Chartmogul: Add CustomerCount stream (#10756)

* 🎉 Source Chartmogul: Add CustomerCount stream

* Update description

* address comments

* update changelog

* format source file

* run seed file

Co-authored-by: marcosmarxm <marcosmarxm@gmail.com>

* default to no resource limits for OSS (#10800)

* Add autoformat (#10808)

* Bump Airbyte version from 0.35.44-alpha to 0.35.45-alpha (#10818)

Co-authored-by: lmossman <lmossman@users.noreply.github.com>

* Set default values as current values in editMode (#10486)

* Set default values as current values in editMode

* Fix unit tests

* Save signup fields (#10768)

* Temporary save signup fields into firebase_user.displayName

* Use default values if no displayName was stored before

* Move regsiter to localStorage

* Address PR comments

* Source Woocommerce: fixes (#10529)

* fixed issues

* Fix: multiple issues

* modify configured catalog

* Fix: remove unused variables

* Fix: orders request with parameters

* Fix: add new line in configured catalogs

* Fix: remove unused imports

* Fix: catalog changes

* Source woocommerce: publishing connector (#10791)

* fixed issues

* Fix: multiple issues

* modify configured catalog

* Fix: remove unused variables

* Fix: orders request with parameters

* Fix: add new line in configured catalogs

* Fix: remove unused imports

* Fix: catalog changes

* fix: change schema for meta_data

Co-authored-by: Manoj <saimanoj58@gmail.com>

* Surface any active child thread of dying connectors  (#10660)

* Interrupt child thread of dying connectors to avoid getting stuck

* Catch and print stacktrace

* Add test on interrupt/kill time outs

* Send message to sentry too

* Add another token to alleviate API limit pressure. (#10826)

We are running into Github API rate limits.

This PR:
- introduces another token as a temp solution.
- reorganises the workflow file.

* Add caching to all jobs in the main build. (#10801)

Add build dependency caching to all jobs in the main build.

This speeds things up by 5 mins over the previously uncached time.

* 🐛 Handle try/catch in BigQuery destination consumers (#10755)

* Handle try/catch in BigQuery destination consumers

* Remove parallelStream

* Bumpversion of connector

* update changelogs

* update seeds

* Format code (#10837)

* Regenerate MySQL outputs from normalization tests

* format

* Use cypress dashboard and stabilize e2e tests (#10807)

* Record e2e tests to cypress dashboard

* Make env variable accessible in script

* Improve e2e_test script

* Properly wait for server to be ready

* Isolate test suites better

* More test isolation

* Revert baseUrl for development

* 🐛 Source Github: add new streams `Deployments`, `ProjectColumns`, `PullRequestCommits` (#10385)

Signed-off-by: Sergey Chvalyuk <grubberr@gmail.com>

* Remove the use of ConfigPersistence for ActorCatalog operation (#10387)

* Skip ConfigPersistence for ActorCatalog operations

* Fix catalog insertion logic

- ActorCatalog and ActorCatalogFetchEvent are stored within the same
  transation.
- The function writing catalog now automatically handles deduplication.
- Fixed function visibility: helper function to handle ActorCatalog
  insertion are now private.

* Fix fetch catalog query

take the catalog associated with the latest fetch event in case where
multiple event are present for the same config, actorId, actor version.

* Fix name of columns used for insert

* Add testing on deduplication of catalogs

* Add javadoc for actor catalog functions

* Rename sourceId to actorId

* Fix formatting

* Update integrations README.md (#10851)

Updated verbiage from grades to stages
Updated connector stages to match cloud stage tags
Added connectors missing on README.md that appear on cloud drop down

* [10033] Destination-Snowflake: added basic part for support oauth login mode

* added basic logic for token refresh

* Updated spec to support DBT normalization and OAuth

* snowflake oauth

Signed-off-by: Sergey Chvalyuk <grubberr@gmail.com>

* test_transform_snowflake_oauth added

Signed-off-by: Sergey Chvalyuk <grubberr@gmail.com>

* [4654] Added backward compatibility

* Added test to check a backward compatibility

* fixed oauth connection

* Updated doc, fixed code as per comments in PR

* to be more explicit

Signed-off-by: Sergey Chvalyuk <grubberr@gmail.com>

* Added executor service

* Fixed merge conflict

* Updated doc and bumped version

* Bumped version

* bump 0.1.71 -> 0.1.72

Signed-off-by: Sergey Chvalyuk <grubberr@gmail.com>

* Updated doc

* fix version in basic-normalization.md

Signed-off-by: Sergey Chvalyuk <grubberr@gmail.com>

* Added explicit re-set property, but even now it already works

* dummy bumping version

* updated spec

Co-authored-by: ievgeniit <etsybaev@gmail.com>
Co-authored-by: Tim Roes <tim@airbyte.io>
Co-authored-by: Octavia Squidington III <90398440+octavia-squidington-iii@users.noreply.github.com>
Co-authored-by: timroes <timroes@users.noreply.github.com>
Co-authored-by: Philippe Boyd <philippeboyd@users.noreply.github.com>
Co-authored-by: marcosmarxm <marcosmarxm@gmail.com>
Co-authored-by: Álvaro Torres Cogollo <atorrescogollo@gmail.com>
Co-authored-by: Eugene Kulak <widowmakerreborn@gmail.com>
Co-authored-by: Sherif A. Nada <snadalive@gmail.com>
Co-authored-by: Eugene Kulak <kulak.eugene@gmail.com>
Co-authored-by: Benoit Moriceau <benoit@airbyte.io>
Co-authored-by: Amruta Ranade <11484018+Amruta-Ranade@users.noreply.github.com>
Co-authored-by: Charles <charles@airbyte.io>
Co-authored-by: Parker Mossman <parker@airbyte.io>
Co-authored-by: Jared Rhizor <me@jaredrhizor.com>
Co-authored-by: augan-rymkhan <93112548+augan-rymkhan@users.noreply.github.com>
Co-authored-by: auganbay <auganenu@gmail.com>
Co-authored-by: keterslayter <32784192+keterslayter@users.noreply.github.com>
Co-authored-by: Daniel Diamond <33811744+danieldiamond@users.noreply.github.com>
Co-authored-by: Ronald Fortmann <72810611+rfortmann-ewolff@users.noreply.github.com>
Co-authored-by: Marcos Marx <marcosmarxm@users.noreply.github.com>
Co-authored-by: ksoenandar <kevin.soenandar@gmail.com>
Co-authored-by: Aaditya Sinha <75474786+aadityasinha-dotcom@users.noreply.github.com>
Co-authored-by: benmoriceau <benmoriceau@users.noreply.github.com>
Co-authored-by: Michele Zuccala <michele@zuccala.com>
Co-authored-by: vitaliizazmic <75620293+vitaliizazmic@users.noreply.github.com>
Co-authored-by: Davin Chia <davinchia@gmail.com>
Co-authored-by: Lakshmikant Shrinivas <lakshmikant@gmail.com>
Co-authored-by: Augustin <augustin.lafanechere@gmail.com>
Co-authored-by: Christophe Duong <christophe.duong@gmail.com>
Co-authored-by: lmossman <lake@airbyte.io>
Co-authored-by: lmossman <lmossman@users.noreply.github.com>
Co-authored-by: Maksym Pavlenok <antixar@gmail.com>
Co-authored-by: sherifnada <sherifnada@users.noreply.github.com>
Co-authored-by: LiRen Tu <tuliren.git@outlook.com>
Co-authored-by: Subodh Kant Chaturvedi <subodh1810@gmail.com>
Co-authored-by: girarda <alexandre@airbyte.io>
Co-authored-by: Vadym Hevlich <vege1wgw@gmail.com>
Co-authored-by: jdclarke5 <jdclarke5@gmail.com>
Co-authored-by: jrhizor <jrhizor@users.noreply.github.com>
Co-authored-by: girarda <girarda@users.noreply.github.com>
Co-authored-by: Azhar Dewji <azhardewji@gmail.com>
Co-authored-by: Alasdair Brown <sdairs@users.noreply.github.com>
Co-authored-by: Julia <julia.chvyrova@gmail.com>
Co-authored-by: Lucas Wiley <lucas@tremendous.com>
Co-authored-by: Philip Corr <PhilipCorr@users.noreply.github.com>
Co-authored-by: Greg Solovyev <grishick@users.noreply.github.com>
Co-authored-by: Peter Hu <peter@airbyte.io>
Co-authored-by: Malik Diarra <malik@airbyte.io>
Co-authored-by: Thibaud Chardonnens <thibaud.ch@gmail.com>
Co-authored-by: davinchia <davinchia@users.noreply.github.com>
Co-authored-by: Erica Struthers <93952107+erica-airbyte@users.noreply.github.com>
Co-authored-by: Edward Gao <edward.gao@airbyte.io>
Co-authored-by: Tim Roes <mail@timroes.de>
Co-authored-by: ksengers <30521298+Koen03@users.noreply.github.com>
Co-authored-by: Koen Sengers <k.sengers@gynzy.com>
Co-authored-by: Titas Skrebe <titas@omnisend.com>
Co-authored-by: Artem Astapenko <3767150+Jamakase@users.noreply.github.com>
Co-authored-by: Manoj Reddy KS <saimanoj58@gmail.com>
Co-authored-by: Harshith Mullapudi <harshithmullapudi@gmail.com>
Co-authored-by: Juan <80164312+jnr0790@users.noreply.github.com>
2022-03-25 14:49:54 +02:00

23 KiB

Basic Normalization

High-Level Overview

{% hint style="info" %} The high-level overview contains all the information you need to use Basic Normalization when pulling from APIs. Information past that can be read for advanced or educational purposes. {% endhint %}

When you run your first Airbyte sync without the basic normalization, you'll notice that your data gets written to your destination as one data column with a JSON blob that contains all of your data. This is the _airbyte_raw_ table that you may have seen before. Why do we create this table? A core tenet of ELT philosophy is that data should be untouched as it moves through the E and L stages so that the raw data is always accessible. If an unmodified version of the data exists in the destination, it can be retransformed without needing to sync data again.

If you have Basic Normalization enabled, Airbyte automatically uses this JSON blob to create a schema and tables with your data in mind, converting it to the format of your destination. This runs after your sync and may take a long time if you have a large amount of data synced. If you don't enable Basic Normalization, you'll have to transform the JSON data from that column yourself.

Example

Basic Normalization uses a fixed set of rules to map a json object from a source to the types and format that are native to the destination. For example if a source emits data that looks like this:

{
  "make": "alfa romeo",
  "model": "4C coupe",
  "horsepower": "247"
}

The destination connectors produce the following raw table in the destination database:

CREATE TABLE "_airbyte_raw_cars" (
    -- metadata added by airbyte
    "_airbyte_ab_id" VARCHAR, -- uuid value assigned by connectors to each row of the data written in the destination.
    "_airbyte_emitted_at" TIMESTAMP_WITH_TIMEZONE, -- time at which the record was emitted.
    "_airbyte_data" JSONB -- data stored as a Json Blob.
);

Then, basic normalization would create the following table:

CREATE TABLE "cars" (
    "_airbyte_ab_id" VARCHAR,
    "_airbyte_emitted_at" TIMESTAMP_WITH_TIMEZONE,
    "_airbyte_cars_hashid" VARCHAR,
    "_airbyte_normalized_at" TIMESTAMP_WITH_TIMEZONE,

    -- data from source
    "make" VARCHAR,
    "model" VARCHAR,
    "horsepower" INTEGER
);

Normalization metadata columns

You'll notice that some metadata are added to keep track of important information about each record.

  • Some are introduced at the destination connector level: These are propagated by the normalization process from the raw table to the final table
    • _airbyte_ab_id: uuid value assigned by connectors to each row of the data written in the destination.
    • _airbyte_emitted_at: time at which the record was emitted and recorded by destination connector.
  • While other metadata columns are created at the normalization step.
    • _airbyte_<table_name>_hashid: hash value assigned by airbyte normalization derived from a hash function of the record data.
    • _airbyte_normalized_at: time at which the record was last normalized (useful to track when incremental transformations are performed)

Additional metadata columns can be added on some tables depending on the usage:

  • On the Slowly Changing Dimension (SCD) tables:
  • _airbyte_start_at: equivalent to the cursor column defined on the table, denotes when the row was first seen
  • _airbyte_end_at: denotes until when the row was seen with these particular values. If this column is not NULL, then the record has been updated and is no longer the most up to date one. If NULL, then the row is the latest version for the record.
  • _airbyte_active_row: denotes if the row for the record is the latest version or not.
  • _airbyte_unique_key_scd: hash of primary keys + cursors used to de-duplicate the scd table.
  • On de-duplicated (and SCD) tables:
  • _airbyte_unique_key: hash of primary keys used to de-duplicate the final table.

The normalization rules are not configurable. They are designed to pick a reasonable set of defaults to hit the 80/20 rule of data normalization. We respect that normalization is a detail-oriented problem and that with a fixed set of rules, we cannot normalize your data in such a way that covers all use cases. If this feature does not meet your normalization needs, we always put the full json blob in destination as well, so that you can parse that object however best meets your use case. We will be adding more advanced normalization functionality shortly. Airbyte is focused on the EL of ELT. If you need a really featureful tool for the transformations then, we suggest trying out dbt.

Airbyte places the json blob version of your data in a table called _airbyte_raw_<stream name>. If basic normalization is turned on, it will place a separate copy of the data in a table called <stream name>. Under the hood, Airbyte is using dbt, which means that the data only ingresses into the data store one time. The normalization happens as a query within the datastore. This implementation avoids extra network time and costs.

Why does Airbyte have Basic Normalization?

At its core, Airbyte is geared to handle the EL (Extract Load) steps of an ELT process. These steps can also be referred in Airbyte's dialect as "Source" and "Destination".

However, this is actually producing a table in the destination with a JSON blob column... For the typical analytics use case, you probably want this json blob normalized so that each field is its own column.

So, after EL, comes the T (transformation) and the first T step that Airbyte actually applies on top of the extracted data is called "Normalization".

Airbyte runs this step before handing the final data over to other tools that will manage further transformation down the line.

To summarize, we can represent the ELT process in the diagram below. These are steps that happens between your "Source Database or API" and the final "Replicated Tables" with examples of implementation underneath:

In Airbyte, the current normalization option is implemented using a dbt Transformer composed of:

  • Airbyte base-normalization python package to generate dbt SQL models files
  • dbt to compile and executes the models on top of the data in the destinations that supports it.

Destinations that Support Basic Normalization

Basic Normalization can be used in each of these destinations by configuring the "basic normalization" field to true when configuring the destination in the UI.

Rules

Typing

Airbyte tracks types using JsonSchema's primitive types. Here is how these types will map onto standard SQL types. Note: The names of the types may differ slightly across different destinations.

Airbyte uses the types described in the catalog to determine the correct type for each column. It does not try to use the values themselves to infer the type.

JsonSchema Type Resulting Type Notes
number float
integer integer
string string
bit boolean
boolean boolean
string with format label date-time timestamp with timezone
array new table see nesting
object new table see nesting

Nesting

Basic Normalization attempts to expand any nested arrays or objects it receives into separate tables in order to allow more ergonomic querying of your data.

Arrays

Basic Normalization expands arrays into separate tables. For example if the source provides the following data:

{
  "make": "alfa romeo",
  "model": "4C coupe",
  "limited_editions": [
    { "name": "4C spider", "release_year": 2013 },
    { "name" : "4C spider italia" , "release_year":  2018 }
  ]
}

The resulting normalized schema would be:

CREATE TABLE "cars" (
    "_airbyte_cars_hashid" VARCHAR,
    "_airbyte_emitted_at" TIMESTAMP_WITH_TIMEZONE,
    "_airbyte_normalized_at" TIMESTAMP_WITH_TIMEZONE,

    "make" VARCHAR,
    "model" VARCHAR
);

CREATE TABLE "limited_editions" (
    "_airbyte_limited_editions_hashid" VARCHAR,
    "_airbyte_cars_foreign_hashid" VARCHAR,
    "_airbyte_emitted_at" TIMESTAMP_WITH_TIMEZONE,
    "_airbyte_normalized_at" TIMESTAMP_WITH_TIMEZONE,

    "name" VARCHAR,
    "release_year" VARCHAR
);

If the nested items in the array are not objects then they are expanded into a string field of comma separated values e.g.:

{
  "make": "alfa romeo",
  "model": "4C coupe",
  "limited_editions": [ "4C spider", "4C spider italia"]
}

The resulting normalized schema would be:

CREATE TABLE "cars" (
    "_airbyte_cars_hashid" VARCHAR,
    "_airbyte_emitted_at" TIMESTAMP_WITH_TIMEZONE,
    "_airbyte_normalized_at" TIMESTAMP_WITH_TIMEZONE,

    "make" VARCHAR,
    "model" VARCHAR
);

CREATE TABLE "limited_editions" (
    "_airbyte_limited_editions_hashid" VARCHAR,
    "_airbyte_cars_foreign_hashid" VARCHAR,
    "_airbyte_emitted_at" TIMESTAMP_WITH_TIMEZONE,
    "_airbyte_normalized_at" TIMESTAMP_WITH_TIMEZONE,

    "data" VARCHAR
);

Objects

In the case of a nested object e.g.:

{
  "make": "alfa romeo",
  "model": "4C coupe",
  "powertrain_specs": { "horsepower": 247, "transmission": "6-speed" }
}

The normalized schema would be:

CREATE TABLE "cars" (
    "_airbyte_cars_hashid" VARCHAR,
    "_airbyte_emitted_at" TIMESTAMP_WITH_TIMEZONE,
    "_airbyte_normalized_at" TIMESTAMP_WITH_TIMEZONE,

    "make" VARCHAR,
    "model" VARCHAR
);

CREATE TABLE "powertrain_specs" (
    "_airbyte_powertrain_hashid" VARCHAR,
    "_airbyte_cars_foreign_hashid" VARCHAR,
    "_airbyte_emitted_at" TIMESTAMP_WITH_TIMEZONE,
    "_airbyte_normalized_at" TIMESTAMP_WITH_TIMEZONE,

    "horsepower" INTEGER,
    "transmission" VARCHAR
);

Naming Collisions for un-nested objects

When extracting nested objects or arrays, the Basic Normalization process needs to figure out new names for the expanded tables.

For example, if we had a cars table with a nested column cars containing an object whose schema is identical to the parent table.

{
  "make": "alfa romeo",
  "model": "4C coupe",
  "cars": [
    { "make": "audi", "model": "A7" },
    { "make" : "lotus" , "model":  "elise" }
    { "make" : "chevrolet" , "model":  "mustang" }
  ]
}

The expanded table would have a conflict in terms of naming since both are named cars. To avoid name collisions and ensure a more consistent naming scheme, Basic Normalization chooses the expanded name as follows:

  • cars for the original parent table
  • cars_da3_cars for the expanded nested columns following this naming scheme in 3 parts: <Json path>_<Hash>_<nested column name>
  • Json path: The entire json path string with '_' characters used as delimiters to reach the table that contains the nested column name.
  • Hash: Hash of the entire json path to reach the nested column reduced to 3 characters. This is to make sure we have a unique name (in case part of the name gets truncated, see below)
  • Nested column name: name of the column being expanded into its own table.

By following this strategy, nested columns should "never" collide with other table names. If it does, an exception will probably be thrown either by the normalization process or by dbt that runs afterward.

CREATE TABLE "cars" (
    "_airbyte_cars_hashid" VARCHAR,
    "_airbyte_emitted_at" TIMESTAMP_WITH_TIMEZONE,
    "_airbyte_normalized_at" TIMESTAMP_WITH_TIMEZONE,

    "make" VARCHAR,
    "model" VARCHAR
);

CREATE TABLE "cars_da3_cars" (
    "_airbyte_cars_hashid" VARCHAR,
    "_airbyte_cars_foreign_hashid" VARCHAR,
    "_airbyte_emitted_at" TIMESTAMP_WITH_TIMEZONE,
    "_airbyte_normalized_at" TIMESTAMP_WITH_TIMEZONE,

    "make" VARCHAR,
    "model" VARCHAR
);

Naming limitations & truncation

Note that different destinations have various naming limitations, most commonly on how long names can be. For instance, the Postgres documentation states:

The system uses no more than NAMEDATALEN-1 bytes of an identifier; longer names can be written in commands, but they will be truncated. By default, NAMEDATALEN is 64 so the maximum identifier length is 63 bytes

Most modern data warehouses have name lengths limits on the longer side, so this should not affect us that often. Basic Normalization will fallback to the following rules:

  1. No Truncate if under destination's character limits

However, in the rare cases where these limits are reached:

  1. Truncate only the Json path to fit into destination's character limits
  2. Truncate the Json path to at least the 10 first characters, then truncate the nested column name starting in the middle to preserve prefix/suffix substrings intact (whenever a truncate in the middle is made, two '__' characters are also inserted to denote where it happened) to fit into destination's character limits

As an example from the hubspot source, we could have the following tables with nested columns:

Description Example 1 Example 2
Original Stream Name companies deals
Json path to the nested column companies/property_engagements_last_meeting_booked_campaign deals/properties/engagements_last_meeting_booked_medium
Final table name of expanded nested column on BigQuery companies_2e8_property_engagements_last_meeting_booked_campaign deals_properties_6e6_engagements_l**ast_meeting_**booked_medium
Final table name of expanded nested column on Postgres companies_2e8_property_engag**__**oked_campaign deals_prop_6e6_engagements_l**__**booked_medium

As mentioned in the overview:

  • Airbyte places the json blob version of your data in a table called _airbyte_raw_<stream name>.
  • If basic normalization is turned on, it will place a separate copy of the data in a table called <stream name>.
  • In certain pathological cases, basic normalization is required to generate large models with many columns and multiple intermediate transformation steps for a stream. This may break down the "ephemeral" materialization strategy and require the use of additional intermediate views or tables instead. As a result, you may notice additional temporary tables being generated in the destination to handle these checkpoints.

UI Configurations

To enable basic normalization (which is optional), you can toggle it on or disable it in the "Normalization and Transformation" section when setting up your connection:

Incremental runs

When the source is configured with sync modes compatible with incremental transformations (using append on destination) such as ( full_refresh_append, incremental append or incremental deduped history), only rows that have changed in the source are transferred over the network and written by the destination connector. Normalization will then try to build the normalized tables incrementally as the rows in the raw tables that have been created or updated since the last time dbt ran. As such, on each dbt run, the models get built incrementally. This limits the amount of data that needs to be transformed, vastly reducing the runtime of the transformations. This improves warehouse performance and reduces compute costs. Because normalization can be either run incrementally and, or, in full refresh, a technical column _airbyte_normalized_at can serve to track when was the last time a record has been transformed and written by normalization. This may greatly diverge from the _airbyte_emitted_at value as the normalized tables could be totally re-built at a latter time from the data stored in the _airbyte_raw tables.

Partitioning, clustering, sorting, indexing

Normalization produces tables that are partitioned, clustered, sorted or indexed depending on the destination engine and on the type of tables being built. The goal of these are to make read more performant, especially when running incremental updates.

In general, normalization needs to do lookup on the last emitted_at column to know if a record is freshly produced and need to be incrementally processed or not. But in certain models, such as SCD tables for example, we also need to retrieve older data to update their type 2 SCD end_date and active_row flags, thus a different partitioning scheme is used to optimize that use case.

On Postgres destination, an additional table suffixed with _stg for every stream replicated in incremental deduped history needs to be persisted (in a different staging schema) for incremental transformations to work because of a limitation.

Extending Basic Normalization

Note that all the choices made by Normalization as described in this documentation page in terms of naming (and more) could be overridden by your own custom choices. To do so, you can follow the following tutorials:

CHANGELOG

airbyte-integration/bases/base-normalization

Note that Basic Normalization is packaged in a docker image airbyte/normalization. This image is tied to and released along with a specific Airbyte version. It is not configurable independently like it is possible to do with connectors (source & destinations)

Therefore, in order to "upgrade" to the desired normalization version, you need to use the corresponding Airbyte version that it's being released in:

Airbyte Version Normalization Version Date Pull Request Subject
0.35.59-alpha 0.1.72 2022-03-24 #11093 Added Snowflake OAuth2.0 support
0.35.53-alpha 0.1.71 2022-03-14 #11077 Enable BigQuery to handle project ID embedded inside dataset ID
0.35.49-alpha 0.1.70 2022-03-11 #11051 Upgrade dbt to 1.0.0 (except for MySQL and Oracle)
0.35.45-alpha 0.1.69 2022-03-04 #10754 Enable Clickhouse normalization over SSL
0.35.32-alpha 0.1.68 2022-02-20 #10485 Fix row size too large for table with numerous string fields
0.1.66 2022-02-04 #9341 Fix normalization for bigquery datasetId and tables
0.35.13-alpha 0.1.65 2021-01-28 #9846 Tweak dbt multi-thread parameter down
0.35.12-alpha 0.1.64 2021-01-28 #9793 Support PEM format for ssh-tunnel keys
0.35.4-alpha 0.1.63 2021-01-07 #9301 Fix Snowflake prefix tables starting with numbers
0.1.62 2021-01-07 #9340 Use TCP-port support for clickhouse
0.1.62 2021-01-07 #9063 Change Snowflake-specific materialization settings
0.1.62 2021-01-07 #9317 Fix issue with quoted & case sensitive columns
0.1.62 2021-01-07 #9281 Fix SCD partition by float columns in BigQuery
0.32.11-alpha 0.1.61 2021-12-02 #8394 Fix incremental queries not updating empty tables
0.1.61 2021-12-01 #8378 Fix un-nesting queries and add proper ref hints
0.32.5-alpha 0.1.60 2021-11-22 #8088 Speed-up incremental queries for SCD table on Snowflake
0.30.32-alpha 0.1.59 2021-11-08 #7669 Fix nested incremental dbt
0.30.24-alpha 0.1.57 2021-10-26 #7162 Implement incremental dbt updates
0.30.16-alpha 0.1.52 2021-10-07 #6379 Handle empty string for date and date-time format
0.1.51 2021-10-08 #6799 Added support for ad_cdc_log_pos while normalization
0.1.50 2021-10-07 #6079 Added support for MS SQL Server normalization
0.1.49 2021-10-06 #6709 Forward destination dataset location to dbt profiles
0.29.17-alpha 0.1.47 2021-09-20 #6317 MySQL: updated MySQL normalization with using SSH tunnel
0.1.45 2021-09-18 #6052 Snowflake: accept any date-time format
0.29.8-alpha 0.1.40 2021-08-18 #5433 Allow optional credentials_json for BigQuery
0.29.5-alpha 0.1.39 2021-08-11 #4557 Handle date times and solve conflict name btw stream/field
0.28.2-alpha 0.1.38 2021-07-28 #5027 Handle quotes in column names when parsing JSON blob
0.27.5-alpha 0.1.37 2021-07-22 #3947 Handle NULL cursor field values when deduping
0.27.2-alpha 0.1.36 2021-07-09 #3947 Enable normalization for MySQL destination