Remove normalization docs (#38779)
This commit is contained in:
@@ -39,7 +39,6 @@ _Screenshot taken from [Airbyte Cloud](https://cloud.airbyte.com/signup)_.
|
||||
- Create connectors in minutes with our [no-code Connector Builder](https://docs.airbyte.com/connector-development/connector-builder-ui/overview) or [low-code CDK](https://docs.airbyte.com/connector-development/config-based/low-code-cdk-overview).
|
||||
- Explore popular use cases in our [tutorials](https://airbyte.com/tutorials).
|
||||
- Orchestrate Airbyte syncs with [Airflow](https://docs.airbyte.com/operator-guides/using-the-airflow-airbyte-operator), [Prefect](https://docs.airbyte.com/operator-guides/using-prefect-task), [Dagster](https://docs.airbyte.com/operator-guides/using-dagster-integration), [Kestra](https://docs.airbyte.com/operator-guides/using-kestra-plugin) or the [Airbyte API](https://reference.airbyte.com/reference/start).
|
||||
- Easily transform loaded data with [SQL](https://docs.airbyte.com/operator-guides/transformation-and-normalization/transformations-with-sql) or [dbt](https://docs.airbyte.com/operator-guides/transformation-and-normalization/transformations-with-dbt).
|
||||
|
||||
Try it out yourself with our [demo app](https://demo.airbyte.io/), visit our [full documentation](https://docs.airbyte.com/) and learn more about [recent announcements](https://airbyte.com/blog-categories/company-updates). See our [registry](https://connectors.airbyte.com/files/generated_reports/connector_registry_report.html) for a full list of connectors already available in Airbyte or Airbyte Cloud.
|
||||
|
||||
|
||||
@@ -42,8 +42,6 @@ following[ sync modes](https://docs.airbyte.com/cloud/core-concepts#connection-s
|
||||
| Incremental - Append + Deduped | No | |
|
||||
| Namespaces | Yes | |
|
||||
|
||||
The Teradata destination connector supports [ DBT custom transformation](https://docs.airbyte.com/operator-guides/transformation-and-normalization/transformations-with-airbyte/) type. Teradata DBT Docker image is available at https://hub.docker.com/r/teradata/dbt-teradata.
|
||||
|
||||
### Performance considerations
|
||||
|
||||
## Getting started
|
||||
|
||||
@@ -1 +0,0 @@
|
||||
# Transformations and Normalization
|
||||
@@ -1,109 +0,0 @@
|
||||
---
|
||||
products: oss-*
|
||||
---
|
||||
|
||||
# Transformations with Airbyte (Part 3/3)
|
||||
|
||||
:::warning
|
||||
Normalization and Custom Transformation are deprecated features.
|
||||
Destinations using Normalization will be replaced by [Typing and Deduping](/using-airbyte/core-concepts/typing-deduping.md).
|
||||
Custom Transformation will be removed on March 31. For more information, visit [here](https://github.com/airbytehq/airbyte/discussions/34860).
|
||||
:::
|
||||
|
||||
This tutorial will describe how to push a custom dbt transformation project back to Airbyte to use during syncs.
|
||||
|
||||
This guide is the last part of the tutorial series on transformations, following [Transformations with SQL](transformations-with-sql.md) and [connecting EL with T using dbt](transformations-with-dbt.md).
|
||||
|
||||
\(Example outputs are updated with Airbyte version 0.23.0-alpha from May 2021\)
|
||||
|
||||
## Transformations with Airbyte
|
||||
|
||||
After replication of data from a source connector \(Extract\) to a destination connector \(Load\), multiple optional transformation steps can now be applied as part of an Airbyte Sync. Possible workflows are:
|
||||
|
||||
1. Basic normalization transformations as automatically generated by Airbyte dbt code generator.
|
||||
2. Customized normalization transformations as edited by the user \(the default generated normalization one should therefore be disabled\)
|
||||
3. Customized business transformations as specified by the user.
|
||||
|
||||
## Public Git repository
|
||||
|
||||
In the connection settings page, I can add new Transformations steps to apply after [normalization](../../using-airbyte/core-concepts/basic-normalization.md). For example, I want to run my custom dbt project jaffle_shop, whenever my sync is done replicating and normalizing my data.
|
||||
|
||||
You can find the jaffle shop test repository by clicking [here](https://github.com/dbt-labs/jaffle_shop).
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
## Private Git repository
|
||||
|
||||
Now, let's connect my mono-repo Business Intelligence project stored in a private git repository to update the related tables and dashboards when my Airbyte syncs complete.
|
||||
|
||||
Note that if you need to connect to a private git repository, the recommended way to do so is to generate a `Personal Access Token` that can be used instead of a password. Then, you'll be able to include the credentials in the git repository url:
|
||||
|
||||
- [GitHub - Personal Access Tokens](https://docs.github.com/en/github/authenticating-to-github/keeping-your-account-and-data-secure/creating-a-personal-access-token)
|
||||
- [Gitlab - Personal Access Tokens](https://docs.gitlab.com/ee/user/profile/personal_access_tokens.html)
|
||||
- [Azure DevOps - Personal Access Tokens](https://docs.microsoft.com/en-us/azure/devops/organizations/accounts/use-personal-access-tokens-to-authenticate)
|
||||
|
||||
And then use it for cloning:
|
||||
|
||||
```text
|
||||
git clone https://username:token@github.com/user/repo
|
||||
```
|
||||
|
||||
Where `https://username:token@github.com/user/repo` is the git repository url.
|
||||
|
||||
### Example of a private git repo used as transformations
|
||||
|
||||
As an example, I go through my GitHub account to generate a Personal Access Token to use in Airbyte with permissions to clone my private repositories:
|
||||
|
||||

|
||||
|
||||
This provides me with a token to use:
|
||||
|
||||

|
||||
|
||||
In Airbyte, I can use the git url as: `https://airbyteuser:ghp_***********ShLrG2yXGYF@github.com/airbyteuser/private-datawarehouse.git`
|
||||
|
||||

|
||||
|
||||
## How-to use custom dbt tips
|
||||
|
||||
### Allows "chained" dbt transformations
|
||||
|
||||
Since every transformation leave in his own Docker container, at this moment I can't rely on packages installed using `dbt deps` for the next transformations.
|
||||
According to the dbt documentation, I can configure the [packages folder](https://docs.getdbt.com/reference/project-configs/packages-install-path) outside of the container:
|
||||
|
||||
```yaml
|
||||
# dbt_project.yml
|
||||
packages-install-path: "../dbt_packages"
|
||||
```
|
||||
|
||||
> If I want to chain **dbt deps** and **dbt run**, I may use **[dbt build](https://docs.getdbt.com/reference/commands/build)** instead, which is not equivalent to the two previous commands, but will remove the need to alter the configuration of dbt.
|
||||
|
||||
### Refresh models partially
|
||||
|
||||
Since I am using a mono-repo from my organization, other team members or departments may also contribute their dbt models to this centralized location. This will give us many dbt models and sources to build our complete data warehouse...
|
||||
|
||||
The whole warehouse is scheduled for full refresh on a different orchestration tool, or as part of the git repository CI. However, here, I want to partially refresh some small relevant tables when attaching this operation to a specific Airbyte sync, in this case, the Covid dataset.
|
||||
|
||||
Therefore, I can restrict the execution of models to a particular tag or folder by specifying in the dbt cli arguments, in this case whatever is related to "covid_api":
|
||||
|
||||
```text
|
||||
run --models tag:covid_api opendata.base.*
|
||||
```
|
||||
|
||||
Now, when replications syncs are triggered by Airbyte, my custom transformations from my private git repository are also run at the end!
|
||||
|
||||
### Using a custom run with variables
|
||||
|
||||
If you want to use a custom run and pass variables you need to use the follow syntax:
|
||||
|
||||
```bash
|
||||
run --vars '{"table_name":"sample","schema_name":"other_value"}'
|
||||
```
|
||||
|
||||
This string must have no space. There is a [Github issue](https://github.com/airbytehq/airbyte/issues/4348) to improve this. If you want to contribute to Airbyte, this is a good opportunity!
|
||||
|
||||
### DBT Profile
|
||||
|
||||
There is no need to specify `--profiles-dir`. By default AirByte based on the destination type. For example, if you're using Postgres as your destination, Airbyte will create a profile configuration based on that destination. This means you don't need to specify the credentials. If you specify a custom `profile` file, you are responsible for securely managing the credentials. Currently, we don't have a way to manage and pass secrets and it's recommended you let Airbyte pass this to dbt.
|
||||
@@ -1,225 +0,0 @@
|
||||
---
|
||||
products: oss-*
|
||||
---
|
||||
|
||||
# Transformations with dbt (Part 2/3)
|
||||
|
||||
:::warning
|
||||
Normalization and Custom Transformation are deprecated features.
|
||||
Destinations using Normalization will be replaced by [Typing and Deduping](/using-airbyte/core-concepts/typing-deduping.md).
|
||||
Custom Transformation will be removed on March 31. For more information, visit [here](https://github.com/airbytehq/airbyte/discussions/34860).
|
||||
:::
|
||||
|
||||
This tutorial will describe how to integrate SQL based transformations with Airbyte syncs using specialized transformation tool: dbt.
|
||||
|
||||
This tutorial is the second part of the previous tutorial [Transformations with SQL](transformations-with-sql.md). Next, we'll wrap-up with a third part on submitting transformations back in Airbyte: [Transformations with Airbyte](transformations-with-airbyte.md).
|
||||
|
||||
\(Example outputs are updated with Airbyte version 0.23.0-alpha from May 2021\)
|
||||
|
||||
## Transformations with dbt
|
||||
|
||||
The tool in charge of transformation behind the scenes is actually called [dbt](https://blog.getdbt.com/what--exactly--is-dbt-/) \(Data Build Tool\).
|
||||
|
||||
Before generating the SQL files as we've seen in the previous tutorial, Airbyte sets up a dbt Docker instance and automatically generates a dbt project for us. This is created as specified in the [dbt project documentation page](https://docs.getdbt.com/docs/building-a-dbt-project/projects) with the right credentials for the target destination. The dbt models are then run afterward, thanks to the [dbt CLI](https://docs.getdbt.com/dbt-cli/cli-overview). However, for now, let's run through working with the dbt tool.
|
||||
|
||||
### Validate dbt project settings
|
||||
|
||||
Let's say we identified our workspace \(as shown in the previous tutorial [Transformations with SQL](transformations-with-sql.md)\), and we have a workspace ID of:
|
||||
|
||||
```bash
|
||||
NORMALIZE_WORKSPACE="5/0/"
|
||||
```
|
||||
|
||||
We can verify that the dbt project is properly configured for that workspace:
|
||||
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
docker run --rm -i -v airbyte_workspace:/data -w /data/$NORMALIZE_WORKSPACE/normalize --network host --entrypoint /usr/local/bin/dbt airbyte/normalization debug --profiles-dir=. --project-dir=.
|
||||
```
|
||||
|
||||
Example Output:
|
||||
|
||||
```text
|
||||
Running with dbt=0.19.1
|
||||
dbt version: 0.19.1
|
||||
python version: 3.8.8
|
||||
python path: /usr/local/bin/python
|
||||
os info: Linux-5.10.25-linuxkit-x86_64-with-glibc2.2.5
|
||||
Using profiles.yml file at ./profiles.yml
|
||||
Using dbt_project.yml file at /data/5/0/normalize/dbt_project.yml
|
||||
|
||||
Configuration:
|
||||
profiles.yml file [OK found and valid]
|
||||
dbt_project.yml file [OK found and valid]
|
||||
|
||||
Required dependencies:
|
||||
- git [OK found]
|
||||
|
||||
Connection:
|
||||
host: localhost
|
||||
port: 3000
|
||||
user: postgres
|
||||
database: postgres
|
||||
schema: quarantine
|
||||
search_path: None
|
||||
keepalives_idle: 0
|
||||
sslmode: None
|
||||
Connection test: OK connection ok
|
||||
```
|
||||
|
||||
### Compile and build dbt normalization models
|
||||
|
||||
If the previous command does not show any errors or discrepancies, it is now possible to invoke the CLI from within the docker image to trigger transformation processing:
|
||||
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
docker run --rm -i -v airbyte_workspace:/data -w /data/$NORMALIZE_WORKSPACE/normalize --network host --entrypoint /usr/local/bin/dbt airbyte/normalization run --profiles-dir=. --project-dir=.
|
||||
```
|
||||
|
||||
Example Output:
|
||||
|
||||
```text
|
||||
Running with dbt=0.19.1
|
||||
Found 4 models, 0 tests, 0 snapshots, 0 analyses, 364 macros, 0 operations, 0 seed files, 1 source, 0 exposures
|
||||
|
||||
Concurrency: 32 threads (target='prod')
|
||||
|
||||
1 of 1 START table model quarantine.covid_epidemiology....................................................... [RUN]
|
||||
1 of 1 OK created table model quarantine.covid_epidemiology.................................................. [SELECT 35822 in 0.47s]
|
||||
|
||||
Finished running 1 table model in 0.74s.
|
||||
|
||||
Completed successfully
|
||||
|
||||
Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
|
||||
```
|
||||
|
||||
### Exporting dbt normalization project outside Airbyte
|
||||
|
||||
As seen in the tutorial on [exploring workspace folder](../browsing-output-logs.md), it is possible to browse the `normalize` folder and examine further logs if an error occurs.
|
||||
|
||||
In particular, we can also take a look at the dbt models generated by Airbyte and export them to the local host filesystem:
|
||||
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
|
||||
TUTORIAL_DIR="$(pwd)/tutorial/"
|
||||
rm -rf $TUTORIAL_DIR/normalization-files
|
||||
mkdir -p $TUTORIAL_DIR/normalization-files
|
||||
|
||||
docker cp airbyte-server:/tmp/workspace/$NORMALIZE_WORKSPACE/normalize/ $TUTORIAL_DIR/normalization-files
|
||||
|
||||
NORMALIZE_DIR=$TUTORIAL_DIR/normalization-files/normalize
|
||||
cd $NORMALIZE_DIR
|
||||
cat $NORMALIZE_DIR/models/generated/**/*.sql
|
||||
```
|
||||
|
||||
Example Output:
|
||||
|
||||
```text
|
||||
{{ config(alias="covid_epidemiology_ab1", schema="_airbyte_quarantine", tags=["top-level-intermediate"]) }}
|
||||
-- SQL model to parse JSON blob stored in a single column and extract into separated field columns as described by the JSON Schema
|
||||
select
|
||||
{{ json_extract_scalar('_airbyte_data', ['key']) }} as {{ adapter.quote('key') }},
|
||||
{{ json_extract_scalar('_airbyte_data', ['date']) }} as {{ adapter.quote('date') }},
|
||||
{{ json_extract_scalar('_airbyte_data', ['new_tested']) }} as new_tested,
|
||||
{{ json_extract_scalar('_airbyte_data', ['new_deceased']) }} as new_deceased,
|
||||
{{ json_extract_scalar('_airbyte_data', ['total_tested']) }} as total_tested,
|
||||
{{ json_extract_scalar('_airbyte_data', ['new_confirmed']) }} as new_confirmed,
|
||||
{{ json_extract_scalar('_airbyte_data', ['new_recovered']) }} as new_recovered,
|
||||
{{ json_extract_scalar('_airbyte_data', ['total_deceased']) }} as total_deceased,
|
||||
{{ json_extract_scalar('_airbyte_data', ['total_confirmed']) }} as total_confirmed,
|
||||
{{ json_extract_scalar('_airbyte_data', ['total_recovered']) }} as total_recovered,
|
||||
_airbyte_emitted_at
|
||||
from {{ source('quarantine', '_airbyte_raw_covid_epidemiology') }}
|
||||
-- covid_epidemiology
|
||||
|
||||
{{ config(alias="covid_epidemiology_ab2", schema="_airbyte_quarantine", tags=["top-level-intermediate"]) }}
|
||||
-- SQL model to cast each column to its adequate SQL type converted from the JSON schema type
|
||||
select
|
||||
cast({{ adapter.quote('key') }} as {{ dbt_utils.type_string() }}) as {{ adapter.quote('key') }},
|
||||
cast({{ adapter.quote('date') }} as {{ dbt_utils.type_string() }}) as {{ adapter.quote('date') }},
|
||||
cast(new_tested as {{ dbt_utils.type_float() }}) as new_tested,
|
||||
cast(new_deceased as {{ dbt_utils.type_float() }}) as new_deceased,
|
||||
cast(total_tested as {{ dbt_utils.type_float() }}) as total_tested,
|
||||
cast(new_confirmed as {{ dbt_utils.type_float() }}) as new_confirmed,
|
||||
cast(new_recovered as {{ dbt_utils.type_float() }}) as new_recovered,
|
||||
cast(total_deceased as {{ dbt_utils.type_float() }}) as total_deceased,
|
||||
cast(total_confirmed as {{ dbt_utils.type_float() }}) as total_confirmed,
|
||||
cast(total_recovered as {{ dbt_utils.type_float() }}) as total_recovered,
|
||||
_airbyte_emitted_at
|
||||
from {{ ref('covid_epidemiology_ab1_558') }}
|
||||
-- covid_epidemiology
|
||||
|
||||
{{ config(alias="covid_epidemiology_ab3", schema="_airbyte_quarantine", tags=["top-level-intermediate"]) }}
|
||||
-- SQL model to build a hash column based on the values of this record
|
||||
select
|
||||
*,
|
||||
{{ dbt_utils.surrogate_key([
|
||||
adapter.quote('key'),
|
||||
adapter.quote('date'),
|
||||
'new_tested',
|
||||
'new_deceased',
|
||||
'total_tested',
|
||||
'new_confirmed',
|
||||
'new_recovered',
|
||||
'total_deceased',
|
||||
'total_confirmed',
|
||||
'total_recovered',
|
||||
]) }} as _airbyte_covid_epidemiology_hashid
|
||||
from {{ ref('covid_epidemiology_ab2_558') }}
|
||||
-- covid_epidemiology
|
||||
|
||||
{{ config(alias="covid_epidemiology", schema="quarantine", tags=["top-level"]) }}
|
||||
-- Final base SQL model
|
||||
select
|
||||
{{ adapter.quote('key') }},
|
||||
{{ adapter.quote('date') }},
|
||||
new_tested,
|
||||
new_deceased,
|
||||
total_tested,
|
||||
new_confirmed,
|
||||
new_recovered,
|
||||
total_deceased,
|
||||
total_confirmed,
|
||||
total_recovered,
|
||||
_airbyte_emitted_at,
|
||||
_airbyte_covid_epidemiology_hashid
|
||||
from {{ ref('covid_epidemiology_ab3_558') }}
|
||||
-- covid_epidemiology from {{ source('quarantine', '_airbyte_raw_covid_epidemiology') }}
|
||||
```
|
||||
|
||||
If you have [dbt installed](https://docs.getdbt.com/dbt-cli/installation/) locally on your machine, you can then view, edit, version, customize, and run the dbt models in your project outside Airbyte syncs.
|
||||
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
|
||||
dbt deps --profiles-dir=$NORMALIZE_DIR --project-dir=$NORMALIZE_DIR
|
||||
dbt run --profiles-dir=$NORMALIZE_DIR --project-dir=$NORMALIZE_DIR --full-refresh
|
||||
```
|
||||
|
||||
Example Output:
|
||||
|
||||
```text
|
||||
Running with dbt=0.19.1
|
||||
Installing https://github.com/fishtown-analytics/dbt-utils.git@0.6.4
|
||||
Installed from revision 0.6.4
|
||||
|
||||
Running with dbt=0.19.1
|
||||
Found 4 models, 0 tests, 0 snapshots, 0 analyses, 364 macros, 0 operations, 0 seed files, 1 source, 0 exposures
|
||||
|
||||
Concurrency: 32 threads (target='prod')
|
||||
|
||||
1 of 1 START table model quarantine.covid_epidemiology....................................................... [RUN]
|
||||
1 of 1 OK created table model quarantine.covid_epidemiology.................................................. [SELECT 35822 in 0.44s]
|
||||
|
||||
Finished running 1 table model in 0.63s.
|
||||
|
||||
Completed successfully
|
||||
|
||||
Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
|
||||
```
|
||||
|
||||
Now, that you've exported the generated normalization models, you can edit and tweak them as necessary.
|
||||
|
||||
If you want to know how to push your modifications back to Airbyte and use your updated dbt project during Airbyte syncs, you can continue with the following [tutorial on importing transformations into Airbyte](transformations-with-airbyte.md)...
|
||||
@@ -1,326 +0,0 @@
|
||||
---
|
||||
products: oss-*
|
||||
---
|
||||
|
||||
# Transformations with SQL (Part 1/3)
|
||||
|
||||
:::warning
|
||||
Normalization and Custom Transformation are deprecated features.
|
||||
Destinations using Normalization will be replaced by [Typing and Deduping](/using-airbyte/core-concepts/typing-deduping.md).
|
||||
Custom Transformation will be removed on March 31. For more information, visit [here](https://github.com/airbytehq/airbyte/discussions/34860).
|
||||
:::
|
||||
|
||||
This tutorial will describe how to integrate SQL based transformations with Airbyte syncs using plain SQL queries.
|
||||
|
||||
This is the first part of ELT tutorial. The second part goes deeper with [Transformations with dbt](transformations-with-dbt.md) and then wrap-up with a third part on [Transformations with Airbyte](transformations-with-airbyte.md).
|
||||
|
||||
## \(Examples outputs are updated with Airbyte version 0.23.0-alpha from May 2021\)
|
||||
|
||||
### First transformation step: Normalization
|
||||
|
||||
At its core, Airbyte is geared to handle the EL \(Extract Load\) steps of an ELT process. These steps can also be referred in Airbyte's dialect as "Source" and "Destination".
|
||||
|
||||
However, this is actually producing a table in the destination with a JSON blob column... For the typical analytics use case, you probably want this json blob normalized so that each field is its own column.
|
||||
|
||||
So, after EL, comes the T \(transformation\) and the first T step that Airbyte actually applies on top of the extracted data is called "Normalization". You can find more information about it [here](../../using-airbyte/core-concepts/basic-normalization.md).
|
||||
|
||||
Airbyte runs this step before handing the final data over to other tools that will manage further transformation down the line.
|
||||
|
||||
To summarize, we can represent the ELT process in the diagram below. These are steps that happens between your "Source Database or API" and the final "Replicated Tables" with examples of implementation underneath:
|
||||
|
||||

|
||||
|
||||
Anyway, it is possible to short-circuit this process \(no vendor lock-in\) and handle it yourself by turning this option off in the destination settings page.
|
||||
|
||||
This could be useful if:
|
||||
|
||||
1. You have a use-case not related to analytics that could be handled with data in its raw JSON format.
|
||||
2. You can implement your own transformer. For example, you could write them in a different language, create them in an analytics engine like Spark, or use a transformation tool such as dbt or Dataform.
|
||||
3. You want to customize and change how the data is normalized with your own queries.
|
||||
|
||||
In order to do so, we will now describe how you can leverage the basic normalization outputs that Airbyte generates to build your own transformations if you don't want to start from scratch.
|
||||
|
||||
Note: We will rely on docker commands that we've gone over as part of another [Tutorial on Exploring Docker Volumes](../browsing-output-logs.md).
|
||||
|
||||
### \(Optional\) Configure some Covid \(data\) source and Postgres destinations
|
||||
|
||||
If you have sources and destinations already setup on your deployment, you can skip to the next section.
|
||||
|
||||
For the sake of this tutorial, let's create some source and destination as an example that we can refer to afterward. We'll be using a file accessible from a public API, so you can easily reproduce this setup:
|
||||
|
||||
```text
|
||||
Here are some examples of public API CSV:
|
||||
https://storage.googleapis.com/covid19-open-data/v2/latest/epidemiology.csv
|
||||
```
|
||||
|
||||

|
||||
|
||||
And a local Postgres Database, making sure that "Basic normalization" is enabled:
|
||||
|
||||

|
||||
|
||||
After setting up the connectors, we can trigger the sync and study the logs:
|
||||
|
||||

|
||||
|
||||
Notice that the process ran in the `/tmp/workspace/5/0` folder.
|
||||
|
||||
### Identify Workspace ID with Normalize steps
|
||||
|
||||
If you went through the previous setup of source/destination section and run a sync, you were able to identify which workspace was used, let's define some environment variables to remember this:
|
||||
|
||||
```bash
|
||||
NORMALIZE_WORKSPACE="5/0/"
|
||||
```
|
||||
|
||||
Or if you want to find any folder where the normalize step was run:
|
||||
|
||||
```bash
|
||||
# find automatically latest workspace where normalization was run
|
||||
NORMALIZE_WORKSPACE=`docker run --rm -i -v airbyte_workspace:/data busybox find /data -path "*normalize/models*" | sed -E "s;/data/([0-9]+/[0-9]+/)normalize/.*;\1;g" | sort | uniq | tail -n 1`
|
||||
```
|
||||
|
||||
### Export Plain SQL files
|
||||
|
||||
Airbyte is internally using a specialized tool for handling transformations called dbt.
|
||||
|
||||
The Airbyte Python module reads the `destination_catalog.json` file and generates dbt code responsible for interpreting and transforming the raw data.
|
||||
|
||||
The final output of dbt is producing SQL files that can be run on top of the destination that you selected.
|
||||
|
||||
Therefore, it is possible to extract these SQL files, modify them and run it yourself manually outside Airbyte!
|
||||
|
||||
You would be able to find these at the following location inside the server's docker container:
|
||||
|
||||
```text
|
||||
/tmp/workspace/${NORMALIZE_WORKSPACE}/build/run/airbyte_utils/models/generated/airbyte_tables/<schema>/<your_table_name>.sql
|
||||
```
|
||||
|
||||
In order to extract them, you can run:
|
||||
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
docker cp airbyte-server:/tmp/workspace/${NORMALIZE_WORKSPACE}/build/run/airbyte_utils/models/generated/ models/
|
||||
|
||||
find models
|
||||
```
|
||||
|
||||
Example Output:
|
||||
|
||||
```text
|
||||
models/airbyte_tables/quarantine/covid_epidemiology_f11.sql
|
||||
```
|
||||
|
||||
Let's inspect the generated SQL file by running:
|
||||
|
||||
```bash
|
||||
cat models/**/covid_epidemiology*.sql
|
||||
```
|
||||
|
||||
Example Output:
|
||||
|
||||
```sql
|
||||
create table "postgres".quarantine."covid_epidemiology_f11__dbt_tmp"
|
||||
as (
|
||||
|
||||
with __dbt__CTE__covid_epidemiology_ab1_558 as (
|
||||
|
||||
-- SQL model to parse JSON blob stored in a single column and extract into separated field columns as described by the JSON Schema
|
||||
select
|
||||
jsonb_extract_path_text(_airbyte_data, 'key') as "key",
|
||||
jsonb_extract_path_text(_airbyte_data, 'date') as "date",
|
||||
jsonb_extract_path_text(_airbyte_data, 'new_tested') as new_tested,
|
||||
jsonb_extract_path_text(_airbyte_data, 'new_deceased') as new_deceased,
|
||||
jsonb_extract_path_text(_airbyte_data, 'total_tested') as total_tested,
|
||||
jsonb_extract_path_text(_airbyte_data, 'new_confirmed') as new_confirmed,
|
||||
jsonb_extract_path_text(_airbyte_data, 'new_recovered') as new_recovered,
|
||||
jsonb_extract_path_text(_airbyte_data, 'total_deceased') as total_deceased,
|
||||
jsonb_extract_path_text(_airbyte_data, 'total_confirmed') as total_confirmed,
|
||||
jsonb_extract_path_text(_airbyte_data, 'total_recovered') as total_recovered,
|
||||
_airbyte_emitted_at
|
||||
from "postgres".quarantine._airbyte_raw_covid_epidemiology
|
||||
-- covid_epidemiology
|
||||
), __dbt__CTE__covid_epidemiology_ab2_558 as (
|
||||
|
||||
-- SQL model to cast each column to its adequate SQL type converted from the JSON schema type
|
||||
select
|
||||
cast("key" as
|
||||
varchar
|
||||
) as "key",
|
||||
cast("date" as
|
||||
varchar
|
||||
) as "date",
|
||||
cast(new_tested as
|
||||
float
|
||||
) as new_tested,
|
||||
cast(new_deceased as
|
||||
float
|
||||
) as new_deceased,
|
||||
cast(total_tested as
|
||||
float
|
||||
) as total_tested,
|
||||
cast(new_confirmed as
|
||||
float
|
||||
) as new_confirmed,
|
||||
cast(new_recovered as
|
||||
float
|
||||
) as new_recovered,
|
||||
cast(total_deceased as
|
||||
float
|
||||
) as total_deceased,
|
||||
cast(total_confirmed as
|
||||
float
|
||||
) as total_confirmed,
|
||||
cast(total_recovered as
|
||||
float
|
||||
) as total_recovered,
|
||||
_airbyte_emitted_at
|
||||
from __dbt__CTE__covid_epidemiology_ab1_558
|
||||
-- covid_epidemiology
|
||||
), __dbt__CTE__covid_epidemiology_ab3_558 as (
|
||||
|
||||
-- SQL model to build a hash column based on the values of this record
|
||||
select
|
||||
*,
|
||||
md5(cast(
|
||||
|
||||
coalesce(cast("key" as
|
||||
varchar
|
||||
), '') || '-' || coalesce(cast("date" as
|
||||
varchar
|
||||
), '') || '-' || coalesce(cast(new_tested as
|
||||
varchar
|
||||
), '') || '-' || coalesce(cast(new_deceased as
|
||||
varchar
|
||||
), '') || '-' || coalesce(cast(total_tested as
|
||||
varchar
|
||||
), '') || '-' || coalesce(cast(new_confirmed as
|
||||
varchar
|
||||
), '') || '-' || coalesce(cast(new_recovered as
|
||||
varchar
|
||||
), '') || '-' || coalesce(cast(total_deceased as
|
||||
varchar
|
||||
), '') || '-' || coalesce(cast(total_confirmed as
|
||||
varchar
|
||||
), '') || '-' || coalesce(cast(total_recovered as
|
||||
varchar
|
||||
), '')
|
||||
|
||||
as
|
||||
varchar
|
||||
)) as _airbyte_covid_epidemiology_hashid
|
||||
from __dbt__CTE__covid_epidemiology_ab2_558
|
||||
-- covid_epidemiology
|
||||
)-- Final base SQL model
|
||||
select
|
||||
"key",
|
||||
"date",
|
||||
new_tested,
|
||||
new_deceased,
|
||||
total_tested,
|
||||
new_confirmed,
|
||||
new_recovered,
|
||||
total_deceased,
|
||||
total_confirmed,
|
||||
total_recovered,
|
||||
_airbyte_emitted_at,
|
||||
_airbyte_covid_epidemiology_hashid
|
||||
from __dbt__CTE__covid_epidemiology_ab3_558
|
||||
-- covid_epidemiology from "postgres".quarantine._airbyte_raw_covid_epidemiology
|
||||
);
|
||||
```
|
||||
|
||||
#### Simple SQL Query
|
||||
|
||||
We could simplify the SQL query by removing some parts that may be unnecessary for your current usage \(such as generating a md5 column; [Why exactly would I want to use that?!](https://blog.getdbt.com/the-most-underutilized-function-in-sql/)\).
|
||||
|
||||
It would turn into a simpler query:
|
||||
|
||||
```sql
|
||||
create table "postgres"."public"."covid_epidemiology"
|
||||
as (
|
||||
select
|
||||
_airbyte_emitted_at,
|
||||
(current_timestamp at time zone 'utc')::timestamp as _airbyte_normalized_at,
|
||||
|
||||
cast(jsonb_extract_path_text("_airbyte_data",'key') as varchar) as "key",
|
||||
cast(jsonb_extract_path_text("_airbyte_data",'date') as varchar) as "date",
|
||||
cast(jsonb_extract_path_text("_airbyte_data",'new_tested') as float) as new_tested,
|
||||
cast(jsonb_extract_path_text("_airbyte_data",'new_deceased') as float) as new_deceased,
|
||||
cast(jsonb_extract_path_text("_airbyte_data",'total_tested') as float) as total_tested,
|
||||
cast(jsonb_extract_path_text("_airbyte_data",'new_confirmed') as float) as new_confirmed,
|
||||
cast(jsonb_extract_path_text("_airbyte_data",'new_recovered') as float) as new_recovered,
|
||||
cast(jsonb_extract_path_text("_airbyte_data",'total_deceased') as float) as total_deceased,
|
||||
cast(jsonb_extract_path_text("_airbyte_data",'total_confirmed') as float) as total_confirmed,
|
||||
cast(jsonb_extract_path_text("_airbyte_data",'total_recovered') as float) as total_recovered
|
||||
from "postgres".public._airbyte_raw_covid_epidemiology
|
||||
);
|
||||
```
|
||||
|
||||
#### Customize SQL Query
|
||||
|
||||
Feel free to:
|
||||
|
||||
- Rename the columns as you desire
|
||||
- avoiding using keywords such as `"key"` or `"date"`
|
||||
- You can tweak the column data type if the ones generated by Airbyte are not the ones you favor
|
||||
- For example, let's use `Integer` instead of `Float` for the number of Covid cases...
|
||||
- Add deduplicating logic
|
||||
|
||||
- if you can identify which columns to use as Primary Keys
|
||||
|
||||
\(since airbyte isn't able to detect those automatically yet...\)
|
||||
|
||||
- \(Note: actually I am not even sure if I can tell the proper primary key in this dataset...\)
|
||||
|
||||
- Create a View \(or materialized views\) instead of a Table.
|
||||
- etc
|
||||
|
||||
```sql
|
||||
create view "postgres"."public"."covid_epidemiology" as (
|
||||
with parse_json_cte as (
|
||||
select
|
||||
_airbyte_emitted_at,
|
||||
|
||||
cast(jsonb_extract_path_text("_airbyte_data",'key') as varchar) as id,
|
||||
cast(jsonb_extract_path_text("_airbyte_data",'date') as varchar) as updated_at,
|
||||
cast(jsonb_extract_path_text("_airbyte_data",'new_tested') as float) as new_tested,
|
||||
cast(jsonb_extract_path_text("_airbyte_data",'new_deceased') as float) as new_deceased,
|
||||
cast(jsonb_extract_path_text("_airbyte_data",'total_tested') as float) as total_tested,
|
||||
cast(jsonb_extract_path_text("_airbyte_data",'new_confirmed') as float) as new_confirmed,
|
||||
cast(jsonb_extract_path_text("_airbyte_data",'new_recovered') as float) as new_recovered,
|
||||
cast(jsonb_extract_path_text("_airbyte_data",'total_deceased') as float) as total_deceased,
|
||||
cast(jsonb_extract_path_text("_airbyte_data",'total_confirmed') as float) as total_confirmed,
|
||||
cast(jsonb_extract_path_text("_airbyte_data",'total_recovered') as float) as total_recovered
|
||||
from "postgres".public._airbyte_raw_covid_epidemiology
|
||||
),
|
||||
cte as (
|
||||
select
|
||||
*,
|
||||
row_number() over (
|
||||
partition by id
|
||||
order by updated_at desc
|
||||
) as row_num
|
||||
from parse_json_cte
|
||||
)
|
||||
select
|
||||
substring(id, 1, 2) as id, -- Probably not the right way to identify the primary key in this dataset...
|
||||
updated_at,
|
||||
_airbyte_emitted_at,
|
||||
|
||||
case when new_tested = 'NaN' then 0 else cast(new_tested as integer) end as new_tested,
|
||||
case when new_deceased = 'NaN' then 0 else cast(new_deceased as integer) end as new_deceased,
|
||||
case when total_tested = 'NaN' then 0 else cast(total_tested as integer) end as total_tested,
|
||||
case when new_confirmed = 'NaN' then 0 else cast(new_confirmed as integer) end as new_confirmed,
|
||||
case when new_recovered = 'NaN' then 0 else cast(new_recovered as integer) end as new_recovered,
|
||||
case when total_deceased = 'NaN' then 0 else cast(total_deceased as integer) end as total_deceased,
|
||||
case when total_confirmed = 'NaN' then 0 else cast(total_confirmed as integer) end as total_confirmed,
|
||||
case when total_recovered = 'NaN' then 0 else cast(total_recovered as integer) end as total_recovered
|
||||
from cte
|
||||
where row_num = 1
|
||||
);
|
||||
```
|
||||
|
||||
Then you can run in your preferred SQL editor or tool!
|
||||
|
||||
If you are familiar with dbt or want to learn more about it, you can continue with the following [tutorial using dbt](transformations-with-dbt.md)...
|
||||
@@ -1,447 +0,0 @@
|
||||
---
|
||||
products: all
|
||||
---
|
||||
|
||||
# Basic Normalization
|
||||
|
||||
:::danger
|
||||
|
||||
Basic normalization is being removed in favor of [Typing and Deduping](typing-deduping.md), as part of [Destinations V2](/release_notes/upgrading_to_destinations_v2). This pages remains as a guide for legacy connectors.
|
||||
|
||||
:::
|
||||
|
||||
## High-Level Overview
|
||||
|
||||
:::info
|
||||
|
||||
The high-level overview contains all the information you need to use Basic Normalization when pulling from APIs. Information past that can be read for advanced or educational purposes.
|
||||
|
||||
:::
|
||||
|
||||
For every connection, you can choose between two options:
|
||||
|
||||
- Basic Normalization: Airbyte converts the raw JSON blob version of your data to the format of your destination. _Note: Not all destinations support normalization._
|
||||
- Raw data (no normalization): Airbyte places the JSON blob version of your data in a table called `_airbyte_raw_<stream name>`
|
||||
|
||||
When basic normalization is enabled, Airbyte transforms data after the sync in a step called `Basic Normalization`, which structures data from the source into a format appropriate for consumption in the destination. For example, when writing data from a nested, dynamically typed source like a JSON API to a relational destination like Postgres, normalization is the process which un-nests JSON from the source into a relational table format which uses the appropriate column types in the destination.
|
||||
|
||||
Without basic normalization, your data will be written to your destination as one data column with a JSON blob that contains all of your data. This is the `_airbyte_raw_` table that you may have seen before. Why do we create this table? A core tenet of ELT philosophy is that data should be untouched as it moves through the E and L stages so that the raw data is always accessible. If an unmodified version of the data exists in the destination, it can be retransformed without needing to sync data again.
|
||||
|
||||
If you have Basic Normalization enabled, Airbyte automatically uses this JSON blob to create a schema and tables with your data in mind, converting it to the format of your destination. This runs after your sync and may take a long time if you have a large amount of data synced. If you don't enable Basic Normalization, you'll have to transform the JSON data from that column yourself.
|
||||
|
||||
:::note
|
||||
|
||||
Typing and Deduping may cause an increase in your destination's compute cost. This cost will vary depending on the amount of data that is transformed and is not related to Airbyte credit usage.
|
||||
|
||||
:::
|
||||
|
||||
## Example
|
||||
|
||||
Basic Normalization uses a fixed set of rules to map a json object from a source to the types and format that are native to the destination. For example if a source emits data that looks like this:
|
||||
|
||||
```javascript
|
||||
{
|
||||
"make": "alfa romeo",
|
||||
"model": "4C coupe",
|
||||
"horsepower": "247"
|
||||
}
|
||||
```
|
||||
|
||||
The destination connectors produce the following raw table in the destination database:
|
||||
|
||||
```sql
|
||||
CREATE TABLE "_airbyte_raw_cars" (
|
||||
-- metadata added by airbyte
|
||||
"_airbyte_ab_id" VARCHAR, -- uuid value assigned by connectors to each row of the data written in the destination.
|
||||
"_airbyte_emitted_at" TIMESTAMP_WITH_TIMEZONE, -- time at which the record was emitted.
|
||||
"_airbyte_data" JSONB -- data stored as a Json Blob.
|
||||
);
|
||||
```
|
||||
|
||||
Then, basic normalization would create the following table:
|
||||
|
||||
```sql
|
||||
CREATE TABLE "cars" (
|
||||
"_airbyte_ab_id" VARCHAR,
|
||||
"_airbyte_emitted_at" TIMESTAMP_WITH_TIMEZONE,
|
||||
"_airbyte_cars_hashid" VARCHAR,
|
||||
"_airbyte_normalized_at" TIMESTAMP_WITH_TIMEZONE,
|
||||
|
||||
-- data from source
|
||||
"make" VARCHAR,
|
||||
"model" VARCHAR,
|
||||
"horsepower" INTEGER
|
||||
);
|
||||
```
|
||||
|
||||
## Normalization metadata columns
|
||||
|
||||
You'll notice that some metadata are added to keep track of important information about each record.
|
||||
|
||||
- Some are introduced at the destination connector level: These are propagated by the normalization process from the raw table to the final table
|
||||
- `_airbyte_ab_id`: uuid value assigned by connectors to each row of the data written in the destination.
|
||||
- `_airbyte_emitted_at`: time at which the record was emitted and recorded by destination connector.
|
||||
- While other metadata columns are created at the normalization step.
|
||||
- `_airbyte_<table_name>_hashid`: hash value assigned by airbyte normalization derived from a hash function of the record data.
|
||||
- `_airbyte_normalized_at`: time at which the record was last normalized (useful to track when incremental transformations are performed)
|
||||
|
||||
Additional metadata columns can be added on some tables depending on the usage:
|
||||
|
||||
- On the Slowly Changing Dimension (SCD) tables:
|
||||
- `_airbyte_start_at`: equivalent to the cursor column defined on the table, denotes when the row was first seen
|
||||
- `_airbyte_end_at`: denotes until when the row was seen with these particular values. If this column is not NULL, then the record has been updated and is no longer the most up to date one. If NULL, then the row is the latest version for the record.
|
||||
- `_airbyte_active_row`: denotes if the row for the record is the latest version or not.
|
||||
- `_airbyte_unique_key_scd`: hash of primary keys + cursors used to de-duplicate the scd table.
|
||||
- On de-duplicated (and SCD) tables:
|
||||
- `_airbyte_unique_key`: hash of primary keys used to de-duplicate the final table.
|
||||
|
||||
The [normalization rules](#Rules) are _not_ configurable. They are designed to pick a reasonable set of defaults to hit the 80/20 rule of data normalization. We respect that normalization is a detail-oriented problem and that with a fixed set of rules, we cannot normalize your data in such a way that covers all use cases. If this feature does not meet your normalization needs, we always put the full json blob in destination as well, so that you can parse that object however best meets your use case. We will be adding more advanced normalization functionality shortly. Airbyte is focused on the EL of ELT. If you need a really featureful tool for the transformations then, we suggest trying out dbt.
|
||||
|
||||
Airbyte places the json blob version of your data in a table called `_airbyte_raw_<stream name>`. If basic normalization is turned on, it will place a separate copy of the data in a table called `<stream name>`. Under the hood, Airbyte is using dbt, which means that the data only ingresses into the data store one time. The normalization happens as a query within the datastore. This implementation avoids extra network time and costs.
|
||||
|
||||
## Why does Airbyte have Basic Normalization?
|
||||
|
||||
At its core, Airbyte is geared to handle the EL \(Extract Load\) steps of an ELT process. These steps can also be referred in Airbyte's dialect as "Source" and "Destination".
|
||||
|
||||
However, this is actually producing a table in the destination with a JSON blob column... For the typical analytics use case, you probably want this json blob normalized so that each field is its own column.
|
||||
|
||||
So, after EL, comes the T \(transformation\) and the first T step that Airbyte actually applies on top of the extracted data is called "Normalization".
|
||||
|
||||
Airbyte runs this step before handing the final data over to other tools that will manage further transformation down the line.
|
||||
|
||||
To summarize, we can represent the ELT process in the diagram below. These are steps that happens between your "Source Database or API" and the final "Replicated Tables" with examples of implementation underneath:
|
||||
|
||||

|
||||
|
||||
In Airbyte, the current normalization option is implemented using a dbt Transformer composed of:
|
||||
|
||||
- Airbyte base-normalization python package to generate dbt SQL models files
|
||||
- dbt to compile and executes the models on top of the data in the destinations that supports it.
|
||||
|
||||
## Destinations that Support Basic Normalization
|
||||
|
||||
- [BigQuery](../../integrations/destinations/bigquery.md)
|
||||
- [MS Server SQL](../../integrations/destinations/mssql.md)
|
||||
- [MySQL](../../integrations/destinations/mysql.md)
|
||||
- The server must support the `WITH` keyword.
|
||||
- Require MySQL >= 8.0, or MariaDB >= 10.2.1.
|
||||
- [Postgres](../../integrations/destinations/postgres.md)
|
||||
- [Redshift](../../integrations/destinations/redshift.md)
|
||||
- [Snowflake](../../integrations/destinations/snowflake.md)
|
||||
|
||||
Basic Normalization can be configured when you're creating the connection between your Connection Setup and after in the Transformation Tab.
|
||||
Select the option: **Normalized tabular data**.
|
||||
|
||||
## Rules
|
||||
|
||||
### Typing
|
||||
|
||||
Airbyte tracks types using JsonSchema's primitive types. Here is how these types will map onto standard SQL types. Note: The names of the types may differ slightly across different destinations.
|
||||
|
||||
Airbyte uses the types described in the catalog to determine the correct type for each column. It does not try to use the values themselves to infer the type.
|
||||
|
||||
| JsonSchema Type | Resulting Type | Notes |
|
||||
| :------------------------------------- | :---------------------- | :---------------------- |
|
||||
| `number` | float | |
|
||||
| `integer` | integer | |
|
||||
| `string` | string | |
|
||||
| `bit` | boolean | |
|
||||
| `boolean` | boolean | |
|
||||
| `string` with format label `date-time` | timestamp with timezone | |
|
||||
| `array` | new table | see [nesting](#Nesting) |
|
||||
| `object` | new table | see [nesting](#Nesting) |
|
||||
|
||||
### Nesting
|
||||
|
||||
Basic Normalization attempts to expand any nested arrays or objects it receives into separate tables in order to allow more ergonomic querying of your data.
|
||||
|
||||
#### Arrays
|
||||
|
||||
Basic Normalization expands arrays into separate tables. For example if the source provides the following data:
|
||||
|
||||
```javascript
|
||||
{
|
||||
"make": "alfa romeo",
|
||||
"model": "4C coupe",
|
||||
"limited_editions": [
|
||||
{ "name": "4C spider", "release_year": 2013 },
|
||||
{ "name" : "4C spider italia" , "release_year": 2018 }
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The resulting normalized schema would be:
|
||||
|
||||
```sql
|
||||
CREATE TABLE "cars" (
|
||||
"_airbyte_cars_hashid" VARCHAR,
|
||||
"_airbyte_emitted_at" TIMESTAMP_WITH_TIMEZONE,
|
||||
"_airbyte_normalized_at" TIMESTAMP_WITH_TIMEZONE,
|
||||
|
||||
"make" VARCHAR,
|
||||
"model" VARCHAR
|
||||
);
|
||||
|
||||
CREATE TABLE "limited_editions" (
|
||||
"_airbyte_limited_editions_hashid" VARCHAR,
|
||||
"_airbyte_cars_foreign_hashid" VARCHAR,
|
||||
"_airbyte_emitted_at" TIMESTAMP_WITH_TIMEZONE,
|
||||
"_airbyte_normalized_at" TIMESTAMP_WITH_TIMEZONE,
|
||||
|
||||
"name" VARCHAR,
|
||||
"release_year" VARCHAR
|
||||
);
|
||||
```
|
||||
|
||||
If the nested items in the array are not objects then they are expanded into a string field of comma separated values e.g.:
|
||||
|
||||
```javascript
|
||||
{
|
||||
"make": "alfa romeo",
|
||||
"model": "4C coupe",
|
||||
"limited_editions": [ "4C spider", "4C spider italia"]
|
||||
}
|
||||
```
|
||||
|
||||
The resulting normalized schema would be:
|
||||
|
||||
```sql
|
||||
CREATE TABLE "cars" (
|
||||
"_airbyte_cars_hashid" VARCHAR,
|
||||
"_airbyte_emitted_at" TIMESTAMP_WITH_TIMEZONE,
|
||||
"_airbyte_normalized_at" TIMESTAMP_WITH_TIMEZONE,
|
||||
|
||||
"make" VARCHAR,
|
||||
"model" VARCHAR
|
||||
);
|
||||
|
||||
CREATE TABLE "limited_editions" (
|
||||
"_airbyte_limited_editions_hashid" VARCHAR,
|
||||
"_airbyte_cars_foreign_hashid" VARCHAR,
|
||||
"_airbyte_emitted_at" TIMESTAMP_WITH_TIMEZONE,
|
||||
"_airbyte_normalized_at" TIMESTAMP_WITH_TIMEZONE,
|
||||
|
||||
"data" VARCHAR
|
||||
);
|
||||
```
|
||||
|
||||
#### Objects
|
||||
|
||||
In the case of a nested object e.g.:
|
||||
|
||||
```javascript
|
||||
{
|
||||
"make": "alfa romeo",
|
||||
"model": "4C coupe",
|
||||
"powertrain_specs": { "horsepower": 247, "transmission": "6-speed" }
|
||||
}
|
||||
```
|
||||
|
||||
The normalized schema would be:
|
||||
|
||||
```sql
|
||||
CREATE TABLE "cars" (
|
||||
"_airbyte_cars_hashid" VARCHAR,
|
||||
"_airbyte_emitted_at" TIMESTAMP_WITH_TIMEZONE,
|
||||
"_airbyte_normalized_at" TIMESTAMP_WITH_TIMEZONE,
|
||||
|
||||
"make" VARCHAR,
|
||||
"model" VARCHAR
|
||||
);
|
||||
|
||||
CREATE TABLE "powertrain_specs" (
|
||||
"_airbyte_powertrain_hashid" VARCHAR,
|
||||
"_airbyte_cars_foreign_hashid" VARCHAR,
|
||||
"_airbyte_emitted_at" TIMESTAMP_WITH_TIMEZONE,
|
||||
"_airbyte_normalized_at" TIMESTAMP_WITH_TIMEZONE,
|
||||
|
||||
"horsepower" INTEGER,
|
||||
"transmission" VARCHAR
|
||||
);
|
||||
```
|
||||
|
||||
### Naming Collisions for un-nested objects
|
||||
|
||||
When extracting nested objects or arrays, the Basic Normalization process needs to figure out new names for the expanded tables.
|
||||
|
||||
For example, if we had a `cars` table with a nested column `cars` containing an object whose schema is identical to the parent table.
|
||||
|
||||
```javascript
|
||||
{
|
||||
"make": "alfa romeo",
|
||||
"model": "4C coupe",
|
||||
"cars": [
|
||||
{ "make": "audi", "model": "A7" },
|
||||
{ "make" : "lotus" , "model": "elise" }
|
||||
{ "make" : "chevrolet" , "model": "mustang" }
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The expanded table would have a conflict in terms of naming since both are named `cars`. To avoid name collisions and ensure a more consistent naming scheme, Basic Normalization chooses the expanded name as follows:
|
||||
|
||||
- `cars` for the original parent table
|
||||
- `cars_da3_cars` for the expanded nested columns following this naming scheme in 3 parts: `<Json path>_<Hash>_<nested column name>`
|
||||
- Json path: The entire json path string with '\_' characters used as delimiters to reach the table that contains the nested column name.
|
||||
- Hash: Hash of the entire json path to reach the nested column reduced to 3 characters. This is to make sure we have a unique name \(in case part of the name gets truncated, see below\)
|
||||
- Nested column name: name of the column being expanded into its own table.
|
||||
|
||||
By following this strategy, nested columns should "never" collide with other table names. If it does, an exception will probably be thrown either by the normalization process or by dbt that runs afterward.
|
||||
|
||||
```sql
|
||||
CREATE TABLE "cars" (
|
||||
"_airbyte_cars_hashid" VARCHAR,
|
||||
"_airbyte_emitted_at" TIMESTAMP_WITH_TIMEZONE,
|
||||
"_airbyte_normalized_at" TIMESTAMP_WITH_TIMEZONE,
|
||||
|
||||
"make" VARCHAR,
|
||||
"model" VARCHAR
|
||||
);
|
||||
|
||||
CREATE TABLE "cars_da3_cars" (
|
||||
"_airbyte_cars_hashid" VARCHAR,
|
||||
"_airbyte_cars_foreign_hashid" VARCHAR,
|
||||
"_airbyte_emitted_at" TIMESTAMP_WITH_TIMEZONE,
|
||||
"_airbyte_normalized_at" TIMESTAMP_WITH_TIMEZONE,
|
||||
|
||||
"make" VARCHAR,
|
||||
"model" VARCHAR
|
||||
);
|
||||
```
|
||||
|
||||
### Naming limitations & truncation
|
||||
|
||||
Note that different destinations have various naming limitations, most commonly on how long names can be. For instance, the Postgres documentation states:
|
||||
|
||||
> The system uses no more than NAMEDATALEN-1 bytes of an identifier; longer names can be written in commands, but they will be truncated. By default, NAMEDATALEN is 64 so the maximum identifier length is 63 bytes
|
||||
|
||||
Most modern data warehouses have name lengths limits on the longer side, so this should not affect us that often. Basic Normalization will fallback to the following rules:
|
||||
|
||||
1. No Truncate if under destination's character limits
|
||||
|
||||
However, in the rare cases where these limits are reached:
|
||||
|
||||
1. Truncate only the `Json path` to fit into destination's character limits
|
||||
2. Truncate the `Json path` to at least the 10 first characters, then truncate the nested column name starting in the middle to preserve prefix/suffix substrings intact \(whenever a truncate in the middle is made, two '\_\_' characters are also inserted to denote where it happened\) to fit into destination's character limits
|
||||
|
||||
As an example from the hubspot source, we could have the following tables with nested columns:
|
||||
|
||||
| Description | Example 1 | Example 2 |
|
||||
| :----------------------------------------------------- | :------------------------------------------------------------------ | :-------------------------------------------------------------------- |
|
||||
| Original Stream Name | companies | deals |
|
||||
| Json path to the nested column | `companies/property_engagements_last_meeting_booked_campaign` | `deals/properties/engagements_last_meeting_booked_medium` |
|
||||
| Final table name of expanded nested column on BigQuery | companies_2e8_property_engag**ements_last_meeting_bo**oked_campaign | deals_prop**erties**\_6e6_engagements_l**ast_meeting\_**booked_medium |
|
||||
| Final table name of expanded nested column on Postgres | companies_2e8_property_engag**\_\_**oked_campaign | deals_prop_6e6_engagements_l**\_\_**booked_medium |
|
||||
|
||||
As mentioned in the overview:
|
||||
|
||||
- Airbyte places the json blob version of your data in a table called `_airbyte_raw_<stream name>`.
|
||||
- If basic normalization is turned on, it will place a separate copy of the data in a table called `<stream name>`.
|
||||
- In certain pathological cases, basic normalization is required to generate large models with many columns and multiple intermediate transformation steps for a stream. This may break down the "ephemeral" materialization strategy and require the use of additional intermediate views or tables instead. As a result, you may notice additional temporary tables being generated in the destination to handle these checkpoints.
|
||||
|
||||
## UI Configurations
|
||||
|
||||
To enable basic normalization \(which is optional\), you can toggle it on or disable it in the "Normalization and Transformation" section when setting up your connection:
|
||||
|
||||

|
||||
|
||||
## Incremental runs
|
||||
|
||||
When the source is configured with sync modes compatible with incremental transformations (using append on destination) such as ( [full_refresh_append](./sync-modes/full-refresh-append.md), [incremental append](./sync-modes/incremental-append.md) or [incremental deduped history](./sync-modes/incremental-append-deduped.md)), only rows that have changed in the source are transferred over the network and written by the destination connector.
|
||||
Normalization will then try to build the normalized tables incrementally as the rows in the raw tables that have been created or updated since the last time dbt ran. As such, on each dbt run, the models get built incrementally. This limits the amount of data that needs to be transformed, vastly reducing the runtime of the transformations. This improves warehouse performance and reduces compute costs.
|
||||
Because normalization can be either run incrementally and, or, in full refresh, a technical column `_airbyte_normalized_at` can serve to track when was the last time a record has been transformed and written by normalization.
|
||||
This may greatly diverge from the `_airbyte_emitted_at` value as the normalized tables could be totally re-built at a latter time from the data stored in the `_airbyte_raw` tables.
|
||||
|
||||
## Partitioning, clustering, sorting, indexing
|
||||
|
||||
Normalization produces tables that are partitioned, clustered, sorted or indexed depending on the destination engine and on the type of tables being built. The goal of these are to make read more performant, especially when running incremental updates.
|
||||
|
||||
In general, normalization needs to do lookup on the last emitted_at column to know if a record is freshly produced and need to be
|
||||
incrementally processed or not. But in certain models, such as SCD tables for example, we also need to retrieve older data to update their type 2 SCD end_date and active_row flags, thus a different partitioning scheme is used to optimize that use case.
|
||||
|
||||
On Postgres destination, an additional table suffixed with `_stg` for every stream replicated in [incremental deduped history](./sync-modes/incremental-append-deduped.md) needs to be persisted (in a different staging schema) for incremental transformations to work because of a [limitation](https://github.com/dbt-labs/docs.getdbt.com/issues/335#issuecomment-694199569).
|
||||
|
||||
## Extending Basic Normalization
|
||||
|
||||
Note that all the choices made by Normalization as described in this documentation page in terms of naming (and more) could be overridden by your own custom choices. To do so, you can follow the following tutorials:
|
||||
|
||||
- to build a [custom SQL view](../../operator-guides/transformation-and-normalization/transformations-with-sql.md) with your own naming conventions
|
||||
- to export, edit and run [custom dbt normalization](../../operator-guides/transformation-and-normalization/transformations-with-dbt.md) yourself
|
||||
- or further, you can configure the use of a custom dbt project within Airbyte by following [this guide](../../operator-guides/transformation-and-normalization/transformations-with-airbyte.md).
|
||||
|
||||
## CHANGELOG
|
||||
|
||||
### airbyte-integration/bases/base-normalization
|
||||
|
||||
Note that Basic Normalization is packaged in a docker image `airbyte/normalization`. This image is tied to and released along with a specific Airbyte version. It is not configurable independently like it is possible to do with connectors \(source & destinations\)
|
||||
|
||||
Therefore, in order to "upgrade" to the desired normalization version, you need to use the corresponding Airbyte version that it's being released in:
|
||||
|
||||
| Airbyte Version | Normalization Version | Date | Pull Request | Subject |
|
||||
| :-------------- | :------------------------- | :--------- | :----------------------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| | 0.4.3 | 2023-05-11 | [\#25993](https://github.com/airbytehq/airbyte/pull/25993) | Fix bug in source-postgres CDC for multiple updates on a single PK in a single transaction (destinations MySQL, MSSQL, TiDB may still be affected in certain cases) |
|
||||
| | 0.4.2 | 2023-05-03 | [\#25771](https://github.com/airbytehq/airbyte/pull/25771) | Remove old VARCHAR to SUPER migration functionality for destination Redshift |
|
||||
| | 0.4.1 | 2023-04-26 | [\#25591](https://github.com/airbytehq/airbyte/pull/25591) | Pin MarkupSafe library for Oracle normalization to fix build. |
|
||||
| | 0.4.0 | 2023-03-23 | [\#22381](https://github.com/airbytehq/airbyte/pull/22381) | Prevent normalization from creating unnecessary duplicates in nested tables. |
|
||||
| | 0.2.27 | 2023-03-15 | [\#24077](https://github.com/airbytehq/airbyte/pull/24077) | Add more bigquery reserved words |
|
||||
| | 0.2.26 | 2023-02-15 | [\#19573](https://github.com/airbytehq/airbyte/pull/19573) | Update Clickhouse dbt version to 1.4.0 |
|
||||
| | 0.3.2 (broken, do not use) | 2023-01-31 | [\#22165](https://github.com/airbytehq/airbyte/pull/22165) | Fix support for non-object top-level schemas |
|
||||
| | 0.3.1 (broken, do not use) | 2023-01-31 | [\#22161](https://github.com/airbytehq/airbyte/pull/22161) | Fix handling for combined primitive types |
|
||||
| | 0.3.0 (broken, do not use) | 2023-01-30 | [\#19721](https://github.com/airbytehq/airbyte/pull/19721) | Update normalization to airbyte-protocol v1.0.0 |
|
||||
| | 0.2.25 | 2022-12-05 | [\#19573](https://github.com/airbytehq/airbyte/pull/19573) | Update Clickhouse dbt version |
|
||||
| | 0.2.24 | 2022-11-01 | [\#18015](https://github.com/airbytehq/airbyte/pull/18015) | Add a drop table hook that drops \*\_scd tables after overwrite/reset |
|
||||
| | 0.2.23 | 2022-10-12 | [\#17483](https://github.com/airbytehq/airbyte/pull/17483) (published in [\#17896](https://github.com/airbytehq/airbyte/pull/17896)) | Remove unnecessary `Native Port` config option |
|
||||
| | 0.2.22 | 2022-09-05 | [\#16339](https://github.com/airbytehq/airbyte/pull/16339) | Update Clickhouse DBT to 1.1.8 |
|
||||
| | 0.2.21 | 2022-09-09 | [\#15833](https://github.com/airbytehq/airbyte/pull/15833/) | SSH Tunnel: allow using OPENSSH key format (published in [\#16545](https://github.com/airbytehq/airbyte/pull/16545)) |
|
||||
| | 0.2.20 | 2022-08-30 | [\#15592](https://github.com/airbytehq/airbyte/pull/15592) | Add TiDB support |
|
||||
| | 0.2.19 | 2022-08-21 | [\#14897](https://github.com/airbytehq/airbyte/pull/14897) | Update Clickhouse DBT to 1.1.7 |
|
||||
| | 0.2.16 | 2022-08-04 | [\#14295](https://github.com/airbytehq/airbyte/pull/14295) | Fixed SSH tunnel port usage |
|
||||
| | 0.2.14 | 2022-08-01 | [\#14790](https://github.com/airbytehq/airbyte/pull/14790) | Add and persist job failures for Normalization |
|
||||
| | 0.2.13 | 2022-07-27 | [\#14683](https://github.com/airbytehq/airbyte/pull/14683) | Quote schema name to allow reserved keywords |
|
||||
| | 0.2.12 | 2022-07-26 | [\#14362](https://github.com/airbytehq/airbyte/pull/14362) | Handle timezone in date-time format. Parse date correct in clickhouse. |
|
||||
| | 0.2.11 | 2022-07-26 | [\#13591](https://github.com/airbytehq/airbyte/pull/13591) | Updated support for integer columns. |
|
||||
| | 0.2.10 | 2022-07-18 | [\#14792](https://github.com/airbytehq/airbyte/pull/14792) | Add support for key pair auth for snowflake |
|
||||
| | 0.2.9 | 2022-07-06 | [\#14485](https://github.com/airbytehq/airbyte/pull/14485) | BigQuery partition pruning otimization |
|
||||
| | 0.2.8 | 2022-07-13 | [\#14522](https://github.com/airbytehq/airbyte/pull/14522) | BigQuery replaces `NULL` array entries with the string value `"NULL"` |
|
||||
| | 0.2.7 | 2022-07-05 | [\#11694](https://github.com/airbytehq/airbyte/pull/11694) | Do not return NULL for MySQL column values > 512 chars |
|
||||
| | 0.2.6 | 2022-06-16 | [\#13894](https://github.com/airbytehq/airbyte/pull/13894) | Fix incorrect jinja2 macro `json_extract_array` call |
|
||||
| | 0.2.5 | 2022-06-15 | [\#11470](https://github.com/airbytehq/airbyte/pull/11470) | Upgrade MySQL to dbt 1.0.0 |
|
||||
| | 0.2.4 | 2022-06-14 | [\#12846](https://github.com/airbytehq/airbyte/pull/12846) | CDC correctly deletes propagates deletions to final tables |
|
||||
| | 0.2.3 | 2022-06-10 | [\#11204](https://github.com/airbytehq/airbyte/pull/11204) | MySQL: add support for SSh tunneling |
|
||||
| | 0.2.2 | 2022-06-02 | [\#13289](https://github.com/airbytehq/airbyte/pull/13289) | BigQuery use `json_extract_string_array` for array of simple type elements |
|
||||
| | 0.2.1 | 2022-05-17 | [\#12924](https://github.com/airbytehq/airbyte/pull/12924) | Fixed checking --event-buffer-size on old dbt crashed entrypoint.sh |
|
||||
| | 0.2.0 | 2022-05-15 | [\#12745](https://github.com/airbytehq/airbyte/pull/12745) | Snowflake: add datetime without timezone |
|
||||
| | 0.1.78 | 2022-05-06 | [\#12305](https://github.com/airbytehq/airbyte/pull/12305) | Mssql: use NVARCHAR and datetime2 by default |
|
||||
| 0.36.2-alpha | 0.1.77 | 2022-04-19 | [\#12064](https://github.com/airbytehq/airbyte/pull/12064) | Add support redshift SUPER type |
|
||||
| 0.35.65-alpha | 0.1.75 | 2022-04-09 | [\#11511](https://github.com/airbytehq/airbyte/pull/11511) | Move DBT modules from `/tmp/dbt_modules` to `/dbt` |
|
||||
| 0.35.61-alpha | 0.1.74 | 2022-03-24 | [\#10905](https://github.com/airbytehq/airbyte/pull/10905) | Update clickhouse dbt version |
|
||||
| 0.35.60-alpha | 0.1.73 | 2022-03-25 | [\#11267](https://github.com/airbytehq/airbyte/pull/11267) | Set `--event-buffer-size` to reduce memory usage |
|
||||
| 0.35.59-alpha | 0.1.72 | 2022-03-24 | [\#11093](https://github.com/airbytehq/airbyte/pull/11093) | Added Snowflake OAuth2.0 support |
|
||||
| 0.35.53-alpha | 0.1.71 | 2022-03-14 | [\#11077](https://github.com/airbytehq/airbyte/pull/11077) | Enable BigQuery to handle project ID embedded inside dataset ID |
|
||||
| 0.35.49-alpha | 0.1.70 | 2022-03-11 | [\#11051](https://github.com/airbytehq/airbyte/pull/11051) | Upgrade dbt to 1.0.0 (except for MySQL and Oracle) |
|
||||
| 0.35.45-alpha | 0.1.69 | 2022-03-04 | [\#10754](https://github.com/airbytehq/airbyte/pull/10754) | Enable Clickhouse normalization over SSL |
|
||||
| 0.35.32-alpha | 0.1.68 | 2022-02-20 | [\#10485](https://github.com/airbytehq/airbyte/pull/10485) | Fix row size too large for table with numerous `string` fields |
|
||||
| | 0.1.66 | 2022-02-04 | [\#9341](https://github.com/airbytehq/airbyte/pull/9341) | Fix normalization for bigquery datasetId and tables |
|
||||
| 0.35.13-alpha | 0.1.65 | 2021-01-28 | [\#9846](https://github.com/airbytehq/airbyte/pull/9846) | Tweak dbt multi-thread parameter down |
|
||||
| 0.35.12-alpha | 0.1.64 | 2021-01-28 | [\#9793](https://github.com/airbytehq/airbyte/pull/9793) | Support PEM format for ssh-tunnel keys |
|
||||
| 0.35.4-alpha | 0.1.63 | 2021-01-07 | [\#9301](https://github.com/airbytehq/airbyte/pull/9301) | Fix Snowflake prefix tables starting with numbers |
|
||||
| | 0.1.62 | 2021-01-07 | [\#9340](https://github.com/airbytehq/airbyte/pull/9340) | Use TCP-port support for clickhouse |
|
||||
| | 0.1.62 | 2021-01-07 | [\#9063](https://github.com/airbytehq/airbyte/pull/9063) | Change Snowflake-specific materialization settings |
|
||||
| | 0.1.62 | 2021-01-07 | [\#9317](https://github.com/airbytehq/airbyte/pull/9317) | Fix issue with quoted & case sensitive columns |
|
||||
| | 0.1.62 | 2021-01-07 | [\#9281](https://github.com/airbytehq/airbyte/pull/9281) | Fix SCD partition by float columns in BigQuery |
|
||||
| 0.32.11-alpha | 0.1.61 | 2021-12-02 | [\#8394](https://github.com/airbytehq/airbyte/pull/8394) | Fix incremental queries not updating empty tables |
|
||||
| | 0.1.61 | 2021-12-01 | [\#8378](https://github.com/airbytehq/airbyte/pull/8378) | Fix un-nesting queries and add proper ref hints |
|
||||
| 0.32.5-alpha | 0.1.60 | 2021-11-22 | [\#8088](https://github.com/airbytehq/airbyte/pull/8088) | Speed-up incremental queries for SCD table on Snowflake |
|
||||
| 0.30.32-alpha | 0.1.59 | 2021-11-08 | [\#7669](https://github.com/airbytehq/airbyte/pull/7169) | Fix nested incremental dbt |
|
||||
| 0.30.24-alpha | 0.1.57 | 2021-10-26 | [\#7162](https://github.com/airbytehq/airbyte/pull/7162) | Implement incremental dbt updates |
|
||||
| 0.30.16-alpha | 0.1.52 | 2021-10-07 | [\#6379](https://github.com/airbytehq/airbyte/pull/6379) | Handle empty string for date and date-time format |
|
||||
| | 0.1.51 | 2021-10-08 | [\#6799](https://github.com/airbytehq/airbyte/pull/6799) | Added support for ad_cdc_log_pos while normalization |
|
||||
| | 0.1.50 | 2021-10-07 | [\#6079](https://github.com/airbytehq/airbyte/pull/6079) | Added support for MS SQL Server normalization |
|
||||
| | 0.1.49 | 2021-10-06 | [\#6709](https://github.com/airbytehq/airbyte/pull/6709) | Forward destination dataset location to dbt profiles |
|
||||
| 0.29.17-alpha | 0.1.47 | 2021-09-20 | [\#6317](https://github.com/airbytehq/airbyte/pull/6317) | MySQL: updated MySQL normalization with using SSH tunnel |
|
||||
| | 0.1.45 | 2021-09-18 | [\#6052](https://github.com/airbytehq/airbyte/pull/6052) | Snowflake: accept any date-time format |
|
||||
| 0.29.8-alpha | 0.1.40 | 2021-08-18 | [\#5433](https://github.com/airbytehq/airbyte/pull/5433) | Allow optional credentials_json for BigQuery |
|
||||
| 0.29.5-alpha | 0.1.39 | 2021-08-11 | [\#4557](https://github.com/airbytehq/airbyte/pull/4557) | Handle date times and solve conflict name btw stream/field |
|
||||
| 0.28.2-alpha | 0.1.38 | 2021-07-28 | [\#5027](https://github.com/airbytehq/airbyte/pull/5027) | Handle quotes in column names when parsing JSON blob |
|
||||
| 0.27.5-alpha | 0.1.37 | 2021-07-22 | [\#3947](https://github.com/airbytehq/airbyte/pull/4881/) | Handle `NULL` cursor field values when deduping |
|
||||
| 0.27.2-alpha | 0.1.36 | 2021-07-09 | [\#3947](https://github.com/airbytehq/airbyte/pull/4163/) | Enable normalization for MySQL destination |
|
||||
@@ -30,7 +30,7 @@ A connection is an automated data pipeline that replicates data from a source to
|
||||
| [Sync Mode](/using-airbyte/core-concepts/sync-modes/README.md) | How should the streams be replicated (read and written)? |
|
||||
| [Sync Schedule](/using-airbyte/core-concepts/sync-schedules.md) | When should a data sync be triggered? |
|
||||
| [Destination Namespace and Stream Prefix](/using-airbyte/core-concepts/namespaces.md) | Where should the replicated data be written? |
|
||||
| [Schema Propagation](using-airbyte/schema-change-management.md) | How should Airbyte handle schema drift in sources? |
|
||||
| [Schema Propagation](using-airbyte/schema-change-management.md) | How should Airbyte handle schema drift in sources? |
|
||||
|
||||
## Stream
|
||||
|
||||
@@ -87,22 +87,7 @@ Read more about each [sync mode](/using-airbyte/core-concepts/sync-modes/README.
|
||||
|
||||
## Typing and Deduping
|
||||
|
||||
Typing and deduping ensures the data emitted from sources is written into the correct type-cast relational columns and only contains unique records. Typing and deduping is only relevant for the following relational database & warehouse destinations:
|
||||
|
||||
- Snowflake
|
||||
- BigQuery
|
||||
|
||||
:::info
|
||||
Typing and Deduping is the default method of transforming datasets within data warehouse and database destinations after they've been replicated. We are retaining documentation about normalization to support legacy destinations.
|
||||
:::
|
||||
|
||||
For more details, see our [Typing & Deduping documentation](/using-airbyte/core-concepts/typing-deduping).
|
||||
|
||||
## Basic Normalization
|
||||
|
||||
Basic Normalization transforms data after a sync to denest columns into their own tables. Note that normalization is only available for relational database & warehouse destinations that have not yet migrated to Destinations V2, and will eventually be fully deprecated.
|
||||
|
||||
For more details, see our [Basic Normalization documentation](/using-airbyte/core-concepts/basic-normalization.md).
|
||||
Typing and deduping ensures the data emitted from sources is written into the correct type-cast relational columns, and if deduplication is selected, only contains unique records. Typing and deduping is only relevant for relational database & warehouse destinations. For more details, see our [Typing & Deduping documentation](/using-airbyte/core-concepts/typing-deduping).
|
||||
|
||||
## Custom Transformations
|
||||
|
||||
|
||||
@@ -88,8 +88,6 @@
|
||||
to: /using-airbyte/core-concepts/sync-modes/incremental-append
|
||||
- from: /understanding-airbyte/connections/incremental-append-deduped
|
||||
to: /using-airbyte/core-concepts/sync-modes/incremental-append-deduped
|
||||
- from: /understanding-airbyte/basic-normalization
|
||||
to: /using-airbyte/core-concepts/basic-normalization
|
||||
- from: /understanding-airbyte/typing-deduping
|
||||
to: /using-airbyte/core-concepts/typing-deduping
|
||||
- from:
|
||||
|
||||
@@ -516,25 +516,12 @@ module.exports = {
|
||||
"using-airbyte/core-concepts/sync-modes/full-refresh-overwrite",
|
||||
],
|
||||
},
|
||||
{
|
||||
type: "category",
|
||||
label: "Typing and Deduping",
|
||||
link: {
|
||||
type: "doc",
|
||||
id: "using-airbyte/core-concepts/typing-deduping",
|
||||
},
|
||||
items: ["using-airbyte/core-concepts/basic-normalization"],
|
||||
},
|
||||
"using-airbyte/core-concepts/typing-deduping",
|
||||
"using-airbyte/schema-change-management",
|
||||
{
|
||||
type: "category",
|
||||
label: "Transformations",
|
||||
items: [
|
||||
"cloud/managing-airbyte-cloud/dbt-cloud-integration",
|
||||
"operator-guides/transformation-and-normalization/transformations-with-sql",
|
||||
"operator-guides/transformation-and-normalization/transformations-with-dbt",
|
||||
"operator-guides/transformation-and-normalization/transformations-with-airbyte",
|
||||
],
|
||||
items: ["cloud/managing-airbyte-cloud/dbt-cloud-integration"],
|
||||
},
|
||||
],
|
||||
},
|
||||
|
||||
Reference in New Issue
Block a user