postgres cdc (#2548)
* spike * more * debezium wip * use oneof for configuration * iterator wrapping structure * push current * working loop * move capability into source * hack it into a sharable state * debezium test runner (#2617) * CDC Wait for Values (#2618) * output actual AirbyteMessages for cdc (#2631) * message conversion * fmt * add lsn extraction and comparison (#2613) * postgres cdc catalog (#2673) * update cdc catalog * A * table selection for cdc (#2690) * table selection for cdc * fix broken merge * also test double quote in name * Add state management to CDC (#2718) * CDC: Fix Producer/Consumer State Machine (#2721) * CDC Postgres Tests (#2777) * fix postgres cdc image name and run check before reading data (#2785) * minor postgres cdc fixes * add test and fix check behavior * fix * improve comment * remove unused props, remove todos, add some more sanity tests (#2791) * cdc: add offset store tests (#2793) * clean (#2798) * postgres cdc docs (#2784) * cdc docs * Update docs/integrations/sources/postgres.md Co-authored-by: Charles <giardina.charles@gmail.com> * address gcp * learn too english * add link * add more disk space warnings * add additional cdc use case * add information on how to find postgresql.conf * add how to find the file Co-authored-by: Charles <giardina.charles@gmail.com> * various merge conflict fixes (#2799) * cdc standard tests (#2813) * require cdc users to create publications & update docs (#2818) * postgres cdc race condition * working? but different process * add additional logging to help debug in the future * everything done except working config * remove unintended change * Use oneOf in PG CDC spec (#2827) * add oneOf configuration for postgres cdc (#2831) * add oneof configuration for cdc postgres * fmt Co-authored-by: Charles <giardina.charles@gmail.com> * fix test (#2834) * fix test * bump version * add docs on creating replica identities (#2838) * add docs on creating replica identities * emphasize danger * grammar * bump pg version in source catalog * generate seed files Co-authored-by: cgardens <giardina.charles@gmail.com>
This commit is contained in:
@@ -51,8 +51,8 @@ Postgres data types are mapped to the following data types when synchronizing da
|
||||
| :--- | :--- |
|
||||
| Full Refresh Sync | Yes |
|
||||
| Incremental - Append Sync | Yes |
|
||||
| Replicate Incremental Deletes | Coming soon |
|
||||
| Logical Replication \(WAL\) | Coming soon |
|
||||
| Replicate Incremental Deletes | Yes |
|
||||
| Logical Replication \(WAL\) | Yes |
|
||||
| SSL Support | Yes |
|
||||
| SSH Tunnel Connection | Coming soon |
|
||||
|
||||
@@ -97,5 +97,106 @@ GRANT SELECT ON ALL TABLES IN SCHEMA <schema_name> TO airbyte;
|
||||
ALTER DEFAULT PRIVILEGES IN SCHEMA <schema_name> GRANT SELECT ON TABLES TO airbyte;
|
||||
```
|
||||
|
||||
#### 3. Set up CDC \(Optional\)
|
||||
|
||||
Please read [the section on CDC below](#setting-up-cdc-for-postgres) for more information.
|
||||
|
||||
#### 4. That's it!
|
||||
|
||||
Your database user should now be ready for use with Airbyte.
|
||||
|
||||
## Change Data Capture (CDC) / Logical Replication / WAL Replication
|
||||
We use [logical replication](https://www.postgresql.org/docs/10/logical-replication.html) of the Postgres write-ahead log (WAL) to incrementally capture deletes using the `pgoutput` plugin.
|
||||
|
||||
We do not require installing custom plugins like `wal2json` or `test_decoding`. We use `pgoutput`, which is included in Postgres 10+ by default.
|
||||
|
||||
Please read the [CDC docs](../../architecture/cdc.md) for an overview of how Airbyte approaches CDC.
|
||||
|
||||
### Should I use CDC for Postgres?
|
||||
* If you need a record of deletions and can accept the limitations posted below, you should to use CDC for Postgres.
|
||||
* If your data set is small and you just want snapshot of your table in the destination, consider using Full Refresh replication for your table instead of CDC.
|
||||
* If the limitations prevent you from using CDC and your goal is to maintain a snapshot of your table in the destination, consider using non-CDC incremental and occasionally reset the data and re-sync.
|
||||
* If your table has a primary key but doesn't have a reasonable cursor field for incremental syncing (i.e. `updated_at`), CDC allows you to sync your table incrementally.
|
||||
|
||||
### CDC Limitations
|
||||
* Make sure to read our [CDC docs](../../architecture/cdc.md) to see limitations that impact all databases using CDC replication.
|
||||
* CDC is only available for Postgres 10+.
|
||||
* Airbyte requires a replication slot configured only for its use. Only one source should be configured that uses this replication slot. Instructions on how to set up a replication slot can be found below.
|
||||
* Log-based replication only works for master instances of Postgres.
|
||||
* Using logical replication increases disk space used on the database server. The additional data is stored until it is consumed.
|
||||
* We recommend setting frequent syncs for CDC in order to ensure that this data doesn't fill up your disk space.
|
||||
* If you stop syncing a CDC-configured Postgres instance to Airbyte, you should delete the replication slot. Otherwise, it may fill up your disk space.
|
||||
* Our CDC implementation uses at least once delivery for all change records.
|
||||
|
||||
### Setting up CDC for Postgres
|
||||
|
||||
#### Enable logical replication
|
||||
|
||||
Follow one of these guides to enable logical replication:
|
||||
* [Bare Metal, VMs (EC2/GCE/etc), Docker, etc.](#setting-up-cdc-on-bare-metal-vms-ec2gceetc-docker-etc)
|
||||
* [AWS Postgres RDS or Aurora](#setting-up-cdc-on-aws-postgres-rds-or-aurora)
|
||||
* [Azure Database for Postgres](#setting-up-cdc-on-azure-database-for-postgres)
|
||||
|
||||
#### Add user-level permissions
|
||||
|
||||
We recommend using a user specifically for Airbyte's replication so you can minimize access. This Airbyte user for your instance needs to be granted `REPLICATION` and `LOGIN` permissions. You can create a role with `CREATE ROLE <name> REPLICATION LOGIN;` and grant that role to the user. You still need to make sure the user can connect to the database, use the schema, and to use `SELECT` on tables (the same are required for non-CDC incremental syncs and all full refreshes).
|
||||
|
||||
#### Create replication slot
|
||||
Next, you will need to create a replication slot. Here is the query used to create a replication slot called `airbyte_slot`:
|
||||
```
|
||||
SELECT pg_create_logical_replication_slot('airbyte_slot', 'pgoutput');`
|
||||
```
|
||||
|
||||
This slot **must** use `pgoutput`.
|
||||
|
||||
#### Create publications and replication identities for tables
|
||||
|
||||
For each table you want to replicate with CDC, you will need to run `CREATE PUBLICATION airbyte_publication FOR TABLES <tbl1, tbl2, tbl3>;`. This publication name is customizable. For each of these tables, you will need to run `ALTER TABLE tbl1 REPLICA IDENTITY DEFAULT;`. **You cannot run `ALTER`/`UPDATE`/`DELETE` commands on a table between the creation of a publication and adding the replica identity**, so we recommend running the `CREATE PUBLICATION` and adding all relevant `REPLICATION IDENTITY` alterations immediately. Please refer to the [Postgres docs](https://www.postgresql.org/docs/10/sql-alterpublication.html) if you need to add or remove tables from your publication in the future.
|
||||
|
||||
The UI currently allows selecting any tables for CDC. If a table is selected that is not part of the publication, it will not replicate even though it is selected. If a table is part of the publication but does not have a replication identity, that replication identity will be created automatically on the first run if the Airbyte user has the necessary permissions.
|
||||
|
||||
#### Start syncing
|
||||
When configuring the source, select CDC and provide the replication slot and publication you just created. You should be ready to sync data with CDC!
|
||||
|
||||
### Setting up CDC on Bare Metal, VMs (EC2/GCE/etc), Docker, etc.
|
||||
Some settings must be configured in the `postgresql.conf` file for your database. You can find the location of this file using `psql -U postgres -c 'SHOW config_file'` withe the correct `psql` credentials specified. Alternatively, a custom file can be specified when running postgres with the `-c` flag. For example `postgres -c config_file=/etc/postgresql/postgresql.conf` runs Postgres with the config file at `/etc/postgresql/postgresql.conf`.
|
||||
|
||||
If you are syncing data from a server using the `postgres` Docker image, you will need to mount a file and change the command to run Postgres with the set config file. If you're just testing CDC behavior, you may want to use a modified version of a [sample `postgresql.conf`](https://github.com/postgres/postgres/blob/master/src/backend/utils/misc/postgresql.conf.sample).
|
||||
|
||||
* `wal_level` is the type of coding used within the Postgres write-ahead log. This must be set to `logical` for Airbyte CDC.
|
||||
* `max_wal_senders` is the maximum number of processes used for handling WAL changes. This must be at least one.
|
||||
* `max_replication_slots` is the maximum number of replication slots that are allowed to stream WAL changes. This must one if Airbyte will be the only service reading subscribing to WAL changes or more if other services are also reading from the WAL.
|
||||
|
||||
Here is what these settings would look like in `postgresql.conf`:
|
||||
```
|
||||
wal_level = logical
|
||||
max_wal_senders = 1
|
||||
max_replication_slots = 1
|
||||
```
|
||||
|
||||
After setting these values you will need to restart your instance.
|
||||
|
||||
Finally, [follow the rest of steps above](#setting-up-cdc-for-postgres).
|
||||
|
||||
### Setting up CDC on AWS Postgres RDS or Aurora
|
||||
* Go to the `Configuration` tab for your DB cluster.
|
||||
* Find your cluster parameter group. You will either edit the parameters for this group or create a copy of this parameter group to edit. If you create a copy you will need to change your cluster's parameter group before restarting.
|
||||
* Within the parameter group page, search for `rds.logical_replication`. Select this row and click on the `Edit parameters` button. Set this value to `1`.
|
||||
* Wait for a maintenance window to automatically restart the instance or restart it manually.
|
||||
* Finally, [follow the rest of steps above](#setting-up-cdc-for-postgres).
|
||||
|
||||
### Setting up CDC on Azure Database for Postgres
|
||||
Use either the Azure CLI to:
|
||||
```
|
||||
az postgres server configuration set --resource-group group --server-name server --name azure.replication_support --value logical
|
||||
az postgres server restart --resource-group group --name server
|
||||
```
|
||||
|
||||
Finally, [follow the rest of steps above](#setting-up-cdc-for-postgres).
|
||||
|
||||
### Setting up CDC on Google CloudSQL
|
||||
|
||||
Unfortunately, logical replication is not configurable for Google CloudSQL. You can indicate your support for this feature on the [Google Issue Tracker](https://issuetracker.google.com/issues/120274585).
|
||||
|
||||
### Setting up CDC on other platforms
|
||||
If you encounter one of those not listed below, please consider [contributing to our docs](https://github.com/airbytehq/airbyte/tree/master/docs) and providing setup instructions.
|
||||
|
||||
Reference in New Issue
Block a user