GitBook: [master] 186 pages and 77 assets modified
@@ -1,6 +1,6 @@
|
||||
# Introduction
|
||||
|
||||
[](https://github.com/airbytehq/airbyte/actions/workflows/gradle.yml) [](./LICENSE) [](./LICENSE)
|
||||
[](https://github.com/airbytehq/airbyte/actions/workflows/gradle.yml) [](https://github.com/airbytehq/airbyte/tree/a9b1c6c0420550ad5069aca66c295223e0d05e27/LICENSE/README.md) [](https://github.com/airbytehq/airbyte/tree/a9b1c6c0420550ad5069aca66c295223e0d05e27/LICENSE/README.md)
|
||||
|
||||

|
||||
|
||||
@@ -20,7 +20,7 @@ Airbyte is on a mission to make data integration pipelines a commodity.
|
||||
* **No more security compliance process** to go through as Airbyte is self-hosted.
|
||||
* **No more pricing indexed on volume**, as cloud-based solutions offer.
|
||||
|
||||
Here's a list of our [connectors with their health status](docs/integrations).
|
||||
Here's a list of our [connectors with their health status](docs/integrations/).
|
||||
|
||||
## Quick start
|
||||
|
||||
@@ -52,7 +52,7 @@ If you want to schedule a 20-min call with our team to help you get set up, plea
|
||||
|
||||
We love contributions to Airbyte, big or small.
|
||||
|
||||
See our [Contributing guide](docs/contributing-to-airbyte/) on how to get started. Not sure where to start? We’ve listed some [good first issues](https://github.com/airbytehq/airbyte/labels/good%20first%20issue) to start with. If you have any questions, please open a draft PR or visit our [slack channel](slack.airbyte.io) where the core team can help answer your questions.
|
||||
See our [Contributing guide](docs/contributing-to-airbyte/) on how to get started. Not sure where to start? We’ve listed some [good first issues](https://github.com/airbytehq/airbyte/labels/good%20first%20issue) to start with. If you have any questions, please open a draft PR or visit our [slack channel](https://github.com/airbytehq/airbyte/tree/a9b1c6c0420550ad5069aca66c295223e0d05e27/slack.airbyte.io) where the core team can help answer your questions.
|
||||
|
||||
**Note that you are able to create connectors using the language you want, as Airbyte connections run as Docker containers.**
|
||||
|
||||
@@ -73,5 +73,5 @@ Check out our [roadmap](docs/project-overview/roadmap.md) to get informed on wha
|
||||
|
||||
## License
|
||||
|
||||
See the [LICENSE](docs/project-overview/licenses/README.md) file for licensing information.
|
||||
See the [LICENSE](docs/project-overview/licenses/) file for licensing information.
|
||||
|
||||
|
||||
|
Before Width: | Height: | Size: 681 KiB After Width: | Height: | Size: 681 KiB |
|
Before Width: | Height: | Size: 681 KiB After Width: | Height: | Size: 681 KiB |
|
Before Width: | Height: | Size: 681 KiB After Width: | Height: | Size: 681 KiB |
BIN
docs/.gitbook/assets/change-to-per-week (3) (3) (4).png
Normal file
|
After Width: | Height: | Size: 681 KiB |
|
Before Width: | Height: | Size: 965 KiB After Width: | Height: | Size: 965 KiB |
|
Before Width: | Height: | Size: 965 KiB After Width: | Height: | Size: 965 KiB |
|
Before Width: | Height: | Size: 965 KiB After Width: | Height: | Size: 965 KiB |
|
Before Width: | Height: | Size: 965 KiB After Width: | Height: | Size: 965 KiB |
|
After Width: | Height: | Size: 492 KiB |
|
Before Width: | Height: | Size: 470 KiB After Width: | Height: | Size: 470 KiB |
|
Before Width: | Height: | Size: 470 KiB After Width: | Height: | Size: 470 KiB |
|
Before Width: | Height: | Size: 470 KiB After Width: | Height: | Size: 470 KiB |
|
After Width: | Height: | Size: 470 KiB |
|
Before Width: | Height: | Size: 243 KiB After Width: | Height: | Size: 243 KiB |
|
Before Width: | Height: | Size: 243 KiB After Width: | Height: | Size: 243 KiB |
|
Before Width: | Height: | Size: 243 KiB After Width: | Height: | Size: 243 KiB |
BIN
docs/.gitbook/assets/launch (3) (3) (4).png
Normal file
|
After Width: | Height: | Size: 243 KiB |
|
Before Width: | Height: | Size: 667 KiB After Width: | Height: | Size: 667 KiB |
|
Before Width: | Height: | Size: 667 KiB After Width: | Height: | Size: 667 KiB |
|
Before Width: | Height: | Size: 667 KiB After Width: | Height: | Size: 667 KiB |
BIN
docs/.gitbook/assets/meetings-participant-ranked (3) (3) (4).png
Normal file
|
After Width: | Height: | Size: 667 KiB |
|
Before Width: | Height: | Size: 434 KiB After Width: | Height: | Size: 434 KiB |
|
Before Width: | Height: | Size: 434 KiB After Width: | Height: | Size: 434 KiB |
|
Before Width: | Height: | Size: 434 KiB After Width: | Height: | Size: 434 KiB |
BIN
docs/.gitbook/assets/postgres_credentials (3) (3) (4).png
Normal file
|
After Width: | Height: | Size: 434 KiB |
|
Before Width: | Height: | Size: 612 KiB After Width: | Height: | Size: 612 KiB |
|
Before Width: | Height: | Size: 612 KiB After Width: | Height: | Size: 612 KiB |
|
Before Width: | Height: | Size: 612 KiB After Width: | Height: | Size: 612 KiB |
BIN
docs/.gitbook/assets/schema (3) (3) (4).png
Normal file
|
After Width: | Height: | Size: 612 KiB |
|
Before Width: | Height: | Size: 323 KiB After Width: | Height: | Size: 323 KiB |
|
Before Width: | Height: | Size: 323 KiB After Width: | Height: | Size: 323 KiB |
|
Before Width: | Height: | Size: 323 KiB After Width: | Height: | Size: 323 KiB |
|
Before Width: | Height: | Size: 261 KiB After Width: | Height: | Size: 261 KiB |
|
Before Width: | Height: | Size: 261 KiB After Width: | Height: | Size: 261 KiB |
|
Before Width: | Height: | Size: 261 KiB After Width: | Height: | Size: 261 KiB |
|
Before Width: | Height: | Size: 1.1 MiB After Width: | Height: | Size: 1.1 MiB |
|
Before Width: | Height: | Size: 1.1 MiB After Width: | Height: | Size: 1.1 MiB |
|
Before Width: | Height: | Size: 1.1 MiB After Width: | Height: | Size: 1.1 MiB |
BIN
docs/.gitbook/assets/tableau-dashboard (3) (3) (4).png
Normal file
|
After Width: | Height: | Size: 1.1 MiB |
|
Before Width: | Height: | Size: 508 KiB After Width: | Height: | Size: 508 KiB |
|
Before Width: | Height: | Size: 508 KiB After Width: | Height: | Size: 508 KiB |
|
Before Width: | Height: | Size: 508 KiB After Width: | Height: | Size: 508 KiB |
|
After Width: | Height: | Size: 508 KiB |
@@ -8,7 +8,7 @@
|
||||
* [Set up a Connection](quickstart/set-up-a-connection.md)
|
||||
* [Deploying Airbyte](deploying-airbyte/README.md)
|
||||
* [Local Deployment](deploying-airbyte/local-deployment.md)
|
||||
* [On Airbyte Cloud](deploying-airbyte/on-cloud.md)
|
||||
* [On Airbyte Cloud](deploying-airbyte/on-cloud.md)
|
||||
* [On AWS \(EC2\)](deploying-airbyte/on-aws-ec2.md)
|
||||
* [On AWS ECS \(Coming Soon\)](deploying-airbyte/on-aws-ecs.md)
|
||||
* [On Azure\(VM\)](deploying-airbyte/on-azure-vm-cloud-shell.md)
|
||||
@@ -139,14 +139,14 @@
|
||||
* [Databricks](integrations/destinations/databricks.md)
|
||||
* [DynamoDB](integrations/destinations/dynamodb.md)
|
||||
* [Chargify](integrations/destinations/keen.md)
|
||||
* [Google Cloud Storage (GCS)](integrations/destinations/gcs.md)
|
||||
* [Google Cloud Storage \(GCS\)](integrations/destinations/gcs.md)
|
||||
* [Google PubSub](integrations/destinations/pubsub.md)
|
||||
* [Kafka](integrations/destinations/kafka.md)
|
||||
* [Keen](integrations/destinations/keen.md)
|
||||
* [Keen](integrations/destinations/keen-1.md)
|
||||
* [Local CSV](integrations/destinations/local-csv.md)
|
||||
* [Local JSON](integrations/destinations/local-json.md)
|
||||
* [MeiliSearch](integrations/destinations/meilisearch.md)
|
||||
* [MongoDB](integrations/destinations/mongodb.md)
|
||||
* [MongoDB](integrations/destinations/mongodb.md)
|
||||
* [MSSQL](integrations/destinations/mssql.md)
|
||||
* [MySQL](integrations/destinations/mysql.md)
|
||||
* [Oracle DB](integrations/destinations/oracle.md)
|
||||
@@ -179,7 +179,7 @@
|
||||
* [HTTP-API-based Connectors](connector-development/cdk-python/http-streams.md)
|
||||
* [Python Concepts](connector-development/cdk-python/python-concepts.md)
|
||||
* [Stream Slices](connector-development/cdk-python/stream-slices.md)
|
||||
* [Connector Development Kit \(Javascript\)](connector-development/cdk-faros-js/README.md)
|
||||
* [Connector Development Kit \(Javascript\)](connector-development/cdk-faros-js.md)
|
||||
* [Airbyte 101 for Connector Development](connector-development/airbyte101.md)
|
||||
* [Testing Connectors](connector-development/testing-connectors/README.md)
|
||||
* [Source Acceptance Tests Reference](connector-development/testing-connectors/source-acceptance-tests-reference.md)
|
||||
@@ -227,3 +227,4 @@
|
||||
* [On Setting up a New Connection](troubleshooting/new-connection.md)
|
||||
* [On Running a Sync](troubleshooting/running-sync.md)
|
||||
* [On Upgrading](troubleshooting/on-upgrading.md)
|
||||
|
||||
|
||||
@@ -6,13 +6,13 @@ To build a new connector in Java or Python, we provide templates so you don't ne
|
||||
|
||||
**Note: you are not required to maintain the connectors you create.** The goal is that the Airbyte core team and the community help maintain the connector.
|
||||
|
||||
## Python Connector-Development Kit (CDK)
|
||||
## Python Connector-Development Kit \(CDK\)
|
||||
|
||||
You can build a connector very quickly in Python with the [Airbyte CDK](cdk-python/README.md), which generates 75% of the code required for you.
|
||||
You can build a connector very quickly in Python with the [Airbyte CDK](cdk-python/), which generates 75% of the code required for you.
|
||||
|
||||
## TS/JS Connector-Development Kit (Faros AI Airbyte CDK)
|
||||
## TS/JS Connector-Development Kit \(Faros AI Airbyte CDK\)
|
||||
|
||||
You can build a connector in TypeScript/JavaScript with the [Faros AI CDK](./cdk-faros-js/README.md), which generates and boostraps most of the code required for HTTP Airbyte sources.
|
||||
You can build a connector in TypeScript/JavaScript with the [Faros AI CDK](https://github.com/airbytehq/airbyte/tree/01b905a38385ca514c2d9c07cc44a8f9a48ce762/docs/connector-development/cdk-faros-js/README.md), which generates and boostraps most of the code required for HTTP Airbyte sources.
|
||||
|
||||
## The Airbyte specification
|
||||
|
||||
@@ -25,7 +25,7 @@ Before building a new connector, review [Airbyte's data protocol specification](
|
||||
To add a new connector you need to:
|
||||
|
||||
1. Implement & Package your connector in an Airbyte Protocol compliant Docker image
|
||||
2. Add integration tests for your connector. At a minimum, all connectors must pass [Airbyte's standard test suite](testing-connectors/README.md), but you can also add your own tests.
|
||||
2. Add integration tests for your connector. At a minimum, all connectors must pass [Airbyte's standard test suite](testing-connectors/), but you can also add your own tests.
|
||||
3. Document how to build & test your connector
|
||||
4. Publish the Docker image containing the connector
|
||||
|
||||
@@ -36,11 +36,13 @@ Each requirement has a subsection below.
|
||||
If you are building a connector in any of the following languages/frameworks, then you're in luck! We provide autogenerated templates to get you started quickly:
|
||||
|
||||
#### Sources
|
||||
|
||||
* **Python Source Connector**
|
||||
* [**Singer**](https://singer.io)**-based Python Source Connector**. [Singer.io](https://singer.io/) is an open source framework with a large community and many available connectors \(known as taps & targets\). To build an Airbyte connector from a Singer tap, wrap the tap in a thin Python package to make it Airbyte Protocol-compatible. See the [Github Connector](https://github.com/airbytehq/airbyte/tree/master/airbyte-integrations/connectors/source-github-singer) for an example of an Airbyte Connector implemented on top of a Singer tap.
|
||||
* **Generic Connector**: This template provides a basic starting point for any language.
|
||||
|
||||
#### Destinations
|
||||
|
||||
* **Java Destination Connector**
|
||||
* **Python Destination Connector**
|
||||
|
||||
@@ -58,7 +60,7 @@ and choose the relevant template by using the arrow keys. This will generate a n
|
||||
Search the generated directory for "TODO"s and follow them to implement your connector. For more detailed walkthroughs and instructions, follow the relevant tutorial:
|
||||
|
||||
* [Speedrun: Building a HTTP source with the CDK](tutorials/cdk-speedrun.md)
|
||||
* [Building a HTTP source with the CDK](tutorials/cdk-tutorial-python-http)
|
||||
* [Building a HTTP source with the CDK](tutorials/cdk-tutorial-python-http/)
|
||||
* [Building a Python source](tutorials/building-a-python-source.md)
|
||||
* [Building a Python destination](tutorials/building-a-python-destination.md)
|
||||
* [Building a Java destination](tutorials/building-a-java-destination.md)
|
||||
@@ -67,9 +69,9 @@ As you implement your connector, make sure to review the [Best Practices for Con
|
||||
|
||||
### 2. Integration tests
|
||||
|
||||
At a minimum, your connector must implement the acceptance tests described in [Testing Connectors](testing-connectors/README.md)
|
||||
At a minimum, your connector must implement the acceptance tests described in [Testing Connectors](testing-connectors/)
|
||||
|
||||
**Note: Acceptance tests are not yet available for Python destination connectors. Coming [soon](https://github.com/airbytehq/airbyte/issues/4698)!**
|
||||
**Note: Acceptance tests are not yet available for Python destination connectors. Coming** [**soon**](https://github.com/airbytehq/airbyte/issues/4698)**!**
|
||||
|
||||
### 3. Document building & testing your connector
|
||||
|
||||
@@ -88,10 +90,12 @@ When you submit a PR to Airbyte with your connector, the reviewer will use the c
|
||||
2. `:airbyte-integrations:connectors:source-<name>:integrationTest` should run integration tests including Airbyte's Standard test suite.
|
||||
|
||||
### 4. Publish the connector
|
||||
Typically this will be handled as part of code review by an Airbyter. There is a section below on what steps are needed for publishing a connector and will mostly be used by Airbyte employees publishing the connector.
|
||||
|
||||
Typically this will be handled as part of code review by an Airbyter. There is a section below on what steps are needed for publishing a connector and will mostly be used by Airbyte employees publishing the connector.
|
||||
|
||||
## Updating an existing connector
|
||||
The steps for updating an existing connector are the same as for building a new connector minus the need to use the autogenerator to create a new connector. Therefore the steps are:
|
||||
|
||||
The steps for updating an existing connector are the same as for building a new connector minus the need to use the autogenerator to create a new connector. Therefore the steps are:
|
||||
|
||||
1. Iterate on the connector to make the needed changes
|
||||
2. Run tests
|
||||
@@ -100,7 +104,7 @@ The steps for updating an existing connector are the same as for building a new
|
||||
|
||||
## Publishing a connector
|
||||
|
||||
Once you've finished iterating on the changes to a connector as specified in its `README.md`, follow these instructions to ship the new version of the connector with Airbyte out of the box.
|
||||
Once you've finished iterating on the changes to a connector as specified in its `README.md`, follow these instructions to ship the new version of the connector with Airbyte out of the box.
|
||||
|
||||
1. Bump the version in the `Dockerfile` of the connector \(`LABEL io.airbyte.version=X.X.X`\).
|
||||
2. Update the connector definition in the Airbyte connector index to use the new version:
|
||||
@@ -125,6 +129,7 @@ Once you've finished iterating on the changes to a connector as specified in its
|
||||
6. The new version of the connector is now available for everyone who uses it. Thank you!
|
||||
|
||||
## Using credentials in CI
|
||||
|
||||
In order to run integration tests in CI, you'll often need to inject credentials into CI. There are a few steps for doing this:
|
||||
|
||||
1. **Place the credentials into Lastpass**: Airbyte uses a shared Lastpass account as the source of truth for all secrets. Place the credentials **exactly as they should be used by the connector** into a secure note i.e: it should basically be a copy paste of the `config.json` passed into a connector via the `--config` flag. We use the following naming pattern: `<source OR destination> <name> creds` e.g: `source google adwords creds` or `destination snowflake creds`.
|
||||
@@ -132,3 +137,4 @@ In order to run integration tests in CI, you'll often need to inject credentials
|
||||
3. **Inject the credentials into test and publish CI workflows**: edit the files `.github/workflows/publish-command.yml` and `.github/workflows/test-command.yml` to inject the secret into the CI run. This will make these secrets available to the `/test` and `/publish` commands.
|
||||
4. **During CI, write the secret from env variables to the connector directory**: edit `tools/bin/ci_credentials.sh` to write the secret into the `secrets/` directory of the relevant connector.
|
||||
5. That should be it.
|
||||
|
||||
|
||||
@@ -2,5 +2,5 @@
|
||||
|
||||
## The Airbyte Catalog
|
||||
|
||||
The Airbyte catalog defines the relationship between your incoming data's schema and the schema of your output stream. This
|
||||
is an incredibly important concept to understand as a connector dev, so check out the AirbyteCatalog [here](../understanding-airbyte/beginners-guide-to-catalog.md).
|
||||
The Airbyte catalog defines the relationship between your incoming data's schema and the schema of your output stream. This is an incredibly important concept to understand as a connector dev, so check out the AirbyteCatalog [here](../understanding-airbyte/beginners-guide-to-catalog.md).
|
||||
|
||||
|
||||
@@ -48,3 +48,4 @@ When reviewing connectors, we'll use the following "checklist" to verify whether
|
||||
### Rate Limiting
|
||||
|
||||
Most APIs enforce rate limits. Your connector should gracefully handle those \(i.e: without failing the connector process\). The most common way to handle rate limits is to implement backoff.
|
||||
|
||||
|
||||
@@ -1,11 +1,12 @@
|
||||
# Connector Development Kit (TypeScript/JavaScript)
|
||||
# Connector Development Kit \(Javascript\)
|
||||
|
||||
The [Faros AI TypeScript/JavaScript CDK](https://github.com/faros-ai/airbyte-connectors/tree/main/faros-airbyte-cdk) allows you to build Airbyte connectors quickly similarly to how our [Python CDK](../cdk-python) does. This CDK currently offers support for creating Airbyte source connectors for:
|
||||
The [Faros AI TypeScript/JavaScript CDK](https://github.com/faros-ai/airbyte-connectors/tree/main/faros-airbyte-cdk) allows you to build Airbyte connectors quickly similarly to how our [Python CDK](cdk-python/) does. This CDK currently offers support for creating Airbyte source connectors for:
|
||||
|
||||
- HTTP APIs
|
||||
* HTTP APIs
|
||||
|
||||
## Resources
|
||||
|
||||
[This document](https://github.com/faros-ai/airbyte-connectors/blob/main/sources/README.md) is the main guide for developing an Airbyte source with the Faros CDK.
|
||||
|
||||
An example of a source built with the Faros AI CDK can be found [here](https://github.com/faros-ai/airbyte-connectors/tree/main/sources/example-source). It's recommended that you follow along with the example source while building for the first time.
|
||||
An example of a source built with the Faros AI CDK can be found [here](https://github.com/faros-ai/airbyte-connectors/tree/main/sources/example-source). It's recommended that you follow along with the example source while building for the first time.
|
||||
|
||||
@@ -10,7 +10,7 @@ The CDK provides an improved developer experience by providing basic implementat
|
||||
|
||||
This document is a general introduction to the CDK. Readers should have basic familiarity with the [Airbyte Specification](https://docs.airbyte.io/architecture/airbyte-specification) before proceeding.
|
||||
|
||||
If you have any issues with troubleshooting or want to learn more about the CDK from the Airbyte team, head to the #connector-development channel in [our Slack](https://airbytehq.slack.com/ssb/redirect) to inquire further!
|
||||
If you have any issues with troubleshooting or want to learn more about the CDK from the Airbyte team, head to the \#connector-development channel in [our Slack](https://airbytehq.slack.com/ssb/redirect) to inquire further!
|
||||
|
||||
## Getting Started
|
||||
|
||||
@@ -29,23 +29,23 @@ Additionally, you can follow [this tutorial](https://docs.airbyte.io/connector-d
|
||||
|
||||
#### Basic Concepts
|
||||
|
||||
If you want to learn more about the classes required to implement an Airbyte Source, head to our [basic concepts doc](./basic-concepts.md).
|
||||
If you want to learn more about the classes required to implement an Airbyte Source, head to our [basic concepts doc](basic-concepts.md).
|
||||
|
||||
#### Full Refresh Streams
|
||||
|
||||
If you have questions or are running into issues creating your first full refresh stream, head over to our [full refresh stream doc](./full-refresh-stream.md). If you have questions about implementing a `path` or `parse_response` function, this doc is for you.
|
||||
If you have questions or are running into issues creating your first full refresh stream, head over to our [full refresh stream doc](full-refresh-stream.md). If you have questions about implementing a `path` or `parse_response` function, this doc is for you.
|
||||
|
||||
#### Incremental Streams
|
||||
|
||||
Having trouble figuring out how to write a `stream_slices` function or aren't sure what a `cursor_field` is? Head to our [incremental stream doc](./incremental-stream.md).
|
||||
Having trouble figuring out how to write a `stream_slices` function or aren't sure what a `cursor_field` is? Head to our [incremental stream doc](incremental-stream.md).
|
||||
|
||||
#### Practical Tips
|
||||
|
||||
Airbyte recommends using the CDK template generator to develop with the CDK. The template generates created all the required scaffolding, with convenient TODOs, allowing developers to truly focus on implementing the API.
|
||||
|
||||
For tips on useful Python knowledge, see the [Python Concepts](./python-concepts.md) page.
|
||||
For tips on useful Python knowledge, see the [Python Concepts](python-concepts.md) page.
|
||||
|
||||
You can find a complete tutorial for implementing an HTTP source connector in [this tutorial](../tutorials/cdk-tutorial-python-http)
|
||||
You can find a complete tutorial for implementing an HTTP source connector in [this tutorial](../tutorials/cdk-tutorial-python-http/)
|
||||
|
||||
### Example Connectors
|
||||
|
||||
|
||||
@@ -46,7 +46,7 @@ Note that while this is the most flexible way to implement a source connector, i
|
||||
|
||||
An `AbstractSource` also owns a set of `Stream`s. This is populated via the `AbstractSource`'s `streams` [function](https://github.com/airbytehq/airbyte/blob/master/airbyte-cdk/python/airbyte_cdk/sources/abstract_source.py#L63). `Discover` and `Read` rely on this populated set.
|
||||
|
||||
`Discover` returns an `AirbyteCatalog` representing all the distinct resources the underlying API supports. Here is the [entrypoint](https://github.com/airbytehq/airbyte/blob/master/airbyte-cdk/python/airbyte_cdk/sources/abstract_source.py#L74) for those interested in reading the code. See [schemas](schemas.md) for more information on how to declare the schema of a stream.
|
||||
`Discover` returns an `AirbyteCatalog` representing all the distinct resources the underlying API supports. Here is the [entrypoint](https://github.com/airbytehq/airbyte/blob/master/airbyte-cdk/python/airbyte_cdk/sources/abstract_source.py#L74) for those interested in reading the code. See [schemas](https://github.com/airbytehq/airbyte/tree/21116cad97f744f936e503f9af5a59ed3ac59c38/docs/contributing-to-airbyte/python/concepts/schemas.md) for more information on how to declare the schema of a stream.
|
||||
|
||||
`Read` creates an in-memory stream reading from each of the `AbstractSource`'s streams. Here is the [entrypoint](https://github.com/airbytehq/airbyte/blob/master/airbyte-cdk/python/airbyte_cdk/sources/abstract_source.py#L90) for those interested.
|
||||
|
||||
|
||||
@@ -8,7 +8,7 @@ Several new pieces are essential to understand how incrementality works with the
|
||||
* cursor fields
|
||||
* `Stream.get_updated_state`
|
||||
|
||||
as well as a few other optional concepts.
|
||||
as well as a few other optional concepts.
|
||||
|
||||
### `AirbyteStateMessage`
|
||||
|
||||
@@ -26,23 +26,22 @@ In the context of the CDK, setting the `Stream.cursor_field` property to any tru
|
||||
|
||||
This function helps the stream keep track of the latest state by inspecting every record output by the stream \(as returned by the `Stream.read_records` method\) and comparing it against the most recent state object. This allows sync to resume from where the previous sync last stopped, regardless of success or failure. This function typically compares the state object's and the latest record's cursor field, picking the latest one.
|
||||
|
||||
|
||||
## Checkpointing state
|
||||
|
||||
There are two ways to checkpointing state (i.e: controling the timing of when state is saved) while reading data from a connector:
|
||||
There are two ways to checkpointing state \(i.e: controling the timing of when state is saved\) while reading data from a connector:
|
||||
|
||||
1. Interval-based checkpointing
|
||||
2. Stream Slices
|
||||
|
||||
|
||||
### Interval based checkpointing
|
||||
This is the simplest method for checkpointing. When the interval is set to a truthy value e.g: 100, then state is persisted after every 100 records output by the connector e.g: state is saved after reading 100 records, then 200, 300, etc..
|
||||
|
||||
While this is very simple, **it requires that records are output in ascending order with regards to the cursor field**. For example, if your stream outputs records in ascending order of the `updated_at` field, then this is a good fit for your usecase. But if the stream outputs records in a random order, then you cannot use this method because we can only be certain that we read records after a particular `updated_at` timestamp once all records have been fully read.
|
||||
This is the simplest method for checkpointing. When the interval is set to a truthy value e.g: 100, then state is persisted after every 100 records output by the connector e.g: state is saved after reading 100 records, then 200, 300, etc..
|
||||
|
||||
Interval based checkpointing can be implemented by setting the `Stream.state_checkpoint_interval` property e.g:
|
||||
While this is very simple, **it requires that records are output in ascending order with regards to the cursor field**. For example, if your stream outputs records in ascending order of the `updated_at` field, then this is a good fit for your usecase. But if the stream outputs records in a random order, then you cannot use this method because we can only be certain that we read records after a particular `updated_at` timestamp once all records have been fully read.
|
||||
|
||||
```
|
||||
Interval based checkpointing can be implemented by setting the `Stream.state_checkpoint_interval` property e.g:
|
||||
|
||||
```text
|
||||
class MyAmazingStream(Stream):
|
||||
# Save the state every 100 records
|
||||
state_checkpoint_interval = 100
|
||||
@@ -58,7 +57,7 @@ A Slice object is not typed, and the developer is free to include any informatio
|
||||
|
||||
As an example, suppose an API is able to dispense data hourly. If the last sync was exactly 24 hours ago, we can either make an API call retrieving all data at once, or make 24 calls each retrieving an hour's worth of data. In the latter case, the `stream_slices` function, sees that the previous state contains yesterday's timestamp, and returns a list of 24 Slices, each with a different hourly timestamp to be used when creating request. If the stream fails halfway through \(at the 12th slice\), then the next time it starts reading, it will read from the beginning of the 12th slice.
|
||||
|
||||
For a more in-depth description of stream slicing, see the [Stream Slices guide](stream-slices.md).
|
||||
For a more in-depth description of stream slicing, see the [Stream Slices guide](https://github.com/airbytehq/airbyte/tree/8500fef4133d3d06e16e8b600d65ebf2c58afefd/docs/connector-development/cdk-python/stream-slices.md).
|
||||
|
||||
## Conclusion
|
||||
|
||||
|
||||
@@ -1,24 +1,30 @@
|
||||
# Defining your stream schemas
|
||||
Your connector must describe the schema of each stream it can output using [JSONSchema](https://json-schema.org).
|
||||
# Defining Stream Schemas
|
||||
|
||||
Your connector must describe the schema of each stream it can output using [JSONSchema](https://json-schema.org).
|
||||
|
||||
The simplest way to do this is to describe the schema of your streams using one `.json` file per stream. You can also dynamically generate the schema of your stream in code, or you can combine both approaches: start with a `.json` file and dynamically add properties to it.
|
||||
|
||||
The simplest way to do this is to describe the schema of your streams using one `.json` file per stream. You can also dynamically generate the schema of your stream in code, or you can combine both approaches: start with a `.json` file and dynamically add properties to it.
|
||||
|
||||
The schema of a stream is the return value of `Stream.get_json_schema`.
|
||||
|
||||
|
||||
## Static schemas
|
||||
|
||||
By default, `Stream.get_json_schema` reads a `.json` file in the `schemas/` directory whose name is equal to the value of the `Stream.name` property. In turn `Stream.name` by default returns the name of the class in snake case. Therefore, if you have a class `class EmployeeBenefits(HttpStream)` the default behavior will look for a file called `schemas/employee_benefits.json`. You can override any of these behaviors as you need.
|
||||
|
||||
Important note: any objects referenced via `$ref` should be placed in the `shared/` directory in their own `.json` files.
|
||||
|
||||
### Generating schemas from OpenAPI definitions
|
||||
|
||||
If you are implementing a connector to pull data from an API which publishes an [OpenAPI/Swagger spec](https://swagger.io/specification/), you can use a tool we've provided for generating JSON schemas from the OpenAPI definition file. Detailed information can be found [here](https://github.com/airbytehq/airbyte/tree/master/tools/openapi2jsonschema/).
|
||||
|
||||
|
||||
## Dynamic schemas
|
||||
|
||||
If you'd rather define your schema in code, override `Stream.get_json_schema` in your stream class to return a `dict` describing the schema using [JSONSchema](https://json-schema.org).
|
||||
|
||||
## Dynamically modifying static schemas
|
||||
Override `Stream.get_json_schema` to run the default behavior, edit the returned value, then return the edited value:
|
||||
```
|
||||
## Dynamically modifying static schemas
|
||||
|
||||
Override `Stream.get_json_schema` to run the default behavior, edit the returned value, then return the edited value:
|
||||
|
||||
```text
|
||||
def get_json_schema(self):
|
||||
schema = super().get_json_schema()
|
||||
schema['dynamically_determined_property'] = "property"
|
||||
@@ -27,11 +33,12 @@ def get_json_schema(self):
|
||||
|
||||
## Type transformation
|
||||
|
||||
It is important to ensure output data conforms to the declared json schema. This is because the destination receiving this data to load into tables may strictly enforce schema (e.g. when data is stored in a SQL database, you can't put CHAT type into INTEGER column). In the case of changes to API output (which is almost guaranteed to happen over time) or a minor mistake in jsonschema definition, data syncs could thus break because of mismatched datatype schemas.
|
||||
It is important to ensure output data conforms to the declared json schema. This is because the destination receiving this data to load into tables may strictly enforce schema \(e.g. when data is stored in a SQL database, you can't put CHAT type into INTEGER column\). In the case of changes to API output \(which is almost guaranteed to happen over time\) or a minor mistake in jsonschema definition, data syncs could thus break because of mismatched datatype schemas.
|
||||
|
||||
To remain robust in operation, the CDK provides a transformation ability to perform automatic object mutation to align with desired schema before outputting to the destination. All streams inherited from airbyte_cdk.sources.streams.core.Stream class have this transform configuration available. It is _disabled_ by default and can be configured per stream within a source connector.
|
||||
To remain robust in operation, the CDK provides a transformation ability to perform automatic object mutation to align with desired schema before outputting to the destination. All streams inherited from airbyte_cdk.sources.streams.core.Stream class have this transform configuration available. It is \_disabled_ by default and can be configured per stream within a source connector.
|
||||
|
||||
### Default type transformation
|
||||
|
||||
Here's how you can configure the TypeTransformer:
|
||||
|
||||
```python
|
||||
@@ -43,26 +50,35 @@ class MyStream(Stream):
|
||||
transformer = Transformer(TransformConfig.DefaultSchemaNormalization)
|
||||
...
|
||||
```
|
||||
|
||||
In this case default transformation will be applied. For example if you have schema like this
|
||||
```json
|
||||
|
||||
```javascript
|
||||
{"type": "object", "properties": {"value": {"type": "string"}}}
|
||||
```
|
||||
|
||||
and source API returned object with non-string type, it would be casted to string automaticaly:
|
||||
```json
|
||||
|
||||
```javascript
|
||||
{"value": 12} -> {"value": "12"}
|
||||
```
|
||||
|
||||
Also it works on complex types:
|
||||
```json
|
||||
|
||||
```javascript
|
||||
{"value": {"unexpected_object": "value"}} -> {"value": "{'unexpected_object': 'value'}"}
|
||||
```
|
||||
|
||||
And objects inside array of referenced by $ref attribute.
|
||||
|
||||
If the value cannot be cast (e.g. string "asdf" cannot be casted to integer), the field would retain its original value. Schema type transformation support any jsonschema types, nested objects/arrays and reference types. Types described as array of more than one type (except "null"), types under oneOf/anyOf keyword wont be transformed.
|
||||
If the value cannot be cast \(e.g. string "asdf" cannot be casted to integer\), the field would retain its original value. Schema type transformation support any jsonschema types, nested objects/arrays and reference types. Types described as array of more than one type \(except "null"\), types under oneOf/anyOf keyword wont be transformed.
|
||||
|
||||
*Note:* This transformation is done by the source, not the stream itself. I.e. if you have overriden "read_records" method in your stream it wont affect object transformation. All transformation are done in-place by modifing output object before passing it to "get_updated_state" method, so "get_updated_state" would receive the transformed object.
|
||||
_Note:_ This transformation is done by the source, not the stream itself. I.e. if you have overriden "read\_records" method in your stream it wont affect object transformation. All transformation are done in-place by modifing output object before passing it to "get\_updated\_state" method, so "get\_updated\_state" would receive the transformed object.
|
||||
|
||||
### Custom schema type transformation
|
||||
|
||||
Default schema type transformation performs simple type casting. Sometimes you want to perform more sophisticated transform like making "date-time" field compliant to rcf3339 standard. In this case you can use custom schema type transformation:
|
||||
|
||||
```python
|
||||
class MyStream(Stream):
|
||||
...
|
||||
@@ -74,27 +90,34 @@ class MyStream(Stream):
|
||||
# transformed_value = ...
|
||||
return transformed_value
|
||||
```
|
||||
Where original_value is initial field value and field_schema is part of jsonschema describing field type. For schema
|
||||
```json
|
||||
|
||||
Where original\_value is initial field value and field\_schema is part of jsonschema describing field type. For schema
|
||||
|
||||
```javascript
|
||||
{"type": "object", "properties": {"value": {"type": "string", "format": "date-time"}}}
|
||||
```
|
||||
field_schema variable would be equal to
|
||||
```json
|
||||
|
||||
field\_schema variable would be equal to
|
||||
|
||||
```javascript
|
||||
{"type": "string", "format": "date-time"}
|
||||
```
|
||||
|
||||
In this case default transformation would be skipped and only custom transformation apply. If you want to run both default and custom transformation you can configure transdormer object by combining config flags:
|
||||
|
||||
```python
|
||||
transformer = Transformer(TransformConfig.DefaultSchemaNormalization | TransformConfig.CustomSchemaNormalization)
|
||||
```
|
||||
|
||||
In this case custom transformation will be applied after default type transformation function. Note that order of flags doesnt matter, default transformation will always be run before custom.
|
||||
|
||||
### Performance consideration
|
||||
|
||||
Transofrming each object on the fly would add some time for each object processing. This time is depends on object/schema complexitiy and hardware configuration.
|
||||
Transofrming each object on the fly would add some time for each object processing. This time is depends on object/schema complexitiy and hardware configuration.
|
||||
|
||||
There is some performance benchmark we've done with ads_insights facebook schema (it is complex schema with objects nested inside arrays ob object and a lot of references) and example object.
|
||||
Here is average transform time per single object, seconds:
|
||||
```
|
||||
There is some performance benchmark we've done with ads\_insights facebook schema \(it is complex schema with objects nested inside arrays ob object and a lot of references\) and example object. Here is average transform time per single object, seconds:
|
||||
|
||||
```text
|
||||
regular transform:
|
||||
0.0008423403530008121
|
||||
|
||||
@@ -107,4 +130,6 @@ transform without actual value setting (but iterating through object properties
|
||||
just traverse/validate through json schema and object fields:
|
||||
0.0006139181846665452
|
||||
```
|
||||
On my PC (AMD Ryzen 7 5800X) it took 0.8 milliseconds per one object. As you can see most time (~ 75%) is taken by jsonschema traverse/validation routine and very little (less than 10 %) by actual converting. Processing time can be reduced by skipping jsonschema type checking but it would be no warnings about possible object jsonschema inconsistency.
|
||||
|
||||
On my PC \(AMD Ryzen 7 5800X\) it took 0.8 milliseconds per one object. As you can see most time \(~ 75%\) is taken by jsonschema traverse/validation routine and very little \(less than 10 %\) by actual converting. Processing time can be reduced by skipping jsonschema type checking but it would be no warnings about possible object jsonschema inconsistency.
|
||||
|
||||
|
||||
@@ -1,14 +1,16 @@
|
||||
# Connector Specification Reference
|
||||
The [connector specification](../understanding-airbyte/airbyte-specification.md#spec) describes what inputs can be used to configure a connector. Like the rest of the Airbyte Protocol, it uses [JsonSchema](https://json-schema.org), but with some slight modifications.
|
||||
|
||||
The [connector specification](../understanding-airbyte/airbyte-specification.md#spec) describes what inputs can be used to configure a connector. Like the rest of the Airbyte Protocol, it uses [JsonSchema](https://json-schema.org), but with some slight modifications.
|
||||
|
||||
## Demoing your specification
|
||||
|
||||
While iterating on your specification, you can preview what it will look like in the UI in realtime by following the instructions [here](https://github.com/airbytehq/airbyte/blob/master/airbyte-webapp/docs/HowTo-ConnectionSpecification.md).
|
||||
|
||||
|
||||
### Secret obfuscation
|
||||
By default, any fields in a connector's specification are visible can be read in the UI. However, if you want to obfuscate fields in the UI and API (for example when working with a password), add the `airbyte_secret` annotation to your connector's `spec.json` e.g:
|
||||
|
||||
```
|
||||
By default, any fields in a connector's specification are visible can be read in the UI. However, if you want to obfuscate fields in the UI and API \(for example when working with a password\), add the `airbyte_secret` annotation to your connector's `spec.json` e.g:
|
||||
|
||||
```text
|
||||
"password": {
|
||||
"type": "string",
|
||||
"examples": ["hunter2"],
|
||||
@@ -16,14 +18,13 @@ By default, any fields in a connector's specification are visible can be read in
|
||||
},
|
||||
```
|
||||
|
||||
Here is an example of what the password field would look like:
|
||||
<img width="806" alt="Screen Shot 2021-08-04 at 11 15 04 PM" src="https://user-images.githubusercontent.com/6246757/128300633-7f379b05-5f4a-46e8-ad88-88155e7f4260.png">
|
||||
|
||||
Here is an example of what the password field would look like: 
|
||||
|
||||
### Multi-line String inputs
|
||||
Sometimes when a user is inputting a string field into a connector, newlines need to be preserveed. For example, if we want a connector to use an RSA key which looks like this:
|
||||
|
||||
```
|
||||
Sometimes when a user is inputting a string field into a connector, newlines need to be preserveed. For example, if we want a connector to use an RSA key which looks like this:
|
||||
|
||||
```text
|
||||
---- BEGIN PRIVATE KEY ----
|
||||
123
|
||||
456
|
||||
@@ -31,11 +32,11 @@ Sometimes when a user is inputting a string field into a connector, newlines nee
|
||||
---- END PRIVATE KEY ----
|
||||
```
|
||||
|
||||
we need to preserve the line-breaks. In other words, the string `---- BEGIN PRIVATE KEY ----123456789---- END PRIVATE KEY ----` is not equivalent to the one above since it loses linebreaks.
|
||||
we need to preserve the line-breaks. In other words, the string `---- BEGIN PRIVATE KEY ----123456789---- END PRIVATE KEY ----` is not equivalent to the one above since it loses linebreaks.
|
||||
|
||||
By default, string inputs in the UI can lose their linebreaks. In order to accept multi-line strings in the UI, annotate your string field with `multiline: true` e.g:
|
||||
By default, string inputs in the UI can lose their linebreaks. In order to accept multi-line strings in the UI, annotate your string field with `multiline: true` e.g:
|
||||
|
||||
```
|
||||
```text
|
||||
"private_key": {
|
||||
"type": "string",
|
||||
"description": "RSA private key to use for SSH connection",
|
||||
@@ -44,30 +45,27 @@ By default, string inputs in the UI can lose their linebreaks. In order to accep
|
||||
},
|
||||
```
|
||||
|
||||
this will display a multi-line textbox in the UI like the following screenshot:
|
||||
<img width="796" alt="Screen Shot 2021-08-04 at 11 13 09 PM" src="https://user-images.githubusercontent.com/6246757/128300404-1dc35323-bceb-4f93-9b81-b23cc4beb670.png">
|
||||
this will display a multi-line textbox in the UI like the following screenshot: 
|
||||
|
||||
### Using `oneOf`s
|
||||
|
||||
### Using `oneOf`s
|
||||
In some cases, a connector needs to accept one out of many options. For example, a connector might need to know the compression codec of the file it will read, which will render in the Airbyte UI as a list of the available codecs. In JSONSchema, this can be expressed using the [oneOf](https://json-schema.org/understanding-json-schema/reference/combining.html#oneof) keyword.
|
||||
|
||||
{% hint style="info" %}
|
||||
Some connectors may follow an older format for dropdown lists, we are currently migrating away from that to this standard.
|
||||
{% endhint %}
|
||||
|
||||
In order for the Airbyte UI to correctly render a specification, however, a few extra rules must be followed:
|
||||
In order for the Airbyte UI to correctly render a specification, however, a few extra rules must be followed:
|
||||
|
||||
1. The top-level item containing the `oneOf` must have `type: object`.
|
||||
2. Each item in the `oneOf` array must be a property with `type: object`.
|
||||
3. One `string` field with the same property name must be consistently present throughout each object inside the `oneOf` array. It is required to add a [`const`](https://json-schema.org/understanding-json-schema/reference/generic.html#constant-values) value unique to that `oneOf` option.
|
||||
|
||||
Let's look at the [source-file](../integrations/sources/file.md) implementation as an example. In this example, we have `provider` as a dropdown
|
||||
list option, which allows the user to select what provider their file is being hosted on. We note that the `oneOf` keyword lives under the `provider` object as follows:
|
||||
Let's look at the [source-file](../integrations/sources/file.md) implementation as an example. In this example, we have `provider` as a dropdown list option, which allows the user to select what provider their file is being hosted on. We note that the `oneOf` keyword lives under the `provider` object as follows:
|
||||
|
||||
In each item in the `oneOf` array, the `option_title` string field exists with the aforementioned `const`, `default` and `enum` value unique to that item. There is a [Github issue](https://github.com/airbytehq/airbyte/issues/6384) to improve it and use only `const` in the specification. This helps the UI and the connector distinguish between the option that was chosen by the user. This can
|
||||
be displayed with adapting the file source spec to this example:
|
||||
In each item in the `oneOf` array, the `option_title` string field exists with the aforementioned `const`, `default` and `enum` value unique to that item. There is a [Github issue](https://github.com/airbytehq/airbyte/issues/6384) to improve it and use only `const` in the specification. This helps the UI and the connector distinguish between the option that was chosen by the user. This can be displayed with adapting the file source spec to this example:
|
||||
|
||||
```json
|
||||
```javascript
|
||||
{
|
||||
"connection_specification": {
|
||||
"$schema": "http://json-schema.org/draft-07/schema#",
|
||||
@@ -126,5 +124,6 @@ be displayed with adapting the file source spec to this example:
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
@@ -65,10 +65,10 @@ def connector_setup():
|
||||
container.stop()
|
||||
```
|
||||
|
||||
These tests are configurable via `acceptance-test-config.yml`. Each test has a number of inputs,
|
||||
you can provide multiple sets of inputs which will cause the same to run multiple times - one for each set of inputs.
|
||||
These tests are configurable via `acceptance-test-config.yml`. Each test has a number of inputs, you can provide multiple sets of inputs which will cause the same to run multiple times - one for each set of inputs.
|
||||
|
||||
Example of `acceptance-test-config.yml`:
|
||||
|
||||
```yaml
|
||||
connector_image: string # Docker image to test, for example 'airbyte/source-hubspot:0.1.0'
|
||||
base_path: string # Base path for all relative paths, optional, default - ./
|
||||
@@ -84,97 +84,111 @@ tests: # Tests configuration
|
||||
```
|
||||
|
||||
## Test Spec
|
||||
|
||||
Verify that a spec operation issued to the connector returns a valid spec.
|
||||
| Input | Type| Default | Note |
|
||||
|--|--|--|--|
|
||||
| `spec_path` | string | `secrets/spec.json` |Path to a JSON object representing the spec expected to be output by this connector |
|
||||
| `timeout_seconds` | int | 10 |Test execution timeout in seconds|
|
||||
|
||||
| Input | Type | Default | Note |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| `spec_path` | string | `secrets/spec.json` | Path to a JSON object representing the spec expected to be output by this connector |
|
||||
| `timeout_seconds` | int | 10 | Test execution timeout in seconds |
|
||||
|
||||
## Test Connection
|
||||
|
||||
Verify that a check operation issued to the connector with the input config file returns a successful response.
|
||||
| Input | Type| Default | Note |
|
||||
|--|--|--|--|
|
||||
| `config_path` | string | `secrets/config.json` |Path to a JSON object representing a valid connector configuration|
|
||||
| `status` | `succeed` `failed` `exception`| |Indicate if connection check should succeed with provided config|
|
||||
| `timeout_seconds` | int | 30 |Test execution timeout in seconds|
|
||||
|
||||
| Input | Type | Default | Note |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| `config_path` | string | `secrets/config.json` | Path to a JSON object representing a valid connector configuration |
|
||||
| `status` | `succeed` `failed` `exception` | | Indicate if connection check should succeed with provided config |
|
||||
| `timeout_seconds` | int | 30 | Test execution timeout in seconds |
|
||||
|
||||
## Test Discovery
|
||||
|
||||
Verifies when a discover operation is run on the connector using the given config file, a valid catalog is produced by the connector.
|
||||
| Input | Type| Default | Note |
|
||||
|--|--|--|--|
|
||||
| `config_path` | string | `secrets/config.json` |Path to a JSON object representing a valid connector configuration|
|
||||
| `configured_catalog_path` | string| `integration_tests/configured_catalog.json` |Path to configured catalog|
|
||||
| `timeout_seconds` | int | 30 |Test execution timeout in seconds|
|
||||
|
||||
| Input | Type | Default | Note |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| `config_path` | string | `secrets/config.json` | Path to a JSON object representing a valid connector configuration |
|
||||
| `configured_catalog_path` | string | `integration_tests/configured_catalog.json` | Path to configured catalog |
|
||||
| `timeout_seconds` | int | 30 | Test execution timeout in seconds |
|
||||
|
||||
## Test Basic Read
|
||||
|
||||
Configuring all streams in the input catalog to full refresh mode verifies that a read operation produces some RECORD messages.
|
||||
Each stream should have some data, if you can't guarantee this for particular streams - add them to the `empty_streams` list.
|
||||
| Input | Type| Default | Note |
|
||||
|--|--|--|--|
|
||||
| `config_path` | string | `secrets/config.json` |Path to a JSON object representing a valid connector configuration|
|
||||
| `configured_catalog_path` | string| `integration_tests/configured_catalog.json` |Path to configured catalog|
|
||||
| `empty_streams` | array | [] |List of streams that might be empty|
|
||||
| `validate_schema` | boolean | True |Verify that structure and types of records matches the schema from discovery command|
|
||||
| `timeout_seconds` | int | 5*60 |Test execution timeout in seconds|
|
||||
| `expect_records` | object |None| Compare produced records with expected records, see details below|
|
||||
| `expect_records.path` | string | | File with expected records|
|
||||
Configuring all streams in the input catalog to full refresh mode verifies that a read operation produces some RECORD messages. Each stream should have some data, if you can't guarantee this for particular streams - add them to the `empty_streams` list.
|
||||
|
||||
| Input | Type | Default | Note |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| `config_path` | string | `secrets/config.json` | Path to a JSON object representing a valid connector configuration |
|
||||
| `configured_catalog_path` | string | `integration_tests/configured_catalog.json` | Path to configured catalog |
|
||||
| `empty_streams` | array | \[\] | List of streams that might be empty |
|
||||
| `validate_schema` | boolean | True | Verify that structure and types of records matches the schema from discovery command |
|
||||
| `timeout_seconds` | int | 5\*60 | Test execution timeout in seconds |
|
||||
| `expect_records` | object | None | Compare produced records with expected records, see details below |
|
||||
| `expect_records.path` | string | | File with expected records |
|
||||
| `expect_records.extra_fields` | boolean | False | Allow output records to have other fields i.e: expected records are a subset |
|
||||
| `expect_records.exact_order` | boolean | False | Ensure that records produced in exact same order|
|
||||
| `expect_records.extra_records` | boolean | True | Allow connector to produce extra records, but still enforce all records from the expected file to be produced|
|
||||
| `expect_records.exact_order` | boolean | False | Ensure that records produced in exact same order |
|
||||
| `expect_records.extra_records` | boolean | True | Allow connector to produce extra records, but still enforce all records from the expected file to be produced |
|
||||
|
||||
`expect_records` is a nested configuration, if omitted - the part of the test responsible for record matching will be skipped. Due to the fact that we can't identify records without primary keys, only the following flag combinations are supported:
|
||||
| extra_fields | exact_order| extra_records |
|
||||
|--|--|--|
|
||||
|x|x||
|
||||
||x|x|
|
||||
||x||
|
||||
|||x|
|
||||
||||
|
||||
|
||||
| extra\_fields | exact\_order | extra\_records |
|
||||
| :--- | :--- | :--- |
|
||||
| x | x | |
|
||||
| | x | x |
|
||||
| | x | |
|
||||
| | | x |
|
||||
| | | |
|
||||
|
||||
### Schema format checking
|
||||
|
||||
If some field has [format](https://json-schema.org/understanding-json-schema/reference/string.html#format) attribute specified on its catalog json schema, Source Acceptance Testing framework performs checking against format. It support checking of all [builtin](https://json-schema.org/understanding-json-schema/reference/string.html#built-in-formats) jsonschema formats for draft 7 specification: email, hostnames, ip addresses, time, date and date-time formats.
|
||||
|
||||
Note: For date-time we are not checking against compliance against ISO8601 (and RFC3339 as subset of it). Since we are using specified format to set database column type on db normalization stage, value should be compliant to bigquery [timestamp](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#timestamp_type) and SQL "timestamp with timezone" formats.
|
||||
### Example of `expected_records.txt`:
|
||||
In general, the expected_records.json should contain the subset of output of the records of particular stream you need to test.
|
||||
The required fields are: `stream, data, emitted_at`
|
||||
Note: For date-time we are not checking against compliance against ISO8601 \(and RFC3339 as subset of it\). Since we are using specified format to set database column type on db normalization stage, value should be compliant to bigquery [timestamp](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#timestamp_type) and SQL "timestamp with timezone" formats.
|
||||
|
||||
```JSON
|
||||
### Example of `expected_records.txt`:
|
||||
|
||||
In general, the expected\_records.json should contain the subset of output of the records of particular stream you need to test. The required fields are: `stream, data, emitted_at`
|
||||
|
||||
```javascript
|
||||
{"stream": "my_stream", "data": {"field_1": "value0", "field_2": "value0", "field_3": null, "field_4": {"is_true": true}, "field_5": 123}, "emitted_at": 1626172757000}
|
||||
{"stream": "my_stream", "data": {"field_1": "value1", "field_2": "value1", "field_3": null, "field_4": {"is_true": false}, "field_5": 456}, "emitted_at": 1626172757000}
|
||||
{"stream": "my_stream", "data": {"field_1": "value2", "field_2": "value2", "field_3": null, "field_4": {"is_true": true}, "field_5": 678}, "emitted_at": 1626172757000}
|
||||
{"stream": "my_stream", "data": {"field_1": "value3", "field_2": "value3", "field_3": null, "field_4": {"is_true": false}, "field_5": 91011}, "emitted_at": 1626172757000}
|
||||
|
||||
```
|
||||
|
||||
## Test Full Refresh sync
|
||||
|
||||
### TestSequentialReads
|
||||
|
||||
This test performs two read operations on all streams which support full refresh syncs. It then verifies that the RECORD messages output from both were identical or the former is a strict subset of the latter.
|
||||
| Input | Type| Default | Note |
|
||||
|--|--|--|--|
|
||||
| `config_path` | string | `secrets/config.json` |Path to a JSON object representing a valid connector configuration|
|
||||
| `configured_catalog_path` | string | `integration_tests/configured_catalog.json` |Path to configured catalog|
|
||||
| `timeout_seconds` | int | 20*60 |Test execution timeout in seconds|
|
||||
|
||||
| Input | Type | Default | Note |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| `config_path` | string | `secrets/config.json` | Path to a JSON object representing a valid connector configuration |
|
||||
| `configured_catalog_path` | string | `integration_tests/configured_catalog.json` | Path to configured catalog |
|
||||
| `timeout_seconds` | int | 20\*60 | Test execution timeout in seconds |
|
||||
|
||||
## Test Incremental sync
|
||||
|
||||
### TestTwoSequentialReads
|
||||
|
||||
This test verifies that all streams in the input catalog which support incremental sync can do so correctly. It does this by running two read operations: the first takes the configured catalog and config provided to this test as input. It then verifies that the sync produced a non-zero number of `RECORD` and `STATE` messages. The second read takes the same catalog and config used in the first test, plus the last `STATE` message output by the first read operation as the input state file. It verifies that either no records are produced \(since we read all records in the first sync\) or all records that produced have cursor value greater or equal to cursor value from `STATE` message. This test is performed only for streams that support incremental. Streams that do not support incremental sync are ignored. If no streams in the input catalog support incremental sync, this test is skipped.
|
||||
| Input | Type| Default | Note |
|
||||
|--|--|--|--|
|
||||
| `config_path` | string | `secrets/config.json` |Path to a JSON object representing a valid connector configuration|
|
||||
| `configured_catalog_path` | string | `integration_tests/configured_catalog.json` |Path to configured catalog|
|
||||
| `cursor_paths` | dict | {} | For each stream, the path of its cursor field in the output state messages. If omitted the path will be taken from the last piece of path from stream cursor_field.|
|
||||
| `timeout_seconds` | int | 20*60 |Test execution timeout in seconds|
|
||||
|
||||
| Input | Type | Default | Note |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| `config_path` | string | `secrets/config.json` | Path to a JSON object representing a valid connector configuration |
|
||||
| `configured_catalog_path` | string | `integration_tests/configured_catalog.json` | Path to configured catalog |
|
||||
| `cursor_paths` | dict | {} | For each stream, the path of its cursor field in the output state messages. If omitted the path will be taken from the last piece of path from stream cursor\_field. |
|
||||
| `timeout_seconds` | int | 20\*60 | Test execution timeout in seconds |
|
||||
|
||||
### TestStateWithAbnormallyLargeValues
|
||||
|
||||
This test verifies that sync produces no records when run with the STATE with abnormally large values
|
||||
| Input | Type| Default | Note |
|
||||
|--|--|--|--|
|
||||
| `config_path` | string | `secrets/config.json` |Path to a JSON object representing a valid connector configuration|
|
||||
| `configured_catalog_path` | string | `integration_tests/configured_catalog.json` |Path to configured catalog|
|
||||
| `future_state_path` | string | None |Path to the state file with abnormally large cursor values|
|
||||
| `timeout_seconds` | int | 20*60 |Test execution timeout in seconds|
|
||||
|
||||
| Input | Type | Default | Note |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| `config_path` | string | `secrets/config.json` | Path to a JSON object representing a valid connector configuration |
|
||||
| `configured_catalog_path` | string | `integration_tests/configured_catalog.json` | Path to configured catalog |
|
||||
| `future_state_path` | string | None | Path to the state file with abnormally large cursor values |
|
||||
| `timeout_seconds` | int | 20\*60 | Test execution timeout in seconds |
|
||||
|
||||
|
||||
@@ -40,7 +40,7 @@ $ cd airbyte-integrations/connector-templates/generator # assumes you are starti
|
||||
$ ./generate.sh
|
||||
```
|
||||
|
||||
Select the `Java Destination` template and then input the name of your connector. We'll refer to the destination as `<name>-destination` in this tutorial, but you should replace `<name>` with the actual name you used for your connector e.g: `BigQueryDestination` or `bigquery-destination`.
|
||||
Select the `Java Destination` template and then input the name of your connector. We'll refer to the destination as `<name>-destination` in this tutorial, but you should replace `<name>` with the actual name you used for your connector e.g: `BigQueryDestination` or `bigquery-destination`.
|
||||
|
||||
### Step 2: Build the newly generated destination
|
||||
|
||||
@@ -51,43 +51,45 @@ You can build the destination by running:
|
||||
./gradlew :airbyte-integrations:connectors:destination-<name>:build
|
||||
```
|
||||
|
||||
On Mac M1(Apple Silicon) machines(until openjdk images natively support ARM64 images) set the platform variable as shown below and build
|
||||
On Mac M1\(Apple Silicon\) machines\(until openjdk images natively support ARM64 images\) set the platform variable as shown below and build
|
||||
|
||||
```bash
|
||||
export DOCKER_BUILD_PLATFORM=linux/amd64
|
||||
# Must be run from the Airbyte project root
|
||||
./gradlew :airbyte-integrations:connectors:destination-<name>:build
|
||||
```
|
||||
|
||||
this compiles the java code for your destination and builds a Docker image with the connector. At this point, we haven't implemented anything of value yet, but once we do, you'll use this command to compile your code and Docker image.
|
||||
this compiles the java code for your destination and builds a Docker image with the connector. At this point, we haven't implemented anything of value yet, but once we do, you'll use this command to compile your code and Docker image.
|
||||
|
||||
{% hint style="info" %}
|
||||
Airbyte uses Gradle to manage Java dependencies. To add dependencies for your connector, manage them in the `build.gradle` file inside your connector's directory.
|
||||
Airbyte uses Gradle to manage Java dependencies. To add dependencies for your connector, manage them in the `build.gradle` file inside your connector's directory.
|
||||
{% endhint %}
|
||||
|
||||
#### Iterating on your implementation
|
||||
|
||||
We recommend the following ways of iterating on your connector as you're making changes:
|
||||
|
||||
* Test-driven development (TDD) in Java
|
||||
* Test-driven development (TDD) using Airbyte's Acceptance Tests
|
||||
* Test-driven development \(TDD\) in Java
|
||||
* Test-driven development \(TDD\) using Airbyte's Acceptance Tests
|
||||
* Directly running the docker image
|
||||
|
||||
#### Test-driven development in Java
|
||||
This should feel like a standard flow for a Java developer: you make some code changes then run java tests against them. You can do this directly in your IDE, but you can also run all unit tests via Gradle by running the command to build the connector:
|
||||
|
||||
```
|
||||
This should feel like a standard flow for a Java developer: you make some code changes then run java tests against them. You can do this directly in your IDE, but you can also run all unit tests via Gradle by running the command to build the connector:
|
||||
|
||||
```text
|
||||
./gradlew :airbyte-integrations:connectors:destination-<name>:build
|
||||
```
|
||||
|
||||
This will build the code and run any unit tests. This approach is great when you are testing local behaviors and writing unit tests.
|
||||
This will build the code and run any unit tests. This approach is great when you are testing local behaviors and writing unit tests.
|
||||
|
||||
#### TDD using acceptance tests & integration tests
|
||||
|
||||
Airbyte provides a standard test suite (dubbed "Acceptance Tests") that runs against every destination connector. They are "free" baseline tests to ensure the basic functionality of the destination. When developing a connector, you can simply run the tests between each change and use the feedback to guide your development.
|
||||
Airbyte provides a standard test suite \(dubbed "Acceptance Tests"\) that runs against every destination connector. They are "free" baseline tests to ensure the basic functionality of the destination. When developing a connector, you can simply run the tests between each change and use the feedback to guide your development.
|
||||
|
||||
If you want to try out this approach, check out Step 6 which describes what you need to do to set up the acceptance Tests for your destination.
|
||||
|
||||
The nice thing about this approach is that you are running your destination exactly as Airbyte will run it in the CI. The downside is that the tests do not run very quickly. As such, we recommend this iteration approach only once you've implemented most of your connector and are in the finishing stages of implementation. Note that Acceptance Tests are required for every connector supported by Airbyte, so you should make sure to run them a couple of times while iterating to make sure your connector is compatible with Airbyte.
|
||||
The nice thing about this approach is that you are running your destination exactly as Airbyte will run it in the CI. The downside is that the tests do not run very quickly. As such, we recommend this iteration approach only once you've implemented most of your connector and are in the finishing stages of implementation. Note that Acceptance Tests are required for every connector supported by Airbyte, so you should make sure to run them a couple of times while iterating to make sure your connector is compatible with Airbyte.
|
||||
|
||||
#### Directly running the destination using Docker
|
||||
|
||||
@@ -116,11 +118,12 @@ The nice thing about this approach is that you are running your destination exac
|
||||
|
||||
Each destination contains a specification written in JsonSchema that describes its inputs. Defining the specification is a good place to start when developing your destination. Check out the documentation [here](https://json-schema.org/) to learn the syntax. Here's [an example](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/destination-postgres/src/main/resources/spec.json) of what the `spec.json` looks like for the postgres destination.
|
||||
|
||||
Your generated template should have the spec file in `airbyte-integrations/connectors/destination-<name>/src/main/resources/spec.json`. The generated connector will take care of reading this file and converting it to the correct output. Edit it and you should be done with this step.
|
||||
Your generated template should have the spec file in `airbyte-integrations/connectors/destination-<name>/src/main/resources/spec.json`. The generated connector will take care of reading this file and converting it to the correct output. Edit it and you should be done with this step.
|
||||
|
||||
For more details on what the spec is, you can read about the Airbyte Protocol [here](../../understanding-airbyte/airbyte-specification.md).
|
||||
|
||||
See the `spec` operation in action:
|
||||
See the `spec` operation in action:
|
||||
|
||||
```bash
|
||||
# First build the connector
|
||||
./gradlew :airbyte-integrations:connectors:destination-<name>:build
|
||||
@@ -131,15 +134,15 @@ docker run --rm airbyte/destination-<name>:dev spec
|
||||
|
||||
### Step 4: Implement `check`
|
||||
|
||||
The check operation accepts a JSON object conforming to the `spec.json`. In other words if the `spec.json` said that the destination requires a `username` and `password` the config object might be `{ "username": "airbyte", "password": "password123" }`. It returns a json object that reports, given the credentials in the config, whether we were able to connect to the destination.
|
||||
The check operation accepts a JSON object conforming to the `spec.json`. In other words if the `spec.json` said that the destination requires a `username` and `password` the config object might be `{ "username": "airbyte", "password": "password123" }`. It returns a json object that reports, given the credentials in the config, whether we were able to connect to the destination.
|
||||
|
||||
While developing, we recommend storing any credentials in `secrets/config.json`. Any `secrets` directory in the Airbyte repo is gitignored by default.
|
||||
|
||||
Implement the `check` method in the generated file `<Name>Destination.java`. Here's an [example implementation](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/destination-bigquery/src/main/java/io/airbyte/integrations/destination/bigquery/BigQueryDestination.java#L94) from the BigQuery destination.
|
||||
Implement the `check` method in the generated file `<Name>Destination.java`. Here's an [example implementation](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/destination-bigquery/src/main/java/io/airbyte/integrations/destination/bigquery/BigQueryDestination.java#L94) from the BigQuery destination.
|
||||
|
||||
Verify that the method is working by placing your config in `secrets/config.json` then running:
|
||||
Verify that the method is working by placing your config in `secrets/config.json` then running:
|
||||
|
||||
```
|
||||
```text
|
||||
# First build the connector
|
||||
./gradlew :airbyte-integrations:connectors:destination-<name>:build
|
||||
|
||||
@@ -148,26 +151,25 @@ docker run -v $(pwd)/secrets:/secrets --rm airbyte/destination-<name>:dev check
|
||||
```
|
||||
|
||||
### Step 5: Implement `write`
|
||||
The `write` operation is the main workhorse of a destination connector: it reads input data from the source and writes it to the underlying destination. It takes as input the config file used to run the connector as well as the configured catalog: the file used to describe the schema of the incoming data and how it should be written to the destination. Its "output" is two things:
|
||||
|
||||
The `write` operation is the main workhorse of a destination connector: it reads input data from the source and writes it to the underlying destination. It takes as input the config file used to run the connector as well as the configured catalog: the file used to describe the schema of the incoming data and how it should be written to the destination. Its "output" is two things:
|
||||
|
||||
1. Data written to the underlying destination
|
||||
2. `AirbyteMessage`s of type `AirbyteStateMessage`, written to stdout to indicate which records have been written so far during a sync. It's important to output these messages when possible in order to avoid re-extracting messages from the source. See the [write operation protocol reference](https://docs.airbyte.io/understanding-airbyte/airbyte-specification#write) for more information.
|
||||
|
||||
To implement the `write` Airbyte operation, implement the `getConsumer` method in your generated `<Name>Destination.java` file. Here are some example implementations from different destination conectors:
|
||||
|
||||
|
||||
* [BigQuery](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/destination-bigquery/src/main/java/io/airbyte/integrations/destination/bigquery/BigQueryDestination.java#L188)
|
||||
* [Google Pubsub](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/destination-pubsub/src/main/java/io/airbyte/integrations/destination/pubsub/PubsubDestination.java#L98)
|
||||
* [Local CSV](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/destination-csv/src/main/java/io/airbyte/integrations/destination/csv/CsvDestination.java#L90)
|
||||
* [Postgres](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/destination-postgres/src/main/java/io/airbyte/integrations/destination/postgres/PostgresDestination.java)
|
||||
|
||||
|
||||
{% hint style="info" %}
|
||||
The Postgres destination leverages the `AbstractJdbcDestination` superclass which makes it extremely easy to create a destination for a database or data warehouse if it has a compatible JDBC driver. If the destination you are implementing has a JDBC driver, be sure to check out `AbstractJdbcDestination`.
|
||||
The Postgres destination leverages the `AbstractJdbcDestination` superclass which makes it extremely easy to create a destination for a database or data warehouse if it has a compatible JDBC driver. If the destination you are implementing has a JDBC driver, be sure to check out `AbstractJdbcDestination`.
|
||||
{% endhint %}
|
||||
|
||||
For a brief overview on the Airbyte catalog check out [the Beginner's Guide to the Airbyte Catalog](../../understanding-airbyte/beginners-guide-to-catalog.md).
|
||||
|
||||
|
||||
### Step 6: Set up Acceptance Tests
|
||||
|
||||
The Acceptance Tests are a set of tests that run against all destinations. These tests are run in the Airbyte CI to prevent regressions and verify a baseline of functionality. The test cases are contained and documented in the [following file](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/bases/standard-destination-test/src/main/java/io/airbyte/integrations/standardtest/destination/DestinationAcceptanceTest.java).
|
||||
@@ -175,6 +177,7 @@ The Acceptance Tests are a set of tests that run against all destinations. These
|
||||
To setup acceptance Tests for your connector, follow the `TODO`s in the generated file `<name>DestinationAcceptanceTest.java`. Once setup, you can run the tests using `./gradlew :airbyte-integrations:connectors:destination-<name>:integrationTest`. Make sure to run this command from the Airbyte repository root.
|
||||
|
||||
### Step 7: Write unit tests and/or integration tests
|
||||
|
||||
The Acceptance Tests are meant to cover the basic functionality of a destination. Think of it as the bare minimum required for us to add a destination to Airbyte. You should probably add some unit testing or custom integration testing in case you need to test additional functionality of your destination.
|
||||
|
||||
#### Step 8: Update the docs
|
||||
@@ -182,4 +185,6 @@ The Acceptance Tests are meant to cover the basic functionality of a destination
|
||||
Each connector has its own documentation page. By convention, that page should have the following path: in `docs/integrations/destinations/<destination-name>.md`. For the documentation to get packaged with the docs, make sure to add a link to it in `docs/SUMMARY.md`. You can pattern match doing that from existing connectors.
|
||||
|
||||
## Wrapping up
|
||||
Well done on making it this far! If you'd like your connector to ship with Airbyte by default, create a PR against the Airbyte repo and we'll work with you to get it across the finish line.
|
||||
|
||||
Well done on making it this far! If you'd like your connector to ship with Airbyte by default, create a PR against the Airbyte repo and we'll work with you to get it across the finish line.
|
||||
|
||||
|
||||
@@ -6,7 +6,7 @@ This article provides a checklist for how to create a Python destination. Each s
|
||||
|
||||
## Requirements
|
||||
|
||||
Docker and Python with the versions listed in the [tech stack section](../../understanding-airbyte/tech-stack.md). You can use any Python version between 3.7 and 3.9, but this tutorial was tested with 3.7.
|
||||
Docker and Python with the versions listed in the [tech stack section](../../understanding-airbyte/tech-stack.md). You can use any Python version between 3.7 and 3.9, but this tutorial was tested with 3.7.
|
||||
|
||||
## Checklist
|
||||
|
||||
@@ -22,7 +22,7 @@ Docker and Python with the versions listed in the [tech stack section](../../und
|
||||
* Step 8: Update the docs \(in `docs/integrations/destinations/<destination-name>.md`\)
|
||||
|
||||
{% hint style="info" %}
|
||||
If you need help with any step of the process, feel free to submit a PR with your progress and any questions you have, or ask us on [slack](https://slack.airbyte.io). Also reference the KvDB python destination implementation if you want to see an example of a working destination.
|
||||
If you need help with any step of the process, feel free to submit a PR with your progress and any questions you have, or ask us on [slack](https://slack.airbyte.io). Also reference the KvDB python destination implementation if you want to see an example of a working destination.
|
||||
{% endhint %}
|
||||
|
||||
## Explaining Each Step
|
||||
@@ -36,11 +36,11 @@ $ cd airbyte-integrations/connector-templates/generator # assumes you are starti
|
||||
$ ./generate.sh
|
||||
```
|
||||
|
||||
Select the `Python Destination` template and then input the name of your connector. We'll refer to the destination as `destination-<name>` in this tutorial, but you should replace `<name>` with the actual name you used for your connector e.g: `redis` or `google-sheets`.
|
||||
Select the `Python Destination` template and then input the name of your connector. We'll refer to the destination as `destination-<name>` in this tutorial, but you should replace `<name>` with the actual name you used for your connector e.g: `redis` or `google-sheets`.
|
||||
|
||||
### Step 2: Setup the dev environment
|
||||
|
||||
Setup your Python virtual environment:
|
||||
Setup your Python virtual environment:
|
||||
|
||||
```bash
|
||||
cd airbyte-integrations/connectors/destination-<name>
|
||||
@@ -54,6 +54,7 @@ source .venv/bin/activate
|
||||
# Install with the "tests" extra which provides test requirements
|
||||
pip install '.[tests]'
|
||||
```
|
||||
|
||||
This step sets up the initial python environment. **All** subsequent `python` or `pip` commands assume you have activated your virtual environment.
|
||||
|
||||
If you want your IDE to auto complete and resolve dependencies properly, point it at the python binary in `airbyte-integrations/connectors/destination-<name>/.venv/bin/python`. Also anytime you change the dependencies in the `setup.py` make sure to re-run the build command. The build system will handle installing all dependencies in the `setup.py` into the virtual environment.
|
||||
@@ -62,14 +63,14 @@ Let's quickly get a few housekeeping items out of the way.
|
||||
|
||||
#### Dependencies
|
||||
|
||||
Python dependencies for your destination should be declared in `airbyte-integrations/connectors/destination-<name>/setup.py` in the `install_requires` field. You might notice that a couple of Airbyte dependencies are already declared there (mainly the Airbyte CDK and potentially some testing libraries or helpers). Keep those as they will be useful during development.
|
||||
Python dependencies for your destination should be declared in `airbyte-integrations/connectors/destination-<name>/setup.py` in the `install_requires` field. You might notice that a couple of Airbyte dependencies are already declared there \(mainly the Airbyte CDK and potentially some testing libraries or helpers\). Keep those as they will be useful during development.
|
||||
|
||||
You may notice that there is a `requirements.txt` in your destination's directory as well. Do not touch this. It is autogenerated and used to install local Airbyte dependencies which are not published to PyPI. All your dependencies should be declared in `setup.py`.
|
||||
|
||||
#### Iterating on your implementation
|
||||
|
||||
Pretty much all it takes to create a destination is to implement the `Destination` interface. Let's briefly recap the three methods implemented by a Destination:
|
||||
|
||||
Pretty much all it takes to create a destination is to implement the `Destination` interface. Let's briefly recap the three methods implemented by a Destination:
|
||||
|
||||
1. `spec`: declares the user-provided credentials or configuration needed to run the connector
|
||||
2. `check`: tests if the user-provided configuration can be used to connect to the underlying data destination, and with the correct write permissions
|
||||
3. `write`: writes data to the underlying destination by reading a configuration, a stream of records from stdin, and a configured catalog describing the schema of the data and how it should be written to the destination
|
||||
@@ -98,8 +99,7 @@ cat messages.jsonl | python main.py write --config secrets/config.json --catalog
|
||||
|
||||
The nice thing about this approach is that you can iterate completely within in python. The downside is that you are not quite running your destination as it will actually be run by Airbyte. Specifically you're not running it from within the docker container that will house it.
|
||||
|
||||
**Run using Docker**
|
||||
If you want to run your destination exactly as it will be run by Airbyte \(i.e. within a docker container\), you can use the following commands from the connector module directory \(`airbyte-integrations/connectors/destination-<name>`\):
|
||||
**Run using Docker** If you want to run your destination exactly as it will be run by Airbyte \(i.e. within a docker container\), you can use the following commands from the connector module directory \(`airbyte-integrations/connectors/destination-<name>`\):
|
||||
|
||||
```bash
|
||||
# First build the container
|
||||
@@ -117,7 +117,7 @@ The nice thing about this approach is that you are running your source exactly a
|
||||
|
||||
**TDD using standard tests**
|
||||
|
||||
_note: these tests aren't yet available for Python connectors but will be very soon. Until then you should use custom unit or integration tests for TDD_.
|
||||
_note: these tests aren't yet available for Python connectors but will be very soon. Until then you should use custom unit or integration tests for TDD_.
|
||||
|
||||
Airbyte provides a standard test suite that is run against every destination. The objective of these tests is to provide some "free" tests that can sanity check that the basic functionality of the destination works. One approach to developing your connector is to simply run the tests between each change and use the feedback from them to guide your development.
|
||||
|
||||
@@ -127,26 +127,25 @@ The nice thing about this approach is that you are running your destination exac
|
||||
|
||||
### Step 3: Implement `spec`
|
||||
|
||||
Each destination contains a specification written in JsonSchema that describes the inputs it requires and accepts. Defining the specification is a good place to start development.
|
||||
To do this, find the spec file generated in `airbyte-integrations/connectors/destination-<name>/src/main/resources/spec.json`. Edit it and you should be done with this step. The generated connector will take care of reading this file and converting it to the correct output.
|
||||
Each destination contains a specification written in JsonSchema that describes the inputs it requires and accepts. Defining the specification is a good place to start development. To do this, find the spec file generated in `airbyte-integrations/connectors/destination-<name>/src/main/resources/spec.json`. Edit it and you should be done with this step. The generated connector will take care of reading this file and converting it to the correct output.
|
||||
|
||||
Some notes about fields in the output spec:
|
||||
|
||||
* `supportsNormalization` is a boolean which indicates if this connector supports [basic normalization via DBT](https://docs.airbyte.io/understanding-airbyte/basic-normalization). If true, `supportsDBT` must also be true.
|
||||
* `supportsDBT` is a boolean which indicates whether this destination is compatible with DBT. If set to true, the user can define custom DBT transformations that run on this destination after each successful sync. This must be true if `supportsNormalization` is set to true.
|
||||
* `supported_destination_sync_modes`: An array of strings declaring the sync modes supported by this connector. The available options are:
|
||||
* `overwrite`: The connector can be configured to wipe any existing data in a stream before writing new data
|
||||
* `append`: The connector can be configured to append new data to existing data
|
||||
* `append_dedupe`: The connector can be configured to deduplicate (i.e: UPSERT) data in the destination based on the new data and primary keys
|
||||
* `overwrite`: The connector can be configured to wipe any existing data in a stream before writing new data
|
||||
* `append`: The connector can be configured to append new data to existing data
|
||||
* `append_dedupe`: The connector can be configured to deduplicate \(i.e: UPSERT\) data in the destination based on the new data and primary keys
|
||||
* `supportsIncremental`: Whether the connector supports any `append` sync mode. Must be set to true if `append` or `append_dedupe` are included in the `supported_destination_sync_modes`.
|
||||
|
||||
|
||||
Some helpful resources:
|
||||
Some helpful resources:
|
||||
|
||||
* [**JSONSchema website**](https://json-schema.org/)
|
||||
* [**Definition of Airbyte Protocol data models**](https://github.com/airbytehq/airbyte/blob/master/airbyte-protocol/models/src/main/resources/airbyte_protocol/airbyte_protocol.yaml). The output of `spec` is described by the `ConnectorSpecification` model (which is wrapped in an `AirbyteConnectionStatus` message).
|
||||
* [**Definition of Airbyte Protocol data models**](https://github.com/airbytehq/airbyte/blob/master/airbyte-protocol/models/src/main/resources/airbyte_protocol/airbyte_protocol.yaml). The output of `spec` is described by the `ConnectorSpecification` model \(which is wrapped in an `AirbyteConnectionStatus` message\).
|
||||
* [**Postgres Destination's spec.json file**](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/destination-postgres/src/main/resources/spec.json) as an example `spec.json`.
|
||||
|
||||
Once you've edited the file, see the `spec` operation in action:
|
||||
Once you've edited the file, see the `spec` operation in action:
|
||||
|
||||
```bash
|
||||
python main.py spec
|
||||
@@ -154,20 +153,21 @@ python main.py spec
|
||||
|
||||
### Step 4: Implement `check`
|
||||
|
||||
The check operation accepts a JSON object conforming to the `spec.json`. In other words if the `spec.json` said that the destination requires a `username` and `password`, the config object might be `{ "username": "airbyte", "password": "password123" }`. It returns a json object that reports, given the credentials in the config, whether we were able to connect to the destination.
|
||||
The check operation accepts a JSON object conforming to the `spec.json`. In other words if the `spec.json` said that the destination requires a `username` and `password`, the config object might be `{ "username": "airbyte", "password": "password123" }`. It returns a json object that reports, given the credentials in the config, whether we were able to connect to the destination.
|
||||
|
||||
While developing, we recommend storing any credentials in `secrets/config.json`. Any `secrets` directory in the Airbyte repo is gitignored by default.
|
||||
|
||||
Implement the `check` method in the generated file `destination_<name>/destination.py`. Here's an [example implementation](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/destination-kvdb/destination_kvdb/destination.py) from the KvDB destination.
|
||||
Implement the `check` method in the generated file `destination_<name>/destination.py`. Here's an [example implementation](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/destination-kvdb/destination_kvdb/destination.py) from the KvDB destination.
|
||||
|
||||
Verify that the method is working by placing your config in `secrets/config.json` then running:
|
||||
Verify that the method is working by placing your config in `secrets/config.json` then running:
|
||||
|
||||
```bash
|
||||
python main.py check --config secrets/config.json
|
||||
```
|
||||
|
||||
### Step 5: Implement `write`
|
||||
The `write` operation is the main workhorse of a destination connector: it reads input data from the source and writes it to the underlying destination. It takes as input the config file used to run the connector as well as the configured catalog: the file used to describe the schema of the incoming data and how it should be written to the destination. Its "output" is two things:
|
||||
|
||||
The `write` operation is the main workhorse of a destination connector: it reads input data from the source and writes it to the underlying destination. It takes as input the config file used to run the connector as well as the configured catalog: the file used to describe the schema of the incoming data and how it should be written to the destination. Its "output" is two things:
|
||||
|
||||
1. Data written to the underlying destination
|
||||
2. `AirbyteMessage`s of type `AirbyteStateMessage`, written to stdout to indicate which records have been written so far during a sync. It's important to output these messages when possible in order to avoid re-extracting messages from the source. See the [write operation protocol reference](https://docs.airbyte.io/understanding-airbyte/airbyte-specification#write) for more information.
|
||||
@@ -176,22 +176,25 @@ To implement the `write` Airbyte operation, implement the `write` method in your
|
||||
|
||||
### Step 6: Set up Acceptance Tests
|
||||
|
||||
_Coming soon. These tests are not yet available for Python destinations but will be very soon. For now please skip this step and rely on copious
|
||||
amounts of integration and unit testing_.
|
||||
_Coming soon. These tests are not yet available for Python destinations but will be very soon. For now please skip this step and rely on copious amounts of integration and unit testing_.
|
||||
|
||||
### Step 7: Write unit tests and/or integration tests
|
||||
|
||||
The Acceptance Tests are meant to cover the basic functionality of a destination. Think of it as the bare minimum required for us to add a destination to Airbyte. You should probably add some unit testing or custom integration testing in case you need to test additional functionality of your destination.
|
||||
|
||||
Add unit tests in `unit_tests/` directory and integration tests in the `integration_tests/` directory. Run them via
|
||||
|
||||
```bash
|
||||
python -m pytest -s -vv integration_tests/
|
||||
```
|
||||
```
|
||||
|
||||
See the [KvDB integration tests](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/destination-kvdb/integration_tests/integration_test.py) for an example of tests you can implement.
|
||||
See the [KvDB integration tests](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/destination-kvdb/integration_tests/integration_test.py) for an example of tests you can implement.
|
||||
|
||||
#### Step 8: Update the docs
|
||||
|
||||
Each connector has its own documentation page. By convention, that page should have the following path: in `docs/integrations/destinations/<destination-name>.md`. For the documentation to get packaged with the docs, make sure to add a link to it in `docs/SUMMARY.md`. You can pattern match doing that from existing connectors.
|
||||
|
||||
## Wrapping up
|
||||
Well done on making it this far! If you'd like your connector to ship with Airbyte by default, create a PR against the Airbyte repo and we'll work with you to get it across the finish line.
|
||||
|
||||
Well done on making it this far! If you'd like your connector to ship with Airbyte by default, create a PR against the Airbyte repo and we'll work with you to get it across the finish line.
|
||||
|
||||
|
||||
@@ -6,6 +6,8 @@ This is a blazing fast guide to building an HTTP source connector. Think of it a
|
||||
|
||||
If you are a visual learner and want to see a video version of this guide going over each part in detail, check it out below.
|
||||
|
||||
{% embed url="https://www.youtube.com/watch?v=kJ3hLoNfz\_E&t=3s" caption="A speedy CDK overview." %}
|
||||
|
||||
## Dependencies
|
||||
|
||||
1. Python >= 3.7
|
||||
@@ -38,7 +40,7 @@ cd source_python_http_example
|
||||
|
||||
We're working with the PokeAPI, so we need to define our input schema to reflect that. Open the `spec.json` file here and replace it with:
|
||||
|
||||
```json
|
||||
```javascript
|
||||
{
|
||||
"documentationUrl": "https://docs.airbyte.io/integrations/sources/pokeapi",
|
||||
"connectionSpecification": {
|
||||
@@ -58,10 +60,10 @@ We're working with the PokeAPI, so we need to define our input schema to reflect
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
As you can see, we have one input to our input schema, which is `pokemon_name`, which is required. Normally, input schemas will contain information such as API keys and client secrets that need to get passed down to all endpoints or streams.
|
||||
|
||||
Ok, let's write a function that checks the inputs we just defined. Nuke the `source.py` file. Now add this code to it. For a crucial time skip, we're going to define all the imports we need in the future here. Also note
|
||||
that your `AbstractSource` class name must be a camel-cased version of the name you gave in the generation phase. In our case, this is `SourcePythonHttpExample`.
|
||||
Ok, let's write a function that checks the inputs we just defined. Nuke the `source.py` file. Now add this code to it. For a crucial time skip, we're going to define all the imports we need in the future here. Also note that your `AbstractSource` class name must be a camel-cased version of the name you gave in the generation phase. In our case, this is `SourcePythonHttpExample`.
|
||||
|
||||
```python
|
||||
from typing import Any, Iterable, List, Mapping, MutableMapping, Optional, Tuple
|
||||
@@ -152,11 +154,9 @@ class Pokemon(HttpStream):
|
||||
return None # TODO
|
||||
```
|
||||
|
||||
Now download [this file](https://github.com/airbytehq/airbyte/blob/master/airbyte-cdk/python/docs/tutorials/http_api_source_assets/pokemon.json). Name it `pokemon.json` and place it in `/source_python_http_example/schemas`.
|
||||
Now download [this file](https://github.com/airbytehq/airbyte/blob/master/airbyte-cdk/python/docs/tutorials/http_api_source_assets/pokemon.json). Name it `pokemon.json` and place it in `/source_python_http_example/schemas`.
|
||||
|
||||
This file defines your output schema for every endpoint that you want to implement. Normally, this will likely be the most time-consuming section of the connector development process, as it requires defining the output of the endpoint
|
||||
exactly. This is really important, as Airbyte needs to have clear expectations for what the stream will output. Note that the name of this stream will be consistent in the naming of the JSON schema and the `HttpStream` class, as
|
||||
`pokemon.json` and `Pokemon` respectively in this case. Learn more about schema creation [here](https://docs.airbyte.io/connector-development/cdk-python/full-refresh-stream#defining-the-streams-schema).
|
||||
This file defines your output schema for every endpoint that you want to implement. Normally, this will likely be the most time-consuming section of the connector development process, as it requires defining the output of the endpoint exactly. This is really important, as Airbyte needs to have clear expectations for what the stream will output. Note that the name of this stream will be consistent in the naming of the JSON schema and the `HttpStream` class, as `pokemon.json` and `Pokemon` respectively in this case. Learn more about schema creation [here](https://docs.airbyte.io/connector-development/cdk-python/full-refresh-stream#defining-the-streams-schema).
|
||||
|
||||
Test your discover function. You should receive a fairly large JSON object in return.
|
||||
|
||||
@@ -213,8 +213,7 @@ class Pokemon(HttpStream):
|
||||
return None
|
||||
```
|
||||
|
||||
We now need a catalog that defines all of our streams. We only have one stream: `Pokemon`. Download that file [here](https://github.com/airbytehq/airbyte/blob/master/airbyte-cdk/python/docs/tutorials/http_api_source_assets/configured_catalog_pokeapi.json). Place it in `/sample_files` named as `configured_catalog.json`. More clearly,
|
||||
this is where we tell Airbyte all the streams/endpoints we support for the connector and in which sync modes Airbyte can run the connector on. Learn more about the AirbyteCatalog [here](https://docs.airbyte.io/understanding-airbyte/beginners-guide-to-catalog) and learn more about sync modes [here](https://docs.airbyte.io/understanding-airbyte/connections#sync-modes).
|
||||
We now need a catalog that defines all of our streams. We only have one stream: `Pokemon`. Download that file [here](https://github.com/airbytehq/airbyte/blob/master/airbyte-cdk/python/docs/tutorials/http_api_source_assets/configured_catalog_pokeapi.json). Place it in `/sample_files` named as `configured_catalog.json`. More clearly, this is where we tell Airbyte all the streams/endpoints we support for the connector and in which sync modes Airbyte can run the connector on. Learn more about the AirbyteCatalog [here](https://docs.airbyte.io/understanding-airbyte/beginners-guide-to-catalog) and learn more about sync modes [here](https://docs.airbyte.io/understanding-airbyte/connections#sync-modes).
|
||||
|
||||
Let's read some data.
|
||||
|
||||
|
||||
@@ -8,7 +8,7 @@ $ cd airbyte-integrations/connector-templates/generator # assumes you are starti
|
||||
$ ./generate.sh
|
||||
```
|
||||
|
||||
This will bring up an interactive helper application. Use the arrow keys to pick a template from the list. Select the `Python HTTP API Source` template and then input the name of your connector. The application will create a new directory in airbyte/airbyte-integrations/connectors/ with the name of your new connector.
|
||||
This will bring up an interactive helper application. Use the arrow keys to pick a template from the list. Select the `Python HTTP API Source` template and then input the name of your connector. The application will create a new directory in airbyte/airbyte-integrations/connectors/ with the name of your new connector.
|
||||
|
||||
For this walk-through we will refer to our source as `python-http-example`. The finalized source code for this tutorial can be found [here](https://github.com/airbytehq/airbyte/tree/master/airbyte-integrations/connectors/source-python-http-tutorial).
|
||||
|
||||
|
||||
@@ -24,9 +24,9 @@ Optionally, we can provide additional inputs to customize requests:
|
||||
|
||||
Backoff policy options:
|
||||
|
||||
- `retry_factor` Specifies factor for exponential backoff policy (by default is 5)
|
||||
- `max_retries` Specifies maximum amount of retries for backoff policy (by default is 5)
|
||||
- `raise_on_http_errors` If set to False, allows opting-out of raising HTTP code exception (by default is True)
|
||||
* `retry_factor` Specifies factor for exponential backoff policy \(by default is 5\)
|
||||
* `max_retries` Specifies maximum amount of retries for backoff policy \(by default is 5\)
|
||||
* `raise_on_http_errors` If set to False, allows opting-out of raising HTTP code exception \(by default is True\)
|
||||
|
||||
There are many other customizable options - you can find them in the [`airbyte_cdk.sources.streams.http.HttpStream`](https://github.com/airbytehq/airbyte/blob/master/airbyte-cdk/python/airbyte_cdk/sources/streams/http/http.py) class.
|
||||
|
||||
|
||||
@@ -12,7 +12,7 @@ Place any integration tests in the `integration_tests` directory such that they
|
||||
|
||||
## Standard Tests
|
||||
|
||||
Standard tests are a fixed set of tests Airbyte provides that every Airbyte source connector must pass. While they're only required if you intend to submit your connector to Airbyte, you might find them helpful in any case. See [Testing your connectors](../../testing-connectors/README.md)
|
||||
Standard tests are a fixed set of tests Airbyte provides that every Airbyte source connector must pass. While they're only required if you intend to submit your connector to Airbyte, you might find them helpful in any case. See [Testing your connectors](../../testing-connectors/)
|
||||
|
||||
If you want to submit this connector to become a default connector within Airbyte, follow steps 8 onwards from the [Python source checklist](../building-a-python-source.md#step-8-set-up-standard-tests)
|
||||
|
||||
|
||||
@@ -1,2 +1,2 @@
|
||||
# Creating an HTTP API Source with the Python CDK
|
||||
# Python CDK: Creating a HTTP API Source
|
||||
|
||||
|
||||
@@ -28,13 +28,13 @@ Here is a list of easy [good first issues](https://github.com/airbytehq/airbyte/
|
||||
|
||||
It's easy to add your own connector to Airbyte! **Since Airbyte connectors are encapsulated within Docker containers, you can use any language you like.** Here are some links on how to add sources and destinations. We haven't built the documentation for all languages yet, so don't hesitate to reach out to us if you'd like help developing connectors in other languages.
|
||||
|
||||
For sources, simply head over to our [Python CDK](../connector-development/cdk-python/README.md).
|
||||
For sources, simply head over to our [Python CDK](../connector-development/cdk-python/).
|
||||
|
||||
{% hint style="info" %}
|
||||
The CDK currently does not support creating destinations, but it will very soon.
|
||||
{% endhint %}
|
||||
|
||||
* See [Building new connectors](../connector-development/README.md) to get started.
|
||||
* See [Building new connectors](../connector-development/) to get started.
|
||||
* Since we frequently build connectors in Python, on top of Singer or in Java, we've created generator libraries to get you started quickly: [Build Python Source Connectors](../connector-development/tutorials/building-a-python-source.md) and [Build Java Destination Connectors](../connector-development/tutorials/building-a-java-destination.md)
|
||||
* Integration tests \(tests that run a connector's image against an external resource\) can be run one of three ways, as detailed [here](../connector-development/testing-connectors/source-acceptance-tests-reference.md)
|
||||
|
||||
@@ -72,7 +72,7 @@ First, a big thank you! A few things to keep in mind when contributing code:
|
||||
* If you're working on an issue, please comment that you are doing so to prevent duplicate work by others also.
|
||||
* Rebase master with your branch before submitting a pull request.
|
||||
|
||||
Here are some details about [our review process](#review-process).
|
||||
Here are some details about [our review process](./#review-process).
|
||||
|
||||
### **Upvoting issues, feature and connector requests**
|
||||
|
||||
|
||||
@@ -16,6 +16,6 @@ Install it in IntelliJ:
|
||||
2. Select the file we just downloaded
|
||||
3. Select `GoogleStyle` in the drop down
|
||||
4. Change default `Hard wrap at` in `Wrapping and Braces` tab to **150**.
|
||||
5. We prefer `import foo.bar.ClassName` over `import foo.bar.*`. Even in cases where we import multiple classes from the same package. This can be set by going to `Preferences > Code Style > Java > Imports` and changing `Class count to use import with '*'` to 9999 and `Names count to use static import with '*' to 9999.
|
||||
5. We prefer `import foo.bar.ClassName` over `import foo.bar.*`. Even in cases where we import multiple classes from the same package. This can be set by going to `Preferences > Code Style > Java > Imports` and changing `Class count to use import with '*'` to 9999 and \`Names count to use static import with '\*' to 9999.
|
||||
6. You're done!
|
||||
|
||||
|
||||
@@ -28,7 +28,7 @@ To start contributing:
|
||||
|
||||
## Build with `gradle`
|
||||
|
||||
To compile and build just the platform (not all the connectors):
|
||||
To compile and build just the platform \(not all the connectors\):
|
||||
|
||||
```bash
|
||||
SUB_BUILD=PLATFORM ./gradlew build
|
||||
@@ -38,7 +38,6 @@ This will build all the code and run all the unit tests.
|
||||
|
||||
`SUB_BUILD=PLATFORM ./gradlew build` creates all the necessary artifacts \(Webapp, Jars and Docker images\) so that you can run Airbyte locally. Since this builds everything, it can take some time.
|
||||
|
||||
|
||||
{% hint style="info" %}
|
||||
Gradle will use all CPU cores by default. If Gradle uses too much/too little CPU, tuning the number of CPU cores it uses to better suit a dev's need can help.
|
||||
|
||||
|
||||
@@ -1,14 +1,15 @@
|
||||
# Developing On Kubernetes
|
||||
# Developing on Kubernetes
|
||||
|
||||
Make sure to read [our docs for developing locally](./developing-locally.md) first.
|
||||
Make sure to read [our docs for developing locally](developing-locally.md) first.
|
||||
|
||||
## Architecture
|
||||
|
||||

|
||||
|
||||
## Iteration Cycle (Locally)
|
||||
## Iteration Cycle \(Locally\)
|
||||
|
||||
If you're developing locally using Minikube/Docker Desktop/Kind, you can iterate with the following series of commands:
|
||||
|
||||
```bash
|
||||
./gradlew composeBuild # build dev images
|
||||
kubectl delete -k kube/overlays/dev # optional (allows you to recreate resources from scratch)
|
||||
@@ -18,18 +19,16 @@ kubectl port-forward svc/airbyte-webapp-svc 8000:80 # port forward the api/ui
|
||||
|
||||
## Iteration Cycle \(on GKE\)
|
||||
|
||||
The process is similar to developing on a local cluster, except you will need to build the local version and push it to your own container
|
||||
registry with names such as `your-registry/scheduler`. Then you will need to configure an overlay to override the name of images and apply
|
||||
your overlay with `kubectl apply -k <path to your overlay>`.
|
||||
The process is similar to developing on a local cluster, except you will need to build the local version and push it to your own container registry with names such as `your-registry/scheduler`. Then you will need to configure an overlay to override the name of images and apply your overlay with `kubectl apply -k <path to your overlay>`.
|
||||
|
||||
We are [working to improve this process](https://github.com/airbytehq/airbyte/issues/4225).
|
||||
|
||||
## Completely resetting a local cluster
|
||||
|
||||
In most cases, running `kubectl delete -k kube/overlays/dev` is sufficient to remove the core Airbyte-related components. However, if you are in a dev environment on a local cluster only running Airbyte and want to start **completely from scratch** (removing all PVCs, pods, completed pods, etc.), you can use the following command
|
||||
to destroy everything on the cluster:
|
||||
In most cases, running `kubectl delete -k kube/overlays/dev` is sufficient to remove the core Airbyte-related components. However, if you are in a dev environment on a local cluster only running Airbyte and want to start **completely from scratch** \(removing all PVCs, pods, completed pods, etc.\), you can use the following command to destroy everything on the cluster:
|
||||
|
||||
```bash
|
||||
# BE CAREFUL, THIS COMMAND DELETES ALL RESOURCES, EVEN NON-AIRBYTE ONES!
|
||||
kubectl delete "$(kubectl api-resources --namespaced=true --verbs=delete -o name | tr "\n" "," | sed -e 's/,$//')" --all
|
||||
```
|
||||
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
# Gradle Cheatsheet
|
||||
|
||||
|
||||
## Overview
|
||||
|
||||
We have 3 ways of slicing our builds:
|
||||
|
||||
1. **Build Everything**: Including every single connectors.
|
||||
@@ -12,39 +12,43 @@ We have 3 ways of slicing our builds:
|
||||
|
||||
In our CI we run **Build Platform** and **Build Connectors Base**. Then separately, on a regular cadence, we build each connector and run its integration tests.
|
||||
|
||||
We split Build Platform and Build Connectors Base from each other for a few reasons:
|
||||
1. The tech stacks are very different. The Platform is almost entirely Java. Because of differing needs around separating environments, the Platform build can be optimized separately from the Connectors one.
|
||||
2. We want to the iteration cycles of people working on connectors or the platform faster _and_ independent. e.g. Before this change someone working on a Platform feature needs to run formatting on the entire codebase (including connectors). This led to a lot of cosmetic build failures that obfuscated actually problems. Ideally a failure on the connectors side should not block progress on the platform side.
|
||||
3. The lifecycles are different. One can safely release the Platform even if parts of Connectors Base is failing (and vice versa).
|
||||
We split Build Platform and Build Connectors Base from each other for a few reasons: 1. The tech stacks are very different. The Platform is almost entirely Java. Because of differing needs around separating environments, the Platform build can be optimized separately from the Connectors one. 2. We want to the iteration cycles of people working on connectors or the platform faster _and_ independent. e.g. Before this change someone working on a Platform feature needs to run formatting on the entire codebase \(including connectors\). This led to a lot of cosmetic build failures that obfuscated actually problems. Ideally a failure on the connectors side should not block progress on the platform side. 3. The lifecycles are different. One can safely release the Platform even if parts of Connectors Base is failing \(and vice versa\).
|
||||
|
||||
Future Work: The next step here is to figure out how to more formally split connectors and platform. Right now we exploit behavior in `settings.gradle` to separate them. This is not a best practice. Ultimately, we want these two builds to be totally separate. We do not know what that will look like yet.
|
||||
|
||||
## Cheatsheet
|
||||
|
||||
Here is a cheatsheet for common gradle commands.
|
||||
|
||||
### Basic Build Syntax
|
||||
|
||||
Here is the syntax for running gradle commands on the different parts of the code base that we called out above.
|
||||
|
||||
#### Build Everything
|
||||
```shell
|
||||
|
||||
```text
|
||||
./gradlew <gradle command>
|
||||
```
|
||||
|
||||
#### Build Platform
|
||||
```shell
|
||||
|
||||
```text
|
||||
SUB_BUILD=PLATFORM ./gradlew <gradle command>
|
||||
```
|
||||
|
||||
#### Build Connectors Base
|
||||
```shell
|
||||
|
||||
```text
|
||||
SUB_BUILD=CONNECTORS_BASE ./gradlew <gradle command>
|
||||
```
|
||||
|
||||
### Build
|
||||
In order to "build" the project. This task includes producing all artifacts and running unit tests (anything called in the `:test` task). It does _not_ include integration tests (anything called in the `:integrationTest` task).
|
||||
|
||||
In order to "build" the project. This task includes producing all artifacts and running unit tests \(anything called in the `:test` task\). It does _not_ include integration tests \(anything called in the `:integrationTest` task\).
|
||||
|
||||
For example all the following are valid.
|
||||
```shell
|
||||
|
||||
```text
|
||||
./gradlew build
|
||||
SUB_BUILD=PLATFORM ./gradlew build
|
||||
SUB_BUILD=CONNECTORS_BASE ./gradlew build
|
||||
@@ -52,10 +56,11 @@ SUB_BUILD=CONNECTORS_BASE ./gradlew build
|
||||
|
||||
### Formatting
|
||||
|
||||
The build system has a custom task called `format`. It is not called as part of `build`. If the command is called on a subset of the project, it will (mostly) target just the included modules. The exception is that `spotless` (a gradle formatter) will always format any file types that it is configured to manage regardless of which sub build is run. `spotless` is relatively fast, so this should not be too much of an annoyance. It can lead to formatting changes in unexpected parts of the code base.
|
||||
The build system has a custom task called `format`. It is not called as part of `build`. If the command is called on a subset of the project, it will \(mostly\) target just the included modules. The exception is that `spotless` \(a gradle formatter\) will always format any file types that it is configured to manage regardless of which sub build is run. `spotless` is relatively fast, so this should not be too much of an annoyance. It can lead to formatting changes in unexpected parts of the code base.
|
||||
|
||||
For example all the following are valid.
|
||||
```shell
|
||||
|
||||
```text
|
||||
./gradlew format
|
||||
SUB_BUILD=PLATFORM ./gradlew format
|
||||
SUB_BUILD=CONNECTORS_BASE ./gradlew format
|
||||
@@ -64,44 +69,53 @@ SUB_BUILD=CONNECTORS_BASE ./gradlew format
|
||||
### Platform-Specific Commands
|
||||
|
||||
#### Build Artifacts
|
||||
|
||||
This command just builds the docker images that are used as artifacts in the platform. It bypasses running tests.
|
||||
|
||||
```shell
|
||||
```text
|
||||
SUB_BUILD=PLATFORM ./gradlew composeBuild
|
||||
```
|
||||
|
||||
#### Running Tests
|
||||
|
||||
The Platform has 3 different levels of tests: Unit Tests, Acceptance Tests, Frontend Acceptance Tests.
|
||||
|
||||
##### Unit Tests
|
||||
Unit Tests can be run using the `:test` task on any submodule. These test class-level behavior. They should avoid using external resources (e.g. calling staging services or pulling resources from the internet). We do allow these tests to spin up local resources (usually in docker containers). For example, we use test containers frequently to spin up test postgres databases.
|
||||
**Unit Tests**
|
||||
|
||||
Unit Tests can be run using the `:test` task on any submodule. These test class-level behavior. They should avoid using external resources \(e.g. calling staging services or pulling resources from the internet\). We do allow these tests to spin up local resources \(usually in docker containers\). For example, we use test containers frequently to spin up test postgres databases.
|
||||
|
||||
**Acceptance Tests**
|
||||
|
||||
##### Acceptance Tests
|
||||
We split Acceptance Tests into 2 different test suites:
|
||||
|
||||
* Platform Acceptance Tests: These tests are a coarse test to sanity check that each major feature in the platform. They are run with the following command: `SUB_BUILD=PLATFORM ./gradlew :airbyte-tests:acceptanceTests`. These tests expect to find a local version of Airbyte running. For testing the docker version start Airbyte locally. For an example, see the [script](https://github.com/airbytehq/airbyte/blob/master/tools/bin/acceptance_test.sh) that is used by the CI. For Kubernetes, see the [script](https://github.com/airbytehq/airbyte/blob/master/tools/bin/acceptance_test_kube.sh) that is used by the CI.
|
||||
* Migration Acceptance Tests: These tests make sure the end-to-end process of migrating from one version of Airbyte to the next works. These tests are run with the following command: `SUB_BUILD=PLATFORM ./gradlew :airbyte-tests:automaticMigrationAcceptanceTest --scan`. These tests do not expect there to be a separate deployment of Airbyte running.
|
||||
|
||||
These tests currently all live in `airbyte-tests`
|
||||
|
||||
##### Frontend Acceptance Tests
|
||||
**Frontend Acceptance Tests**
|
||||
|
||||
These are acceptance tests for the frontend. They are run with `SUB_BUILD=PLATFORM ./gradlew --no-daemon :airbyte-e2e-testing:e2etest`. Like the Platform Acceptance Tests, they expect Airbyte to be running locally. See the [script](https://github.com/airbytehq/airbyte/blob/master/tools/bin/e2e_test.sh) that is used by the CI.
|
||||
|
||||
These tests currently all live in `airbyte-e2e-testing`.
|
||||
|
||||
##### Future Work
|
||||
Our story around "integration testing" or "E2E testing" is a little ambiguous. Our Platform Acceptance Test Suite is getting somewhat unwieldy. It was meant to just be some coarse sanity checks, but over time we have found more need to test interactions between systems more granular. Whether we start supporting a separate class of tests (e.g. integration tests) or figure out how allow for more granular tests in the existing Acceptance Test framework is TBD.
|
||||
**Future Work**
|
||||
|
||||
### Connectors-Specific Commands (Connector Development)
|
||||
Our story around "integration testing" or "E2E testing" is a little ambiguous. Our Platform Acceptance Test Suite is getting somewhat unwieldy. It was meant to just be some coarse sanity checks, but over time we have found more need to test interactions between systems more granular. Whether we start supporting a separate class of tests \(e.g. integration tests\) or figure out how allow for more granular tests in the existing Acceptance Test framework is TBD.
|
||||
|
||||
### Connectors-Specific Commands \(Connector Development\)
|
||||
|
||||
#### Commands used in CI
|
||||
All connectors, regardless of implementation language, implement the following interface to allow uniformity in the build system when run from CI:
|
||||
|
||||
**Build connector, run unit tests, and build Docker image**: `./gradlew :airbyte-integrations:connectors:<name>:build`
|
||||
**Run integration tests**: `./gradlew :airbyte-integrations:connectors:<name>:integrationTest`
|
||||
All connectors, regardless of implementation language, implement the following interface to allow uniformity in the build system when run from CI:
|
||||
|
||||
**Build connector, run unit tests, and build Docker image**: `./gradlew :airbyte-integrations:connectors:<name>:build` **Run integration tests**: `./gradlew :airbyte-integrations:connectors:<name>:integrationTest`
|
||||
|
||||
#### Python
|
||||
The ideal end state for a Python connector developer is that they shouldn't have to know Gradle exists.
|
||||
|
||||
The ideal end state for a Python connector developer is that they shouldn't have to know Gradle exists.
|
||||
|
||||
We're almost there, but today there is only one Gradle command that's needed when developing in Python, used for formatting code.
|
||||
|
||||
**Formatting python module**: `./gradlew :airbyte-integrations:connectors:<name>:airbytePythonFormat`
|
||||
|
||||
|
||||
@@ -4,15 +4,18 @@ This guide contains instructions on how to setup Python with Gradle within the A
|
||||
|
||||
## Python Connector Development
|
||||
|
||||
Before working with connectors written in Python, we recommend running
|
||||
Before working with connectors written in Python, we recommend running
|
||||
|
||||
```bash
|
||||
./gradlew :airbyte-integrations:connectors:<connector directory name>:build
|
||||
```
|
||||
|
||||
e.g
|
||||
|
||||
```bash
|
||||
./gradlew :airbyte-integrations:connectors:source-postgres:build
|
||||
```
|
||||
|
||||
from the root project directory. This will create a `virtualenv` and install dependencies for the connector you want to work on as well as any internal Airbyte python packages it depends on.
|
||||
|
||||
When iterating on a single connector, you will often iterate by running
|
||||
@@ -30,37 +33,45 @@ This command will:
|
||||
2. [isort](https://pypi.org/project/isort/) to sort imports
|
||||
3. [Flake8](https://pypi.org/project/flake8/) to check formatting
|
||||
4. [MyPy](https://pypi.org/project/mypy/) to check type usage
|
||||
|
||||
|
||||
## Formatting/linting
|
||||
To format and lint your code before commit you can use the Gradle command above, but for convenience we support [pre-commit](https://pre-commit.com/) tool.
|
||||
To use it you need to install it first:
|
||||
|
||||
To format and lint your code before commit you can use the Gradle command above, but for convenience we support [pre-commit](https://pre-commit.com/) tool. To use it you need to install it first:
|
||||
|
||||
```bash
|
||||
pip install pre-commit
|
||||
```
|
||||
then, to install `pre-commit` as a git hook, run
|
||||
```
|
||||
|
||||
then, to install `pre-commit` as a git hook, run
|
||||
|
||||
```text
|
||||
pre-commit install
|
||||
```
|
||||
|
||||
That's it, `pre-commit` will format/lint the code every time you commit something. You find more information about pre-commit [here](https://pre-commit.com/).
|
||||
|
||||
## IDE
|
||||
|
||||
At Airbyte, we use IntelliJ IDEA for development. Although it is possible to develop connectors with any IDE, we typically recommend IntelliJ IDEA or PyCharm, since we actively work towards compatibility.
|
||||
|
||||
### Autocompletion
|
||||
|
||||
Install the [Pydantic](https://plugins.jetbrains.com/plugin/12861-pydantic) plugin. This will help autocompletion with some of our internal types.
|
||||
|
||||
### PyCharm (ItelliJ IDEA)
|
||||
### PyCharm \(ItelliJ IDEA\)
|
||||
|
||||
The following setup steps are written for PyCharm but should have similar equivalents for IntelliJ IDEA:
|
||||
|
||||
2. Go to `File -> New -> Project...`
|
||||
3. Select `Pure Python`.
|
||||
4. Select a project name like `airbyte` and a directory **outside of** the `airbyte` code root.
|
||||
5. Go to `Prefferences -> Project -> Python Interpreter`
|
||||
6. Find a gear ⚙️ button next to `Python interpreter` dropdown list, click and select `Add`
|
||||
7. Select `Virtual Environment -> Existing`
|
||||
8. Set the interpreter path to the one that was created by Gradle command, i.e. `airbyte-integrations/connectors/your-connector-dir/.venv/bin/python`.
|
||||
9. Wait for PyCharm to finish indexing and loading skeletons from selected virtual environment.
|
||||
1. Go to `File -> New -> Project...`
|
||||
2. Select `Pure Python`.
|
||||
3. Select a project name like `airbyte` and a directory **outside of** the `airbyte` code root.
|
||||
4. Go to `Prefferences -> Project -> Python Interpreter`
|
||||
5. Find a gear ⚙️ button next to `Python interpreter` dropdown list, click and select `Add`
|
||||
6. Select `Virtual Environment -> Existing`
|
||||
7. Set the interpreter path to the one that was created by Gradle command, i.e. `airbyte-integrations/connectors/your-connector-dir/.venv/bin/python`.
|
||||
8. Wait for PyCharm to finish indexing and loading skeletons from selected virtual environment.
|
||||
|
||||
You should now have access to code completion and proper syntax highlighting for python projects.
|
||||
|
||||
If you need to work on another connector you can quickly change the current virtual environment in the bottom toolbar.
|
||||
|
||||
|
||||
@@ -4,12 +4,13 @@ Our documentation uses [GitBook](https://gitbook.com), and all the [Markdown](ht
|
||||
|
||||
## Workflow for updating docs
|
||||
|
||||
1. Modify docs using Git or the Github UI (All docs live in the `docs/` folder in the [Airbyte repository](https://github.com/airbytehq/airbyte))
|
||||
1. Modify docs using Git or the Github UI \(All docs live in the `docs/` folder in the [Airbyte repository](https://github.com/airbytehq/airbyte)\)
|
||||
2. If you're adding new files, update `docs/SUMMARY.md`.
|
||||
3. If you're moving existing pages, add redirects in the [`.gitbook.yaml` file](https://github.com/airbytehq/airbyte/blob/master/.gitbook.yaml) in the Airbyte repository root directory
|
||||
4. Create a Pull Request
|
||||
|
||||
### Modify in the Github UI
|
||||
|
||||
1. Directly edit the docs you want to edit [in the Github UI](https://docs.github.com/en/github/managing-files-in-a-repository/managing-files-on-github/editing-files-in-your-repository)
|
||||
2. Create a Pull Request
|
||||
|
||||
@@ -22,20 +23,22 @@ Our documentation uses [GitBook](https://gitbook.com), and all the [Markdown](ht
|
||||
git clone git@github.com:{YOUR_USERNAME}/airbyte.git
|
||||
cd airbyte
|
||||
```
|
||||
|
||||
Or
|
||||
|
||||
```bash
|
||||
git clone https://github.com/{YOUR_USERNAME}/airbyte.git
|
||||
cd airbyte
|
||||
```
|
||||
{% hint style="info" %}
|
||||
|
||||
While cloning on Windows, you might encounter errors about long filenames. Refer to the instructions [here](../deploying-airbyte/local-deployment.md#handling-long-filename-error) to correct it.
|
||||
{% endhint %}
|
||||
|
||||
3. Modify the documentation.
|
||||
4. Create a pull request
|
||||
|
||||
## Documentation Best Practices
|
||||
Connectors typically have the following documentation elements:
|
||||
|
||||
Connectors typically have the following documentation elements:
|
||||
|
||||
* READMEs
|
||||
* Changelogs
|
||||
@@ -43,86 +46,91 @@ Connectors typically have the following documentation elements:
|
||||
* Source code comments
|
||||
* How-to guides
|
||||
|
||||
Below are some best practices related to each of these.
|
||||
Below are some best practices related to each of these.
|
||||
|
||||
### READMEs
|
||||
|
||||
Every module should have a README containing:
|
||||
|
||||
* A brief description of the module
|
||||
* development pre-requisites (like which language or binaries are required for development)
|
||||
* development pre-requisites \(like which language or binaries are required for development\)
|
||||
* how to install dependencies
|
||||
* how to build and run the code locally & via Docker
|
||||
* any other information needed for local iteration
|
||||
|
||||
|
||||
### Changelogs
|
||||
|
||||
##### Core
|
||||
**Core**
|
||||
|
||||
Core changelogs should be updated in the `docs/project-overview/platform.md` file.
|
||||
|
||||
#### Connectors
|
||||
Each connector should have a CHANGELOG.md section in its public facing docs in the `docs/integrations/<sources OR destinations>/<name>` at the bottom of the page. Inside, each new connector version should have a section whose title is the connector's version number. The body of this section should describe the changes added in the new version. For example:
|
||||
|
||||
```
|
||||
Each connector should have a CHANGELOG.md section in its public facing docs in the `docs/integrations/<sources OR destinations>/<name>` at the bottom of the page. Inside, each new connector version should have a section whose title is the connector's version number. The body of this section should describe the changes added in the new version. For example:
|
||||
|
||||
```text
|
||||
| Version | Date | Pull Request | Subject |
|
||||
| :------ | :-------- | :----- | :------ |
|
||||
| 0.2.0 | 20XX-05-XX | [PR2#](https://github.com/airbytehq/airbyte/pull/PR2#) | Fixed bug with schema generation <br><br> Added a better description for the `password` input parameter |
|
||||
| 0.1.0 | 20XX-04-XX | [PR#](https://github.com/airbytehq/airbyte/pull/PR#) | Added incremental sync |
|
||||
```
|
||||
|
||||
|
||||
### Source code comments
|
||||
It's hard to pin down exactly what to do around source code comments, but there are two (very subjective) and rough guidelines:
|
||||
|
||||
It's hard to pin down exactly what to do around source code comments, but there are two \(very subjective\) and rough guidelines:
|
||||
|
||||
**If something is not obvious, write it down**. Examples include:
|
||||
|
||||
* non-trivial class definitions should have docstrings
|
||||
* magic variables should have comments explaining why those values are used (e.g: if using a page size of 10 in a connector, describe why if possible. If there is no reason, that's also fine, just mention in a comment).
|
||||
* magic variables should have comments explaining why those values are used \(e.g: if using a page size of 10 in a connector, describe why if possible. If there is no reason, that's also fine, just mention in a comment\).
|
||||
* Complicated subroutines/logic which cannot be refactored should have comments explaining what they are doing and why
|
||||
|
||||
**If something is obvious, don't write it down** since it's probably more likely to go out of date. For example, a comment like `x = 42; // sets x to 42` is not adding any new information and is therefore better omitted.
|
||||
|
||||
### Issues & Pull Requests
|
||||
**If something is obvious, don't write it down** since it's probably more likely to go out of date. For example, a comment like `x = 42; // sets x to 42` is not adding any new information and is therefore better omitted.
|
||||
|
||||
### Issues & Pull Requests
|
||||
|
||||
#### Titles
|
||||
|
||||
**Describe outputs, not implementation**: An issue or PR title should describe the desired end result, not the implementation. The exception is child issues/subissues of an epic.
|
||||
**Be specific about the domain**. Airbyte operates a monorepo, so being specific about what is being changed in the PR or issue title is important.
|
||||
**Describe outputs, not implementation**: An issue or PR title should describe the desired end result, not the implementation. The exception is child issues/subissues of an epic. **Be specific about the domain**. Airbyte operates a monorepo, so being specific about what is being changed in the PR or issue title is important.
|
||||
|
||||
Some examples:
|
||||
_subpar issue title_: `Remove airbyteCdk.dependsOn("unrelatedPackage")`. This describes a solution not a problem.
|
||||
Some examples: _subpar issue title_: `Remove airbyteCdk.dependsOn("unrelatedPackage")`. This describes a solution not a problem.
|
||||
|
||||
_good issue title_: `Building the Airbyte Python CDK should not build unrelated packages`. Describes desired end state and the intent is understandable without reading the full issue.
|
||||
_good issue title_: `Building the Airbyte Python CDK should not build unrelated packages`. Describes desired end state and the intent is understandable without reading the full issue.
|
||||
|
||||
_subpar PR title_: `Update tests`. Which tests? What was the update?
|
||||
|
||||
_good PR title_: `Source MySQL: update acceptance tests to connect to SSL-enabled database`. Specific about the domain and change that was made.
|
||||
|
||||
**PR title conventions**
|
||||
When creating a PR, follow the naming conventions depending on the change being made:
|
||||
_good PR title_: `Source MySQL: update acceptance tests to connect to SSL-enabled database`. Specific about the domain and change that was made.
|
||||
|
||||
* Notable updates to Airbyte Core: "🎉<description of feature>"
|
||||
* e.g: `🎉 enable configuring un-nesting in normalization`
|
||||
* New connectors: “🎉 New source or destination: <name>” e.g: `🎉 New Source: Okta`
|
||||
* New connector features: “🎉<Source or Destination> <name>: <feature description> E.g:
|
||||
* `🎉 Destination Redshift: write JSONs as SUPER type instead of VARCHAR`
|
||||
* `🎉 Source MySQL: enable logical replication`
|
||||
**PR title conventions** When creating a PR, follow the naming conventions depending on the change being made:
|
||||
|
||||
* Notable updates to Airbyte Core: "🎉"
|
||||
* e.g: `🎉 enable configuring un-nesting in normalization`
|
||||
* New connectors: “🎉 New source or destination: ” e.g: `🎉 New Source: Okta`
|
||||
* New connector features: “🎉 : E.g:
|
||||
* `🎉 Destination Redshift: write JSONs as SUPER type instead of VARCHAR`
|
||||
* `🎉 Source MySQL: enable logical replication`
|
||||
* Bugfixes should start with the 🐛 emoji
|
||||
* `🐛 Source Facebook Marketing: fix incorrect parsing of lookback window`
|
||||
* `🐛 Source Facebook Marketing: fix incorrect parsing of lookback window`
|
||||
* Documentation improvements should start with any of the book/paper emojis: 📚 📝 etc…
|
||||
* Any refactors, cleanups, etc.. that are not visible improvements to the user should not have emojis
|
||||
* Any refactors, cleanups, etc.. that are not visible improvements to the user should not have emojis
|
||||
|
||||
The emojis help us identify which commits should be included in the product release notes.
|
||||
The emojis help us identify which commits should be included in the product release notes.
|
||||
|
||||
#### Descriptions
|
||||
**Context**: Provide enough information (or a link to enough information) in the description so team members with no context can understand what the issue or PR is trying to accomplish. This usually means you should include two things:
|
||||
#### Descriptions
|
||||
|
||||
**Context**: Provide enough information \(or a link to enough information\) in the description so team members with no context can understand what the issue or PR is trying to accomplish. This usually means you should include two things:
|
||||
|
||||
1. Some background information motivating the problem
|
||||
2. A description of the problem itself
|
||||
3. Good places to start reading and file changes that can be skipped
|
||||
Some examples:
|
||||
|
||||
_insufficient context_: `Create an OpenAPI to JSON schema generator`. Unclear what the value or problem being solved here is.
|
||||
Some examples:
|
||||
|
||||
_insufficient context_: `Create an OpenAPI to JSON schema generator`. Unclear what the value or problem being solved here is.
|
||||
|
||||
_good context_:
|
||||
```
|
||||
|
||||
```text
|
||||
When creating or updating connectors, we spend a lot of time manually transcribing JSON Schema files based on OpenAPI docs. This is ncessary because OpenAPI and JSON schema are very similar but not perfectly compatible. This process is automatable. Therefore we should create a program which converts from OpenAPI to JSONSchema format.
|
||||
```
|
||||
```
|
||||
|
||||
|
||||
@@ -26,10 +26,9 @@ We recommend following [this guide](https://docs.docker.com/docker-for-windows/i
|
||||
|
||||
**I have a Mac with the M1 chip. Is it possible to run Airbyte?**
|
||||
|
||||
Some users using Macs with an M1 chip are facing some problems running Airbyte.
|
||||
The problem is related with the chip and Docker. [Issue #2017](https://github.com/airbytehq/airbyte/issues/2017) was created to follow up the problem, you can subscribe to it and get updates about the resolution.
|
||||
If you can successfully run Airbyte using a MacBook with the M1 chip, let us know so that we can share the process with the community!
|
||||
Some users using Macs with an M1 chip are facing some problems running Airbyte. The problem is related with the chip and Docker. [Issue \#2017](https://github.com/airbytehq/airbyte/issues/2017) was created to follow up the problem, you can subscribe to it and get updates about the resolution. If you can successfully run Airbyte using a MacBook with the M1 chip, let us know so that we can share the process with the community!
|
||||
|
||||
**Other issues**
|
||||
|
||||
If you encounter any issues, just connect to our [Slack](https://slack.airbyte.io). Our community will help! We also have a [troubleshooting](../troubleshooting/on-deploying.md) section in our docs for common problems.
|
||||
|
||||
|
||||
@@ -11,7 +11,7 @@ Airbyte Cloud requires no setup and can be immediately run from your web browser
|
||||
If you don't have an invite, sign up [here!](https://airbyte.io/cloud-waitlist)
|
||||
|
||||
**2. Click on the default workspace.**
|
||||
|
||||
|
||||
You will be provided 1000 credits to get your first few syncs going!
|
||||
|
||||

|
||||
@@ -21,3 +21,4 @@ You will be provided 1000 credits to get your first few syncs going!
|
||||

|
||||
|
||||
**4. You're done!**
|
||||
|
||||
|
||||
@@ -1,14 +1,16 @@
|
||||
# On Kubernetes
|
||||
# On Kubernetes \(Beta\)
|
||||
|
||||
## Overview
|
||||
|
||||
Airbyte allows scaling sync workloads horizontally using Kubernetes. The core components (api server, scheduler, etc) run as deployments while the scheduler launches connector-related pods on different nodes.
|
||||
Airbyte allows scaling sync workloads horizontally using Kubernetes. The core components \(api server, scheduler, etc\) run as deployments while the scheduler launches connector-related pods on different nodes.
|
||||
|
||||
## Getting Started
|
||||
|
||||
### Cluster Setup
|
||||
|
||||
For local testing we recommend following one of the following setup guides:
|
||||
* [Docker Desktop (Mac)](https://docs.docker.com/desktop/kubernetes/)
|
||||
|
||||
* [Docker Desktop \(Mac\)](https://docs.docker.com/desktop/kubernetes/)
|
||||
* [Minikube](https://minikube.sigs.k8s.io/docs/start/)
|
||||
* NOTE: Start Minikube with at least 4gb RAM with `minikube start --memory=4000`
|
||||
* [Kind](https://kind.sigs.k8s.io/docs/user/quick-start/)
|
||||
@@ -17,8 +19,7 @@ For testing on GKE you can [create a cluster with the command line or the Cloud
|
||||
|
||||
For testing on EKS you can [install eksctl](https://eksctl.io/introduction/) and run `eksctl create cluster` to create an EKS cluster/VPC/subnets/etc. This process should take 10-15 minutes.
|
||||
|
||||
For production, Airbyte should function on most clusters v1.19 and above. We have tested support on GKE and EKS. If you run into a problem starting
|
||||
Airbyte, please reach out on the `#troubleshooting` channel on our [Slack](https://slack.airbyte.io/) or [create an issue on GitHub](https://github.com/airbytehq/airbyte/issues/new?assignees=&labels=type%2Fbug&template=bug-report.md&title=).
|
||||
For production, Airbyte should function on most clusters v1.19 and above. We have tested support on GKE and EKS. If you run into a problem starting Airbyte, please reach out on the `#troubleshooting` channel on our [Slack](https://slack.airbyte.io/) or [create an issue on GitHub](https://github.com/airbytehq/airbyte/issues/new?assignees=&labels=type%2Fbug&template=bug-report.md&title=).
|
||||
|
||||
### Install `kubectl`
|
||||
|
||||
@@ -31,7 +32,9 @@ Configure `kubectl` to connect to your cluster by using `kubectl use-context my-
|
||||
* For GKE
|
||||
* Configure `gcloud` with `gcloud auth login`.
|
||||
* On the Google Cloud Console, the cluster page will have a `Connect` button, which will give a command to run locally that looks like
|
||||
|
||||
`gcloud container clusters get-credentials CLUSTER_NAME --zone ZONE_NAME --project PROJECT_NAME`.
|
||||
|
||||
* Use `kubectl config get-contexts` to show the contexts available.
|
||||
* Run `kubectl use-context <gke context>` to access the cluster from `kubectl`.
|
||||
* For EKS
|
||||
@@ -43,16 +46,16 @@ Configure `kubectl` to connect to your cluster by using `kubectl use-context my-
|
||||
|
||||
### Configure Logs
|
||||
|
||||
Both `dev` and `stable` versions of Airbyte include a stand-alone `Minio` deployment. Airbyte publishes logs to this `Minio` deployment by default.
|
||||
This means Airbyte comes as a **self-contained Kubernetes deployment - no other configuration is required**.
|
||||
Both `dev` and `stable` versions of Airbyte include a stand-alone `Minio` deployment. Airbyte publishes logs to this `Minio` deployment by default. This means Airbyte comes as a **self-contained Kubernetes deployment - no other configuration is required**.
|
||||
|
||||
Airbyte currently supports logging to `Minio`, `S3` or `GCS`. The following instructions are for users wishing to log to their own `Minio` layer, `S3` bucket
|
||||
or `GCS` bucket.
|
||||
Airbyte currently supports logging to `Minio`, `S3` or `GCS`. The following instructions are for users wishing to log to their own `Minio` layer, `S3` bucket or `GCS` bucket.
|
||||
|
||||
The provided credentials require both read and write permissions. The logger attempts to create the log bucket if it does not exist.
|
||||
|
||||
#### Configuring Custom Minio Log Location
|
||||
|
||||
Replace the following variables in the `.env` file in the `kube/overlays/stable` directory:
|
||||
|
||||
```text
|
||||
# The Minio bucket to write logs in.
|
||||
S3_LOG_BUCKET=
|
||||
@@ -63,10 +66,13 @@ AWS_SECRET_ACCESS_KEY=
|
||||
# Endpoint where Minio is deployed at.
|
||||
S3_MINIO_ENDPOINT=
|
||||
```
|
||||
|
||||
The `S3_PATH_STYLE_ACCESS` variable should remain `true`. The `S3_LOG_BUCKET_REGION` variable should remain empty.
|
||||
|
||||
#### Configuring Custom S3 Log Location
|
||||
|
||||
Replace the following variables in the `.env` file in the `kube/overlays/stable` directory:
|
||||
|
||||
```text
|
||||
# The S3 bucket to write logs in.
|
||||
S3_LOG_BUCKET=
|
||||
@@ -82,20 +88,22 @@ S3_MINIO_ENDPOINT=
|
||||
S3_PATH_STYLE_ACCESS=
|
||||
```
|
||||
|
||||
See [here](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html) for instructions on creating an S3 bucket and
|
||||
[here](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html#access-keys-and-secret-access-keys) for instructions on creating AWS credentials.
|
||||
See [here](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html) for instructions on creating an S3 bucket and [here](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html#access-keys-and-secret-access-keys) for instructions on creating AWS credentials.
|
||||
|
||||
#### Configuring Custom GCS Log Location
|
||||
|
||||
Create the GCP service account with read/write permission to the GCS log bucket.
|
||||
|
||||
1) Base64 encode the GCP json secret.
|
||||
```
|
||||
1\) Base64 encode the GCP json secret.
|
||||
|
||||
```text
|
||||
# The output of this command will be a Base64 string.
|
||||
$ cat gcp.json | base64
|
||||
```
|
||||
2) Populate the gcs-log-creds secrets with the Base64-encoded credential. This is as simple as taking the encoded credential from the previous step
|
||||
and adding it to the `secret-gcs-log-creds,yaml` file.
|
||||
```
|
||||
|
||||
2\) Populate the gcs-log-creds secrets with the Base64-encoded credential. This is as simple as taking the encoded credential from the previous step and adding it to the `secret-gcs-log-creds,yaml` file.
|
||||
|
||||
```text
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata:
|
||||
@@ -105,28 +113,28 @@ data:
|
||||
gcp.json: <base64-encoded-string>
|
||||
```
|
||||
|
||||
3) Replace the following variables in the `.env` file in the `kube/overlays/stable` directory:
|
||||
```
|
||||
3\) Replace the following variables in the `.env` file in the `kube/overlays/stable` directory:
|
||||
|
||||
```text
|
||||
# The GCS bucket to write logs in.
|
||||
GCP_STORAGE_BUCKET=
|
||||
# The path the GCS creds are written to. Unless you know what you are doing, use the below default value.
|
||||
GOOGLE_APPLICATION_CREDENTIALS=/secrets/gcs-log-creds/gcp.json
|
||||
```
|
||||
|
||||
See [here](https://cloud.google.com/storage/docs/creating-buckets) for instruction on creating a GCS bucket and
|
||||
[here](https://cloud.google.com/iam/docs/creating-managing-service-account-keys#iam-service-account-keys-create-console) for instruction on creating GCP credentials.
|
||||
See [here](https://cloud.google.com/storage/docs/creating-buckets) for instruction on creating a GCS bucket and [here](https://cloud.google.com/iam/docs/creating-managing-service-account-keys#iam-service-account-keys-create-console) for instruction on creating GCP credentials.
|
||||
|
||||
### Launch Airbyte
|
||||
|
||||
Run the following commands to launch Airbyte:
|
||||
|
||||
```text
|
||||
git clone https://github.com/airbytehq/airbyte.git
|
||||
cd airbyte
|
||||
kubectl apply -k kube/overlays/stable
|
||||
```
|
||||
|
||||
After 2-5 minutes, `kubectl get pods | grep airbyte` should show `Running` as the status for all the core Airbyte pods. This may take longer
|
||||
on Kubernetes clusters with slow internet connections.
|
||||
After 2-5 minutes, `kubectl get pods | grep airbyte` should show `Running` as the status for all the core Airbyte pods. This may take longer on Kubernetes clusters with slow internet connections.
|
||||
|
||||
Run `kubectl port-forward svc/airbyte-webapp-svc 8000:80` to allow access to the UI/API.
|
||||
|
||||
@@ -147,50 +155,40 @@ Now visit [http://localhost:8000](http://localhost:8000) in your browser and sta
|
||||
|
||||
### Increasing job parallelism
|
||||
|
||||
The number of simultaneous jobs (getting specs, checking connections, discovering schemas, and performing syncs) is limited by a few factors. First of all, the `SUBMITTER_NUM_THREADS` (set in the `.env` file for your Kustimization overlay) provides a global limit on the number of simultaneous jobs that can run across all worker pods.
|
||||
The number of simultaneous jobs \(getting specs, checking connections, discovering schemas, and performing syncs\) is limited by a few factors. First of all, the `SUBMITTER_NUM_THREADS` \(set in the `.env` file for your Kustimization overlay\) provides a global limit on the number of simultaneous jobs that can run across all worker pods.
|
||||
|
||||
The number of worker pods can be changed by increasing the number of replicas for the `airbyte-worker` deployment. An example of a Kustomization patch that increases this number can be seen in `airbyte/kube/overlays/dev-integration-test/kustomization.yaml` and `airbyte/kube/overlays/dev-integration-test/parallelize-worker.yaml`. The number of simultaneous jobs on a specific worker pod is also limited by the number of ports exposed by the worker deployment and set by `TEMPORAL_WORKER_PORTS` in your `.env` file. Without additional ports used to communicate to connector pods, jobs will start to run but will hang until ports become available.
|
||||
The number of worker pods can be changed by increasing the number of replicas for the `airbyte-worker` deployment. An example of a Kustomization patch that increases this number can be seen in `airbyte/kube/overlays/dev-integration-test/kustomization.yaml` and `airbyte/kube/overlays/dev-integration-test/parallelize-worker.yaml`. The number of simultaneous jobs on a specific worker pod is also limited by the number of ports exposed by the worker deployment and set by `TEMPORAL_WORKER_PORTS` in your `.env` file. Without additional ports used to communicate to connector pods, jobs will start to run but will hang until ports become available.
|
||||
|
||||
You can also tune environment variables for the max simultaneous job types that can run on the worker pod by setting `MAX_SPEC_WORKERS`, `MAX_CHECK_WORKERS`, `MAX_DISCOVER_WORKERS`, `MAX_SYNC_WORKERS` for the worker pod deployment (not in the `.env` file). These values can be used if you want to create separate worker deployments for separate types of workers with different resource allocations.
|
||||
You can also tune environment variables for the max simultaneous job types that can run on the worker pod by setting `MAX_SPEC_WORKERS`, `MAX_CHECK_WORKERS`, `MAX_DISCOVER_WORKERS`, `MAX_SYNC_WORKERS` for the worker pod deployment \(not in the `.env` file\). These values can be used if you want to create separate worker deployments for separate types of workers with different resource allocations.
|
||||
|
||||
### Cloud logging
|
||||
|
||||
Airbyte writes logs to two directories. App logs, including server and scheduler logs, are written to the `app-logging` directory.
|
||||
Job logs are written to the `job-logging` directory. Both directories live at the top-level e.g., the `app-logging` directory lives at
|
||||
`s3://log-bucket/app-logging` etc. These paths can change, so we recommend having a dedicated log bucket, and to not use this bucket for other
|
||||
purposes.
|
||||
Airbyte writes logs to two directories. App logs, including server and scheduler logs, are written to the `app-logging` directory. Job logs are written to the `job-logging` directory. Both directories live at the top-level e.g., the `app-logging` directory lives at `s3://log-bucket/app-logging` etc. These paths can change, so we recommend having a dedicated log bucket, and to not use this bucket for other purposes.
|
||||
|
||||
Airbyte publishes logs every minute. This means it is normal to see minute-long log delays. Each publish creates it's own log file, since Cloud
|
||||
Storages do not support append operations. This also mean it is normal to see hundreds of files in your log bucket.
|
||||
Airbyte publishes logs every minute. This means it is normal to see minute-long log delays. Each publish creates it's own log file, since Cloud Storages do not support append operations. This also mean it is normal to see hundreds of files in your log bucket.
|
||||
|
||||
Each log file is named `{yyyyMMddHH24mmss}_{podname}_{UUID}` and is not compressed. Users can view logs simply by navigating to the relevant folder and
|
||||
downloading the file for the time period in question.
|
||||
Each log file is named `{yyyyMMddHH24mmss}_{podname}_{UUID}` and is not compressed. Users can view logs simply by navigating to the relevant folder and downloading the file for the time period in question.
|
||||
|
||||
See the [Known Issues](#known-issues) section for planned logging improvements.
|
||||
See the [Known Issues](on-kubernetes.md#known-issues) section for planned logging improvements.
|
||||
|
||||
### Using an external DB
|
||||
|
||||
After [Issue #3605](https://github.com/airbytehq/airbyte/issues/3605) is completed, users will be able to configure custom dbs instead of a simple
|
||||
`postgres` container running directly in Kubernetes. This separate instance (preferable on a system like AWS RDS or Google Cloud SQL) should be easier
|
||||
and safer to maintain than Postgres on your cluster.
|
||||
After [Issue \#3605](https://github.com/airbytehq/airbyte/issues/3605) is completed, users will be able to configure custom dbs instead of a simple `postgres` container running directly in Kubernetes. This separate instance \(preferable on a system like AWS RDS or Google Cloud SQL\) should be easier and safer to maintain than Postgres on your cluster.
|
||||
|
||||
## Known Issues
|
||||
|
||||
As we improve our Kubernetes offering, we would like to point out some common pain points. We are working on improving these. Please let us know if
|
||||
there are any other issues blocking your adoption of Airbyte or if you would like to contribute fixes to address any of these issues.
|
||||
As we improve our Kubernetes offering, we would like to point out some common pain points. We are working on improving these. Please let us know if there are any other issues blocking your adoption of Airbyte or if you would like to contribute fixes to address any of these issues.
|
||||
|
||||
* Some UI operations have higher latency on Kubernetes than Docker-Compose. ([#4233](https://github.com/airbytehq/airbyte/issues/4233))
|
||||
* Logging to Azure Storage is not supported. ([#4200](https://github.com/airbytehq/airbyte/issues/4200))
|
||||
* Large log files might take a while to load. ([#4201](https://github.com/airbytehq/airbyte/issues/4201))
|
||||
* UI does not include configured buckets in the displayed log path. ([#4204](https://github.com/airbytehq/airbyte/issues/4204))
|
||||
* Logs are not reset when Airbyte is re-deployed. ([#4235](https://github.com/airbytehq/airbyte/issues/4235))
|
||||
* Some UI operations have higher latency on Kubernetes than Docker-Compose. \([\#4233](https://github.com/airbytehq/airbyte/issues/4233)\)
|
||||
* Logging to Azure Storage is not supported. \([\#4200](https://github.com/airbytehq/airbyte/issues/4200)\)
|
||||
* Large log files might take a while to load. \([\#4201](https://github.com/airbytehq/airbyte/issues/4201)\)
|
||||
* UI does not include configured buckets in the displayed log path. \([\#4204](https://github.com/airbytehq/airbyte/issues/4204)\)
|
||||
* Logs are not reset when Airbyte is re-deployed. \([\#4235](https://github.com/airbytehq/airbyte/issues/4235)\)
|
||||
* File sources reading from and file destinations writing to local mounts are not supported on Kubernetes.
|
||||
|
||||
## Customizing Airbyte Manifests
|
||||
|
||||
We use [Kustomize](https://kustomize.io/) to allow overrides for different environments. Our shared resources are in the `kube/resources` directory,
|
||||
and we define overlays for each environment. We recommend creating your own overlay if you want to customize your deployments.
|
||||
This overlay can live in your own VCS.
|
||||
We use [Kustomize](https://kustomize.io/) to allow overrides for different environments. Our shared resources are in the `kube/resources` directory, and we define overlays for each environment. We recommend creating your own overlay if you want to customize your deployments. This overlay can live in your own VCS.
|
||||
|
||||
Example `kustomization.yaml` file:
|
||||
|
||||
@@ -204,8 +202,7 @@ bases:
|
||||
|
||||
### View Raw Manifests
|
||||
|
||||
For a specific overlay, you can run `kubectl kustomize kube/overlays/stable` to view the manifests that Kustomize will apply to your Kubernetes cluster.
|
||||
This is useful for debugging because it will show the exact resources you are defining.
|
||||
For a specific overlay, you can run `kubectl kustomize kube/overlays/stable` to view the manifests that Kustomize will apply to your Kubernetes cluster. This is useful for debugging because it will show the exact resources you are defining.
|
||||
|
||||
### Helm Charts
|
||||
|
||||
@@ -214,41 +211,47 @@ Check out the [Helm Chart Readme](https://github.com/airbytehq/airbyte/tree/mast
|
||||
## Operator Guide
|
||||
|
||||
### View API Server Logs
|
||||
`kubectl logs deployments/airbyte-server` to view real-time logs. Logs can also be downloaded as a text file via the Admin tab in the UI.
|
||||
|
||||
`kubectl logs deployments/airbyte-server` to view real-time logs. Logs can also be downloaded as a text file via the Admin tab in the UI.
|
||||
|
||||
### View Scheduler or Job Logs
|
||||
|
||||
`kubectl logs deployments/airbyte-scheduler` to view real-time logs. Logs can also be downloaded as a text file via the Admin tab in the UI.
|
||||
|
||||
### Connector Container Logs
|
||||
Although all logs can be accessed by viewing the scheduler logs, connector container logs may be easier to understand when isolated by accessing from
|
||||
the Airbyte UI or the [Airbyte API](../api-documentation.md) for a specific job attempt. Connector pods launched by Airbyte will not relay logs directly
|
||||
to Kubernetes logging. You must access these logs through Airbyte.
|
||||
|
||||
Although all logs can be accessed by viewing the scheduler logs, connector container logs may be easier to understand when isolated by accessing from the Airbyte UI or the [Airbyte API](../api-documentation.md) for a specific job attempt. Connector pods launched by Airbyte will not relay logs directly to Kubernetes logging. You must access these logs through Airbyte.
|
||||
|
||||
### Upgrading Airbyte Kube
|
||||
|
||||
See [Upgrading K8s](../operator-guides/upgrading-airbyte.md).
|
||||
|
||||
### Resizing Volumes
|
||||
To resize a volume, change the `.spec.resources.requests.storage` value. After re-applying, the mount should be extended if that operation is supported
|
||||
for your type of mount. For a production deployment, it's useful to track the usage of volumes to ensure they don't run out of space.
|
||||
|
||||
To resize a volume, change the `.spec.resources.requests.storage` value. After re-applying, the mount should be extended if that operation is supported for your type of mount. For a production deployment, it's useful to track the usage of volumes to ensure they don't run out of space.
|
||||
|
||||
### Copy Files To/From Volumes
|
||||
|
||||
See the documentation for [`kubectl cp`](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#cp).
|
||||
|
||||
### Listing Files
|
||||
|
||||
```bash
|
||||
kubectl exec -it airbyte-scheduler-6b5747df5c-bj4fx ls /tmp/workspace/8
|
||||
```
|
||||
|
||||
### Reading Files
|
||||
|
||||
```bash
|
||||
kubectl exec -it airbyte-scheduler-6b5747df5c-bj4fx cat /tmp/workspace/8/0/logs.log
|
||||
```
|
||||
|
||||
### Persistent storage on GKE regional cluster
|
||||
Running Airbyte on GKE regional cluster requires enabling persistent regional storage. To do so, enable [CSI driver](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/gce-pd-csi-driver)
|
||||
on GKE. After enabling, add `storageClassName: standard-rwo` to the [volume-configs](../../kube/resources/volume-configs.yaml) yaml.
|
||||
|
||||
Running Airbyte on GKE regional cluster requires enabling persistent regional storage. To do so, enable [CSI driver](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/gce-pd-csi-driver) on GKE. After enabling, add `storageClassName: standard-rwo` to the [volume-configs](https://github.com/airbytehq/airbyte/tree/86ee2ad05bccb4aca91df2fb07c412efde5ba71c/kube/resources/volume-configs.yaml) yaml.
|
||||
|
||||
`volume-configs.yaml` example:
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
@@ -266,8 +269,10 @@ spec:
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
If you run into any problems operating Airbyte on Kubernetes, please reach out on the `#issues` channel on our [Slack](https://slack.airbyte.io/) or
|
||||
[create an issue on GitHub](https://github.com/airbytehq/airbyte/issues/new?assignees=&labels=type%2Fbug&template=bug-report.md&title=).
|
||||
|
||||
If you run into any problems operating Airbyte on Kubernetes, please reach out on the `#issues` channel on our [Slack](https://slack.airbyte.io/) or [create an issue on GitHub](https://github.com/airbytehq/airbyte/issues/new?assignees=&labels=type%2Fbug&template=bug-report.md&title=).
|
||||
|
||||
## Developing Airbyte on Kubernetes
|
||||
|
||||
[Read about the Kubernetes dev cycle!](https://docs.airbyte.io/contributing-to-airbyte/developing-on-kubernetes)
|
||||
|
||||
|
||||
@@ -1,25 +1,26 @@
|
||||
# Install Airbyte on Oracle Cloud Infrastructure (OCI) VM
|
||||
# On Oracle Cloud Infrastructure VM
|
||||
|
||||
Install Airbyte on Oracle Cloud Infrastructure VM running Oracle Linux 7
|
||||
|
||||
## Create OCI Instance
|
||||
Go to OCI Console > Compute > Instances > Create Instance
|
||||
## Create OCI Instance
|
||||
|
||||
Go to OCI Console > Compute > Instances > Create Instance
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
|
||||
## Whitelist Port 8000 for a CIDR range in Security List of OCI VM Subnet
|
||||
Go to OCI Console > Networking > Virtual Cloud Network
|
||||
|
||||
Select the Subnet > Security List > Add Ingress Rules
|
||||
Go to OCI Console > Networking > Virtual Cloud Network
|
||||
|
||||
Select the Subnet > Security List > Add Ingress Rules
|
||||
|
||||

|
||||
|
||||
|
||||
## Login to the Instance/VM with the SSH key and 'opc' user
|
||||
```
|
||||
|
||||
```text
|
||||
chmod 600 private-key-file
|
||||
|
||||
ssh -i private-key-file opc@oci-private-instance-ip -p 2200
|
||||
@@ -37,52 +38,48 @@ sudo service docker start
|
||||
|
||||
sudo usermod -a -G docker $USER
|
||||
|
||||
|
||||
### Install Docker Compose
|
||||
|
||||
sudo wget https://github.com/docker/compose/releases/download/1.26.2/docker-compose-$(uname -s)-$(uname -m) -O /usr/local/bin/docker-compose
|
||||
sudo wget [https://github.com/docker/compose/releases/download/1.26.2/docker-compose-$\(uname](https://github.com/docker/compose/releases/download/1.26.2/docker-compose-$%28uname) -s\)-$\(uname -m\) -O /usr/local/bin/docker-compose
|
||||
|
||||
sudo chmod +x /usr/local/bin/docker-compose
|
||||
|
||||
docker-compose --version
|
||||
|
||||
|
||||
### Install Airbyte
|
||||
|
||||
mkdir airbyte && cd airbyte
|
||||
|
||||
wget https://raw.githubusercontent.com/airbytehq/airbyte/master/{.env,docker-compose.yaml}
|
||||
wget [https://raw.githubusercontent.com/airbytehq/airbyte/master/{.env,docker-compose.yaml}](https://raw.githubusercontent.com/airbytehq/airbyte/master/{.env,docker-compose.yaml})
|
||||
|
||||
which docker-compose
|
||||
which docker-compose
|
||||
|
||||
sudo /usr/local/bin/docker-compose up -d
|
||||
|
||||
|
||||
|
||||
## Create SSH Tunnel to Login to the Instance
|
||||
|
||||
it is highly recommended to not have a Public IP for the Instance where you are running Airbyte).
|
||||
it is highly recommended to not have a Public IP for the Instance where you are running Airbyte\).
|
||||
|
||||
### SSH Local Port Forward to Airbyte VM
|
||||
|
||||
From your local workstation
|
||||
|
||||
```
|
||||
```text
|
||||
ssh opc@bastion-host-public-ip -i <private-key-file.key> -L 2200:oci-private-instance-ip:22
|
||||
ssh opc@localhost -i <private-key-file.key> -p 2200
|
||||
```
|
||||
|
||||
### Airbyte GUI Local Port Forward to Airbyte VM
|
||||
|
||||
```
|
||||
```text
|
||||
ssh opc@bastion-host-public-ip -i <private-key-file.key> -L 8000:oci-private-instance-ip:8000
|
||||
```
|
||||
|
||||
|
||||
## Access Airbyte
|
||||
|
||||
Open URL in Browser : http://localhost:8000/
|
||||
Open URL in Browser : [http://localhost:8000/](http://localhost:8000/)
|
||||
|
||||

|
||||
|
||||
/* Please note Airbyte currently does not support SSL/TLS certificates */
|
||||
/ _Please note Airbyte currently does not support SSL/TLS certificates_ /
|
||||
|
||||
|
||||
@@ -1,137 +1,141 @@
|
||||
# Connector Catalog
|
||||
|
||||
## Connector grades
|
||||
|
||||
Airbyte uses a grading system for connectors to help users understand what to expect from a connector. There are three grades, explained below:
|
||||
|
||||
**Certified**: This connector has been proven to be robust via usage by a large number of users and extensive testing.
|
||||
|
||||
**Beta**: While this connector is well tested and is expected to work a majority of the time, it was released recently. There may be some unhandled edge cases but Airbyte will provide very quick turnaround for support on any issues (we'll publish our target KPIs for support turnaround very soon). All beta connectors will make their way to certified status after enough field testing.
|
||||
**Beta**: While this connector is well tested and is expected to work a majority of the time, it was released recently. There may be some unhandled edge cases but Airbyte will provide very quick turnaround for support on any issues \(we'll publish our target KPIs for support turnaround very soon\). All beta connectors will make their way to certified status after enough field testing.
|
||||
|
||||
**Alpha**: This connector is either not sufficiently tested, has extremely limited functionality (e.g: created as an example connector), or for any other reason may not be very mature.
|
||||
**Alpha**: This connector is either not sufficiently tested, has extremely limited functionality \(e.g: created as an example connector\), or for any other reason may not be very mature.
|
||||
|
||||
### Sources
|
||||
|
||||
| Connector | Grade |
|
||||
|----|----|
|
||||
|[Amazon Seller Partner](./sources/amazon-seller-partner.md)| Alpha |
|
||||
|[Amplitude](./sources/amplitude.md)| Beta |
|
||||
|[Apify Dataset](./sources/apify-dataset.md)| Alpha |
|
||||
|[Appstore](./sources/appstore.md)| Alpha |
|
||||
|[Asana](./sources/asana.md) | Beta |
|
||||
|[AWS CloudTrail](./sources/aws-cloudtrail.md)| Beta |
|
||||
|[BambooHR](./sources/bamboo-hr.md)| Alpha |
|
||||
|[Braintree](./sources/braintree.md)| Alpha |
|
||||
|[BigCommerce](./sources/bigcommerce.md)| Alpha |
|
||||
|[BigQuery](./sources/bigquery.md)| Beta |
|
||||
|[Bing Ads](./sources/bing-ads.md)| Beta |
|
||||
|[Cart](./sources/cart.md)| Beta |
|
||||
|[Chargebee](./sources/chargebee.md)| Alpha |
|
||||
|[ClickHouse](./sources/clickhouse.md)| Beta |
|
||||
|[Close.com](./sources/close-com.md)| Beta |
|
||||
|[CockroachDB](./sources/cockroachdb.md)| Beta |
|
||||
|[Db2](./sources/db2.md)| Beta |
|
||||
|[Dixa](./sources/dixa.md) | Alpha |
|
||||
|[Drift](./sources/drift.md)| Beta |
|
||||
|[Drupal](./sources/drupal.md)| Beta |
|
||||
|[Exchange Rates API](./sources/exchangeratesapi.md)| Certified |
|
||||
|[Facebook Marketing](./sources/facebook-marketing.md)| Beta |
|
||||
|[Facebook Pages](./sources/facebook-pages.md)| Alpha |
|
||||
|[Files](./sources/file.md)| Certified |
|
||||
|[Freshdesk](./sources/freshdesk.md)| Certified |
|
||||
|[GitHub](./sources/github.md)| Beta |
|
||||
|[GitLab](./sources/gitlab.md)| Beta |
|
||||
|[Google Ads](./sources/google-ads.md)| Beta |
|
||||
|[Google Adwords](./sources/google-adwords.md)| Beta |
|
||||
|[Google Analytics v4](./sources/google-analytics-v4.md)| Beta |
|
||||
|[Google Directory](./sources/google-directory.md)| Certified |
|
||||
|[Google Search Console](./sources/google-search-console.md)| Beta |
|
||||
|[Google Sheets](./sources/google-sheets.md)| Certified |
|
||||
|[Google Workspace Admin Reports](./sources/google-workspace-admin-reports.md)| Certified |
|
||||
|[Greenhouse](./sources/greenhouse.md)| Beta |
|
||||
|[Hubspot](./sources/hubspot.md)| Certified |
|
||||
|[Instagram](./sources/instagram.md)| Certified |
|
||||
|[Intercom](./sources/intercom.md)| Beta |
|
||||
|[Iterable](./sources/iterable.md)| Beta |
|
||||
|[Jira](./sources/jira.md)| Certified |
|
||||
|[Klaviyo](./sources/klaviyo.md)| Beta |
|
||||
|[Klaviyo](./sources/kustomer.md)| Alpha |
|
||||
|[LinkedIn Ads](./sources/linkedin-ads.md)| Beta |
|
||||
|[Kustomer](./sources/kustomer.md)| Alpha |
|
||||
|[Lever Hiring](./sources/lever-hiring.md)| Beta |
|
||||
|[Looker](./sources/looker.md)| Beta |
|
||||
|[Magento](./sources/magento.md)| Beta |
|
||||
|[Mailchimp](./sources/mailchimp.md)| Certified |
|
||||
|[Marketo](./sources/marketo.md)| Beta |
|
||||
|[Microsoft SQL Server \(MSSQL\)](./sources/mssql.md)| Certified |
|
||||
|[Microsoft Dynamics AX](./sources/microsoft-dynamics-ax.md)| Beta |
|
||||
|[Microsoft Dynamics Customer Engagement](./sources/microsoft-dynamics-customer-engagement.md)| Beta |
|
||||
|[Microsoft Dynamics GP](./sources/microsoft-dynamics-gp.md)| Beta |
|
||||
|[Microsoft Dynamics NAV](./sources/microsoft-dynamics-nav.md)| Beta |
|
||||
|[Microsoft Teams](./sources/microsoft-teams.md)| Certified |
|
||||
|[Mixpanel](./sources/mixpanel.md)| Beta |
|
||||
|[Mongo DB](./sources/mongodb-v2.md)| Beta |
|
||||
|[MySQL](./sources/mysql.md)| Certified |
|
||||
|[Okta](./sources/okta.md)| Beta |
|
||||
|[Oracle DB](./sources/oracle.md)| Certified |
|
||||
|[Oracle PeopleSoft](./sources/oracle-peoplesoft.md)| Beta |
|
||||
|[Oracle Siebel CRM](./sources/oracle-siebel-crm.md)| Beta |
|
||||
|[PayPal Transaction](./sources/paypal-transaction.md)| Beta |
|
||||
|[Pipedrive](./sources/pipedrive.md)| Alpha |
|
||||
|[Plaid](./sources/plaid.md)| Alpha |
|
||||
|[PokéAPI](./sources/pokeapi.md)| Beta |
|
||||
|[Postgres](./sources/postgres.md)| Certified |
|
||||
|[PostHog](./sources/posthog.md)| Beta |
|
||||
|[PrestaShop](./sources/presta-shop.md)| Beta |
|
||||
|[Quickbooks](./sources/quickbooks.md)| Beta |
|
||||
|[Recharge](./sources/recharge.md)| Beta |
|
||||
|[Recurly](./sources/recurly.md)| Beta |
|
||||
|[Redshift](./sources/redshift.md)| Certified |
|
||||
|[S3](./sources/s3.md)| Alpha |
|
||||
|[Salesforce](./sources/salesforce.md)| Certified |
|
||||
|[SAP Business One](./sources/sap-business-one.md)| Beta |
|
||||
|[Sendgrid](./sources/sendgrid.md)| Certified |
|
||||
|[Shopify](./sources/shopify.md)| Certified |
|
||||
|[Short.io](./sources/shortio.md)| Beta |
|
||||
|[Slack](./sources/slack.md)| Beta |
|
||||
|[Spree Commerce](./sources/spree-commerce.md)| Beta |
|
||||
|[Smartsheets](./sources/smartsheets.md)| Beta |
|
||||
|[Snowflake](./sources/snowflake.md)| Beta |
|
||||
|[Square](./sources/square.md)| Beta |
|
||||
|[Stripe](./sources/stripe.md)| Certified |
|
||||
|[Sugar CRM](./sources/sugar-crm.md)| Beta |
|
||||
|[SurveyMonkey](./sources/surveymonkey.md)| Beta |
|
||||
|[Tempo](./sources/tempo.md)| Beta |
|
||||
|[Trello](./sources/trello.md)| Beta |
|
||||
|[Twilio](./sources/twilio.md)| Beta |
|
||||
|[US Census](./sources/us-census.md)| Alpha |
|
||||
|[WooCommerce](./sources/woo-commerce.md)| Beta |
|
||||
|[Wordpress](./sources/wordpress.md)| Beta |
|
||||
|[Zencart](./sources/zencart.md)| Beta |
|
||||
|[Zendesk Chat](./sources/zendesk-chat.md)| Certified |
|
||||
|[Zendesk Sunshine](./sources/zendesk-sunshine.md)| Beta |
|
||||
|[Zendesk Support](./sources/zendesk-support.md)| Certified |
|
||||
|[Zendesk Talk](./sources/zendesk-talk.md)| Certified |
|
||||
|[Zoom](./sources/zoom.md)| Beta |
|
||||
|[Zuora](./sources/zuora.md)| Beta |
|
||||
| :--- | :--- |
|
||||
| [Amazon Seller Partner](sources/amazon-seller-partner.md) | Alpha |
|
||||
| [Amplitude](sources/amplitude.md) | Beta |
|
||||
| [Apify Dataset](sources/apify-dataset.md) | Alpha |
|
||||
| [Appstore](sources/appstore.md) | Alpha |
|
||||
| [Asana](sources/asana.md) | Beta |
|
||||
| [AWS CloudTrail](sources/aws-cloudtrail.md) | Beta |
|
||||
| [BambooHR](sources/bamboo-hr.md) | Alpha |
|
||||
| [Braintree](sources/braintree.md) | Alpha |
|
||||
| [BigCommerce](sources/bigcommerce.md) | Alpha |
|
||||
| [BigQuery](sources/bigquery.md) | Beta |
|
||||
| [Bing Ads](sources/bing-ads.md) | Beta |
|
||||
| [Cart](sources/cart.md) | Beta |
|
||||
| [Chargebee](sources/chargebee.md) | Alpha |
|
||||
| [ClickHouse](sources/clickhouse.md) | Beta |
|
||||
| [Close.com](sources/close-com.md) | Beta |
|
||||
| [CockroachDB](sources/cockroachdb.md) | Beta |
|
||||
| [Db2](sources/db2.md) | Beta |
|
||||
| [Dixa](sources/dixa.md) | Alpha |
|
||||
| [Drift](sources/drift.md) | Beta |
|
||||
| [Drupal](sources/drupal.md) | Beta |
|
||||
| [Exchange Rates API](sources/exchangeratesapi.md) | Certified |
|
||||
| [Facebook Marketing](sources/facebook-marketing.md) | Beta |
|
||||
| [Facebook Pages](sources/facebook-pages.md) | Alpha |
|
||||
| [Files](sources/file.md) | Certified |
|
||||
| [Freshdesk](sources/freshdesk.md) | Certified |
|
||||
| [GitHub](sources/github.md) | Beta |
|
||||
| [GitLab](sources/gitlab.md) | Beta |
|
||||
| [Google Ads](sources/google-ads.md) | Beta |
|
||||
| [Google Adwords](sources/google-adwords.md) | Beta |
|
||||
| [Google Analytics v4](sources/google-analytics-v4.md) | Beta |
|
||||
| [Google Directory](sources/google-directory.md) | Certified |
|
||||
| [Google Search Console](sources/google-search-console.md) | Beta |
|
||||
| [Google Sheets](sources/google-sheets.md) | Certified |
|
||||
| [Google Workspace Admin Reports](sources/google-workspace-admin-reports.md) | Certified |
|
||||
| [Greenhouse](sources/greenhouse.md) | Beta |
|
||||
| [Hubspot](sources/hubspot.md) | Certified |
|
||||
| [Instagram](sources/instagram.md) | Certified |
|
||||
| [Intercom](sources/intercom.md) | Beta |
|
||||
| [Iterable](sources/iterable.md) | Beta |
|
||||
| [Jira](sources/jira.md) | Certified |
|
||||
| [Klaviyo](sources/klaviyo.md) | Beta |
|
||||
| [Klaviyo](sources/kustomer.md) | Alpha |
|
||||
| [LinkedIn Ads](sources/linkedin-ads.md) | Beta |
|
||||
| [Kustomer](sources/kustomer.md) | Alpha |
|
||||
| [Lever Hiring](sources/lever-hiring.md) | Beta |
|
||||
| [Looker](sources/looker.md) | Beta |
|
||||
| [Magento](sources/magento.md) | Beta |
|
||||
| [Mailchimp](sources/mailchimp.md) | Certified |
|
||||
| [Marketo](sources/marketo.md) | Beta |
|
||||
| [Microsoft SQL Server \(MSSQL\)](sources/mssql.md) | Certified |
|
||||
| [Microsoft Dynamics AX](sources/microsoft-dynamics-ax.md) | Beta |
|
||||
| [Microsoft Dynamics Customer Engagement](sources/microsoft-dynamics-customer-engagement.md) | Beta |
|
||||
| [Microsoft Dynamics GP](sources/microsoft-dynamics-gp.md) | Beta |
|
||||
| [Microsoft Dynamics NAV](sources/microsoft-dynamics-nav.md) | Beta |
|
||||
| [Microsoft Teams](sources/microsoft-teams.md) | Certified |
|
||||
| [Mixpanel](sources/mixpanel.md) | Beta |
|
||||
| [Mongo DB](sources/mongodb-v2.md) | Beta |
|
||||
| [MySQL](sources/mysql.md) | Certified |
|
||||
| [Okta](sources/okta.md) | Beta |
|
||||
| [Oracle DB](sources/oracle.md) | Certified |
|
||||
| [Oracle PeopleSoft](sources/oracle-peoplesoft.md) | Beta |
|
||||
| [Oracle Siebel CRM](sources/oracle-siebel-crm.md) | Beta |
|
||||
| [PayPal Transaction](sources/paypal-transaction.md) | Beta |
|
||||
| [Pipedrive](sources/pipedrive.md) | Alpha |
|
||||
| [Plaid](sources/plaid.md) | Alpha |
|
||||
| [PokéAPI](sources/pokeapi.md) | Beta |
|
||||
| [Postgres](sources/postgres.md) | Certified |
|
||||
| [PostHog](sources/posthog.md) | Beta |
|
||||
| [PrestaShop](sources/presta-shop.md) | Beta |
|
||||
| [Quickbooks](sources/quickbooks.md) | Beta |
|
||||
| [Recharge](sources/recharge.md) | Beta |
|
||||
| [Recurly](sources/recurly.md) | Beta |
|
||||
| [Redshift](sources/redshift.md) | Certified |
|
||||
| [S3](sources/s3.md) | Alpha |
|
||||
| [Salesforce](sources/salesforce.md) | Certified |
|
||||
| [SAP Business One](sources/sap-business-one.md) | Beta |
|
||||
| [Sendgrid](sources/sendgrid.md) | Certified |
|
||||
| [Shopify](sources/shopify.md) | Certified |
|
||||
| [Short.io](sources/shortio.md) | Beta |
|
||||
| [Slack](sources/slack.md) | Beta |
|
||||
| [Spree Commerce](sources/spree-commerce.md) | Beta |
|
||||
| [Smartsheets](sources/smartsheets.md) | Beta |
|
||||
| [Snowflake](sources/snowflake.md) | Beta |
|
||||
| [Square](sources/square.md) | Beta |
|
||||
| [Stripe](sources/stripe.md) | Certified |
|
||||
| [Sugar CRM](sources/sugar-crm.md) | Beta |
|
||||
| [SurveyMonkey](sources/surveymonkey.md) | Beta |
|
||||
| [Tempo](sources/tempo.md) | Beta |
|
||||
| [Trello](sources/trello.md) | Beta |
|
||||
| [Twilio](sources/twilio.md) | Beta |
|
||||
| [US Census](sources/us-census.md) | Alpha |
|
||||
| [WooCommerce](https://github.com/airbytehq/airbyte/tree/8d599c86a84726235c765c78db1ddd85c558bf7f/docs/integrations/sources/woo-commerce.md) | Beta |
|
||||
| [Wordpress](sources/wordpress.md) | Beta |
|
||||
| [Zencart](sources/zencart.md) | Beta |
|
||||
| [Zendesk Chat](sources/zendesk-chat.md) | Certified |
|
||||
| [Zendesk Sunshine](sources/zendesk-sunshine.md) | Beta |
|
||||
| [Zendesk Support](sources/zendesk-support.md) | Certified |
|
||||
| [Zendesk Talk](sources/zendesk-talk.md) | Certified |
|
||||
| [Zoom](sources/zoom.md) | Beta |
|
||||
| [Zuora](sources/zuora.md) | Beta |
|
||||
|
||||
### Destinations
|
||||
|
||||
| Connector | Grade |
|
||||
|----|----|
|
||||
|[AzureBlobStorage](./destinations/azureblobstorage.md)| Alpha |
|
||||
|[BigQuery](./destinations/bigquery.md)| Certified |
|
||||
|[Chargify (Keen)](./destinations/keen.md)| Alpha |
|
||||
|[Databricks](./destinations/databricks.md) | Beta |
|
||||
|[Google Cloud Storage (GCS)](./destinations/gcs.md)| Alpha |
|
||||
|[Google Pubsub](./destinations/pubsub.md)| Alpha |
|
||||
|[Kafka](./destinations/kafka.md)| Alpha |
|
||||
|[Keen](./destinations/keen.md)| Alpha |
|
||||
|[Local CSV](./destinations/local-csv.md)| Certified |
|
||||
|[Local JSON](./destinations/local-json.md)| Certified |
|
||||
|[MeiliSearch](./destinations/meilisearch.md)| Beta |
|
||||
|[MongoDB](./destinations/mongodb.md)| Alpha |
|
||||
|[MySQL](./destinations/mysql.md)| Beta |
|
||||
|[Oracle](./destinations/oracle.md)| Alpha |
|
||||
|[Postgres](./destinations/postgres.md)| Certified |
|
||||
|[Redshift](./destinations/redshift.md)| Certified |
|
||||
|[S3](./destinations/s3.md)| Certified |
|
||||
|[SQL Server (MSSQL)](./destinations/mssql.md)| Alpha |
|
||||
|[Snowflake](./destinations/snowflake.md)| Certified |
|
||||
| :--- | :--- |
|
||||
| [AzureBlobStorage](destinations/azureblobstorage.md) | Alpha |
|
||||
| [BigQuery](destinations/bigquery.md) | Certified |
|
||||
| [Chargify \(Keen\)]() | Alpha |
|
||||
| [Databricks](destinations/databricks.md) | Beta |
|
||||
| [Google Cloud Storage \(GCS\)](destinations/gcs.md) | Alpha |
|
||||
| [Google Pubsub](destinations/pubsub.md) | Alpha |
|
||||
| [Kafka](destinations/kafka.md) | Alpha |
|
||||
| [Keen]() | Alpha |
|
||||
| [Local CSV](destinations/local-csv.md) | Certified |
|
||||
| [Local JSON](destinations/local-json.md) | Certified |
|
||||
| [MeiliSearch](destinations/meilisearch.md) | Beta |
|
||||
| [MongoDB](destinations/mongodb.md) | Alpha |
|
||||
| [MySQL](destinations/mysql.md) | Beta |
|
||||
| [Oracle](destinations/oracle.md) | Alpha |
|
||||
| [Postgres](destinations/postgres.md) | Certified |
|
||||
| [Redshift](destinations/redshift.md) | Certified |
|
||||
| [S3](destinations/s3.md) | Certified |
|
||||
| [SQL Server \(MSSQL\)](destinations/mssql.md) | Alpha |
|
||||
| [Snowflake](destinations/snowflake.md) | Certified |
|
||||
|
||||
|
||||
@@ -6,13 +6,13 @@ description: Missing a connector?
|
||||
|
||||
If you'd like to **ask for a new connector,** you can request it directly [here](https://github.com/airbytehq/airbyte/issues/new?assignees=&labels=area%2Fintegration%2C+new-integration&template=new-integration-request.md&title=).
|
||||
|
||||
If you'd like to build new connectors and **make them part of the pool of pre-built connectors on Airbyte,** first a big thank you. We invite you to check our [contributing guide on building connectors](../contributing-to-airbyte/README.md).
|
||||
If you'd like to build new connectors and **make them part of the pool of pre-built connectors on Airbyte,** first a big thank you. We invite you to check our [contributing guide on building connectors](../contributing-to-airbyte/).
|
||||
|
||||
If you'd like to build new connectors, or update existing ones, **for your own usage,** without contributing to the Airbyte codebase, read along.
|
||||
|
||||
## Developing your own connector
|
||||
|
||||
It's easy to code your own connectors on Airbyte. Here is a link to instruct on how to code new sources and destinations: [building new connectors](../contributing-to-airbyte/README.md)
|
||||
It's easy to code your own connectors on Airbyte. Here is a link to instruct on how to code new sources and destinations: [building new connectors](../contributing-to-airbyte/)
|
||||
|
||||
While the guides in the link above are specific to the languages used most frequently to write integrations, **Airbyte connectors can be written in any language**. Please reach out to us if you'd like help developing connectors in other languages.
|
||||
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
# Azure Blob Storage
|
||||
# AzureBlobStorage
|
||||
|
||||
## Overview
|
||||
|
||||
@@ -11,43 +11,42 @@ The Airbyte Azure Blob Storage destination allows you to sync data to Azure Blob
|
||||
| Feature | Support | Notes |
|
||||
| :--- | :---: | :--- |
|
||||
| Full Refresh Sync | ✅ | Warning: this mode deletes all previously synced data in the configured blob. |
|
||||
| Incremental - Append Sync | ✅ | The append mode would only work for "Append blobs" blobs as per Azure limitations, more details https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction#blobs |
|
||||
| Incremental - Append Sync | ✅ | The append mode would only work for "Append blobs" blobs as per Azure limitations, more details [https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction\#blobs](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction#blobs) |
|
||||
| Incremental - Deduped History | ❌ | As this connector does not support dbt, we don't support this sync mode on this destination. |
|
||||
|
||||
## Configuration
|
||||
|
||||
| Parameter | Type | Notes |
|
||||
| :--- | :---: | :--- |
|
||||
| Endpoint Domain Name | string | This is Azure Blob Storage endpoint domain name. Leave default value (or leave it empty if run container from command line) to use Microsoft native one. |
|
||||
| Azure blob storage container (Bucket) Name | string | A name of the Azure blob storage container. If not exists - will be created automatically. If leave empty, then will be created automatically airbytecontainer+timestamp. |
|
||||
| Endpoint Domain Name | string | This is Azure Blob Storage endpoint domain name. Leave default value \(or leave it empty if run container from command line\) to use Microsoft native one. |
|
||||
| Azure blob storage container \(Bucket\) Name | string | A name of the Azure blob storage container. If not exists - will be created automatically. If leave empty, then will be created automatically airbytecontainer+timestamp. |
|
||||
| Azure Blob Storage account name | string | The account's name of the Azure Blob Storage. |
|
||||
| The Azure blob storage account key | string | Azure blob storage account key. Example: `abcdefghijklmnopqrstuvwxyz/0123456789+ABCDEFGHIJKLMNOPQRSTUVWXYZ/0123456789%++sampleKey==`. |
|
||||
| Format | object | Format specific configuration. See below for details. |
|
||||
|
||||
⚠️ Please note that under "Full Refresh Sync" mode, data in the configured blob will be wiped out before each sync. We recommend you to provision a dedicated Azure Blob Storage Container resource for this sync to prevent unexpected data deletion from misconfiguration. ⚠️
|
||||
|
||||
|
||||
## Output Schema
|
||||
|
||||
Each stream will be outputted to its dedicated Blob according to the configuration. The complete datastore of each stream includes all the output files under that Blob. You can think of the Blob as equivalent of a Table in the database world.
|
||||
|
||||
- Under Full Refresh Sync mode, old output files will be purged before new files are created.
|
||||
- Under Incremental - Append Sync mode, new output files will be added that only contain the new data.
|
||||
* Under Full Refresh Sync mode, old output files will be purged before new files are created.
|
||||
* Under Incremental - Append Sync mode, new output files will be added that only contain the new data.
|
||||
|
||||
### CSV
|
||||
|
||||
Like most of the other Airbyte destination connectors, usually the output has three columns: a UUID, an emission timestamp, and the data blob. With the CSV output, it is possible to normalize (flatten) the data blob to multiple columns.
|
||||
Like most of the other Airbyte destination connectors, usually the output has three columns: a UUID, an emission timestamp, and the data blob. With the CSV output, it is possible to normalize \(flatten\) the data blob to multiple columns.
|
||||
|
||||
| Column | Condition | Description |
|
||||
| :--- | :--- | :--- |
|
||||
| `_airbyte_ab_id` | Always exists | A uuid assigned by Airbyte to each processed record. |
|
||||
| `_airbyte_emitted_at` | Always exists. | A timestamp representing when the event was pulled from the data source. |
|
||||
| `_airbyte_data` | When no normalization (flattening) is needed, all data reside under this column as a json blob. |
|
||||
| root level fields | When root level normalization (flattening) is selected, the root level fields are expanded. |
|
||||
| `_airbyte_data` | When no normalization \(flattening\) is needed, all data reside under this column as a json blob. | |
|
||||
| root level fields | When root level normalization \(flattening\) is selected, the root level fields are expanded. | |
|
||||
|
||||
For example, given the following json object from a source:
|
||||
|
||||
```json
|
||||
```javascript
|
||||
{
|
||||
"user_id": 123,
|
||||
"name": {
|
||||
@@ -69,11 +68,11 @@ With root level normalization, the output CSV is:
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| `26d73cde-7eb1-4e1e-b7db-a4c03b4cf206` | 1622135805000 | 123 | `{ "first": "John", "last": "Doe" }` |
|
||||
|
||||
### JSON Lines (JSONL)
|
||||
### JSON Lines \(JSONL\)
|
||||
|
||||
[Json Lines](https://jsonlines.org/) is a text format with one JSON per line. Each line has a structure as follows:
|
||||
|
||||
```json
|
||||
```javascript
|
||||
{
|
||||
"_airbyte_ab_id": "<uuid>",
|
||||
"_airbyte_emitted_at": "<timestamp-in-millis>",
|
||||
@@ -83,7 +82,7 @@ With root level normalization, the output CSV is:
|
||||
|
||||
For example, given the following two json objects from a source:
|
||||
|
||||
```json
|
||||
```javascript
|
||||
[
|
||||
{
|
||||
"user_id": 123,
|
||||
@@ -104,7 +103,7 @@ For example, given the following two json objects from a source:
|
||||
|
||||
They will be like this in the output file:
|
||||
|
||||
```jsonl
|
||||
```text
|
||||
{ "_airbyte_ab_id": "26d73cde-7eb1-4e1e-b7db-a4c03b4cf206", "_airbyte_emitted_at": "1622135805000", "_airbyte_data": { "user_id": 123, "name": { "first": "John", "last": "Doe" } } }
|
||||
{ "_airbyte_ab_id": "0a61de1b-9cdd-4455-a739-93572c9a5f20", "_airbyte_emitted_at": "1631948170000", "_airbyte_data": { "user_id": 456, "name": { "first": "Jane", "last": "Roe" } } }
|
||||
```
|
||||
@@ -114,17 +113,17 @@ They will be like this in the output file:
|
||||
### Requirements
|
||||
|
||||
1. Create an AzureBlobStorage account.
|
||||
2. Check if it works under https://portal.azure.com/ -> "Storage explorer (preview)".
|
||||
2. Check if it works under [https://portal.azure.com/](https://portal.azure.com/) -> "Storage explorer \(preview\)".
|
||||
|
||||
### Setup guide
|
||||
|
||||
* Fill up AzureBlobStorage info
|
||||
* **Endpoint Domain Name**
|
||||
* Leave default value (or leave it empty if run container from command line) to use Microsoft native one or use your own.
|
||||
* Leave default value \(or leave it empty if run container from command line\) to use Microsoft native one or use your own.
|
||||
* **Azure blob storage container**
|
||||
* If not exists - will be created automatically. If leave empty, then will be created automatically airbytecontainer+timestamp..
|
||||
* **Azure Blob Storage account name**
|
||||
* See [this](https://docs.microsoft.com/en-us/azure/storage/common/storage-account-create?tabs=azure-portal) on how to create an account.
|
||||
* See [this](https://docs.microsoft.com/en-us/azure/storage/common/storage-account-create?tabs=azure-portal) on how to create an account.
|
||||
* **The Azure blob storage account key**
|
||||
* Corresponding key to the above user.
|
||||
* **Format**
|
||||
@@ -133,10 +132,9 @@ They will be like this in the output file:
|
||||
* This depends on your networking setup.
|
||||
* The easiest way to verify if Airbyte is able to connect to your Azure blob storage container is via the check connection tool in the UI.
|
||||
|
||||
|
||||
|
||||
## CHANGELOG
|
||||
|
||||
| Version | Date | Pull Request | Subject |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| 0.1.0 | 2021-08-30 | [#5332](https://github.com/airbytehq/airbyte/pull/5332) | Initial release with JSONL and CSV output. |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| 0.1.0 | 2021-08-30 | [\#5332](https://github.com/airbytehq/airbyte/pull/5332) | Initial release with JSONL and CSV output. |
|
||||
|
||||
|
||||
@@ -13,7 +13,7 @@ description: >-
|
||||
| Full Refresh Sync | Yes | |
|
||||
| Incremental - Append Sync | Yes | |
|
||||
| Incremental - Deduped History | Yes | |
|
||||
| Bulk loading | Yes | |
|
||||
| Bulk loading | Yes | |
|
||||
| Namespaces | Yes | |
|
||||
|
||||
There are two flavors of connectors for this destination:
|
||||
@@ -30,10 +30,10 @@ Check out common troubleshooting issues for the BigQuery destination connector o
|
||||
Each stream will be output into its own table in BigQuery. Each table will contain 3 columns:
|
||||
|
||||
* `_airbyte_ab_id`: a uuid assigned by Airbyte to each event that is processed. The column type in BigQuery is `String`.
|
||||
* `_airbyte_emitted_at`: a timestamp representing when the event was pulled from the data source. The column type in BigQuery is `String`. Due to a Google [limitations](https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv#data_types) for data migration from GCs to BigQuery by its native job - the timestamp (seconds from 1970' can't be used). Only date format, so only String is accepted for us in this case.
|
||||
* `_airbyte_emitted_at`: a timestamp representing when the event was pulled from the data source. The column type in BigQuery is `String`. Due to a Google [limitations](https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv#data_types) for data migration from GCs to BigQuery by its native job - the timestamp \(seconds from 1970' can't be used\). Only date format, so only String is accepted for us in this case.
|
||||
* `_airbyte_data`: a json blob representing with the event data. The column type in BigQuery is `String`.
|
||||
|
||||
## Getting Started (Airbyte Open-Source / Airbyte Cloud)
|
||||
## Getting Started \(Airbyte Open-Source / Airbyte Cloud\)
|
||||
|
||||
#### Requirements
|
||||
|
||||
@@ -45,6 +45,7 @@ To use the BigQuery destination, you'll need:
|
||||
* A Service Account Key to authenticate into your Service Account
|
||||
|
||||
For GCS Staging upload mode:
|
||||
|
||||
* GCS role enabled for same user as used for biqquery
|
||||
* HMAC key obtained for user. Currently, only the [HMAC key](https://cloud.google.com/storage/docs/authentication/hmackeys) is supported. More credential types will be added in the future.
|
||||
|
||||
@@ -88,39 +89,42 @@ You should now have all the requirements needed to configure BigQuery as a desti
|
||||
* **Dataset Location**
|
||||
* **Dataset ID**: the name of the schema where the tables will be created.
|
||||
* **Service Account Key**: the contents of your Service Account Key JSON file
|
||||
* **Google BigQuery client chunk size**: Google BigQuery client's chunk(buffer) size (MIN=1, MAX = 15) for each table. The default 15MiB value is used if not set explicitly. It's recommended to decrease value for big data sets migration for less HEAP memory consumption and avoiding crashes. For more details refer to https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html
|
||||
* **Google BigQuery client chunk size**: Google BigQuery client's chunk\(buffer\) size \(MIN=1, MAX = 15\) for each table. The default 15MiB value is used if not set explicitly. It's recommended to decrease value for big data sets migration for less HEAP memory consumption and avoiding crashes. For more details refer to [https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html](https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html)
|
||||
|
||||
Once you've configured BigQuery as a destination, delete the Service Account Key from your computer.
|
||||
|
||||
#### Uploading Options
|
||||
|
||||
There are 2 available options to upload data to BigQuery `Standard` and `GCS Staging`.
|
||||
- `Standard` is option to upload data directly from your source to BigQuery storage. This way is faster and requires less resources than GCS one.
|
||||
|
||||
* `Standard` is option to upload data directly from your source to BigQuery storage. This way is faster and requires less resources than GCS one.
|
||||
|
||||
Please be aware you may see some fails for big datasets and slow sources, i.e. if reading from source takes more than 10-12 hours.
|
||||
This is caused by the Google BigQuery SDK client limitations. For more details please check https://github.com/airbytehq/airbyte/issues/3549
|
||||
- `GCS Uploading (CSV format)`: This approach has been implemented in order to avoid the issue for big datasets mentioned above.
|
||||
|
||||
This is caused by the Google BigQuery SDK client limitations. For more details please check [https://github.com/airbytehq/airbyte/issues/3549](https://github.com/airbytehq/airbyte/issues/3549)
|
||||
|
||||
* `GCS Uploading (CSV format)`: This approach has been implemented in order to avoid the issue for big datasets mentioned above.
|
||||
|
||||
At the first step all data is uploaded to GCS bucket and then all moved to BigQuery at one shot stream by stream.
|
||||
The [destination-gcs connector](./gcs.md) is partially used under the hood here, so you may check its documentation for more details.
|
||||
|
||||
The [destination-gcs connector](gcs.md) is partially used under the hood here, so you may check its documentation for more details.
|
||||
|
||||
For the GCS Staging upload type additional params must be configured:
|
||||
|
||||
* **GCS Bucket Name**
|
||||
* **GCS Bucket Path**
|
||||
* **GCS Bucket Keep files after migration**
|
||||
* See [this](https://cloud.google.com/storage/docs/creating-buckets) to create an S3 bucket.
|
||||
* **HMAC Key Access ID**
|
||||
* See [this](https://cloud.google.com/storage/docs/authentication/hmackeys) on how to generate an access key.
|
||||
* We recommend creating an Airbyte-specific user or service account. This user or account will require read and write permissions to objects in the bucket.
|
||||
* **Secret Access Key**
|
||||
* Corresponding key to the above access ID.
|
||||
* Make sure your GCS bucket is accessible from the machine running Airbyte.
|
||||
* This depends on your networking setup.
|
||||
* The easiest way to verify if Airbyte is able to connect to your GCS bucket is via the check connection tool in the UI.
|
||||
* **GCS Bucket Name**
|
||||
* **GCS Bucket Path**
|
||||
* **GCS Bucket Keep files after migration**
|
||||
* See [this](https://cloud.google.com/storage/docs/creating-buckets) to create an S3 bucket.
|
||||
* **HMAC Key Access ID**
|
||||
* See [this](https://cloud.google.com/storage/docs/authentication/hmackeys) on how to generate an access key.
|
||||
* We recommend creating an Airbyte-specific user or service account. This user or account will require read and write permissions to objects in the bucket.
|
||||
* **Secret Access Key**
|
||||
* Corresponding key to the above access ID.
|
||||
* Make sure your GCS bucket is accessible from the machine running Airbyte.
|
||||
* This depends on your networking setup.
|
||||
* The easiest way to verify if Airbyte is able to connect to your GCS bucket is via the check connection tool in the UI.
|
||||
|
||||
|
||||
Note:
|
||||
It partially re-uses the destination-gcs connector under the hood. So you may also refer to its guide for additional clarifications.
|
||||
**GCS Region** for GCS would be used the same as set for BigQuery
|
||||
**Format** - Gcs format is set to CSV
|
||||
Note: It partially re-uses the destination-gcs connector under the hood. So you may also refer to its guide for additional clarifications. **GCS Region** for GCS would be used the same as set for BigQuery **Format** - Gcs format is set to CSV
|
||||
|
||||
## Naming Conventions
|
||||
|
||||
@@ -143,24 +147,25 @@ Therefore, Airbyte BigQuery destination will convert any invalid characters into
|
||||
### bigquery
|
||||
|
||||
| Version | Date | Pull Request | Subject |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| 0.4.0 | 2021-10-04 | [#6733](https://github.com/airbytehq/airbyte/issues/6733) | Support dataset starting with numbers |
|
||||
| 0.4.0 | 2021-08-26 | [#5296](https://github.com/airbytehq/airbyte/issues/5296) | Added GCS Staging uploading option |
|
||||
| 0.3.12 | 2021-08-03 | [#3549](https://github.com/airbytehq/airbyte/issues/3549) | Add optional arg to make a possibility to change the BigQuery client's chunk\buffer size |
|
||||
| 0.3.11 | 2021-07-30 | [#5125](https://github.com/airbytehq/airbyte/pull/5125) | Enable `additionalPropertities` in spec.json |
|
||||
| 0.3.10 | 2021-07-28 | [#3549](https://github.com/airbytehq/airbyte/issues/3549) | Add extended logs and made JobId filled with region and projectId |
|
||||
| 0.3.9 | 2021-07-28 | [#5026](https://github.com/airbytehq/airbyte/pull/5026) | Add sanitized json fields in raw tables to handle quotes in column names |
|
||||
| 0.3.6 | 2021-06-18 | [#3947](https://github.com/airbytehq/airbyte/issues/3947) | Service account credentials are now optional. |
|
||||
| 0.3.4 | 2021-06-07 | [#3277](https://github.com/airbytehq/airbyte/issues/3277) | Add dataset location option |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| 0.4.0 | 2021-10-04 | [\#6733](https://github.com/airbytehq/airbyte/issues/6733) | Support dataset starting with numbers |
|
||||
| 0.4.0 | 2021-08-26 | [\#5296](https://github.com/airbytehq/airbyte/issues/5296) | Added GCS Staging uploading option |
|
||||
| 0.3.12 | 2021-08-03 | [\#3549](https://github.com/airbytehq/airbyte/issues/3549) | Add optional arg to make a possibility to change the BigQuery client's chunk\buffer size |
|
||||
| 0.3.11 | 2021-07-30 | [\#5125](https://github.com/airbytehq/airbyte/pull/5125) | Enable `additionalPropertities` in spec.json |
|
||||
| 0.3.10 | 2021-07-28 | [\#3549](https://github.com/airbytehq/airbyte/issues/3549) | Add extended logs and made JobId filled with region and projectId |
|
||||
| 0.3.9 | 2021-07-28 | [\#5026](https://github.com/airbytehq/airbyte/pull/5026) | Add sanitized json fields in raw tables to handle quotes in column names |
|
||||
| 0.3.6 | 2021-06-18 | [\#3947](https://github.com/airbytehq/airbyte/issues/3947) | Service account credentials are now optional. |
|
||||
| 0.3.4 | 2021-06-07 | [\#3277](https://github.com/airbytehq/airbyte/issues/3277) | Add dataset location option |
|
||||
|
||||
### bigquery-denormalized
|
||||
|
||||
| Version | Date | Pull Request | Subject |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| 0.1.6 | 2021-09-16 | [#6145](https://github.com/airbytehq/airbyte/pull/6145) | BigQuery Denormalized support for date, datetime & timestamp types through the json "format" key
|
||||
| 0.1.5 | 2021-09-07 | [#5881](https://github.com/airbytehq/airbyte/pull/5881) | BigQuery Denormalized NPE fix
|
||||
| 0.1.4 | 2021-09-04 | [#5813](https://github.com/airbytehq/airbyte/pull/5813) | fix Stackoverflow error when receive a schema from source where "Array" type doesn't contain a required "items" element |
|
||||
| 0.1.3 | 2021-08-07 | [#5261](https://github.com/airbytehq/airbyte/pull/5261) | 🐛 Destination BigQuery(Denormalized): Fix processing arrays of records |
|
||||
| 0.1.2 | 2021-07-30 | [#5125](https://github.com/airbytehq/airbyte/pull/5125) | Enable `additionalPropertities` in spec.json |
|
||||
| 0.1.1 | 2021-06-21 | [#3555](https://github.com/airbytehq/airbyte/pull/3555) | Partial Success in BufferedStreamConsumer |
|
||||
| 0.1.0 | 2021-06-21 | [#4176](https://github.com/airbytehq/airbyte/pull/4176) | Destination using Typed Struct and Repeated fields |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| 0.1.6 | 2021-09-16 | [\#6145](https://github.com/airbytehq/airbyte/pull/6145) | BigQuery Denormalized support for date, datetime & timestamp types through the json "format" key |
|
||||
| 0.1.5 | 2021-09-07 | [\#5881](https://github.com/airbytehq/airbyte/pull/5881) | BigQuery Denormalized NPE fix |
|
||||
| 0.1.4 | 2021-09-04 | [\#5813](https://github.com/airbytehq/airbyte/pull/5813) | fix Stackoverflow error when receive a schema from source where "Array" type doesn't contain a required "items" element |
|
||||
| 0.1.3 | 2021-08-07 | [\#5261](https://github.com/airbytehq/airbyte/pull/5261) | 🐛 Destination BigQuery\(Denormalized\): Fix processing arrays of records |
|
||||
| 0.1.2 | 2021-07-30 | [\#5125](https://github.com/airbytehq/airbyte/pull/5125) | Enable `additionalPropertities` in spec.json |
|
||||
| 0.1.1 | 2021-06-21 | [\#3555](https://github.com/airbytehq/airbyte/pull/3555) | Partial Success in BufferedStreamConsumer |
|
||||
| 0.1.0 | 2021-06-21 | [\#4176](https://github.com/airbytehq/airbyte/pull/4176) | Destination using Typed Struct and Repeated fields |
|
||||
|
||||
|
||||
@@ -13,11 +13,12 @@ Due to legal reasons, this is currently a private connector that is only availab
|
||||
| Feature | Support | Notes |
|
||||
| :--- | :---: | :--- |
|
||||
| Full Refresh Sync | ✅ | Warning: this mode deletes all previously synced data in the configured bucket path. |
|
||||
| Incremental - Append Sync | ✅ | |
|
||||
| Incremental - Deduped History | ❌ | |
|
||||
| Namespaces | ✅ | |
|
||||
| Incremental - Append Sync | ✅ | |
|
||||
| Incremental - Deduped History | ❌ | |
|
||||
| Namespaces | ✅ | |
|
||||
|
||||
## Data Source
|
||||
|
||||
Databricks supports various cloud storage as the [data source](https://docs.databricks.com/data/data-sources/index.html). Currently, only Amazon S3 is supported.
|
||||
|
||||
## Configuration
|
||||
@@ -25,16 +26,16 @@ Databricks supports various cloud storage as the [data source](https://docs.data
|
||||
| Category | Parameter | Type | Notes |
|
||||
| :--- | :--- | :---: | :--- |
|
||||
| Databricks | Server Hostname | string | Required. See [documentation](https://docs.databricks.com/integrations/bi/jdbc-odbc-bi.html#get-server-hostname-port-http-path-and-jdbc-url). |
|
||||
| | HTTP Path | string | Required. See [documentation](https://docs.databricks.com/integrations/bi/jdbc-odbc-bi.html#get-server-hostname-port-http-path-and-jdbc-url). |
|
||||
| | Port | string | Optional. Default to "443". See [documentation](https://docs.databricks.com/integrations/bi/jdbc-odbc-bi.html#get-server-hostname-port-http-path-and-jdbc-url). |
|
||||
| | Personal Access Token | string | Required. See [documentation](https://docs.databricks.com/sql/user/security/personal-access-tokens.html). |
|
||||
| | HTTP Path | string | Required. See [documentation](https://docs.databricks.com/integrations/bi/jdbc-odbc-bi.html#get-server-hostname-port-http-path-and-jdbc-url). |
|
||||
| | Port | string | Optional. Default to "443". See [documentation](https://docs.databricks.com/integrations/bi/jdbc-odbc-bi.html#get-server-hostname-port-http-path-and-jdbc-url). |
|
||||
| | Personal Access Token | string | Required. See [documentation](https://docs.databricks.com/sql/user/security/personal-access-tokens.html). |
|
||||
| General | Database schema | string | Optional. Default to "public". Each data stream will be written to a table under this database schema. |
|
||||
| | Purge Staging Data | boolean | The connector creates staging files and tables on S3. By default they will be purged when the data sync is complete. Set it to `false` for debugging purpose. |
|
||||
| | Purge Staging Data | boolean | The connector creates staging files and tables on S3. By default they will be purged when the data sync is complete. Set it to `false` for debugging purpose. |
|
||||
| Data Source - S3 | Bucket Name | string | Name of the bucket to sync data into. |
|
||||
| | Bucket Path | string | Subdirectory under the above bucket to sync the data into. |
|
||||
| | Region | string | See [documentation](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html#concepts-available-regions) for all region codes. |
|
||||
| | Access Key ID | string | AWS/Minio credential. |
|
||||
| | Secret Access Key | string | AWS/Minio credential. |
|
||||
| | Bucket Path | string | Subdirectory under the above bucket to sync the data into. |
|
||||
| | Region | string | See [documentation](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html#concepts-available-regions) for all region codes. |
|
||||
| | Access Key ID | string | AWS/Minio credential. |
|
||||
| | Secret Access Key | string | AWS/Minio credential. |
|
||||
|
||||
⚠️ Please note that under "Full Refresh Sync" mode, data in the configured bucket and path will be wiped out before each sync. We recommend you to provision a dedicated S3 resource for this sync to prevent unexpected data deletion from misconfiguration. ⚠️
|
||||
|
||||
@@ -42,13 +43,13 @@ Databricks supports various cloud storage as the [data source](https://docs.data
|
||||
|
||||
Data streams are first written as staging Parquet files on S3, and then loaded into Databricks tables. All the staging files will be deleted after the sync is done. For debugging purposes, here is the full path for a staging file:
|
||||
|
||||
```
|
||||
```text
|
||||
s3://<bucket-name>/<bucket-path>/<uuid>/<stream-name>
|
||||
```
|
||||
|
||||
For example:
|
||||
|
||||
```
|
||||
```text
|
||||
s3://testing_bucket/data_output_path/98c450be-5b1c-422d-b8b5-6ca9903727d9/users
|
||||
↑ ↑ ↑ ↑
|
||||
| | | stream name
|
||||
@@ -57,18 +58,17 @@ s3://testing_bucket/data_output_path/98c450be-5b1c-422d-b8b5-6ca9903727d9/users
|
||||
bucket name
|
||||
```
|
||||
|
||||
|
||||
## Unmanaged Spark SQL Table
|
||||
|
||||
Currently, all streams are synced into unmanaged Spark SQL tables. See [documentation](https://docs.databricks.com/data/tables.html#managed-and-unmanaged-tables) for details. In summary, you have full control of the location of the data underlying an unmanaged table. The full path of each data stream is:
|
||||
|
||||
```
|
||||
```text
|
||||
s3://<bucket-name>/<bucket-path>/<database-schema>/<stream-name>
|
||||
```
|
||||
|
||||
For example:
|
||||
|
||||
```
|
||||
```text
|
||||
s3://testing_bucket/data_output_path/public/users
|
||||
↑ ↑ ↑ ↑
|
||||
| | | stream name
|
||||
@@ -97,11 +97,12 @@ Learn how source data is converted to Parquet and the current limitations [here]
|
||||
|
||||
1. Credentials for a Databricks cluster. See [documentation](https://docs.databricks.com/clusters/create.html).
|
||||
2. Credentials for an S3 bucket. See [documentation](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html#access-keys-and-secret-access-keys).
|
||||
3. Grant the Databricks cluster full access to the S3 bucket. Or mount it as Databricks File System (DBFS). See [documentation](https://docs.databricks.com/data/data-sources/aws/amazon-s3.html).
|
||||
3. Grant the Databricks cluster full access to the S3 bucket. Or mount it as Databricks File System \(DBFS\). See [documentation](https://docs.databricks.com/data/data-sources/aws/amazon-s3.html).
|
||||
|
||||
## CHANGELOG
|
||||
|
||||
| Version | Date | Pull Request | Subject |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| 0.1.1 | 2021-10-05 | [#6792](https://github.com/airbytehq/airbyte/pull/6792) | Require users to accept Databricks JDBC Driver [Terms & Conditions](https://databricks.com/jdbc-odbc-driver-license). |
|
||||
| 0.1.0 | 2021-09-14 | [#5998](https://github.com/airbytehq/airbyte/pull/5998) | Initial private release. |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| 0.1.1 | 2021-10-05 | [\#6792](https://github.com/airbytehq/airbyte/pull/6792) | Require users to accept Databricks JDBC Driver [Terms & Conditions](https://databricks.com/jdbc-odbc-driver-license). |
|
||||
| 0.1.0 | 2021-09-14 | [\#5998](https://github.com/airbytehq/airbyte/pull/5998) | Initial private release. |
|
||||
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
# Dynamodb
|
||||
# DynamoDB
|
||||
|
||||
This destination writes data to AWS DynamoDB.
|
||||
|
||||
@@ -20,9 +20,9 @@ Each stream will be output into its own DynamoDB table. Each table will a collec
|
||||
| Feature | Support | Notes |
|
||||
| :--- | :---: | :--- |
|
||||
| Full Refresh Sync | ✅ | Warning: this mode deletes all previously synced data in the configured DynamoDB table. |
|
||||
| Incremental - Append Sync | ✅ | |
|
||||
| Incremental - Append Sync | ✅ | |
|
||||
| Incremental - Deduped History | ❌ | As this connector does not support dbt, we don't support this sync mode on this destination. |
|
||||
| Namespaces | ✅ | Namespace will be used as part of the table name. |
|
||||
| Namespaces | ✅ | Namespace will be used as part of the table name. |
|
||||
|
||||
### Performance considerations
|
||||
|
||||
@@ -38,24 +38,25 @@ This connector by default uses 10 capacity units for both Read and Write in Dyna
|
||||
### Setup guide
|
||||
|
||||
* Fill up DynamoDB info
|
||||
* **DynamoDB Endpoint**
|
||||
* Leave empty if using AWS DynamoDB, fill in endpoint URL if using customized endpoint.
|
||||
* **DynamoDB Table Name**
|
||||
* The name prefix of the DynamoDB table to store the extracted data. The table name is \<tablename\>\_\<namespace\>\_\<stream\>.
|
||||
* **DynamoDB Region**
|
||||
* The region of the DynamoDB.
|
||||
* **Access Key Id**
|
||||
* See [this](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html#access-keys-and-secret-access-keys) on how to generate an access key.
|
||||
* We recommend creating an Airbyte-specific user. This user will require [read and write permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_examples_dynamodb_specific-table.html) to the DynamoDB table.
|
||||
* **Secret Access Key**
|
||||
* Corresponding key to the above key id.
|
||||
* **DynamoDB Endpoint**
|
||||
* Leave empty if using AWS DynamoDB, fill in endpoint URL if using customized endpoint.
|
||||
* **DynamoDB Table Name**
|
||||
* The name prefix of the DynamoDB table to store the extracted data. The table name is \\_\\_\.
|
||||
* **DynamoDB Region**
|
||||
* The region of the DynamoDB.
|
||||
* **Access Key Id**
|
||||
* See [this](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html#access-keys-and-secret-access-keys) on how to generate an access key.
|
||||
* We recommend creating an Airbyte-specific user. This user will require [read and write permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_examples_dynamodb_specific-table.html) to the DynamoDB table.
|
||||
* **Secret Access Key**
|
||||
* Corresponding key to the above key id.
|
||||
* Make sure your DynamoDB tables are accessible from the machine running Airbyte.
|
||||
* This depends on your networking setup.
|
||||
* You can check AWS DynamoDB documentation with a tutorial on how to properly configure your DynamoDB's access [here](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/access-control-overview.html).
|
||||
* The easiest way to verify if Airbyte is able to connect to your DynamoDB tables is via the check connection tool in the UI.
|
||||
* This depends on your networking setup.
|
||||
* You can check AWS DynamoDB documentation with a tutorial on how to properly configure your DynamoDB's access [here](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/access-control-overview.html).
|
||||
* The easiest way to verify if Airbyte is able to connect to your DynamoDB tables is via the check connection tool in the UI.
|
||||
|
||||
## CHANGELOG
|
||||
|
||||
| Version | Date | Pull Request | Subject |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| 0.1.0 | 2021-08-20 | [#5561](https://github.com/airbytehq/airbyte/pull/5561) | Initial release. |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| 0.1.0 | 2021-08-20 | [\#5561](https://github.com/airbytehq/airbyte/pull/5561) | Initial release. |
|
||||
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
# Google Cloud Storage
|
||||
# Google Cloud Storage \(GCS\)
|
||||
|
||||
## Overview
|
||||
|
||||
@@ -11,7 +11,7 @@ The Airbyte GCS destination allows you to sync data to cloud storage buckets. Ea
|
||||
| Feature | Support | Notes |
|
||||
| :--- | :---: | :--- |
|
||||
| Full Refresh Sync | ✅ | Warning: this mode deletes all previously synced data in the configured bucket path. |
|
||||
| Incremental - Append Sync | ✅ | |
|
||||
| Incremental - Append Sync | ✅ | |
|
||||
| Incremental - Deduped History | ❌ | As this connector does not support dbt, we don't support this sync mode on this destination. |
|
||||
| Namespaces | ❌ | Setting a specific bucket path is equivalent to having separate namespaces. |
|
||||
|
||||
@@ -25,7 +25,7 @@ The Airbyte GCS destination allows you to sync data to cloud storage buckets. Ea
|
||||
| HMAC Key Access ID | string | HMAC key access ID . The access ID for the GCS bucket. When linked to a service account, this ID is 61 characters long; when linked to a user account, it is 24 characters long. See [HMAC key](https://cloud.google.com/storage/docs/authentication/hmackeys) for details. |
|
||||
| HMAC Key Secret | string | The corresponding secret for the access ID. It is a 40-character base-64 encoded string. |
|
||||
| Format | object | Format specific configuration. See below for details. |
|
||||
| Part Size | integer | Arg to configure a block size. Max allowed blocks by GCS = 10,000, i.e. max stream size = blockSize * 10,000 blocks. |
|
||||
| Part Size | integer | Arg to configure a block size. Max allowed blocks by GCS = 10,000, i.e. max stream size = blockSize \* 10,000 blocks. |
|
||||
|
||||
Currently, only the [HMAC key](https://cloud.google.com/storage/docs/authentication/hmackeys) is supported. More credential types will be added in the future.
|
||||
|
||||
@@ -33,13 +33,13 @@ Currently, only the [HMAC key](https://cloud.google.com/storage/docs/authenticat
|
||||
|
||||
The full path of the output data is:
|
||||
|
||||
```
|
||||
```text
|
||||
<bucket-name>/<sorce-namespace-if-exists>/<stream-name>/<upload-date>-<upload-mills>-<partition-id>.<format-extension>
|
||||
```
|
||||
|
||||
For example:
|
||||
|
||||
```
|
||||
```text
|
||||
testing_bucket/data_output_path/public/users/2021_01_01_1609541171643_0.csv
|
||||
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
|
||||
| | | | | | | format extension
|
||||
@@ -54,10 +54,7 @@ bucket name
|
||||
|
||||
Please note that the stream name may contain a prefix, if it is configured on the connection.
|
||||
|
||||
The rationales behind this naming pattern are:
|
||||
1. Each stream has its own directory.
|
||||
2. The data output files can be sorted by upload time.
|
||||
3. The upload time composes of a date part and millis part so that it is both readable and unique.
|
||||
The rationales behind this naming pattern are: 1. Each stream has its own directory. 2. The data output files can be sorted by upload time. 3. The upload time composes of a date part and millis part so that it is both readable and unique.
|
||||
|
||||
Currently, each data sync will only create one file per stream. In the future, the output file can be partitioned by size. Each partition is identifiable by the partition ID, which is always 0 for now.
|
||||
|
||||
@@ -65,8 +62,8 @@ Currently, each data sync will only create one file per stream. In the future, t
|
||||
|
||||
Each stream will be outputted to its dedicated directory according to the configuration. The complete datastore of each stream includes all the output files under that directory. You can think of the directory as equivalent of a Table in the database world.
|
||||
|
||||
- Under Full Refresh Sync mode, old output files will be purged before new files are created.
|
||||
- Under Incremental - Append Sync mode, new output files will be added that only contain the new data.
|
||||
* Under Full Refresh Sync mode, old output files will be purged before new files are created.
|
||||
* Under Incremental - Append Sync mode, new output files will be added that only contain the new data.
|
||||
|
||||
### Avro
|
||||
|
||||
@@ -76,28 +73,28 @@ Each stream will be outputted to its dedicated directory according to the config
|
||||
|
||||
Here is the available compression codecs:
|
||||
|
||||
- No compression
|
||||
- `deflate`
|
||||
- Compression level
|
||||
- Range `[0, 9]`. Default to 0.
|
||||
- Level 0: no compression & fastest.
|
||||
- Level 9: best compression & slowest.
|
||||
- `bzip2`
|
||||
- `xz`
|
||||
- Compression level
|
||||
- Range `[0, 9]`. Default to 6.
|
||||
- Level 0-3 are fast with medium compression.
|
||||
- Level 4-6 are fairly slow with high compression.
|
||||
- Level 7-9 are like level 6 but use bigger dictionaries and have higher memory requirements. Unless the uncompressed size of the file exceeds 8 MiB, 16 MiB, or 32 MiB, it is waste of memory to use the presets 7, 8, or 9, respectively.
|
||||
- `zstandard`
|
||||
- Compression level
|
||||
- Range `[-5, 22]`. Default to 3.
|
||||
- Negative levels are 'fast' modes akin to `lz4` or `snappy`.
|
||||
- Levels above 9 are generally for archival purposes.
|
||||
- Levels above 18 use a lot of memory.
|
||||
- Include checksum
|
||||
- If set to `true`, a checksum will be included in each data block.
|
||||
- `snappy`
|
||||
* No compression
|
||||
* `deflate`
|
||||
* Compression level
|
||||
* Range `[0, 9]`. Default to 0.
|
||||
* Level 0: no compression & fastest.
|
||||
* Level 9: best compression & slowest.
|
||||
* `bzip2`
|
||||
* `xz`
|
||||
* Compression level
|
||||
* Range `[0, 9]`. Default to 6.
|
||||
* Level 0-3 are fast with medium compression.
|
||||
* Level 4-6 are fairly slow with high compression.
|
||||
* Level 7-9 are like level 6 but use bigger dictionaries and have higher memory requirements. Unless the uncompressed size of the file exceeds 8 MiB, 16 MiB, or 32 MiB, it is waste of memory to use the presets 7, 8, or 9, respectively.
|
||||
* `zstandard`
|
||||
* Compression level
|
||||
* Range `[-5, 22]`. Default to 3.
|
||||
* Negative levels are 'fast' modes akin to `lz4` or `snappy`.
|
||||
* Levels above 9 are generally for archival purposes.
|
||||
* Levels above 18 use a lot of memory.
|
||||
* Include checksum
|
||||
* If set to `true`, a checksum will be included in each data block.
|
||||
* `snappy`
|
||||
|
||||
#### Data schema
|
||||
|
||||
@@ -106,7 +103,7 @@ Under the hood, an Airbyte data stream in Json schema is converted to an Avro sc
|
||||
1. Json schema types are mapped to Avro typea as follows:
|
||||
|
||||
| Json Data Type | Avro Data Type |
|
||||
| :---: | :---: |
|
||||
| :---: | :---: |
|
||||
| string | string |
|
||||
| number | double |
|
||||
| integer | int |
|
||||
@@ -115,32 +112,33 @@ Under the hood, an Airbyte data stream in Json schema is converted to an Avro sc
|
||||
| object | record |
|
||||
| array | array |
|
||||
|
||||
2. Built-in Json schema formats are not mapped to Avro logical types at this moment.
|
||||
2. Combined restrictions ("allOf", "anyOf", and "oneOf") will be converted to type unions. The corresponding Avro schema can be less stringent. For example, the following Json schema
|
||||
1. Built-in Json schema formats are not mapped to Avro logical types at this moment.
|
||||
2. Combined restrictions \("allOf", "anyOf", and "oneOf"\) will be converted to type unions. The corresponding Avro schema can be less stringent. For example, the following Json schema
|
||||
|
||||
```json
|
||||
{
|
||||
```javascript
|
||||
{
|
||||
"oneOf": [
|
||||
{ "type": "string" },
|
||||
{ "type": "integer" }
|
||||
]
|
||||
}
|
||||
```
|
||||
will become this in Avro schema:
|
||||
}
|
||||
```
|
||||
|
||||
```json
|
||||
{
|
||||
will become this in Avro schema:
|
||||
|
||||
```javascript
|
||||
{
|
||||
"type": ["null", "string", "int"]
|
||||
}
|
||||
```
|
||||
}
|
||||
```
|
||||
|
||||
2. Keyword `not` is not supported, as there is no equivalent validation mechanism in Avro schema.
|
||||
3. Only alphanumeric characters and underscores (`/a-zA-Z0-9_/`) are allowed in a stream or field name. Any special character will be converted to an alphabet or underscore. For example, `spécial:character_names` will become `special_character_names`. The original names will be stored in the `doc` property in this format: `_airbyte_original_name:<original-name>`.
|
||||
4. All field will be nullable. For example, a `string` Json field will be typed as `["null", "string"]` in Avro. This is necessary because the incoming data stream may have optional fields.
|
||||
5. For array fields in Json schema, when the `items` property is an array, it means that each element in the array should follow its own schema sequentially. For example, the following specification means the first item in the array should be a string, and the second a number.
|
||||
3. Keyword `not` is not supported, as there is no equivalent validation mechanism in Avro schema.
|
||||
4. Only alphanumeric characters and underscores \(`/a-zA-Z0-9_/`\) are allowed in a stream or field name. Any special character will be converted to an alphabet or underscore. For example, `spécial:character_names` will become `special_character_names`. The original names will be stored in the `doc` property in this format: `_airbyte_original_name:<original-name>`.
|
||||
5. All field will be nullable. For example, a `string` Json field will be typed as `["null", "string"]` in Avro. This is necessary because the incoming data stream may have optional fields.
|
||||
6. For array fields in Json schema, when the `items` property is an array, it means that each element in the array should follow its own schema sequentially. For example, the following specification means the first item in the array should be a string, and the second a number.
|
||||
|
||||
```json
|
||||
{
|
||||
```javascript
|
||||
{
|
||||
"array_field": {
|
||||
"type": "array",
|
||||
"items": [
|
||||
@@ -148,12 +146,12 @@ will become this in Avro schema:
|
||||
{ "type": "number" }
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
}
|
||||
```
|
||||
|
||||
This is not supported in Avro schema. As a compromise, the converter creates a union, ["string", "number"], which is less stringent:
|
||||
This is not supported in Avro schema. As a compromise, the converter creates a union, \["string", "number"\], which is less stringent:
|
||||
|
||||
```json
|
||||
```javascript
|
||||
{
|
||||
"name": "array_field",
|
||||
"type": [
|
||||
@@ -165,20 +163,20 @@ This is not supported in Avro schema. As a compromise, the converter creates a u
|
||||
],
|
||||
"default": null
|
||||
}
|
||||
```
|
||||
```
|
||||
|
||||
6. Two Airbyte specific fields will be added to each Avro record:
|
||||
1. Two Airbyte specific fields will be added to each Avro record:
|
||||
|
||||
| Field | Schema | Document |
|
||||
| :--- | :--- | :---: |
|
||||
| `_airbyte_ab_id` | `uuid` | [link](http://avro.apache.org/docs/current/spec.html#UUID)
|
||||
| :--- | :--- | :---: |
|
||||
| `_airbyte_ab_id` | `uuid` | [link](http://avro.apache.org/docs/current/spec.html#UUID) |
|
||||
| `_airbyte_emitted_at` | `timestamp-millis` | [link](http://avro.apache.org/docs/current/spec.html#Timestamp+%28millisecond+precision%29) |
|
||||
|
||||
7. Currently `additionalProperties` is not supported. This means if the source is schemaless (e.g. Mongo), or has flexible fields, they will be ignored. We will have a solution soon. Feel free to submit a new issue if this is blocking for you.
|
||||
1. Currently `additionalProperties` is not supported. This means if the source is schemaless \(e.g. Mongo\), or has flexible fields, they will be ignored. We will have a solution soon. Feel free to submit a new issue if this is blocking for you.
|
||||
|
||||
For example, given the following Json schema:
|
||||
|
||||
```json
|
||||
```javascript
|
||||
{
|
||||
"type": "object",
|
||||
"$schema": "http://json-schema.org/draft-07/schema#",
|
||||
@@ -207,7 +205,7 @@ For example, given the following Json schema:
|
||||
|
||||
Its corresponding Avro schema will be:
|
||||
|
||||
```json
|
||||
```javascript
|
||||
{
|
||||
"name" : "stream_name",
|
||||
"type" : "record",
|
||||
@@ -254,18 +252,18 @@ Its corresponding Avro schema will be:
|
||||
|
||||
### CSV
|
||||
|
||||
Like most of the other Airbyte destination connectors, usually the output has three columns: a UUID, an emission timestamp, and the data blob. With the CSV output, it is possible to normalize (flatten) the data blob to multiple columns.
|
||||
Like most of the other Airbyte destination connectors, usually the output has three columns: a UUID, an emission timestamp, and the data blob. With the CSV output, it is possible to normalize \(flatten\) the data blob to multiple columns.
|
||||
|
||||
| Column | Condition | Description |
|
||||
| :--- | :--- | :--- |
|
||||
| `_airbyte_ab_id` | Always exists | A uuid assigned by Airbyte to each processed record. |
|
||||
| `_airbyte_emitted_at` | Always exists. | A timestamp representing when the event was pulled from the data source. |
|
||||
| `_airbyte_data` | When no normalization (flattening) is needed, all data reside under this column as a json blob. |
|
||||
| root level fields | When root level normalization (flattening) is selected, the root level fields are expanded. |
|
||||
| `_airbyte_data` | When no normalization \(flattening\) is needed, all data reside under this column as a json blob. | |
|
||||
| root level fields | When root level normalization \(flattening\) is selected, the root level fields are expanded. | |
|
||||
|
||||
For example, given the following json object from a source:
|
||||
|
||||
```json
|
||||
```javascript
|
||||
{
|
||||
"user_id": 123,
|
||||
"name": {
|
||||
@@ -287,11 +285,11 @@ With root level normalization, the output CSV is:
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| `26d73cde-7eb1-4e1e-b7db-a4c03b4cf206` | 1622135805000 | 123 | `{ "first": "John", "last": "Doe" }` |
|
||||
|
||||
### JSON Lines (JSONL)
|
||||
### JSON Lines \(JSONL\)
|
||||
|
||||
[Json Lines](https://jsonlines.org/) is a text format with one JSON per line. Each line has a structure as follows:
|
||||
|
||||
```json
|
||||
```javascript
|
||||
{
|
||||
"_airbyte_ab_id": "<uuid>",
|
||||
"_airbyte_emitted_at": "<timestamp-in-millis>",
|
||||
@@ -301,7 +299,7 @@ With root level normalization, the output CSV is:
|
||||
|
||||
For example, given the following two json objects from a source:
|
||||
|
||||
```json
|
||||
```javascript
|
||||
[
|
||||
{
|
||||
"user_id": 123,
|
||||
@@ -322,7 +320,7 @@ For example, given the following two json objects from a source:
|
||||
|
||||
They will be like this in the output file:
|
||||
|
||||
```jsonl
|
||||
```text
|
||||
{ "_airbyte_ab_id": "26d73cde-7eb1-4e1e-b7db-a4c03b4cf206", "_airbyte_emitted_at": "1622135805000", "_airbyte_data": { "user_id": 123, "name": { "first": "John", "last": "Doe" } } }
|
||||
{ "_airbyte_ab_id": "0a61de1b-9cdd-4455-a739-93572c9a5f20", "_airbyte_emitted_at": "1631948170000", "_airbyte_data": { "user_id": 456, "name": { "first": "Jane", "last": "Roe" } } }
|
||||
```
|
||||
@@ -336,17 +334,17 @@ The following configuration is available to configure the Parquet output:
|
||||
| Parameter | Type | Default | Description |
|
||||
| :--- | :---: | :---: | :--- |
|
||||
| `compression_codec` | enum | `UNCOMPRESSED` | **Compression algorithm**. Available candidates are: `UNCOMPRESSED`, `SNAPPY`, `GZIP`, `LZO`, `BROTLI`, `LZ4`, and `ZSTD`. |
|
||||
| `block_size_mb` | integer | 128 (MB) | **Block size (row group size)** in MB. This is the size of a row group being buffered in memory. It limits the memory usage when writing. Larger values will improve the IO when reading, but consume more memory when writing. |
|
||||
| `max_padding_size_mb` | integer | 8 (MB) | **Max padding size** in MB. This is the maximum size allowed as padding to align row groups. This is also the minimum size of a row group. |
|
||||
| `page_size_kb` | integer | 1024 (KB) | **Page size** in KB. The page size is for compression. A block is composed of pages. A page is the smallest unit that must be read fully to access a single record. If this value is too small, the compression will deteriorate. |
|
||||
| `dictionary_page_size_kb` | integer | 1024 (KB) | **Dictionary Page Size** in KB. There is one dictionary page per column per row group when dictionary encoding is used. The dictionary page size works like the page size but for dictionary. |
|
||||
| `block_size_mb` | integer | 128 \(MB\) | **Block size \(row group size\)** in MB. This is the size of a row group being buffered in memory. It limits the memory usage when writing. Larger values will improve the IO when reading, but consume more memory when writing. |
|
||||
| `max_padding_size_mb` | integer | 8 \(MB\) | **Max padding size** in MB. This is the maximum size allowed as padding to align row groups. This is also the minimum size of a row group. |
|
||||
| `page_size_kb` | integer | 1024 \(KB\) | **Page size** in KB. The page size is for compression. A block is composed of pages. A page is the smallest unit that must be read fully to access a single record. If this value is too small, the compression will deteriorate. |
|
||||
| `dictionary_page_size_kb` | integer | 1024 \(KB\) | **Dictionary Page Size** in KB. There is one dictionary page per column per row group when dictionary encoding is used. The dictionary page size works like the page size but for dictionary. |
|
||||
| `dictionary_encoding` | boolean | `true` | **Dictionary encoding**. This parameter controls whether dictionary encoding is turned on. |
|
||||
|
||||
These parameters are related to the `ParquetOutputFormat`. See the [Java doc](https://www.javadoc.io/doc/org.apache.parquet/parquet-hadoop/1.12.0/org/apache/parquet/hadoop/ParquetOutputFormat.html) for more details. Also see [Parquet documentation](https://parquet.apache.org/documentation/latest/#configurations) for their recommended configurations (512 - 1024 MB block size, 8 KB page size).
|
||||
These parameters are related to the `ParquetOutputFormat`. See the [Java doc](https://www.javadoc.io/doc/org.apache.parquet/parquet-hadoop/1.12.0/org/apache/parquet/hadoop/ParquetOutputFormat.html) for more details. Also see [Parquet documentation](https://parquet.apache.org/documentation/latest/#configurations) for their recommended configurations \(512 - 1024 MB block size, 8 KB page size\).
|
||||
|
||||
#### Data schema
|
||||
|
||||
Under the hood, an Airbyte data stream in Json schema is first converted to an Avro schema, then the Json object is converted to an Avro record, and finally the Avro record is outputted to the Parquet format. See the `Data schema` section from the [Avro output](#avro) for rules and limitations.
|
||||
Under the hood, an Airbyte data stream in Json schema is first converted to an Avro schema, then the Json object is converted to an Avro record, and finally the Avro record is outputted to the Parquet format. See the `Data schema` section from the [Avro output](gcs.md#avro) for rules and limitations.
|
||||
|
||||
## Getting started
|
||||
|
||||
@@ -373,7 +371,8 @@ Under the hood, an Airbyte data stream in Json schema is first converted to an A
|
||||
## CHANGELOG
|
||||
|
||||
| Version | Date | Pull Request | Subject |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| 0.1.2 | 2021-09-12 | [#5720](https://github.com/airbytehq/airbyte/issues/5720) | Added configurable block size for stream. Each stream is limited to 10,000 by GCS |
|
||||
| 0.1.1 | 2021-08-26 | [#5296](https://github.com/airbytehq/airbyte/issues/5296) | Added storing gcsCsvFileLocation property for CSV format. This is used by destination-bigquery (GCS Staging upload type) |
|
||||
| 0.1.0 | 2021-07-16 | [#4329](https://github.com/airbytehq/airbyte/pull/4784) | Initial release. |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| 0.1.2 | 2021-09-12 | [\#5720](https://github.com/airbytehq/airbyte/issues/5720) | Added configurable block size for stream. Each stream is limited to 10,000 by GCS |
|
||||
| 0.1.1 | 2021-08-26 | [\#5296](https://github.com/airbytehq/airbyte/issues/5296) | Added storing gcsCsvFileLocation property for CSV format. This is used by destination-bigquery \(GCS Staging upload type\) |
|
||||
| 0.1.0 | 2021-07-16 | [\#4329](https://github.com/airbytehq/airbyte/pull/4784) | Initial release. |
|
||||
|
||||
|
||||
@@ -10,8 +10,7 @@ The Airbyte Kafka destination allows you to sync data to Kafka. Each stream is w
|
||||
|
||||
Each stream will be output into a Kafka topic.
|
||||
|
||||
Currently, this connector only writes data with JSON format. More formats (e.g. Apache Avro) will be supported in
|
||||
the future.
|
||||
Currently, this connector only writes data with JSON format. More formats \(e.g. Apache Avro\) will be supported in the future.
|
||||
|
||||
Each record will contain in its key the uuid assigned by Airbyte, and in the value these 3 fields:
|
||||
|
||||
@@ -29,7 +28,6 @@ Each record will contain in its key the uuid assigned by Airbyte, and in the val
|
||||
| Incremental - Deduped History | No | As this connector does not support dbt, we don't support this sync mode on this destination. |
|
||||
| Namespaces | Yes | |
|
||||
|
||||
|
||||
## Getting started
|
||||
|
||||
### Requirements
|
||||
@@ -46,41 +44,25 @@ Make sure your Kafka brokers can be accessed by Airbyte.
|
||||
|
||||
#### **Permissions**
|
||||
|
||||
Airbyte should be allowed to write messages into topics, and these topics should be created before writing into Kafka
|
||||
or, at least, enable the configuration in the brokers `auto.create.topics.enable` (which is not recommended for
|
||||
production environments).
|
||||
Airbyte should be allowed to write messages into topics, and these topics should be created before writing into Kafka or, at least, enable the configuration in the brokers `auto.create.topics.enable` \(which is not recommended for production environments\).
|
||||
|
||||
Note that if you choose to use dynamic topic names, you will probably need to enable `auto.create.topics.enable`
|
||||
to avoid your connection failing if there was an update to the source connector's schema. Otherwise a hardcoded
|
||||
topic name may be best.
|
||||
Note that if you choose to use dynamic topic names, you will probably need to enable `auto.create.topics.enable` to avoid your connection failing if there was an update to the source connector's schema. Otherwise a hardcoded topic name may be best.
|
||||
|
||||
#### Target topics
|
||||
|
||||
You can determine the topics to which messages are written via the `topic_pattern` configuration parameter.
|
||||
Messages can be written to either a hardcoded, pre-defined topic, or dynamically written to different topics
|
||||
based on the [namespace](https://docs.airbyte.io/understanding-airbyte/namespaces) or stream they came from.
|
||||
You can determine the topics to which messages are written via the `topic_pattern` configuration parameter. Messages can be written to either a hardcoded, pre-defined topic, or dynamically written to different topics based on the [namespace](https://docs.airbyte.io/understanding-airbyte/namespaces) or stream they came from.
|
||||
|
||||
To write all messages to a single hardcoded topic, enter its name in the `topic_pattern` field
|
||||
e.g: setting `topic_pattern` to `my-topic-name` will write all messages from all streams and namespaces to that topic.
|
||||
To write all messages to a single hardcoded topic, enter its name in the `topic_pattern` field e.g: setting `topic_pattern` to `my-topic-name` will write all messages from all streams and namespaces to that topic.
|
||||
|
||||
To define the output topics dynamically, you can leverage the `{namespace}` and `{stream}` pattern variables,
|
||||
which cause messages to be written to different topics based on the values present when producing the records.
|
||||
For example, setting the `topic_pattern` parameter to `airbyte_syncs/{namespace}/{stream}` means that messages
|
||||
from namespace `n1` and stream `s1` will get written to the topic `airbyte_syncs/n1/s1`, and messages
|
||||
from `s2` to `airbyte_syncs/n1/s2` etc.
|
||||
To define the output topics dynamically, you can leverage the `{namespace}` and `{stream}` pattern variables, which cause messages to be written to different topics based on the values present when producing the records. For example, setting the `topic_pattern` parameter to `airbyte_syncs/{namespace}/{stream}` means that messages from namespace `n1` and stream `s1` will get written to the topic `airbyte_syncs/n1/s1`, and messages from `s2` to `airbyte_syncs/n1/s2` etc.
|
||||
|
||||
If you define output topic dynamically, you might want to enable `auto.create.topics.enable` to
|
||||
avoid your connection failing if there was an update to the source connector's schema.
|
||||
Otherwise, you'll need to manually create topics in Kafka as they are added/updated in the source, which is the
|
||||
recommended option for production environments.
|
||||
If you define output topic dynamically, you might want to enable `auto.create.topics.enable` to avoid your connection failing if there was an update to the source connector's schema. Otherwise, you'll need to manually create topics in Kafka as they are added/updated in the source, which is the recommended option for production environments.
|
||||
|
||||
**NOTICE**: a naming convention transformation will be applied to the target topic name using
|
||||
the `StandardNameTransformer` so that some special characters will be replaced.
|
||||
**NOTICE**: a naming convention transformation will be applied to the target topic name using the `StandardNameTransformer` so that some special characters will be replaced.
|
||||
|
||||
### Setup the Kafka destination in Airbyte
|
||||
|
||||
You should now have all the requirements needed to configure Kafka as a destination in the UI. You can configure the
|
||||
following parameters on the Kafka destination (though many of these are optional or have default values):
|
||||
You should now have all the requirements needed to configure Kafka as a destination in the UI. You can configure the following parameters on the Kafka destination \(though many of these are optional or have default values\):
|
||||
|
||||
* **Bootstrap servers**
|
||||
* **Topic pattern**
|
||||
@@ -110,12 +92,13 @@ following parameters on the Kafka destination (though many of these are optional
|
||||
|
||||
More info about this can be found in the [Kafka producer configs documentation site](https://kafka.apache.org/documentation/#producerconfigs).
|
||||
|
||||
*NOTE*: Some configurations for SSL are not available yet.
|
||||
_NOTE_: Some configurations for SSL are not available yet.
|
||||
|
||||
## Changelog
|
||||
|
||||
| Version | Date | Pull Request | Subject |
|
||||
| :------ | :-------- | :----- | :------ |
|
||||
| 0.1.2 | 2021-09-14 | [#6040](https://github.com/airbytehq/airbyte/pull/6040) | Change spec.json and config parser |
|
||||
| 0.1.1 | 2021-07-30 | [#5125](https://github.com/airbytehq/airbyte/pull/5125) | Enable `additionalPropertities` in spec.json |
|
||||
| 0.1.0 | 2021-07-21 | [#3746](https://github.com/airbytehq/airbyte/pull/3746) | Initial Release |
|
||||
| Version | Date | Pull Request | Subject |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| 0.1.2 | 2021-09-14 | [\#6040](https://github.com/airbytehq/airbyte/pull/6040) | Change spec.json and config parser |
|
||||
| 0.1.1 | 2021-07-30 | [\#5125](https://github.com/airbytehq/airbyte/pull/5125) | Enable `additionalPropertities` in spec.json |
|
||||
| 0.1.0 | 2021-07-21 | [\#3746](https://github.com/airbytehq/airbyte/pull/3746) | Initial Release |
|
||||
|
||||
|
||||
67
docs/integrations/destinations/keen-1.md
Normal file
@@ -0,0 +1,67 @@
|
||||
---
|
||||
description: Keen is a fully managed event streaming and analytic platform.
|
||||
---
|
||||
|
||||
# Keen
|
||||
|
||||
## Overview
|
||||
|
||||
The Airbyte Keen destination allows you to send/stream data into Keen. Keen is a flexible, fully managed event streaming and analytic platform.
|
||||
|
||||
### Sync overview
|
||||
|
||||
#### Output schema
|
||||
|
||||
Each stream will output an event in Keen. Each collection will inherit the name from the stream with all non-alphanumeric characters removed, except for `.-_` and whitespace characters. When possible, the connector will try to guess the timestamp value for the record and override the special field `keen.timestamp` with it.
|
||||
|
||||
#### Features
|
||||
|
||||
| Feature | Supported?\(Yes/No\) | Notes |
|
||||
| :--- | :--- | :--- |
|
||||
| Full Refresh Sync | Yes | |
|
||||
| Incremental - Append Sync | Yes | |
|
||||
| Incremental - Deduped History | No | As this connector does not support dbt, we don't support this sync mode on this destination. |
|
||||
| Namespaces | No | |
|
||||
|
||||
## Getting started
|
||||
|
||||
### Requirements
|
||||
|
||||
To use the Keen destination, you'll need:
|
||||
|
||||
* A Keen Project ID
|
||||
* A Keen Master API key associated with the project
|
||||
|
||||
See the setup guide for more information about how to acquire the required resources.
|
||||
|
||||
### Setup guide
|
||||
|
||||
#### Keen Project
|
||||
|
||||
If you already have the project set up, jump to the "Access" section.
|
||||
|
||||
Login to your [Keen](https://keen.io/) account, then click the Add New link next to the Projects label on the left-hand side tab. Then give project a name.
|
||||
|
||||
#### API Key and Project ID
|
||||
|
||||
Keen connector uses Keen Kafka Inbound Cluster to stream the data. It requires `Project ID` and `Master Key` for the authentication. To get them, navigate to the `Access` tab from the left-hand side panel and check the `Project Details` section. **Important**: This destination requires the Project's **Master** Key.
|
||||
|
||||
#### Timestamp Inference
|
||||
|
||||
`Infer Timestamp` field lets you specify if you want the connector to guess the special `keen.timestamp` field based on the streamed data. It might be useful for historical data synchronization to fully leverage Keen's analytics power. If not selected, `keen.timestamp` will be set to date when data was streamed. By default, set to `true`.
|
||||
|
||||
### Setup the Keen destination in Airbyte
|
||||
|
||||
Now you should have all the parameters needed to configure Keen destination.
|
||||
|
||||
* **Project ID**
|
||||
* **Master API Key**
|
||||
* **Infer Timestamp**
|
||||
|
||||
## CHANGELOG
|
||||
|
||||
| Version | Date | Pull Request | Subject |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| 0.2.0 | 2021-09-10 | [\#5973](https://github.com/airbytehq/airbyte/pull/5973) | Fix timestamp inference for complex schemas |
|
||||
| 0.1.0 | 2021-08-18 | [\#5339](https://github.com/airbytehq/airbyte/pull/5339) | Keen Destination Release! |
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
description: Keen is a fully managed event streaming and analytic platform.
|
||||
---
|
||||
|
||||
# Keen
|
||||
# Chargify
|
||||
|
||||
## Overview
|
||||
|
||||
@@ -12,8 +12,7 @@ The Airbyte Keen destination allows you to send/stream data into Keen. Keen is a
|
||||
|
||||
#### Output schema
|
||||
|
||||
Each stream will output an event in Keen. Each collection will inherit the name from the stream with all non-alphanumeric characters removed, except for `.-_ ` and whitespace characters. When possible, the connector will try to guess the timestamp value for the record and override the special field `keen.timestamp` with it.
|
||||
|
||||
Each stream will output an event in Keen. Each collection will inherit the name from the stream with all non-alphanumeric characters removed, except for `.-_` and whitespace characters. When possible, the connector will try to guess the timestamp value for the record and override the special field `keen.timestamp` with it.
|
||||
|
||||
#### Features
|
||||
|
||||
@@ -43,11 +42,9 @@ If you already have the project set up, jump to the "Access" section.
|
||||
|
||||
Login to your [Keen](https://keen.io/) account, then click the Add New link next to the Projects label on the left-hand side tab. Then give project a name.
|
||||
|
||||
#### API Key and Project ID
|
||||
|
||||
#### API Key and Project ID
|
||||
|
||||
Keen connector uses Keen Kafka Inbound Cluster to stream the data. It requires `Project ID` and `Master Key` for the authentication. To get them, navigate to the `Access` tab from the left-hand side panel and check the `Project Details` section.
|
||||
**Important**: This destination requires the Project's **Master** Key.
|
||||
Keen connector uses Keen Kafka Inbound Cluster to stream the data. It requires `Project ID` and `Master Key` for the authentication. To get them, navigate to the `Access` tab from the left-hand side panel and check the `Project Details` section. **Important**: This destination requires the Project's **Master** Key.
|
||||
|
||||
#### Timestamp Inference
|
||||
|
||||
@@ -63,8 +60,8 @@ Now you should have all the parameters needed to configure Keen destination.
|
||||
|
||||
## CHANGELOG
|
||||
|
||||
| Version | Date | Pull Request | Subject |
|
||||
| :------ | :-------- | :----- | :------ |
|
||||
| 0.2.0 | 2021-09-10 | [#5973](https://github.com/airbytehq/airbyte/pull/5973) | Fix timestamp inference for complex schemas |
|
||||
| 0.1.0 | 2021-08-18 | [#5339](https://github.com/airbytehq/airbyte/pull/5339) | Keen Destination Release! |
|
||||
| Version | Date | Pull Request | Subject |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| 0.2.0 | 2021-09-10 | [\#5973](https://github.com/airbytehq/airbyte/pull/5973) | Fix timestamp inference for complex schemas |
|
||||
| 0.1.0 | 2021-08-18 | [\#5339](https://github.com/airbytehq/airbyte/pull/5339) | Keen Destination Release! |
|
||||
|
||||
|
||||
@@ -47,8 +47,7 @@ The local mount is mounted by Docker onto `LOCAL_ROOT`. This means the `/local`
|
||||
|
||||
## Access Replicated Data Files
|
||||
|
||||
If your Airbyte instance is running on the same computer that you are navigating with, you can open your browser and enter [file:///tmp/airbyte\_local](file:///tmp/airbyte_local) to look at the replicated data locally.
|
||||
If the first approach fails or if your Airbyte instance is running on a remote server, follow the following steps to access the replicated files:
|
||||
If your Airbyte instance is running on the same computer that you are navigating with, you can open your browser and enter [file:///tmp/airbyte\_local](file:///tmp/airbyte_local) to look at the replicated data locally. If the first approach fails or if your Airbyte instance is running on a remote server, follow the following steps to access the replicated files:
|
||||
|
||||
1. Access the scheduler container using `docker exec -it airbyte-scheduler bash`
|
||||
2. Navigate to the default local mount using `cd /tmp/airbyte_local`
|
||||
@@ -58,9 +57,9 @@ If the first approach fails or if your Airbyte instance is running on a remote s
|
||||
|
||||
You can also copy the output file to your host machine, the following command will copy the file to the current working directory you are using:
|
||||
|
||||
```
|
||||
```text
|
||||
docker cp airbyte-scheduler:/tmp/airbyte_local/{destination_path}/{filename}.csv .
|
||||
```
|
||||
|
||||
|
||||
Note: If you are running Airbyte on Windows with Docker backed by WSL2, you have to use similar step as above or refer to this [link](../../operator-guides/locating-files-local-destination.md) for an alternative approach.
|
||||
|
||||
|
||||
@@ -47,8 +47,7 @@ The local mount is mounted by Docker onto `LOCAL_ROOT`. This means the `/local`
|
||||
|
||||
## Access Replicated Data Files
|
||||
|
||||
If your Airbyte instance is running on the same computer that you are navigating with, you can open your browser and enter [file:///tmp/airbyte\_local](file:///tmp/airbyte_local) to look at the replicated data locally.
|
||||
If the first approach fails or if your Airbyte instance is running on a remote server, follow the following steps to access the replicated files:
|
||||
If your Airbyte instance is running on the same computer that you are navigating with, you can open your browser and enter [file:///tmp/airbyte\_local](file:///tmp/airbyte_local) to look at the replicated data locally. If the first approach fails or if your Airbyte instance is running on a remote server, follow the following steps to access the replicated files:
|
||||
|
||||
1. Access the scheduler container using `docker exec -it airbyte-scheduler bash`
|
||||
2. Navigate to the default local mount using `cd /tmp/airbyte_local`
|
||||
@@ -58,8 +57,9 @@ If the first approach fails or if your Airbyte instance is running on a remote s
|
||||
|
||||
You can also copy the output file to your host machine, the following command will copy the file to the current working directory you are using:
|
||||
|
||||
```
|
||||
```text
|
||||
docker cp airbyte-scheduler:/tmp/airbyte_local/{destination_path}/{filename}.jsonl .
|
||||
```
|
||||
|
||||
Note: If you are running Airbyte on Windows with Docker backed by WSL2, you have to use similar step as above or refer to this [link](../../operator-guides/locating-files-local-destination.md) for an alternative approach.
|
||||
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
# Mongodb
|
||||
# MongoDB
|
||||
|
||||
## Features
|
||||
|
||||
@@ -17,21 +17,24 @@ Each stream will be output into its own collection in MongoDB. Each collection w
|
||||
* `_airbyte_emitted_at`: a timestamp representing when the event was pulled from the data source. The field type in MongoDB is `Timestamp`.
|
||||
* `_airbyte_data`: a json blob representing with the event data. The field type in MongoDB is `Object`.
|
||||
|
||||
## Getting Started (Airbyte Cloud)
|
||||
## Getting Started \(Airbyte Cloud\)
|
||||
|
||||
Airbyte Cloud only supports connecting to your MongoDB instance with TLS encryption. Other than that, you can proceed with the open-source instructions below.
|
||||
|
||||
## Getting Started (Airbyte Open-Source)
|
||||
## Getting Started \(Airbyte Open-Source\)
|
||||
|
||||
#### Requirements
|
||||
|
||||
To use the MongoDB destination, you'll need:
|
||||
|
||||
* A MongoDB server
|
||||
|
||||
|
||||
#### **Permissions**
|
||||
|
||||
You need a MongoDB user that can create collections and write documents. We highly recommend creating an Airbyte-specific user for this purpose.
|
||||
|
||||
#### Target Database
|
||||
|
||||
You will need to choose an existing database or create a new database that will be used to store synced data from Airbyte.
|
||||
|
||||
### Setup the MongoDB destination in Airbyte
|
||||
@@ -39,14 +42,14 @@ You will need to choose an existing database or create a new database that will
|
||||
You should now have all the requirements needed to configure MongoDB as a destination in the UI. You'll need the following information to configure the MongoDB destination:
|
||||
|
||||
* **Standalone MongoDb instance**
|
||||
* Host: URL of the database
|
||||
* Port: Port to use for connecting to the database
|
||||
* TLS: indicates whether to create encrypted connection
|
||||
* Host: URL of the database
|
||||
* Port: Port to use for connecting to the database
|
||||
* TLS: indicates whether to create encrypted connection
|
||||
* **Replica Set**
|
||||
* Server addresses: the members of a replica set
|
||||
* Replica Set: A replica set name
|
||||
* Server addresses: the members of a replica set
|
||||
* Replica Set: A replica set name
|
||||
* **MongoDb Atlas Cluster**
|
||||
* Cluster URL: URL of a cluster to connect to
|
||||
* Cluster URL: URL of a cluster to connect to
|
||||
* **Database**
|
||||
* **Username**
|
||||
* **Password**
|
||||
@@ -63,12 +66,13 @@ Since database names are case insensitive in MongoDB, database names cannot diff
|
||||
|
||||
#### Restrictions on Database Names for Windows
|
||||
|
||||
For MongoDB deployments running on Windows, database names cannot contain any of the following characters: /\. "$*<>:|?*
|
||||
For MongoDB deployments running on Windows, database names cannot contain any of the following characters: /. "$_<>:\|?_
|
||||
|
||||
Also database names cannot contain the null character.
|
||||
|
||||
#### Restrictions on Database Names for Unix and Linux Systems
|
||||
For MongoDB deployments running on Unix and Linux systems, database names cannot contain any of the following characters: /\. "$
|
||||
|
||||
For MongoDB deployments running on Unix and Linux systems, database names cannot contain any of the following characters: /. "$
|
||||
|
||||
Also database names cannot contain the null character.
|
||||
|
||||
@@ -81,11 +85,13 @@ Database names cannot be empty and must have fewer than 64 characters.
|
||||
Collection names should begin with an underscore or a letter character, and cannot:
|
||||
|
||||
* contain the $.
|
||||
* be an empty string (e.g. "").
|
||||
* be an empty string \(e.g. ""\).
|
||||
* contain the null character.
|
||||
* begin with the system. prefix. (Reserved for internal use.)
|
||||
* begin with the system. prefix. \(Reserved for internal use.\)
|
||||
|
||||
## Changelog
|
||||
| Version | Date | Pull Request | Subject |
|
||||
| :------ | :-------- | :----- | :------ |
|
||||
| 0.1.1 | 2021-09-29 | [6536](https://github.com/airbytehq/airbyte/pull/6536) | Destination MongoDb: added support via TLS/SSL |
|
||||
|
||||
| Version | Date | Pull Request | Subject |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| 0.1.1 | 2021-09-29 | [6536](https://github.com/airbytehq/airbyte/pull/6536) | Destination MongoDb: added support via TLS/SSL |
|
||||
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
# MS SQL Server
|
||||
# MSSQL
|
||||
|
||||
## Features
|
||||
|
||||
@@ -21,21 +21,23 @@ Each stream will be output into its own table in SQL Server. Each table will con
|
||||
* `_airbyte_emitted_at`: a timestamp representing when the event was pulled from the data source. The column type in SQL Server is `DATETIMEOFFSET(7)`.
|
||||
* `_airbyte_data`: a JSON blob representing with the event data. The column type in SQL Server is `NVARCHAR(MAX)`.
|
||||
|
||||
#### Microsoft SQL Server specifics or why NVARCHAR type is used here:
|
||||
#### Microsoft SQL Server specifics or why NVARCHAR type is used here:
|
||||
|
||||
* NVARCHAR is Unicode - 2 bytes per character, therefore max. of 1 billion characters; will handle East Asian, Arabic, Hebrew, Cyrillic etc. characters just fine.
|
||||
* VARCHAR is non-Unicode - 1 byte per character, max. capacity is 2 billion characters, but limited to the character set you're SQL Server is using, basically - no support for those languages mentioned before
|
||||
|
||||
## Getting Started (Airbyte Cloud)
|
||||
## Getting Started \(Airbyte Cloud\)
|
||||
|
||||
Airbyte Cloud only supports connecting to your MSSQL instance with TLS encryption. Other than that, you can proceed with the open-source instructions below.
|
||||
|
||||
| Feature | Supported?\(Yes/No\) | Notes |
|
||||
| :--- | :--- | :--- |
|
||||
| Full Refresh Sync | Yes | |
|
||||
| Incremental - Append Sync | Yes | |
|
||||
| Incremental - Deduped History | Yes | |
|
||||
| Incremental - Deduped History | Yes | |
|
||||
| Namespaces | Yes | |
|
||||
|
||||
## Getting Started (Airbyte Open-Source)
|
||||
## Getting Started \(Airbyte Open-Source\)
|
||||
|
||||
### Requirements
|
||||
|
||||
@@ -45,11 +47,10 @@ MS SQL Server: `Azure SQL Database`, `Azure Synapse Analytics`, `Azure SQL Manag
|
||||
|
||||
### Normalization Requirements
|
||||
|
||||
To sync **with** normalization you'll need to use MS SQL Server of the following versions:
|
||||
`SQL Server 2019`, `SQL Server 2017`, `SQL Server 2016`, `SQL Server 2014`.
|
||||
The work of normalization on `SQL Server 2012` and bellow are not guaranteed.
|
||||
To sync **with** normalization you'll need to use MS SQL Server of the following versions: `SQL Server 2019`, `SQL Server 2017`, `SQL Server 2016`, `SQL Server 2014`. The work of normalization on `SQL Server 2012` and bellow are not guaranteed.
|
||||
|
||||
### Setup guide
|
||||
|
||||
* MS SQL Server: `Azure SQL Database`, `Azure Synapse Analytics`, `Azure SQL Managed Instance`, `SQL Server 2019`, `SQL Server 2017`, `SQL Server 2016`, `SQL Server 2014`, `SQL Server 2012`, or `PDW 2008R2 AU34`.
|
||||
|
||||
#### Network Access
|
||||
@@ -64,9 +65,9 @@ You need a user configured in SQL Server that can create tables and write rows.
|
||||
|
||||
You will need to choose an existing database or create a new database that will be used to store synced data from Airbyte.
|
||||
|
||||
#### SSL configuration (optional)
|
||||
#### SSL configuration \(optional\)
|
||||
|
||||
Airbyte supports a SSL-encrypted connection to the database. If you want to use SSL to securely access your database, ensure that [the server is configured to use an SSL certificate.](https://support.microsoft.com/en-us/topic/how-to-enable-ssl-encryption-for-an-instance-of-sql-server-by-using-microsoft-management-console-1c7ae22f-8518-2b3e-93eb-d735af9e344c)
|
||||
Airbyte supports a SSL-encrypted connection to the database. If you want to use SSL to securely access your database, ensure that [the server is configured to use an SSL certificate.](https://support.microsoft.com/en-us/topic/how-to-enable-ssl-encryption-for-an-instance-of-sql-server-by-using-microsoft-management-console-1c7ae22f-8518-2b3e-93eb-d735af9e344c)
|
||||
|
||||
### Setup the MSSQL destination in Airbyte
|
||||
|
||||
@@ -78,48 +79,53 @@ You should now have all the requirements needed to configure SQL Server as a des
|
||||
* **Password**
|
||||
* **Schema**
|
||||
* **Database**
|
||||
* This database needs to exist within the schema provided.
|
||||
* This database needs to exist within the schema provided.
|
||||
* **SSL Method**:
|
||||
* The SSL configuration supports three modes: Unencrypted, Encrypted (trust server certificate), and Encrypted (verify certificate).
|
||||
* The SSL configuration supports three modes: Unencrypted, Encrypted \(trust server certificate\), and Encrypted \(verify certificate\).
|
||||
* **Unencrypted**: Do not use SSL encryption on the database connection
|
||||
* **Encrypted (trust server certificate)**: Use SSL encryption without verifying the server's certificate. This is useful for self-signed certificates in testing scenarios, but should not be used in production.
|
||||
* **Encrypted (verify certificate)**: Use the server's SSL certificate, after standard certificate verification.
|
||||
* **Host Name In Certificate** (optional): When using certificate verification, this property can be set to specify an expected name for added security. If this value is present, and the server's certificate's host name does not match it, certificate verification will fail.
|
||||
|
||||
* **Encrypted \(trust server certificate\)**: Use SSL encryption without verifying the server's certificate. This is useful for self-signed certificates in testing scenarios, but should not be used in production.
|
||||
* **Encrypted \(verify certificate\)**: Use the server's SSL certificate, after standard certificate verification.
|
||||
* **Host Name In Certificate** \(optional\): When using certificate verification, this property can be set to specify an expected name for added security. If this value is present, and the server's certificate's host name does not match it, certificate verification will fail.
|
||||
|
||||
## Connection via SSH Tunnel
|
||||
|
||||
Airbyte has the ability to connect to the MS SQL Server instance via an SSH Tunnel. The reason you might want to do this because it is not possible
|
||||
(or against security policy) to connect to the database directly (e.g. it does not have a public IP address).
|
||||
Airbyte has the ability to connect to the MS SQL Server instance via an SSH Tunnel. The reason you might want to do this because it is not possible \(or against security policy\) to connect to the database directly \(e.g. it does not have a public IP address\).
|
||||
|
||||
When using an SSH tunnel, you are configuring Airbyte to connect to an intermediate server (a.k.a. a bastion sever) that have direct access to the database.
|
||||
Airbyte connects to the bastion and then asks the bastion to connect directly to the server.
|
||||
When using an SSH tunnel, you are configuring Airbyte to connect to an intermediate server \(a.k.a. a bastion sever\) that have direct access to the database. Airbyte connects to the bastion and then asks the bastion to connect directly to the server.
|
||||
|
||||
Using this feature requires additional configuration, when creating the source. We will talk through what each piece of configuration means.
|
||||
|
||||
1. Configure all fields for the source as you normally would, except `SSH Tunnel Method`.
|
||||
2. `SSH Tunnel Method` defaults to `No Tunnel` (meaning a direct connection). If you want to use an SSH Tunnel choose `SSH Key Authentication` or `Password Authentication`.
|
||||
3. Choose `Key Authentication` if you will be using an RSA private key as your secret for establishing the SSH Tunnel (see below for more information on generating this key).
|
||||
2. `SSH Tunnel Method` defaults to `No Tunnel` \(meaning a direct connection\). If you want to use an SSH Tunnel choose `SSH Key Authentication` or `Password Authentication`.
|
||||
3. Choose `Key Authentication` if you will be using an RSA private key as your secret for establishing the SSH Tunnel \(see below for more information on generating this key\).
|
||||
4. Choose `Password Authentication` if you will be using a password as your secret for establishing the SSH Tunnel.
|
||||
5. `SSH Tunnel Jump Server Host` refers to the intermediate (bastion) server that Airbyte will connect to. This should be a hostname or an IP Address.
|
||||
5. `SSH Tunnel Jump Server Host` refers to the intermediate \(bastion\) server that Airbyte will connect to. This should be a hostname or an IP Address.
|
||||
6. `SSH Connection Port` is the port on the bastion server with which to make the SSH connection. The default port for SSH connections is `22`,
|
||||
so unless you have explicitly changed something, go with the default.
|
||||
|
||||
so unless you have explicitly changed something, go with the default.
|
||||
|
||||
7. `SSH Login Username` is the username that Airbyte should use when connection to the bastion server. This is NOT the MS SQL Server username.
|
||||
8. If you are using `Password Authentication`, then `SSH Login Username` should be set to the password of the User from the previous step.
|
||||
If you are using `SSH Key Authentication` leave this blank. Again, this is not the MS SQL Server password, but the password for the OS-user that
|
||||
Airbyte is using to perform commands on the bastion.
|
||||
|
||||
If you are using `SSH Key Authentication` leave this blank. Again, this is not the MS SQL Server password, but the password for the OS-user that
|
||||
|
||||
Airbyte is using to perform commands on the bastion.
|
||||
|
||||
9. If you are using `SSH Key Authentication`, then `SSH Private Key` should be set to the RSA Private Key that you are using to create the SSH connection.
|
||||
This should be the full contents of the key file starting with `-----BEGIN RSA PRIVATE KEY-----` and ending with `-----END RSA PRIVATE KEY-----`.
|
||||
|
||||
This should be the full contents of the key file starting with `-----BEGIN RSA PRIVATE KEY-----` and ending with `-----END RSA PRIVATE KEY-----`.
|
||||
|
||||
## Changelog
|
||||
|
||||
| Version | Date | Pull Request | Subject |
|
||||
| :------ | :-------- | :----- | :------ |
|
||||
| 0.1.9 | 2021-09-29 | [#5970](https://github.com/airbytehq/airbyte/pull/5970) | Add support & test cases for MSSQL Destination via SSH tunnels |
|
||||
| 0.1.8 | 2021-08-07 | [#5272](https://github.com/airbytehq/airbyte/pull/5272) | Add batch method to insert records |
|
||||
| 0.1.7 | 2021-07-30 | [#5125](https://github.com/airbytehq/airbyte/pull/5125) | Enable `additionalPropertities` in spec.json |
|
||||
| 0.1.6 | 2021-06-21 | [#3555](https://github.com/airbytehq/airbyte/pull/3555) | Partial Success in BufferedStreamConsumer |
|
||||
| 0.1.5 | 2021-07-20 | [#4874](https://github.com/airbytehq/airbyte/pull/4874) | declare object types correctly in spec |
|
||||
| 0.1.4 | 2021-06-17 | [#3744](https://github.com/airbytehq/airbyte/pull/3744) | Fix doc/params in specification file |
|
||||
| 0.1.3 | 2021-05-28 | [#3728](https://github.com/airbytehq/airbyte/pull/3973) | Change dockerfile entrypoint |
|
||||
| 0.1.2 | 2021-05-13 | [#3367](https://github.com/airbytehq/airbyte/pull/3671) | Fix handle symbols unicode |
|
||||
| 0.1.1 | 2021-05-11 | [#3566](https://github.com/airbytehq/airbyte/pull/3195) | MS SQL Server Destination Release! |
|
||||
| Version | Date | Pull Request | Subject |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| 0.1.9 | 2021-09-29 | [\#5970](https://github.com/airbytehq/airbyte/pull/5970) | Add support & test cases for MSSQL Destination via SSH tunnels |
|
||||
| 0.1.8 | 2021-08-07 | [\#5272](https://github.com/airbytehq/airbyte/pull/5272) | Add batch method to insert records |
|
||||
| 0.1.7 | 2021-07-30 | [\#5125](https://github.com/airbytehq/airbyte/pull/5125) | Enable `additionalPropertities` in spec.json |
|
||||
| 0.1.6 | 2021-06-21 | [\#3555](https://github.com/airbytehq/airbyte/pull/3555) | Partial Success in BufferedStreamConsumer |
|
||||
| 0.1.5 | 2021-07-20 | [\#4874](https://github.com/airbytehq/airbyte/pull/4874) | declare object types correctly in spec |
|
||||
| 0.1.4 | 2021-06-17 | [\#3744](https://github.com/airbytehq/airbyte/pull/3744) | Fix doc/params in specification file |
|
||||
| 0.1.3 | 2021-05-28 | [\#3728](https://github.com/airbytehq/airbyte/pull/3973) | Change dockerfile entrypoint |
|
||||
| 0.1.2 | 2021-05-13 | [\#3367](https://github.com/airbytehq/airbyte/pull/3671) | Fix handle symbols unicode |
|
||||
| 0.1.1 | 2021-05-11 | [\#3566](https://github.com/airbytehq/airbyte/pull/3195) | MS SQL Server Destination Release! |
|
||||
|
||||
|
||||
@@ -18,10 +18,11 @@ Each stream will be output into its own table in MySQL. Each table will contain
|
||||
* `_airbyte_emitted_at`: a timestamp representing when the event was pulled from the data source. The column type in MySQL is `TIMESTAMP(6)`.
|
||||
* `_airbyte_data`: a json blob representing with the event data. The column type in MySQL is `JSON`.
|
||||
|
||||
## Getting Started (Airbyte Cloud)
|
||||
## Getting Started \(Airbyte Cloud\)
|
||||
|
||||
Airbyte Cloud only supports connecting to your MySQL instance with TLS encryption. Other than that, you can proceed with the open-source instructions below.
|
||||
|
||||
## Getting Started (Airbyte Open-Source)
|
||||
## Getting Started \(Airbyte Open-Source\)
|
||||
|
||||
### Requirements
|
||||
|
||||
@@ -56,29 +57,30 @@ You should now have all the requirements needed to configure MySQL as a destinat
|
||||
|
||||
## Known Limitations
|
||||
|
||||
Note that MySQL documentation discusses identifiers case sensitivity using the `lower_case_table_names` system variable.
|
||||
One of their recommendations is:
|
||||
Note that MySQL documentation discusses identifiers case sensitivity using the `lower_case_table_names` system variable. One of their recommendations is:
|
||||
|
||||
"It is best to adopt a consistent convention, such as always creating and referring to databases and tables using lowercase names.
|
||||
This convention is recommended for maximum portability and ease of use."
|
||||
```text
|
||||
"It is best to adopt a consistent convention, such as always creating and referring to databases and tables using lowercase names.
|
||||
This convention is recommended for maximum portability and ease of use."
|
||||
```
|
||||
|
||||
[Source: MySQL docs](https://dev.mysql.com/doc/refman/8.0/en/identifier-case-sensitivity.html)
|
||||
|
||||
As a result, Airbyte MySQL destination forces all identifier (table, schema and columns) names to be lowercase.
|
||||
As a result, Airbyte MySQL destination forces all identifier \(table, schema and columns\) names to be lowercase.
|
||||
|
||||
## Connection via SSH Tunnel
|
||||
|
||||
Airbyte has the ability to connect to a MySQl instance via an SSH Tunnel. The reason you might want to do this because it is not possible (or against security policy) to connect to the database directly (e.g. it does not have a public IP address).
|
||||
Airbyte has the ability to connect to a MySQl instance via an SSH Tunnel. The reason you might want to do this because it is not possible \(or against security policy\) to connect to the database directly \(e.g. it does not have a public IP address\).
|
||||
|
||||
When using an SSH tunnel, you are configuring Airbyte to connect to an intermediate server (a.k.a. a bastion sever) that _does_ have direct access to the database. Airbyte connects to the bastion and then asks the bastion to connect directly to the server.
|
||||
When using an SSH tunnel, you are configuring Airbyte to connect to an intermediate server \(a.k.a. a bastion sever\) that _does_ have direct access to the database. Airbyte connects to the bastion and then asks the bastion to connect directly to the server.
|
||||
|
||||
Using this feature requires additional configuration, when creating the destination. We will talk through what each piece of configuration means.
|
||||
|
||||
1. Configure all fields for the destination as you normally would, except `SSH Tunnel Method`.
|
||||
2. `SSH Tunnel Method` defaults to `No Tunnel` (meaning a direct connection). If you want to use an SSH Tunnel choose `SSH Key Authentication` or `Password Authentication`.
|
||||
1. Choose `Key Authentication` if you will be using an RSA private key as your secret for establishing the SSH Tunnel (see below for more information on generating this key).
|
||||
2. Choose `Password Authentication` if you will be using a password as your secret for establishing the SSH Tunnel.
|
||||
3. `SSH Tunnel Jump Server Host` refers to the intermediate (bastion) server that Airbyte will connect to. This should be a hostname or an IP Address.
|
||||
2. `SSH Tunnel Method` defaults to `No Tunnel` \(meaning a direct connection\). If you want to use an SSH Tunnel choose `SSH Key Authentication` or `Password Authentication`.
|
||||
1. Choose `Key Authentication` if you will be using an RSA private key as your secret for establishing the SSH Tunnel \(see below for more information on generating this key\).
|
||||
2. Choose `Password Authentication` if you will be using a password as your secret for establishing the SSH Tunnel.
|
||||
3. `SSH Tunnel Jump Server Host` refers to the intermediate \(bastion\) server that Airbyte will connect to. This should be a hostname or an IP Address.
|
||||
4. `SSH Connection Port` is the port on the bastion server with which to make the SSH connection. The default port for SSH connections is `22`, so unless you have explicitly changed something, go with the default.
|
||||
5. `SSH Login Username` is the username that Airbyte should use when connection to the bastion server. This is NOT the MySQl username.
|
||||
6. If you are using `Password Authentication`, then `SSH Login Username` should be set to the password of the User from the previous step. If you are using `SSH Key Authentication` leave this blank. Again, this is not the MySQl password, but the password for the OS-user that Airbyte is using to perform commands on the bastion.
|
||||
@@ -87,16 +89,17 @@ Using this feature requires additional configuration, when creating the destinat
|
||||
## CHANGELOG
|
||||
|
||||
| Version | Date | Pull Request | Subject |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| 0.1.13 | 2021-09-28 | [#6506](https://github.com/airbytehq/airbyte/pull/6506) | Added support for MySQL destination via TLS/SSL |
|
||||
| 0.1.12 | 2021-09-24 | [#6317](https://github.com/airbytehq/airbyte/pull/6317) | Added option to connect to DB via SSH |
|
||||
| 0.1.11 | 2021-07-30 | [#5125](https://github.com/airbytehq/airbyte/pull/5125) | Enable `additionalPropertities` in spec.json |
|
||||
| 0.1.10 | 2021-07-28 | [#5026](https://github.com/airbytehq/airbyte/pull/5026) | Add sanitized json fields in raw tables to handle quotes in column names |
|
||||
| 0.1.7 | 2021-07-09 | [#4651](https://github.com/airbytehq/airbyte/pull/4651) | Switch normalization flag on so users can use normalization. |
|
||||
| 0.1.6 | 2021-07-03 | [#4531](https://github.com/airbytehq/airbyte/pull/4531) | Added normalization for MySQL. |
|
||||
| 0.1.5 | 2021-07-03 | [#3973](https://github.com/airbytehq/airbyte/pull/3973) | Added `AIRBYTE_ENTRYPOINT` for kubernetes support. |
|
||||
| 0.1.4 | 2021-07-03 | [#3290](https://github.com/airbytehq/airbyte/pull/3290) | Switched to get states from destination instead of source. |
|
||||
| 0.1.3 | 2021-07-03 | [#3387](https://github.com/airbytehq/airbyte/pull/3387) | Fixed a bug for message length checking. |
|
||||
| 0.1.2 | 2021-07-03 | [#3327](https://github.com/airbytehq/airbyte/pull/3327) | Fixed LSEP unicode characters. |
|
||||
| 0.1.1 | 2021-07-03 | [#3289](https://github.com/airbytehq/airbyte/pull/3289) | Added support for outputting messages. |
|
||||
| 0.1.0 | 2021-05-06 | [#3242](https://github.com/airbytehq/airbyte/pull/3242) | Added MySQL destination. |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| 0.1.13 | 2021-09-28 | [\#6506](https://github.com/airbytehq/airbyte/pull/6506) | Added support for MySQL destination via TLS/SSL |
|
||||
| 0.1.12 | 2021-09-24 | [\#6317](https://github.com/airbytehq/airbyte/pull/6317) | Added option to connect to DB via SSH |
|
||||
| 0.1.11 | 2021-07-30 | [\#5125](https://github.com/airbytehq/airbyte/pull/5125) | Enable `additionalPropertities` in spec.json |
|
||||
| 0.1.10 | 2021-07-28 | [\#5026](https://github.com/airbytehq/airbyte/pull/5026) | Add sanitized json fields in raw tables to handle quotes in column names |
|
||||
| 0.1.7 | 2021-07-09 | [\#4651](https://github.com/airbytehq/airbyte/pull/4651) | Switch normalization flag on so users can use normalization. |
|
||||
| 0.1.6 | 2021-07-03 | [\#4531](https://github.com/airbytehq/airbyte/pull/4531) | Added normalization for MySQL. |
|
||||
| 0.1.5 | 2021-07-03 | [\#3973](https://github.com/airbytehq/airbyte/pull/3973) | Added `AIRBYTE_ENTRYPOINT` for kubernetes support. |
|
||||
| 0.1.4 | 2021-07-03 | [\#3290](https://github.com/airbytehq/airbyte/pull/3290) | Switched to get states from destination instead of source. |
|
||||
| 0.1.3 | 2021-07-03 | [\#3387](https://github.com/airbytehq/airbyte/pull/3387) | Fixed a bug for message length checking. |
|
||||
| 0.1.2 | 2021-07-03 | [\#3327](https://github.com/airbytehq/airbyte/pull/3327) | Fixed LSEP unicode characters. |
|
||||
| 0.1.1 | 2021-07-03 | [\#3289](https://github.com/airbytehq/airbyte/pull/3289) | Added support for outputting messages. |
|
||||
| 0.1.0 | 2021-05-06 | [\#3242](https://github.com/airbytehq/airbyte/pull/3242) | Added MySQL destination. |
|
||||
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
# Oracle
|
||||
# Oracle DB
|
||||
|
||||
## Features
|
||||
|
||||
@@ -19,12 +19,13 @@ By default, each stream will be output into its own table in Oracle. Each table
|
||||
* `_AIRBYTE_EMITTED_AT`: a timestamp representing when the event was pulled from the data source. The column type in Oracle is `TIMESTAMP WITH TIME ZONE`.
|
||||
* `_AIRBYTE_DATA`: a json blob representing with the event data. The column type in Oracles is `NCLOB`.
|
||||
|
||||
Enabling normalization will also create normalized, strongly typed tables.
|
||||
Enabling normalization will also create normalized, strongly typed tables.
|
||||
|
||||
## Getting Started \(Airbyte Cloud\)
|
||||
|
||||
## Getting Started (Airbyte Cloud)
|
||||
The Oracle connector is currently in Alpha on Airbyte Cloud. Only TLS encrypted connections to your DB can be made from Airbyte Cloud. Other than that, follow the open-source instructions below.
|
||||
|
||||
## Getting Started (Airbyte Open-Source)
|
||||
## Getting Started \(Airbyte Open-Source\)
|
||||
|
||||
#### Requirements
|
||||
|
||||
@@ -43,10 +44,10 @@ As Airbyte namespaces allows us to store data into different schemas, we have di
|
||||
|
||||
| Login user | Destination user | Required permissions | Comment |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| DBA User | Any user | - | |
|
||||
| Regular user | Same user as login | Create, drop and write table, create session | |
|
||||
| DBA User | Any user | - | |
|
||||
| Regular user | Same user as login | Create, drop and write table, create session | |
|
||||
| Regular user | Any existing user | Create, drop and write ANY table, create session | Grants can be provided on a system level by DBA or by target user directly |
|
||||
| Regular user | Not existing user | Create, drop and write ANY table, create user, create session | Grants should be provided on a system level by DBA |
|
||||
| Regular user | Not existing user | Create, drop and write ANY table, create user, create session | Grants should be provided on a system level by DBA |
|
||||
|
||||
We highly recommend creating an Airbyte-specific user for this purpose.
|
||||
|
||||
@@ -59,33 +60,34 @@ You should now have all the requirements needed to configure Oracle as a destina
|
||||
* **Username**
|
||||
* **Password**
|
||||
* **Database**
|
||||
*
|
||||
## Connection via SSH Tunnel
|
||||
* **Connection via SSH Tunnel**
|
||||
|
||||
Airbyte has the ability to connect to a Oracle instance via an SSH Tunnel. The reason you might want to do this because it is not possible (or against security policy) to connect to the database directly (e.g. it does not have a public IP address).
|
||||
Airbyte has the ability to connect to a Oracle instance via an SSH Tunnel. The reason you might want to do this because it is not possible \(or against security policy\) to connect to the database directly \(e.g. it does not have a public IP address\).
|
||||
|
||||
When using an SSH tunnel, you are configuring Airbyte to connect to an intermediate server (a.k.a. a bastion sever) that _does_ have direct access to the database. Airbyte connects to the bastion and then asks the bastion to connect directly to the server.
|
||||
When using an SSH tunnel, you are configuring Airbyte to connect to an intermediate server \(a.k.a. a bastion sever\) that _does_ have direct access to the database. Airbyte connects to the bastion and then asks the bastion to connect directly to the server.
|
||||
|
||||
Using this feature requires additional configuration, when creating the source. We will talk through what each piece of configuration means.
|
||||
|
||||
1. Configure all fields for the source as you normally would, except `SSH Tunnel Method`.
|
||||
2. `SSH Tunnel Method` defaults to `No Tunnel` (meaning a direct connection). If you want to use an SSH Tunnel choose `SSH Key Authentication` or `Password Authentication`.
|
||||
1. Choose `Key Authentication` if you will be using an RSA private key as your secret for establishing the SSH Tunnel (see below for more information on generating this key).
|
||||
2. Choose `Password Authentication` if you will be using a password as your secret for establishing the SSH Tunnel.
|
||||
3. `SSH Tunnel Jump Server Host` refers to the intermediate (bastion) server that Airbyte will connect to. This should be a hostname or an IP Address.
|
||||
2. `SSH Tunnel Method` defaults to `No Tunnel` \(meaning a direct connection\). If you want to use an SSH Tunnel choose `SSH Key Authentication` or `Password Authentication`.
|
||||
1. Choose `Key Authentication` if you will be using an RSA private key as your secret for establishing the SSH Tunnel \(see below for more information on generating this key\).
|
||||
2. Choose `Password Authentication` if you will be using a password as your secret for establishing the SSH Tunnel.
|
||||
3. `SSH Tunnel Jump Server Host` refers to the intermediate \(bastion\) server that Airbyte will connect to. This should be a hostname or an IP Address.
|
||||
4. `SSH Connection Port` is the port on the bastion server with which to make the SSH connection. The default port for SSH connections is `22`, so unless you have explicitly changed something, go with the default.
|
||||
5. `SSH Login Username` is the username that Airbyte should use when connection to the bastion server. This is NOT the Oracle username.
|
||||
6. If you are using `Password Authentication`, then `SSH Login Username` should be set to the password of the User from the previous step. If you are using `SSH Key Authentication` leave this blank. Again, this is not the Oracle password, but the password for the OS-user that Airbyte is using to perform commands on the bastion.
|
||||
7. If you are using `SSH Key Authentication`, then `SSH Private Key` should be set to the RSA Private Key that you are using to create the SSH connection. This should be the full contents of the key file starting with `-----BEGIN RSA PRIVATE KEY-----` and ending with `-----END RSA PRIVATE KEY-----`.
|
||||
|
||||
## Changelog
|
||||
|
||||
| Version | Date | Pull Request | Subject |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| 0.1.9 | 2021-10-06 | [#6611](https://github.com/airbytehq/airbyte/pull/6611)| 🐛 Destination Oracle: maxStringLength should be 128|
|
||||
| 0.1.8 | 2021-09-28 | [#6370](https://github.com/airbytehq/airbyte/pull/6370)| Add SSH Support for Oracle Destination |
|
||||
| 0.1.7 | 2021-08-30 | [#5746](https://github.com/airbytehq/airbyte/pull/5746) | Use default column name for raw tables |
|
||||
| 0.1.6 | 2021-08-23 | [#5542](https://github.com/airbytehq/airbyte/pull/5542) | Remove support for Oracle 11g to allow normalization |
|
||||
| 0.1.5 | 2021-08-10 | [#5307](https://github.com/airbytehq/airbyte/pull/5307) | 🐛 Destination Oracle: Fix destination check for users without dba role |
|
||||
| 0.1.4 | 2021-07-30 | [#5125](https://github.com/airbytehq/airbyte/pull/5125) | Enable `additionalPropertities` in spec.json |
|
||||
| 0.1.3 | 2021-07-21 | [#3555](https://github.com/airbytehq/airbyte/pull/3555) | Partial Success in BufferedStreamConsumer |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| 0.1.9 | 2021-10-06 | [\#6611](https://github.com/airbytehq/airbyte/pull/6611) | 🐛 Destination Oracle: maxStringLength should be 128 |
|
||||
| 0.1.8 | 2021-09-28 | [\#6370](https://github.com/airbytehq/airbyte/pull/6370) | Add SSH Support for Oracle Destination |
|
||||
| 0.1.7 | 2021-08-30 | [\#5746](https://github.com/airbytehq/airbyte/pull/5746) | Use default column name for raw tables |
|
||||
| 0.1.6 | 2021-08-23 | [\#5542](https://github.com/airbytehq/airbyte/pull/5542) | Remove support for Oracle 11g to allow normalization |
|
||||
| 0.1.5 | 2021-08-10 | [\#5307](https://github.com/airbytehq/airbyte/pull/5307) | 🐛 Destination Oracle: Fix destination check for users without dba role |
|
||||
| 0.1.4 | 2021-07-30 | [\#5125](https://github.com/airbytehq/airbyte/pull/5125) | Enable `additionalPropertities` in spec.json |
|
||||
| 0.1.3 | 2021-07-21 | [\#3555](https://github.com/airbytehq/airbyte/pull/3555) | Partial Success in BufferedStreamConsumer |
|
||||
| 0.1.2 | 2021-07-20 | [4874](https://github.com/airbytehq/airbyte/pull/4874) | Require `sid` instead of `database` in connector specification |
|
||||
|
||||
|
||||
@@ -21,10 +21,11 @@ Each stream will be output into its own table in Postgres. Each table will conta
|
||||
* `_airbyte_emitted_at`: a timestamp representing when the event was pulled from the data source. The column type in Postgres is `TIMESTAMP WITH TIME ZONE`.
|
||||
* `_airbyte_data`: a json blob representing with the event data. The column type in Postgres is `JSONB`.
|
||||
|
||||
## Getting Started (Airbyte Cloud)
|
||||
## Getting Started \(Airbyte Cloud\)
|
||||
|
||||
Airbyte Cloud only supports connecting to your Postgres instance with SSL or TLS encryption. TLS is used by default. Other than that, you can proceed with the open-source instructions below.
|
||||
|
||||
## Getting Started (Airbyte Open-Source)
|
||||
## Getting Started \(Airbyte Open-Source\)
|
||||
|
||||
#### Requirements
|
||||
|
||||
@@ -45,6 +46,7 @@ You need a Postgres user that can create tables and write rows. We highly recomm
|
||||
You will need to choose an existing database or create a new database that will be used to store synced data from Airbyte.
|
||||
|
||||
### Setup the Postgres Destination in Airbyte
|
||||
|
||||
You should now have all the requirements needed to configure Postgres as a destination in the UI. You'll need the following information to configure the Postgres destination:
|
||||
|
||||
* **Host**
|
||||
@@ -72,7 +74,9 @@ From [Postgres SQL Identifiers syntax](https://www.postgresql.org/docs/9.0/sql-s
|
||||
Therefore, Airbyte Postgres destination will create tables and schemas using the Unquoted identifiers when possible or fallback to Quoted Identifiers if the names are containing special characters.
|
||||
|
||||
## Changelog
|
||||
|
||||
| Version | Date | Pull Request | Subject |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| 0.3.10 | 2021-08-11 | [#5336](https://github.com/airbytehq/airbyte/pull/5336) | 🐛 Destination Postgres: fix \u0000(NULL) value processing |
|
||||
| 0.3.11 | 2021-09-07 | [#5743](https://github.com/airbytehq/airbyte/pull/5743) | Add SSH Tunnel support |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| 0.3.10 | 2021-08-11 | [\#5336](https://github.com/airbytehq/airbyte/pull/5336) | 🐛 Destination Postgres: fix \u0000\(NULL\) value processing |
|
||||
| 0.3.11 | 2021-09-07 | [\#5743](https://github.com/airbytehq/airbyte/pull/5743) | Add SSH Tunnel support |
|
||||
|
||||
|
||||
@@ -1,5 +1,7 @@
|
||||
---
|
||||
description: 'Pub/Sub is an asynchronous messaging service provided by Google Cloud Provider.'
|
||||
description: >-
|
||||
Pub/Sub is an asynchronous messaging service provided by Google Cloud
|
||||
Provider.
|
||||
---
|
||||
|
||||
# Google PubSub
|
||||
@@ -9,7 +11,7 @@ description: 'Pub/Sub is an asynchronous messaging service provided by Google Cl
|
||||
The Airbyte Google PubSub destination allows you to send/stream data into PubSub. Pub/Sub is an asynchronous messaging service provided by Google Cloud Provider.
|
||||
|
||||
### Sync overview
|
||||
|
||||
|
||||
#### Output schema
|
||||
|
||||
Each stream will be output a PubSubMessage with attributes. The message attributes will be
|
||||
@@ -17,9 +19,10 @@ Each stream will be output a PubSubMessage with attributes. The message attribut
|
||||
* `_stream`: the name of stream where the data is coming from
|
||||
* `_namespace`: namespace if available from the stream
|
||||
|
||||
The data will be a serialized JSON, containing the following fields
|
||||
The data will be a serialized JSON, containing the following fields
|
||||
|
||||
* `_airbyte_ab_id`: a uuid string assigned by Airbyte to each event that is processed.
|
||||
* `_airbyte_emitted_at`: a long timestamp(ms) representing when the event was pulled from the data source.
|
||||
* `_airbyte_emitted_at`: a long timestamp\(ms\) representing when the event was pulled from the data source.
|
||||
* `_airbyte_data`: a json string representing source data.
|
||||
|
||||
#### Features
|
||||
@@ -50,8 +53,7 @@ See the setup guide for more information about how to create the required resour
|
||||
|
||||
If you have a Google Cloud Project with PubSub enabled, skip to the "Create a Topic" section.
|
||||
|
||||
First, follow along the Google Cloud instructions to [Create a Project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#before_you_begin).
|
||||
PubSub is enabled automatically in new projects. If this is not the case for your project, find it in [Marketplace](https://console.cloud.google.com/marketplace/product/google/pubsub.googleapis.com) and enable.
|
||||
First, follow along the Google Cloud instructions to [Create a Project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#before_you_begin). PubSub is enabled automatically in new projects. If this is not the case for your project, find it in [Marketplace](https://console.cloud.google.com/marketplace/product/google/pubsub.googleapis.com) and enable.
|
||||
|
||||
#### PubSub topic for Airbyte syncs
|
||||
|
||||
@@ -86,6 +88,7 @@ Once you've configured PubSub as a destination, delete the Service Account Key f
|
||||
## CHANGELOG
|
||||
|
||||
| Version | Date | Pull Request | Subject |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| 0.1.1 | August 13, 2021 | [#4699](https://github.com/airbytehq/airbyte/pull/4699)| Added json config validator |
|
||||
| 0.1.0 | June 24, 2021 | [#4339](https://github.com/airbytehq/airbyte/pull/4339)| Initial release |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| 0.1.1 | August 13, 2021 | [\#4699](https://github.com/airbytehq/airbyte/pull/4699) | Added json config validator |
|
||||
| 0.1.0 | June 24, 2021 | [\#4339](https://github.com/airbytehq/airbyte/pull/4339) | Initial release |
|
||||
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
|
||||
The Airbyte Redshift destination allows you to sync data to Redshift.
|
||||
|
||||
This Redshift destination connector has two replication strategies:
|
||||
This Redshift destination connector has two replication strategies:
|
||||
|
||||
1. INSERT: Replicates data via SQL INSERT queries. This is built on top of the destination-jdbc code base and is configured to rely on JDBC 4.2 standard drivers provided by Amazon via Mulesoft [here](https://mvnrepository.com/artifact/com.amazon.redshift/redshift-jdbc42) as described in Redshift documentation [here](https://docs.aws.amazon.com/redshift/latest/mgmt/jdbc20-install.html). **Not recommended for production workloads as this does not scale well**.
|
||||
2. COPY: Replicates data by first uploading data to an S3 bucket and issuing a COPY command. This is the recommended loading approach described by Redshift [best practices](https://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html). Requires an S3 bucket and credentials.
|
||||
@@ -45,7 +45,7 @@ You will need to choose an existing database or create a new database that will
|
||||
3. A staging S3 bucket with credentials \(for the COPY strategy\).
|
||||
|
||||
{% hint style="info" %}
|
||||
Even if your Airbyte instance is running on a server in the same VPC as your Redshift cluster, you may need to place them in the **same security group** to allow connections between the two.
|
||||
Even if your Airbyte instance is running on a server in the same VPC as your Redshift cluster, you may need to place them in the **same security group** to allow connections between the two.
|
||||
{% endhint %}
|
||||
|
||||
### Setup guide
|
||||
@@ -109,9 +109,10 @@ See [docs](https://docs.aws.amazon.com/redshift/latest/dg/r_Character_types.html
|
||||
|
||||
## Changelog
|
||||
|
||||
| Version | Date | Pull Request | Subject |
|
||||
| :------ | :-------- | :----- | :------ |
|
||||
| 0.3.14 | 2021-10-08 | [5924](https://github.com/airbytehq/airbyte/pull/5924) | Fixed AWS S3 Staging COPY is writing records from different table in the same raw table |
|
||||
| 0.3.13 | 2021-09-02 | [5745](https://github.com/airbytehq/airbyte/pull/5745) | Disable STATUPDATE flag when using S3 staging to speed up performance |
|
||||
| 0.3.12 | 2021-07-21 | [3555](https://github.com/airbytehq/airbyte/pull/3555) | Enable partial checkpointing for halfway syncs |
|
||||
| 0.3.11 | 2021-07-20 | [4874](https://github.com/airbytehq/airbyte/pull/4874) | allow `additionalProperties` in connector spec |
|
||||
| Version | Date | Pull Request | Subject |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| 0.3.14 | 2021-10-08 | [5924](https://github.com/airbytehq/airbyte/pull/5924) | Fixed AWS S3 Staging COPY is writing records from different table in the same raw table |
|
||||
| 0.3.13 | 2021-09-02 | [5745](https://github.com/airbytehq/airbyte/pull/5745) | Disable STATUPDATE flag when using S3 staging to speed up performance |
|
||||
| 0.3.12 | 2021-07-21 | [3555](https://github.com/airbytehq/airbyte/pull/3555) | Enable partial checkpointing for halfway syncs |
|
||||
| 0.3.11 | 2021-07-20 | [4874](https://github.com/airbytehq/airbyte/pull/4874) | allow `additionalProperties` in connector spec |
|
||||
|
||||
|
||||
@@ -5,7 +5,7 @@
|
||||
| Feature | Support | Notes |
|
||||
| :--- | :---: | :--- |
|
||||
| Full Refresh Sync | ✅ | Warning: this mode deletes all previously synced data in the configured bucket path. |
|
||||
| Incremental - Append Sync | ✅ | |
|
||||
| Incremental - Append Sync | ✅ | |
|
||||
| Incremental - Deduped History | ❌ | As this connector does not support dbt, we don't support this sync mode on this destination. |
|
||||
| Namespaces | ❌ | Setting a specific bucket path is equivalent to having separate namespaces. |
|
||||
|
||||
@@ -26,19 +26,19 @@ Check out common troubleshooting issues for the S3 destination connector on our
|
||||
| Access Key ID | string | AWS/Minio credential. |
|
||||
| Secret Access Key | string | AWS/Minio credential. |
|
||||
| Format | object | Format specific configuration. See below for details. |
|
||||
| Part Size | integer | Arg to configure a block size. Max allowed blocks by S3 = 10,000, i.e. max stream size = blockSize * 10,000 blocks. |
|
||||
| Part Size | integer | Arg to configure a block size. Max allowed blocks by S3 = 10,000, i.e. max stream size = blockSize \* 10,000 blocks. |
|
||||
|
||||
⚠️ Please note that under "Full Refresh Sync" mode, data in the configured bucket and path will be wiped out before each sync. We recommend you to provision a dedicated S3 resource for this sync to prevent unexpected data deletion from misconfiguration. ⚠️
|
||||
|
||||
The full path of the output data is:
|
||||
|
||||
```
|
||||
```text
|
||||
<bucket-name>/<sorce-namespace-if-exists>/<stream-name>/<upload-date>-<upload-mills>-<partition-id>.<format-extension>
|
||||
```
|
||||
|
||||
For example:
|
||||
|
||||
```
|
||||
```text
|
||||
testing_bucket/data_output_path/public/users/2021_01_01_1609541171643_0.csv
|
||||
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
|
||||
| | | | | | | format extension
|
||||
@@ -53,10 +53,7 @@ bucket name
|
||||
|
||||
Please note that the stream name may contain a prefix, if it is configured on the connection.
|
||||
|
||||
The rationales behind this naming pattern are:
|
||||
1. Each stream has its own directory.
|
||||
2. The data output files can be sorted by upload time.
|
||||
3. The upload time composes of a date part and millis part so that it is both readable and unique.
|
||||
The rationales behind this naming pattern are: 1. Each stream has its own directory. 2. The data output files can be sorted by upload time. 3. The upload time composes of a date part and millis part so that it is both readable and unique.
|
||||
|
||||
Currently, each data sync will only create one file per stream. In the future, the output file can be partitioned by size. Each partition is identifiable by the partition ID, which is always 0 for now.
|
||||
|
||||
@@ -64,39 +61,39 @@ Currently, each data sync will only create one file per stream. In the future, t
|
||||
|
||||
Each stream will be outputted to its dedicated directory according to the configuration. The complete datastore of each stream includes all the output files under that directory. You can think of the directory as equivalent of a Table in the database world.
|
||||
|
||||
- Under Full Refresh Sync mode, old output files will be purged before new files are created.
|
||||
- Under Incremental - Append Sync mode, new output files will be added that only contain the new data.
|
||||
* Under Full Refresh Sync mode, old output files will be purged before new files are created.
|
||||
* Under Incremental - Append Sync mode, new output files will be added that only contain the new data.
|
||||
|
||||
### Avro
|
||||
|
||||
[Apache Avro](https://avro.apache.org/) serializes data in a compact binary format. Currently, the Airbyte S3 Avro connector always uses the [binary encoding](http://avro.apache.org/docs/current/spec.html#binary_encoding), and assumes that all data records follow the same schema.
|
||||
[Apache Avro](https://avro.apache.org/) serializes data in a compact binary format. Currently, the Airbyte S3 Avro connector always uses the [binary encoding](http://avro.apache.org/docs/current/spec.html#binary_encoding), and assumes that all data records follow the same schema.
|
||||
|
||||
#### Configuration
|
||||
|
||||
Here is the available compression codecs:
|
||||
|
||||
- No compression
|
||||
- `deflate`
|
||||
- Compression level
|
||||
- Range `[0, 9]`. Default to 0.
|
||||
- Level 0: no compression & fastest.
|
||||
- Level 9: best compression & slowest.
|
||||
- `bzip2`
|
||||
- `xz`
|
||||
- Compression level
|
||||
- Range `[0, 9]`. Default to 6.
|
||||
- Level 0-3 are fast with medium compression.
|
||||
- Level 4-6 are fairly slow with high compression.
|
||||
- Level 7-9 are like level 6 but use bigger dictionaries and have higher memory requirements. Unless the uncompressed size of the file exceeds 8 MiB, 16 MiB, or 32 MiB, it is waste of memory to use the presets 7, 8, or 9, respectively.
|
||||
- `zstandard`
|
||||
- Compression level
|
||||
- Range `[-5, 22]`. Default to 3.
|
||||
- Negative levels are 'fast' modes akin to `lz4` or `snappy`.
|
||||
- Levels above 9 are generally for archival purposes.
|
||||
- Levels above 18 use a lot of memory.
|
||||
- Include checksum
|
||||
- If set to `true`, a checksum will be included in each data block.
|
||||
- `snappy`
|
||||
* No compression
|
||||
* `deflate`
|
||||
* Compression level
|
||||
* Range `[0, 9]`. Default to 0.
|
||||
* Level 0: no compression & fastest.
|
||||
* Level 9: best compression & slowest.
|
||||
* `bzip2`
|
||||
* `xz`
|
||||
* Compression level
|
||||
* Range `[0, 9]`. Default to 6.
|
||||
* Level 0-3 are fast with medium compression.
|
||||
* Level 4-6 are fairly slow with high compression.
|
||||
* Level 7-9 are like level 6 but use bigger dictionaries and have higher memory requirements. Unless the uncompressed size of the file exceeds 8 MiB, 16 MiB, or 32 MiB, it is waste of memory to use the presets 7, 8, or 9, respectively.
|
||||
* `zstandard`
|
||||
* Compression level
|
||||
* Range `[-5, 22]`. Default to 3.
|
||||
* Negative levels are 'fast' modes akin to `lz4` or `snappy`.
|
||||
* Levels above 9 are generally for archival purposes.
|
||||
* Levels above 18 use a lot of memory.
|
||||
* Include checksum
|
||||
* If set to `true`, a checksum will be included in each data block.
|
||||
* `snappy`
|
||||
|
||||
#### Data schema
|
||||
|
||||
@@ -104,43 +101,44 @@ Under the hood, an Airbyte data stream in Json schema is converted to an Avro sc
|
||||
|
||||
1. Json schema types are mapped to Avro types as follows:
|
||||
|
||||
| Json Data Type | Avro Data Type |
|
||||
| :---: | :---: |
|
||||
| string | string |
|
||||
| number | double |
|
||||
| integer | int |
|
||||
| boolean | boolean |
|
||||
| null | null |
|
||||
| object | record |
|
||||
| array | array |
|
||||
| Json Data Type | Avro Data Type |
|
||||
| :---: | :---: |
|
||||
| string | string |
|
||||
| number | double |
|
||||
| integer | int |
|
||||
| boolean | boolean |
|
||||
| null | null |
|
||||
| object | record |
|
||||
| array | array |
|
||||
|
||||
2. Built-in Json schema formats are not mapped to Avro logical types at this moment.
|
||||
2. Combined restrictions ("allOf", "anyOf", and "oneOf") will be converted to type unions. The corresponding Avro schema can be less stringent. For example, the following Json schema
|
||||
3. Combined restrictions \("allOf", "anyOf", and "oneOf"\) will be converted to type unions. The corresponding Avro schema can be less stringent. For example, the following Json schema
|
||||
|
||||
```json
|
||||
{
|
||||
```javascript
|
||||
{
|
||||
"oneOf": [
|
||||
{ "type": "string" },
|
||||
{ "type": "integer" }
|
||||
]
|
||||
}
|
||||
```
|
||||
will become this in Avro schema:
|
||||
}
|
||||
```
|
||||
|
||||
```json
|
||||
{
|
||||
will become this in Avro schema:
|
||||
|
||||
```javascript
|
||||
{
|
||||
"type": ["null", "string", "int"]
|
||||
}
|
||||
```
|
||||
}
|
||||
```
|
||||
|
||||
2. Keyword `not` is not supported, as there is no equivalent validation mechanism in Avro schema.
|
||||
3. Only alphanumeric characters and underscores (`/a-zA-Z0-9_/`) are allowed in a stream or field name. Any special character will be converted to an alphabet or underscore. For example, `spécial:character_names` will become `special_character_names`. The original names will be stored in the `doc` property in this format: `_airbyte_original_name:<original-name>`.
|
||||
4. The field name cannot start with a number, so an underscore will be added to the field name at the beginning.
|
||||
5. All field will be nullable. For example, a `string` Json field will be typed as `["null", "string"]` in Avro. This is necessary because the incoming data stream may have optional fields.
|
||||
6. For array fields in Json schema, when the `items` property is an array, it means that each element in the array should follow its own schema sequentially. For example, the following specification means the first item in the array should be a string, and the second a number.
|
||||
4. Keyword `not` is not supported, as there is no equivalent validation mechanism in Avro schema.
|
||||
5. Only alphanumeric characters and underscores \(`/a-zA-Z0-9_/`\) are allowed in a stream or field name. Any special character will be converted to an alphabet or underscore. For example, `spécial:character_names` will become `special_character_names`. The original names will be stored in the `doc` property in this format: `_airbyte_original_name:<original-name>`.
|
||||
6. The field name cannot start with a number, so an underscore will be added to the field name at the beginning.
|
||||
7. All field will be nullable. For example, a `string` Json field will be typed as `["null", "string"]` in Avro. This is necessary because the incoming data stream may have optional fields.
|
||||
8. For array fields in Json schema, when the `items` property is an array, it means that each element in the array should follow its own schema sequentially. For example, the following specification means the first item in the array should be a string, and the second a number.
|
||||
|
||||
```json
|
||||
{
|
||||
```javascript
|
||||
{
|
||||
"array_field": {
|
||||
"type": "array",
|
||||
"items": [
|
||||
@@ -148,13 +146,13 @@ Under the hood, an Airbyte data stream in Json schema is converted to an Avro sc
|
||||
{ "type": "number" }
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
}
|
||||
```
|
||||
|
||||
This is not supported in Avro schema. As a compromise, the converter creates a union, ["string", "number"], which is less stringent:
|
||||
This is not supported in Avro schema. As a compromise, the converter creates a union, \["string", "number"\], which is less stringent:
|
||||
|
||||
```json
|
||||
{
|
||||
```javascript
|
||||
{
|
||||
"name": "array_field",
|
||||
"type": [
|
||||
"null",
|
||||
@@ -164,21 +162,21 @@ Under the hood, an Airbyte data stream in Json schema is converted to an Avro sc
|
||||
}
|
||||
],
|
||||
"default": null
|
||||
}
|
||||
```
|
||||
}
|
||||
```
|
||||
|
||||
6. Two Airbyte specific fields will be added to each Avro record:
|
||||
9. Two Airbyte specific fields will be added to each Avro record:
|
||||
|
||||
| Field | Schema | Document |
|
||||
| :--- | :--- | :---: |
|
||||
| `_airbyte_ab_id` | `uuid` | [link](http://avro.apache.org/docs/current/spec.html#UUID)
|
||||
| `_airbyte_emitted_at` | `timestamp-millis` | [link](http://avro.apache.org/docs/current/spec.html#Timestamp+%28millisecond+precision%29) |
|
||||
| Field | Schema | Document |
|
||||
| :--- | :--- | :---: |
|
||||
| `_airbyte_ab_id` | `uuid` | [link](http://avro.apache.org/docs/current/spec.html#UUID) |
|
||||
| `_airbyte_emitted_at` | `timestamp-millis` | [link](http://avro.apache.org/docs/current/spec.html#Timestamp+%28millisecond+precision%29) |
|
||||
|
||||
7. Currently `additionalProperties` is not supported. This means if the source is schemaless (e.g. Mongo), or has flexible fields, they will be ignored. We will have a solution soon. Feel free to submit a new issue if this is blocking for you.
|
||||
10. Currently `additionalProperties` is not supported. This means if the source is schemaless \(e.g. Mongo\), or has flexible fields, they will be ignored. We will have a solution soon. Feel free to submit a new issue if this is blocking for you.
|
||||
|
||||
For example, given the following Json schema:
|
||||
|
||||
```json
|
||||
```javascript
|
||||
{
|
||||
"type": "object",
|
||||
"$schema": "http://json-schema.org/draft-07/schema#",
|
||||
@@ -207,7 +205,7 @@ For example, given the following Json schema:
|
||||
|
||||
Its corresponding Avro schema will be:
|
||||
|
||||
```json
|
||||
```javascript
|
||||
{
|
||||
"name" : "stream_name",
|
||||
"type" : "record",
|
||||
@@ -254,18 +252,18 @@ Its corresponding Avro schema will be:
|
||||
|
||||
### CSV
|
||||
|
||||
Like most of the other Airbyte destination connectors, usually the output has three columns: a UUID, an emission timestamp, and the data blob. With the CSV output, it is possible to normalize (flatten) the data blob to multiple columns.
|
||||
Like most of the other Airbyte destination connectors, usually the output has three columns: a UUID, an emission timestamp, and the data blob. With the CSV output, it is possible to normalize \(flatten\) the data blob to multiple columns.
|
||||
|
||||
| Column | Condition | Description |
|
||||
| :--- | :--- | :--- |
|
||||
| `_airbyte_ab_id` | Always exists | A uuid assigned by Airbyte to each processed record. |
|
||||
| `_airbyte_emitted_at` | Always exists. | A timestamp representing when the event was pulled from the data source. |
|
||||
| `_airbyte_data` | When no normalization (flattening) is needed, all data reside under this column as a json blob. |
|
||||
| root level fields | When root level normalization (flattening) is selected, the root level fields are expanded. |
|
||||
| `_airbyte_data` | When no normalization \(flattening\) is needed, all data reside under this column as a json blob. | |
|
||||
| root level fields | When root level normalization \(flattening\) is selected, the root level fields are expanded. | |
|
||||
|
||||
For example, given the following json object from a source:
|
||||
|
||||
```json
|
||||
```javascript
|
||||
{
|
||||
"user_id": 123,
|
||||
"name": {
|
||||
@@ -287,11 +285,11 @@ With root level normalization, the output CSV is:
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| `26d73cde-7eb1-4e1e-b7db-a4c03b4cf206` | 1622135805000 | 123 | `{ "first": "John", "last": "Doe" }` |
|
||||
|
||||
### JSON Lines (JSONL)
|
||||
### JSON Lines \(JSONL\)
|
||||
|
||||
[Json Lines](https://jsonlines.org/) is a text format with one JSON per line. Each line has a structure as follows:
|
||||
|
||||
```json
|
||||
```javascript
|
||||
{
|
||||
"_airbyte_ab_id": "<uuid>",
|
||||
"_airbyte_emitted_at": "<timestamp-in-millis>",
|
||||
@@ -301,7 +299,7 @@ With root level normalization, the output CSV is:
|
||||
|
||||
For example, given the following two json objects from a source:
|
||||
|
||||
```json
|
||||
```javascript
|
||||
[
|
||||
{
|
||||
"user_id": 123,
|
||||
@@ -322,7 +320,7 @@ For example, given the following two json objects from a source:
|
||||
|
||||
They will be like this in the output file:
|
||||
|
||||
```jsonl
|
||||
```text
|
||||
{ "_airbyte_ab_id": "26d73cde-7eb1-4e1e-b7db-a4c03b4cf206", "_airbyte_emitted_at": "1622135805000", "_airbyte_data": { "user_id": 123, "name": { "first": "John", "last": "Doe" } } }
|
||||
{ "_airbyte_ab_id": "0a61de1b-9cdd-4455-a739-93572c9a5f20", "_airbyte_emitted_at": "1631948170000", "_airbyte_data": { "user_id": 456, "name": { "first": "Jane", "last": "Roe" } } }
|
||||
```
|
||||
@@ -336,19 +334,19 @@ The following configuration is available to configure the Parquet output:
|
||||
| Parameter | Type | Default | Description |
|
||||
| :--- | :---: | :---: | :--- |
|
||||
| `compression_codec` | enum | `UNCOMPRESSED` | **Compression algorithm**. Available candidates are: `UNCOMPRESSED`, `SNAPPY`, `GZIP`, `LZO`, `BROTLI`, `LZ4`, and `ZSTD`. |
|
||||
| `block_size_mb` | integer | 128 (MB) | **Block size (row group size)** in MB. This is the size of a row group being buffered in memory. It limits the memory usage when writing. Larger values will improve the IO when reading, but consume more memory when writing. |
|
||||
| `max_padding_size_mb` | integer | 8 (MB) | **Max padding size** in MB. This is the maximum size allowed as padding to align row groups. This is also the minimum size of a row group. |
|
||||
| `page_size_kb` | integer | 1024 (KB) | **Page size** in KB. The page size is for compression. A block is composed of pages. A page is the smallest unit that must be read fully to access a single record. If this value is too small, the compression will deteriorate. |
|
||||
| `dictionary_page_size_kb` | integer | 1024 (KB) | **Dictionary Page Size** in KB. There is one dictionary page per column per row group when dictionary encoding is used. The dictionary page size works like the page size but for dictionary. |
|
||||
| `block_size_mb` | integer | 128 \(MB\) | **Block size \(row group size\)** in MB. This is the size of a row group being buffered in memory. It limits the memory usage when writing. Larger values will improve the IO when reading, but consume more memory when writing. |
|
||||
| `max_padding_size_mb` | integer | 8 \(MB\) | **Max padding size** in MB. This is the maximum size allowed as padding to align row groups. This is also the minimum size of a row group. |
|
||||
| `page_size_kb` | integer | 1024 \(KB\) | **Page size** in KB. The page size is for compression. A block is composed of pages. A page is the smallest unit that must be read fully to access a single record. If this value is too small, the compression will deteriorate. |
|
||||
| `dictionary_page_size_kb` | integer | 1024 \(KB\) | **Dictionary Page Size** in KB. There is one dictionary page per column per row group when dictionary encoding is used. The dictionary page size works like the page size but for dictionary. |
|
||||
| `dictionary_encoding` | boolean | `true` | **Dictionary encoding**. This parameter controls whether dictionary encoding is turned on. |
|
||||
|
||||
These parameters are related to the `ParquetOutputFormat`. See the [Java doc](https://www.javadoc.io/doc/org.apache.parquet/parquet-hadoop/1.12.0/org/apache/parquet/hadoop/ParquetOutputFormat.html) for more details. Also see [Parquet documentation](https://parquet.apache.org/documentation/latest/#configurations) for their recommended configurations (512 - 1024 MB block size, 8 KB page size).
|
||||
These parameters are related to the `ParquetOutputFormat`. See the [Java doc](https://www.javadoc.io/doc/org.apache.parquet/parquet-hadoop/1.12.0/org/apache/parquet/hadoop/ParquetOutputFormat.html) for more details. Also see [Parquet documentation](https://parquet.apache.org/documentation/latest/#configurations) for their recommended configurations \(512 - 1024 MB block size, 8 KB page size\).
|
||||
|
||||
#### Data schema
|
||||
|
||||
Under the hood, an Airbyte data stream in Json schema is first converted to an Avro schema, then the Json object is converted to an Avro record, and finally the Avro record is outputted to the Parquet format. See the `Data schema` section from the [Avro output](#avro) for rules and limitations.
|
||||
Under the hood, an Airbyte data stream in Json schema is first converted to an Avro schema, then the Json object is converted to an Avro record, and finally the Avro record is outputted to the Parquet format. See the `Data schema` section from the [Avro output](s3.md#avro) for rules and limitations.
|
||||
|
||||
## Getting Started (Airbyte Open-Source / Airbyte Cloud)
|
||||
## Getting Started \(Airbyte Open-Source / Airbyte Cloud\)
|
||||
|
||||
#### Requirements
|
||||
|
||||
@@ -356,6 +354,7 @@ Under the hood, an Airbyte data stream in Json schema is first converted to an A
|
||||
2. An S3 bucket with credentials.
|
||||
|
||||
#### Setup Guide
|
||||
|
||||
* Fill up S3 info
|
||||
* **S3 Endpoint**
|
||||
* Leave empty if using AWS S3, fill in S3 URL if using Minio S3.
|
||||
@@ -375,17 +374,18 @@ Under the hood, an Airbyte data stream in Json schema is first converted to an A
|
||||
## CHANGELOG
|
||||
|
||||
| Version | Date | Pull Request | Subject |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| 0.1.12 | 2021-09-13 | [#5720](https://github.com/airbytehq/airbyte/issues/5720) | Added configurable block size for stream. Each stream is limited to 10,000 by S3 |
|
||||
| 0.1.11 | 2021-09-10 | [#5729](https://github.com/airbytehq/airbyte/pull/5729) | For field names that start with a digit, a `_` will be appended at the beginning for the` Parquet` and `Avro` formats. |
|
||||
| 0.1.10 | 2021-08-17 | [#4699](https://github.com/airbytehq/airbyte/pull/4699) | Added json config validator |
|
||||
| 0.1.9 | 2021-07-12 | [#4666](https://github.com/airbytehq/airbyte/pull/4666) | Fix MinIO output for Parquet format. |
|
||||
| 0.1.8 | 2021-07-07 | [#4613](https://github.com/airbytehq/airbyte/pull/4613) | Patched schema converter to support combined restrictions. |
|
||||
| 0.1.7 | 2021-06-23 | [#4227](https://github.com/airbytehq/airbyte/pull/4227) | Added Avro and JSONL output. |
|
||||
| 0.1.6 | 2021-06-16 | [#4130](https://github.com/airbytehq/airbyte/pull/4130) | Patched the check to verify prefix access instead of full-bucket access. |
|
||||
| 0.1.5 | 2021-06-14 | [#3908](https://github.com/airbytehq/airbyte/pull/3908) | Fixed default `max_padding_size_mb` in `spec.json`. |
|
||||
| 0.1.4 | 2021-06-14 | [#3908](https://github.com/airbytehq/airbyte/pull/3908) | Added Parquet output. |
|
||||
| 0.1.3 | 2021-06-13 | [#4038](https://github.com/airbytehq/airbyte/pull/4038) | Added support for alternative S3. |
|
||||
| 0.1.2 | 2021-06-10 | [#4029](https://github.com/airbytehq/airbyte/pull/4029) | Fixed `_airbyte_emitted_at` field to be a UTC instead of local timestamp for consistency. |
|
||||
| 0.1.1 | 2021-06-09 | [#3973](https://github.com/airbytehq/airbyte/pull/3973) | Added `AIRBYTE_ENTRYPOINT` in base Docker image for Kubernetes support. |
|
||||
| 0.1.0 | 2021-06-03 | [#3672](https://github.com/airbytehq/airbyte/pull/3672) | Initial release with CSV output. |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| 0.1.12 | 2021-09-13 | [\#5720](https://github.com/airbytehq/airbyte/issues/5720) | Added configurable block size for stream. Each stream is limited to 10,000 by S3 |
|
||||
| 0.1.11 | 2021-09-10 | [\#5729](https://github.com/airbytehq/airbyte/pull/5729) | For field names that start with a digit, a `_` will be appended at the beginning for the`Parquet` and `Avro` formats. |
|
||||
| 0.1.10 | 2021-08-17 | [\#4699](https://github.com/airbytehq/airbyte/pull/4699) | Added json config validator |
|
||||
| 0.1.9 | 2021-07-12 | [\#4666](https://github.com/airbytehq/airbyte/pull/4666) | Fix MinIO output for Parquet format. |
|
||||
| 0.1.8 | 2021-07-07 | [\#4613](https://github.com/airbytehq/airbyte/pull/4613) | Patched schema converter to support combined restrictions. |
|
||||
| 0.1.7 | 2021-06-23 | [\#4227](https://github.com/airbytehq/airbyte/pull/4227) | Added Avro and JSONL output. |
|
||||
| 0.1.6 | 2021-06-16 | [\#4130](https://github.com/airbytehq/airbyte/pull/4130) | Patched the check to verify prefix access instead of full-bucket access. |
|
||||
| 0.1.5 | 2021-06-14 | [\#3908](https://github.com/airbytehq/airbyte/pull/3908) | Fixed default `max_padding_size_mb` in `spec.json`. |
|
||||
| 0.1.4 | 2021-06-14 | [\#3908](https://github.com/airbytehq/airbyte/pull/3908) | Added Parquet output. |
|
||||
| 0.1.3 | 2021-06-13 | [\#4038](https://github.com/airbytehq/airbyte/pull/4038) | Added support for alternative S3. |
|
||||
| 0.1.2 | 2021-06-10 | [\#4029](https://github.com/airbytehq/airbyte/pull/4029) | Fixed `_airbyte_emitted_at` field to be a UTC instead of local timestamp for consistency. |
|
||||
| 0.1.1 | 2021-06-09 | [\#3973](https://github.com/airbytehq/airbyte/pull/3973) | Added `AIRBYTE_ENTRYPOINT` in base Docker image for Kubernetes support. |
|
||||
| 0.1.0 | 2021-06-03 | [\#3672](https://github.com/airbytehq/airbyte/pull/3672) | Initial release with CSV output. |
|
||||
|
||||
|
||||
@@ -187,12 +187,11 @@ The final query should show a `STORAGE_GCP_SERVICE_ACCOUNT` property with an ema
|
||||
|
||||
Finally, you need to add read/write permissions to your bucket with that email.
|
||||
|
||||
|
||||
| Version | Date | Pull Request | Subject |
|
||||
| :------ | :-------- | :----- | :------ |
|
||||
| 0.3.14 | 2021-09-08 | [#5924](https://github.com/airbytehq/airbyte/pull/5924) | Fixed AWS S3 Staging COPY is writing records from different table in the same raw table |
|
||||
| 0.3.13 | 2021-09-01 | [#5784](https://github.com/airbytehq/airbyte/pull/5784) | Updated query timeout from 30 minutes to 3 hours |
|
||||
| 0.3.12 | 2021-07-30 | [#5125](https://github.com/airbytehq/airbyte/pull/5125) | Enable `additionalPropertities` in spec.json |
|
||||
| 0.3.11 | 2021-07-21 | [#3555](https://github.com/airbytehq/airbyte/pull/3555) | Partial Success in BufferedStreamConsumer |
|
||||
| 0.3.10 | 2021-07-12 | [#4713](https://github.com/airbytehq/airbyte/pull/4713)| Tag traffic with `airbyte` label to enable optimization opportunities from Snowflake |
|
||||
| Version | Date | Pull Request | Subject |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| 0.3.14 | 2021-09-08 | [\#5924](https://github.com/airbytehq/airbyte/pull/5924) | Fixed AWS S3 Staging COPY is writing records from different table in the same raw table |
|
||||
| 0.3.13 | 2021-09-01 | [\#5784](https://github.com/airbytehq/airbyte/pull/5784) | Updated query timeout from 30 minutes to 3 hours |
|
||||
| 0.3.12 | 2021-07-30 | [\#5125](https://github.com/airbytehq/airbyte/pull/5125) | Enable `additionalPropertities` in spec.json |
|
||||
| 0.3.11 | 2021-07-21 | [\#3555](https://github.com/airbytehq/airbyte/pull/3555) | Partial Success in BufferedStreamConsumer |
|
||||
| 0.3.10 | 2021-07-12 | [\#4713](https://github.com/airbytehq/airbyte/pull/4713) | Tag traffic with `airbyte` label to enable optimization opportunities from Snowflake |
|
||||
|
||||
|
||||
@@ -56,27 +56,27 @@ Information about expected report generation waiting time you may find [here](ht
|
||||
|
||||
### Requirements
|
||||
|
||||
* client_id
|
||||
* client_secret
|
||||
* refresh_token
|
||||
* client\_id
|
||||
* client\_secret
|
||||
* refresh\_token
|
||||
* scope
|
||||
* profiles
|
||||
* region
|
||||
* start_date (optional)
|
||||
* start\_date \(optional\)
|
||||
|
||||
More how to get client_id and client_secret you can find on [AWS docs](https://advertising.amazon.com/API/docs/en-us/setting-up/step-1-create-lwa-app).
|
||||
More how to get client\_id and client\_secret you can find on [AWS docs](https://advertising.amazon.com/API/docs/en-us/setting-up/step-1-create-lwa-app).
|
||||
|
||||
Refresh token is generated according to standard [AWS Oauth 2.0 flow](https://developer.amazon.com/docs/login-with-amazon/conceptual-overview.html)
|
||||
|
||||
Scope usually has "advertising::campaign_management" value, but customers may need to set scope to "cpc_advertising:campaign_management",
|
||||
|
||||
Start date used for generating reports starting from the specified start date. Should be in YYYY-MM-DD format and not more than 60 days in the past. If not specified today date is used. Date for specific profile is calculated according to its timezone, this parameter should be specified in UTC timezone. Since it have no sense of generate report for current day (metrics could be changed) it generates report for day before (e.g. if start_date is 2021-10-11 it would use 20211010 as reportDate parameter for request).
|
||||
Scope usually has "advertising::campaign\_management" value, but customers may need to set scope to "cpc\_advertising:campaign\_management",
|
||||
|
||||
Start date used for generating reports starting from the specified start date. Should be in YYYY-MM-DD format and not more than 60 days in the past. If not specified today date is used. Date for specific profile is calculated according to its timezone, this parameter should be specified in UTC timezone. Since it have no sense of generate report for current day \(metrics could be changed\) it generates report for day before \(e.g. if start\_date is 2021-10-11 it would use 20211010 as reportDate parameter for request\).
|
||||
|
||||
## CHANGELOG
|
||||
|
||||
| Version | Date | Pull Request | Subject |
|
||||
| :------ | :-------- | :----- | :------ |
|
||||
| `0.1.2` | 2021-10-01 | [#6367](https://github.com/airbytehq/airbyte/pull/6461) | `Add option to pull data for different regions. Add option to choose profiles we want to pull data. Add lookback` |
|
||||
| `0.1.1` | 2021-09-22 | [#6367](https://github.com/airbytehq/airbyte/pull/6367) | `Add seller and vendor filters to profiles stream` |
|
||||
| `0.1.0` | 2021-08-13 | [#5023](https://github.com/airbytehq/airbyte/pull/5023) | `Initial version` |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| `0.1.2` | 2021-10-01 | [\#6367](https://github.com/airbytehq/airbyte/pull/6461) | `Add option to pull data for different regions. Add option to choose profiles we want to pull data. Add lookback` |
|
||||
| `0.1.1` | 2021-09-22 | [\#6367](https://github.com/airbytehq/airbyte/pull/6367) | `Add seller and vendor filters to profiles stream` |
|
||||
| `0.1.0` | 2021-08-13 | [\#5023](https://github.com/airbytehq/airbyte/pull/5023) | `Initial version` |
|
||||
|
||||
|
||||
@@ -8,18 +8,17 @@ This source can sync data for the [Amazon Seller Partner API](https://github.com
|
||||
|
||||
This source is capable of syncing the following streams:
|
||||
|
||||
* [GET_FLAT_FILE_ALL_ORDERS_DATA_BY_ORDER_DATE_GENERAL](https://sellercentral.amazon.com/gp/help/help.html?itemID=201648780)
|
||||
* [GET_MERCHANT_LISTINGS_ALL_DATA](https://github.com/amzn/selling-partner-api-docs/blob/main/references/reports-api/reporttype-values.md#inventory-reports)
|
||||
* [GET_FBA_INVENTORY_AGED_DATA](https://sellercentral.amazon.com/gp/help/200740930)
|
||||
* [GET_AMAZON_FULFILLED_SHIPMENTS_DATA_GENERAL](https://sellercentral.amazon.com/gp/help/help.html?itemID=200453120)
|
||||
* [GET_FLAT_FILE_OPEN_LISTINGS_DATA](https://github.com/amzn/selling-partner-api-docs/blob/main/references/reports-api/reporttype-values.md#inventory-reports)
|
||||
* [GET_FBA_FULFILLMENT_REMOVAL_ORDER_DETAIL_DATA](https://sellercentral.amazon.com/gp/help/help.html?itemID=200989110)
|
||||
* [GET_FBA_FULFILLMENT_REMOVAL_SHIPMENT_DETAIL_DATA](https://sellercentral.amazon.com/gp/help/help.html?itemID=200989100)
|
||||
* [GET_VENDOR_INVENTORY_HEALTH_AND_PLANNING_REPORT](https://github.com/amzn/selling-partner-api-docs/blob/main/references/reports-api/reporttype-values.md#vendor-retail-analytics-reports)
|
||||
* [Orders](https://github.com/amzn/selling-partner-api-docs/blob/main/references/orders-api/ordersV0.md) (incremental)
|
||||
* [GET\_FLAT\_FILE\_ALL\_ORDERS\_DATA\_BY\_ORDER\_DATE\_GENERAL](https://sellercentral.amazon.com/gp/help/help.html?itemID=201648780)
|
||||
* [GET\_MERCHANT\_LISTINGS\_ALL\_DATA](https://github.com/amzn/selling-partner-api-docs/blob/main/references/reports-api/reporttype-values.md#inventory-reports)
|
||||
* [GET\_FBA\_INVENTORY\_AGED\_DATA](https://sellercentral.amazon.com/gp/help/200740930)
|
||||
* [GET\_AMAZON\_FULFILLED\_SHIPMENTS\_DATA\_GENERAL](https://sellercentral.amazon.com/gp/help/help.html?itemID=200453120)
|
||||
* [GET\_FLAT\_FILE\_OPEN\_LISTINGS\_DATA](https://github.com/amzn/selling-partner-api-docs/blob/main/references/reports-api/reporttype-values.md#inventory-reports)
|
||||
* [GET\_FBA\_FULFILLMENT\_REMOVAL\_ORDER\_DETAIL\_DATA](https://sellercentral.amazon.com/gp/help/help.html?itemID=200989110)
|
||||
* [GET\_FBA\_FULFILLMENT\_REMOVAL\_SHIPMENT\_DETAIL\_DATA](https://sellercentral.amazon.com/gp/help/help.html?itemID=200989100)
|
||||
* [GET\_VENDOR\_INVENTORY\_HEALTH\_AND\_PLANNING\_REPORT](https://github.com/amzn/selling-partner-api-docs/blob/main/references/reports-api/reporttype-values.md#vendor-retail-analytics-reports)
|
||||
* [Orders](https://github.com/amzn/selling-partner-api-docs/blob/main/references/orders-api/ordersV0.md) \(incremental\)
|
||||
* [VendorDirectFulfillmentShipping](https://github.com/amzn/selling-partner-api-docs/blob/main/references/vendor-direct-fulfillment-shipping-api/vendorDirectFulfillmentShippingV1.md)
|
||||
|
||||
|
||||
### Data type mapping
|
||||
|
||||
| Integration Type | Airbyte Type | Notes |
|
||||
@@ -47,14 +46,14 @@ Information about rate limits you may find [here](https://github.com/amzn/sellin
|
||||
|
||||
### Requirements
|
||||
|
||||
* replication_start_date
|
||||
* refresh_token
|
||||
* lwa_app_id
|
||||
* lwa_client_secret
|
||||
* aws_access_key
|
||||
* aws_secret_key
|
||||
* role_arn
|
||||
* aws_environment
|
||||
* replication\_start\_date
|
||||
* refresh\_token
|
||||
* lwa\_app\_id
|
||||
* lwa\_client\_secret
|
||||
* aws\_access\_key
|
||||
* aws\_secret\_key
|
||||
* role\_arn
|
||||
* aws\_environment
|
||||
* region
|
||||
|
||||
### Setup guide
|
||||
@@ -64,8 +63,9 @@ Information about how to get credentials you may find [here](https://github.com/
|
||||
## CHANGELOG
|
||||
|
||||
| Version | Date | Pull Request | Subject |
|
||||
| :------ | :-------- | :----- | :------ |
|
||||
| `0.2.1` | 2021-09-17 | [#5248](https://github.com/airbytehq/airbyte/pull/5248) | `Added extra stream support. Updated reports streams logics` |
|
||||
| `0.2.0` | 2021-08-06 | [#4863](https://github.com/airbytehq/airbyte/pull/4863) | `Rebuild source with airbyte-cdk` |
|
||||
| `0.1.3` | 2021-06-23 | [#4288](https://github.com/airbytehq/airbyte/pull/4288) | `Bugfix failing connection check` |
|
||||
| `0.1.2` | 2021-06-15 | [#4108](https://github.com/airbytehq/airbyte/pull/4108) | `Fixed: Sync fails with timeout when create report is CANCELLED` |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| `0.2.1` | 2021-09-17 | [\#5248](https://github.com/airbytehq/airbyte/pull/5248) | `Added extra stream support. Updated reports streams logics` |
|
||||
| `0.2.0` | 2021-08-06 | [\#4863](https://github.com/airbytehq/airbyte/pull/4863) | `Rebuild source with airbyte-cdk` |
|
||||
| `0.1.3` | 2021-06-23 | [\#4288](https://github.com/airbytehq/airbyte/pull/4288) | `Bugfix failing connection check` |
|
||||
| `0.1.2` | 2021-06-15 | [\#4108](https://github.com/airbytehq/airbyte/pull/4108) | `Fixed: Sync fails with timeout when create report is CANCELLED` |
|
||||
|
||||
|
||||
@@ -43,8 +43,9 @@ Please read [How to get your API key and Secret key](https://help.amplitude.com/
|
||||
|
||||
## Changelog
|
||||
|
||||
| Version | Date | Pull Request | Subject |
|
||||
| :------ | :-------- | :----- | :------ |
|
||||
| 0.1.2 | 2021-09-21 | [6353](https://github.com/airbytehq/airbyte/pull/6353) | Correct output schemas on cohorts, events, active_users, and average_session_lengths streams |
|
||||
| 0.1.1 | 2021-06-09 | [3973](https://github.com/airbytehq/airbyte/pull/3973) | Add AIRBYTE_ENTRYPOINT for kubernetes support |
|
||||
| 0.1.0 | 2021-06-08 | [3664](https://github.com/airbytehq/airbyte/pull/3664) | New Source: Amplitude |
|
||||
| Version | Date | Pull Request | Subject |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| 0.1.2 | 2021-09-21 | [6353](https://github.com/airbytehq/airbyte/pull/6353) | Correct output schemas on cohorts, events, active\_users, and average\_session\_lengths streams |
|
||||
| 0.1.1 | 2021-06-09 | [3973](https://github.com/airbytehq/airbyte/pull/3973) | Add AIRBYTE\_ENTRYPOINT for kubernetes support |
|
||||
| 0.1.0 | 2021-06-08 | [3664](https://github.com/airbytehq/airbyte/pull/3664) | New Source: Amplitude |
|
||||
|
||||
|
||||
@@ -1,23 +1,20 @@
|
||||
---
|
||||
description: >-
|
||||
Web scraping and automation platform.
|
||||
description: Web scraping and automation platform.
|
||||
---
|
||||
|
||||
# Apify dataset
|
||||
# Apify Dataset
|
||||
|
||||
## Overview
|
||||
|
||||
[Apify](https://www.apify.com) is a web scraping and web automation platform providing both ready-made and custom solutions, an open-source [SDK](https://sdk.apify.com/) for web scraping, proxies, and many other tools to help you build and run web automation jobs at scale.
|
||||
|
||||
The results of a scraping job are usually stored in [Apify Dataset](https://docs.apify.com/storage/dataset). This Airbyte connector allows you
|
||||
to automatically sync the contents of a dataset to your chosen destination using Airbyte.
|
||||
The results of a scraping job are usually stored in [Apify Dataset](https://docs.apify.com/storage/dataset). This Airbyte connector allows you to automatically sync the contents of a dataset to your chosen destination using Airbyte.
|
||||
|
||||
To sync data from a dataset, all you need to know is its ID. You will find it in [Apify console](https://my.apify.com/) under storages.
|
||||
|
||||
### Running Airbyte sync from Apify webhook
|
||||
When your Apify job (aka [actor run](https://docs.apify.com/actors/running)) finishes, it can trigger an Airbyte sync by calling the Airbyte
|
||||
[API](https://airbyte-public-api-docs.s3.us-east-2.amazonaws.com/rapidoc-api-docs.html#post-/v1/connections/sync) manual
|
||||
connection trigger (`POST /v1/connections/sync`). The API can be called from Apify [webhook](https://docs.apify.com/webhooks) which is
|
||||
executed when your Apify run finishes.
|
||||
|
||||
When your Apify job \(aka [actor run](https://docs.apify.com/actors/running)\) finishes, it can trigger an Airbyte sync by calling the Airbyte [API](https://airbyte-public-api-docs.s3.us-east-2.amazonaws.com/rapidoc-api-docs.html#post-/v1/connections/sync) manual connection trigger \(`POST /v1/connections/sync`\). The API can be called from Apify [webhook](https://docs.apify.com/webhooks) which is executed when your Apify run finishes.
|
||||
|
||||

|
||||
|
||||
@@ -43,6 +40,8 @@ The Apify dataset connector uses [Apify Python Client](https://docs.apify.com/ap
|
||||
* Apify [dataset](https://docs.apify.com/storage/dataset) ID
|
||||
|
||||
### Changelog
|
||||
| Version | Date | Pull Request | Subject |
|
||||
| :------ | :-------- | :----- | :------ |
|
||||
| 0.1.0 | 2021-07-29 | [PR#5069](https://github.com/airbytehq/airbyte/pull/5069) | Initial version of the connector |
|
||||
|
||||
| Version | Date | Pull Request | Subject |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| 0.1.0 | 2021-07-29 | [PR\#5069](https://github.com/airbytehq/airbyte/pull/5069) | Initial version of the connector |
|
||||
|
||||
|
||||
@@ -60,6 +60,7 @@ Generate/Find all requirements using this [external article](https://leapfin.com
|
||||
|
||||
## Changelog
|
||||
|
||||
| Version | Date | Pull Request | Subject |
|
||||
| :------ | :-------- | :----- | :------ |
|
||||
| 0.2.4 | 2021-07-06 | [4539](https://github.com/airbytehq/airbyte/pull/4539) | Add `AIRBYTE_ENTRYPOINT` for Kubernetes support |
|
||||
| Version | Date | Pull Request | Subject |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| 0.2.4 | 2021-07-06 | [4539](https://github.com/airbytehq/airbyte/pull/4539) | Add `AIRBYTE_ENTRYPOINT` for Kubernetes support |
|
||||
|
||||
|
||||
@@ -2,8 +2,7 @@
|
||||
|
||||
## Sync overview
|
||||
|
||||
This source can sync data for the [Asana API](https://developers.asana.com/docs). It supports only Full Refresh syncs.
|
||||
|
||||
This source can sync data for the [Asana API](https://developers.asana.com/docs). It supports only Full Refresh syncs.
|
||||
|
||||
### Output schema
|
||||
|
||||
@@ -53,15 +52,14 @@ The Asana connector should not run into Asana API limitations under normal usage
|
||||
|
||||
### Setup guide
|
||||
|
||||
Please follow these [steps](https://developers.asana.com/docs/personal-access-token)
|
||||
to obtain Personal Access Token for your account.
|
||||
|
||||
Please follow these [steps](https://developers.asana.com/docs/personal-access-token) to obtain Personal Access Token for your account.
|
||||
|
||||
## Changelog
|
||||
|
||||
| Version | Date | Pull Request | Subject |
|
||||
| :------ | :-------- | :----- | :------ |
|
||||
| 0.1.3 | 2021-10-06 | [](https://github.com/airbytehq/airbyte/pull/) | Add oauth init flow parameters support |
|
||||
| 0.1.2 | 2021-09-24 | [6402](https://github.com/airbytehq/airbyte/pull/6402) | Fix SAT tests: update schemas and invalid_config.json file |
|
||||
| 0.1.1 | 2021-06-09 | [3973](https://github.com/airbytehq/airbyte/pull/3973) | Add entrypoint and bump version for connector |
|
||||
| 0.1.0 | 2021-05-25 | [3510](https://github.com/airbytehq/airbyte/pull/3510) | New Source: Asana |
|
||||
| Version | Date | Pull Request | Subject |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| 0.1.3 | 2021-10-06 | | Add oauth init flow parameters support |
|
||||
| 0.1.2 | 2021-09-24 | [6402](https://github.com/airbytehq/airbyte/pull/6402) | Fix SAT tests: update schemas and invalid\_config.json file |
|
||||
| 0.1.1 | 2021-06-09 | [3973](https://github.com/airbytehq/airbyte/pull/3973) | Add entrypoint and bump version for connector |
|
||||
| 0.1.0 | 2021-05-25 | [3510](https://github.com/airbytehq/airbyte/pull/3510) | New Source: Asana |
|
||||
|
||||
|
||||