11059 multi-instance, versioned docs (#58095)
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
35
docs/platform/connector-development/README.md
Normal file
@@ -0,0 +1,35 @@
|
||||
# Connector Development
|
||||
|
||||
If you'd like to build a connector that doesn't yet exist in Airbyte's catalog, in most cases you should use [Connector Builder](./connector-builder-ui/overview.md)!
|
||||
Builder works for most API source connectors as long as you can read the data with HTTP requests (REST, GraphQL) and get results in JSON or JSONL formats, CSV and XML support to come soon.
|
||||
|
||||
In rare cases when you need something more complex, you can use the Low-Code CDK directly. Other options and SDKs are described below.
|
||||
|
||||
:::note
|
||||
|
||||
Before building a new connector, review [Airbyte's data protocol specification](../understanding-airbyte/airbyte-protocol.md). As you begin, you should also familiarize yourself with our guide to [Best Practices for Connector Development](./best-practices.md).
|
||||
If you need support along the way, visit the [Slack channel](https://airbytehq.slack.com/archives/C027KKE4BCZ) we have dedicated to helping users with connector development where you can search previous discussions or ask a question of your own.
|
||||
|
||||
:::
|
||||
|
||||
### Process overview
|
||||
|
||||
1. **Pick the technology and build**. The first step in creating a new connector is to choose the tools you’ll use to build it. For _most_ cases, you should start in Connector Builder.
|
||||
2. **Publish as a custom connector**.After building and testing your connector, you’ll need to publish it. This makes it available in your workspace. At that point, you can use the connector you’ve built to move some data!
|
||||
3. **Contribute back to Airbyte**. If you want to contribute what you’ve built to the Airbyte Cloud and OSS connector catalog, follow the steps provided in the [contribution guide for submitting new connectors](../contributing-to-airbyte/submit-new-connector.md).
|
||||
|
||||
### Connector development options
|
||||
|
||||
| Tool | Description |
|
||||
| ----------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| [Connector Builder](./connector-builder-ui/overview.md) | We recommend Connector Builder for developing a connector for an API source. If you’re using Airbyte Cloud, no local developer environment is required to create a new connection with the Connector Builder because you configure it directly in the Airbyte web UI. This tool guides you through creating and testing a connection. Refer to our [tutorial](./connector-builder-ui/tutorial.mdx) on the Connector Builder to guide you through the basics. |
|
||||
| [Low Code Connector Development Kit (CDK)](./config-based/low-code-cdk-overview.md) | This framework lets you build source connectors for HTTP API sources. The Low-code CDK is a declarative framework that allows you to describe the connector using a [YAML schema](./schema-reference) without writing Python code. It’s flexible enough to include [custom Python components](./config-based/advanced-topics/custom-components.md) in conjunction with this method if necessary. |
|
||||
| [Python Connector Development Kit (CDK)](./cdk-python/basic-concepts.md) | While this method provides the most flexibility to developers, it also requires the most code and maintenance. This library provides classes that work out-of-the-box for most scenarios you’ll encounter along with the generators to make the connector scaffolds for you. We maintain an [in-depth guide](./tutorials/custom-python-connector/0-getting-started.md) to building a connector using the Python CDK. |
|
||||
| [Java CDK](./tutorials/building-a-java-destination.md) | If you're bulding a source or a destination against a traditional database (not an HTTP API, not a vector database), you should use the Java CDK instead. |
|
||||
|
||||
|
||||
### Community maintained CDKs
|
||||
|
||||
- The [Typescript CDK](https://github.com/faros-ai/airbyte-connectors) is actively maintained by
|
||||
Faros.ai for use in their product.
|
||||
- The [Airbyte Dotnet CDK](https://github.com/mrhamburg/airbyte.cdk.dotnet) in C#.
|
||||
54
docs/platform/connector-development/best-practices.md
Normal file
@@ -0,0 +1,54 @@
|
||||
# Best Practices
|
||||
|
||||
In order to guarantee the highest quality for connectors, we've compiled the following best practices for connector development. Connectors which follow these best practices will be labelled as "Airbyte Certified" to indicate they've passed a high quality bar and will perform reliably in all production use cases. Following these guidelines is **not required** for your contribution to Airbyte to be accepted, as they add a barrier to entry for contribution \(though adopting them certainly doesn't hurt!\).
|
||||
|
||||
## Principles of developing connectors
|
||||
|
||||
1. **Reliability + usability > more features.** It is better to support 1 feature that works reliably and has a great UX than 2 that are unreliable or hard to use. One solid connector is better than 2 finicky ones.
|
||||
2. **Fail fast.** A user should not be able to configure something that will not work.
|
||||
3. **Fail actionably.** If a failure is actionable by the user, clearly let them know what they can do. Otherwise, make it very easy for them to give us necessary debugging information \(logs etc.\)
|
||||
|
||||
From these principles we extrapolate the following goals for connectors, in descending priority order:
|
||||
|
||||
1. **Correct user input should result in a successful sync.** If there is an issue, it should be extremely easy for the user to see and report.
|
||||
2. **Issues arising from bad user input should print an actionable error message.** "Invalid credentials" is not an actionable message. "Please verify your username/password is correct" is better.
|
||||
3. **Wherever possible, a connector should support incremental sync.** This prevents excessive load on the underlying data source. _\*\*_
|
||||
4. **When running a sync, a connector should communicate its status frequently to provide clear feedback that it is working.** Output a log message at least every 5 minutes.
|
||||
5. **A connector should allow reading or writing as many entities as is feasible.** Supporting syncing all entities from an API is preferred to only supporting a small subset which would satisfy narrow use cases. Similarly, a database should support as many data types as is feasible.
|
||||
|
||||
Note that in the above list, the _least_ important is the number of features it has \(e.g: whether an API connector supports all entities in the API\). The most important thing is that for its declared features, it is reliable and usable. The only exception are “minimum viability” features e.g: for some sources, it’s not feasible to pull data without incremental due to rate limiting issues. In this case, those are considered usability issues.
|
||||
|
||||
## Quality certification checklist
|
||||
|
||||
When reviewing connectors, we'll use the following "checklist" to verify whether the connector is considered "Airbyte certified" or closer to beta or alpha:
|
||||
|
||||
### Integration Testing
|
||||
|
||||
**As much as possible, prove functionality via testing**. This means slightly different things depending on the type of connector:
|
||||
|
||||
- **All connectors** must test all the sync modes they support during integration tests
|
||||
- **Database connectors** should test that they can replicate **all** supported data types in both `read` and `discover` operations
|
||||
- **API connectors** should validate records that every stream outputs data
|
||||
- If this causes rate limiting problems, there should be a periodic CI build which tests this on a less frequent cadence to avoid rate limiting
|
||||
|
||||
**Thoroughly test edge cases.** While Airbyte provides a [Standard Test Suite](testing-connectors/connector-acceptance-tests-reference.md) that all connectors must pass, it's not possible for the standard test suite to cover all edge cases. When in doubt about whether the standard tests provide sufficient evidence of functionality, write a custom test case for your connector.
|
||||
|
||||
### Check Connection
|
||||
|
||||
- **Verify permissions upfront**. The "check connection" operation should verify any necessary permissions upfront e.g: the provided API token has read access to the API entities.
|
||||
- In some cases it's not possible to verify permissions without knowing which streams the user wants to replicate. For example, a provided API token only needs read access to the "Employees" entity if the user wants to replicate the "Employees" stream. In this case, the CheckConnection operation should verify the minimum needed requirements \(e.g: the API token exists\), and the "read" or "write" operation should verify all needed permissions based on the provided catalog, failing if a required permission is not granted.
|
||||
- **Provide actionable feedback for incorrect input.**
|
||||
- Examples of non actionable error messages
|
||||
- "Can't connect". The only recourse this gives the user is to guess whether they need to dig through logs or guess which field of their input configuration is incorrect.
|
||||
- Examples of actionable error messages
|
||||
- "Your username/password combination is incorrect"
|
||||
- "Unable to reach Database host: please verify that there are no firewall rules preventing Airbyte from connecting to the database"
|
||||
- etc...
|
||||
|
||||
### Rate Limiting
|
||||
|
||||
Most APIs enforce rate limits. Your connector should gracefully handle those \(i.e: without failing the connector process\). The most common way to handle rate limits is to implement backoff.
|
||||
|
||||
## Maintaining connectors
|
||||
|
||||
Once a connector has been published for use within Airbyte, we must take special care to account for the customer impact of updates to the connector.
|
||||
48
docs/platform/connector-development/cdk-python/README.md
Normal file
@@ -0,0 +1,48 @@
|
||||
# Connector Development Kit
|
||||
|
||||
:::info
|
||||
This section is for the Python CDK. See our
|
||||
[community-maintained CDKs section](../README.md#community-maintained-cdks) if you want to write connectors in other
|
||||
languages.
|
||||
:::
|
||||
|
||||
The Airbyte Python CDK is a framework for rapidly developing production-grade Airbyte connectors. The CDK currently
|
||||
offers helpers specific for creating Airbyte source connectors for:
|
||||
|
||||
- HTTP APIs \(REST APIs, GraphQL, etc..\)
|
||||
- Generic Python sources \(anything not covered by the above\)
|
||||
|
||||
This document is a general introduction to the CDK. Readers should have basic familiarity with the
|
||||
[Airbyte Specification](https://docs.airbyte.com/understanding-airbyte/airbyte-protocol/) before proceeding.
|
||||
|
||||
If you have any issues with troubleshooting or want to learn more about the CDK from the Airbyte team, head to
|
||||
[the Connector Development section of our Airbyte Forum](https://github.com/airbytehq/airbyte/discussions) to
|
||||
inquire further!
|
||||
|
||||
## Getting Started
|
||||
|
||||
In most cases, you won't need to use the CDK directly, and should start building connectors in Connector Builder, an IDE that is powerd by Airbyte Python CDK. If you do need customization beyond what it offers, you can do so by using `airbyte_cdk` as aa dependency in your Python project.
|
||||
|
||||
[Airbyte CDK reference documentation](https://airbytehq.github.io/airbyte-python-cdk/airbyte_cdk.html) is published automatically with each new CDK release. The rest of this document explains the most basic concepts applicable to any Airbyte API connector.
|
||||
|
||||
### Concepts & Documentation
|
||||
|
||||
#### Basic Concepts
|
||||
|
||||
If you want to learn more about the classes required to implement an Airbyte Source, head to our [basic concepts doc](basic-concepts.md).
|
||||
|
||||
#### Full Refresh Streams
|
||||
|
||||
If you have questions or are running into issues creating your first full refresh stream, head over to our [full refresh stream doc](full-refresh-stream.md). If you have questions about implementing a `path` or `parse_response` function, this doc is for you.
|
||||
|
||||
#### Incremental Streams
|
||||
|
||||
Having trouble figuring out how to write a `stream_slices` function or aren't sure what a `cursor_field` is? Head to our [incremental stream doc](incremental-stream.md).
|
||||
|
||||
#### Practical Tips
|
||||
|
||||
You can find a complete tutorial for implementing an HTTP source connector in [this tutorial](../tutorials/custom-python-connector/0-getting-started.md)
|
||||
|
||||
## Contributing
|
||||
|
||||
We're welcoming all contributions to Airbyte Python CDK! [`airbytehq/airbyte-python-cdk` Github repository](https://github.com/airbytehq/airbyte-python-cdk) CONTRIBUTING.md is the best spot to see up to date guide on how to get started.
|
||||
@@ -0,0 +1,52 @@
|
||||
# Basic Concepts
|
||||
|
||||
## The Airbyte Specification
|
||||
|
||||
As a quick recap, the Airbyte Specification requires an Airbyte Source to support 4 distinct operations:
|
||||
|
||||
| Operation | Description |
|
||||
| --- | --- |
|
||||
| `Spec` | The required configuration in order to interact with the underlying technical system e.g. database information, authentication information etc. |
|
||||
| `Check` | Validate that the provided configuration is valid with sufficient permissions for one to perform all required operations on the Source. |
|
||||
| `Discover` | Discover the Source's schema. This let users select what a subset of the data to sync. Useful if users require only a subset of the data. |
|
||||
| `Read` | Perform the actual syncing process. Data is read from the Source, parsed into `AirbyteRecordMessage`s and sent to the Airbyte Destination. Depending on how the Source is implemented, this sync can be incremental or a full-refresh. |
|
||||
|
||||
A core concept discussed here is the **Source**.
|
||||
|
||||
The Source contains one or more **Streams** \(or **Airbyte Streams**\). A **Stream** is the other concept key to understanding how Airbyte models the data syncing process. A **Stream** models the logical data groups that make up the larger **Source**. If the **Source** is a RDMS, each **Stream** is a table. In a REST API setting, each **Stream** corresponds to one resource within the API. e.g. a **Stripe Source** would have one **Stream** for `Transactions`, one for `Charges` and so on.
|
||||
|
||||
## The `Source` class
|
||||
|
||||
Airbyte provides abstract base classes which make it much easier to perform certain categories of tasks e.g: `HttpStream` makes it easy to create HTTP API-based streams. However, if those do not satisfy your use case \(for example, if you're pulling data from a relational database\), you can always directly implement the Airbyte Protocol by subclassing the CDK's `Source` class.
|
||||
|
||||
The `Source` class implements the `Spec` operation by looking for a file named `spec.yaml` (or `spec.json`) in the module's root by default. This is expected to be a json schema file that specifies the required configuration. Here is an [example](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-exchange-rates/source_exchange_rates/spec.yaml) from the Exchange Rates source.
|
||||
|
||||
Note that while this is the most flexible way to implement a source connector, it is also the most toilsome as you will be required to manually manage state, input validation, correctly conforming to the Airbyte Protocol message formats, and more. We recommend using a subclass of `Source` unless you cannot fulfill your use case otherwise.
|
||||
|
||||
## The `AbstractSource` Object
|
||||
|
||||
`AbstractSource` is a more opinionated implementation of `Source`. It implements `Source`'s 4 methods as follows:
|
||||
|
||||
`Check` delegates to the `AbstractSource`'s `check_connection` function. The function's `config` parameter contains the user-provided configuration, specified in the `spec.yaml` returned by `Spec`. `check_connection` uses this configuration to validate access and permissioning. Here is an [example](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-exchange-rates/source_exchange_rates/source.py#L90) from the same Exchange Rates API.
|
||||
|
||||
### The `Stream` Abstract Base Class
|
||||
|
||||
An `AbstractSource` also owns a set of `Stream`s. This is populated via the `AbstractSource`'s `streams` [function](https://github.com/airbytehq/airbyte-python-cdk/blob/main//airbyte_cdk/sources/abstract_source.py#L63). `Discover` and `Read` rely on this populated set.
|
||||
|
||||
`Discover` returns an `AirbyteCatalog` representing all the distinct resources the underlying API supports. Here is the [entrypoint](https://github.com/airbytehq/airbyte-python-cdk/blob/main//airbyte_cdk/sources/abstract_source.py#L74) for those interested in reading the code. See [schemas](https://github.com/airbytehq/airbyte/tree/21116cad97f744f936e503f9af5a59ed3ac59c38/docs/contributing-to-airbyte/python/concepts/schemas.md) for more information on how to declare the schema of a stream.
|
||||
|
||||
`Read` creates an in-memory stream reading from each of the `AbstractSource`'s streams. Here is the [entrypoint](https://github.com/airbytehq/airbyte-python-cdk/blob/main//airbyte_cdk/sources/abstract_source.py#L90) for those interested.
|
||||
|
||||
As the code examples show, the `AbstractSource` delegates to the set of `Stream`s it owns to fulfill both `Discover` and `Read`. Thus, implementing `AbstractSource`'s `streams` function is required when using the CDK.
|
||||
|
||||
A summary of what we've covered so far on how to use the Airbyte CDK:
|
||||
|
||||
- A concrete implementation of the `AbstractSource` object is required.
|
||||
- This involves,
|
||||
1. implementing the `check_connection`function.
|
||||
2. Creating the appropriate `Stream` classes and returning them in the `streams` function.
|
||||
3. placing the above mentioned `spec.yaml` file in the right place.
|
||||
|
||||
## HTTP Streams
|
||||
|
||||
We've covered how the `AbstractSource` works with the `Stream` interface in order to fulfill the Airbyte Specification. Although developers are welcome to implement their own object, the CDK saves developers the hassle of doing so in the case of HTTP APIs with the [`HTTPStream`](http-streams.md) object.
|
||||
@@ -0,0 +1,48 @@
|
||||
# Full Refresh Streams
|
||||
|
||||
As mentioned in the [Basic Concepts Overview](basic-concepts.md), a `Stream` is the atomic unit for reading data from a Source. A stream can read data from anywhere: a relational database, an API, or even scrape a web page! \(although that might be stretching the limits of what a connector should do\).
|
||||
|
||||
To implement a stream, there are two minimum requirements: 1. Define the stream's schema 2. Implement the logic for reading records from the underlying data source
|
||||
|
||||
## Defining the stream's schema
|
||||
|
||||
Your connector must describe the schema of each stream it can output using [JSONSchema](https://json-schema.org).
|
||||
|
||||
The simplest way to do this is to describe the schema of your streams using one `.json` file per stream. You can also dynamically generate the schema of your stream in code, or you can combine both approaches: start with a `.json` file and dynamically add properties to it.
|
||||
|
||||
The schema of a stream is the return value of `Stream.get_json_schema`.
|
||||
|
||||
### Static schemas
|
||||
|
||||
By default, `Stream.get_json_schema` reads a `.json` file in the `schemas/` directory whose name is equal to the value of the `Stream.name` property. In turn `Stream.name` by default returns the name of the class in snake case. Therefore, if you have a class `class EmployeeBenefits(HttpStream)` the default behavior will look for a file called `schemas/employee_benefits.json`. You can override any of these behaviors as you need.
|
||||
|
||||
Important note: any objects referenced via `$ref` should be placed in the `shared/` directory in their own `.json` files.
|
||||
|
||||
### Dynamic schemas
|
||||
|
||||
If you'd rather define your schema in code, override `Stream.get_json_schema` in your stream class to return a `dict` describing the schema using [JSONSchema](https://json-schema.org).
|
||||
|
||||
### Dynamically modifying static schemas
|
||||
|
||||
Place a `.json` file in the `schemas` folder containing the basic schema like described in the static schemas section. Then, override `Stream.get_json_schema` to run the default behavior, edit the returned value, then return the edited value:
|
||||
|
||||
```text
|
||||
def get_json_schema(self):
|
||||
schema = super().get_json_schema()
|
||||
schema['dynamically_determined_property'] = "property"
|
||||
return schema
|
||||
```
|
||||
|
||||
## Reading records from the data source
|
||||
|
||||
If custom functionality is required for reading a stream, you may need to override `Stream.read_records`. Given some information about how the stream should be read, this method should output an iterable object containing records from the data source. We recommend using generators as they are very efficient with regards to memory requirements.
|
||||
|
||||
## Incremental Streams
|
||||
|
||||
We highly recommend implementing Incremental when feasible. See the [incremental streams page](incremental-stream.md) for more information.
|
||||
|
||||
## Resumable Full Refresh Streams
|
||||
|
||||
Another alternative to Incremental and Full Refresh streams is [resumable full refresh](resumable-full-refresh-stream.md). This is a stream that uses API
|
||||
endpoints that cannot reliably retrieve data in an incremental fashion. However, it can offer improved resilience
|
||||
against errors by checkpointing the stream's page number or cursor.
|
||||
@@ -0,0 +1,81 @@
|
||||
# HTTP-API-based Connectors
|
||||
|
||||
The CDK offers base classes that greatly simplify writing HTTP API-based connectors. Some of the most useful features include helper functionality for:
|
||||
|
||||
- Authentication \(basic auth, Oauth2, or any custom auth method\)
|
||||
- Pagination
|
||||
- Handling rate limiting with static or dynamic backoff timing
|
||||
- Caching
|
||||
|
||||
All these features have sane off-the-shelf defaults but are completely customizable depending on your use case. They can also be combined with other stream features described in the [full refresh streams](full-refresh-stream.md) and [incremental streams](incremental-stream.md) sections.
|
||||
|
||||
## Overview of HTTP Streams
|
||||
|
||||
Just like any general HTTP request, the basic `HTTPStream` requires a url to perform the request, and instructions on how to parse the resulting response.
|
||||
|
||||
The full request path is broken up into two parts, the base url and the path. This makes it easy for developers to create a Source-specific base `HTTPStream` class, with the base url filled in, and individual streams for each available HTTP resource. The [Stripe CDK implementation](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-stripe/source_stripe/source.py) is a reification of this pattern.
|
||||
|
||||
The base url is set via the `url_base` property, while the path is set by implementing the abstract `path` function.
|
||||
|
||||
The `parse_response` function instructs the stream how to parse the API response. This returns an `Iterable`, whose elements are each later transformed into an `AirbyteRecordMessage`. API routes whose response contains a single record generally have a `parse_reponse` function that return a list of just that one response. Routes that return a list, usually have a `parse_response` function that return the received list with all elements. Pulling the data out from the response is sufficient, any deserialization is handled by the CDK.
|
||||
|
||||
Lastly, the `HTTPStream` must describe the schema of the records it outputs using JsonSchema. The simplest way to do this is by placing a `.json` file per stream in the `schemas` directory in the generated python module. The name of the `.json` file must match the lower snake case name of the corresponding Stream. Here are [examples](https://github.com/airbytehq/airbyte/tree/master/airbyte-integrations/connectors/source-stripe/source_stripe/schemas) from the Stripe API.
|
||||
|
||||
You can also dynamically set your schema. See the [schema docs](full-refresh-stream.md#defining-the-streams-schema) for more information.
|
||||
|
||||
These four elements - the `url_base` property, the `path` function, the `parse_response` function and the schema file - are the bare minimum required to implement the `HTTPStream`, and can be seen in the same [Stripe example](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-stripe/source_stripe/source.py#L38).
|
||||
|
||||
This basic implementation gives us a Full-Refresh Airbyte Stream. We say Full-Refresh since the stream does not have state and will always indiscriminately read all data from the underlying API resource.
|
||||
|
||||
## Authentication
|
||||
|
||||
The CDK supports Basic and OAuth2.0 authentication via the `TokenAuthenticator` and `Oauth2Authenticator` classes respectively. Both authentication strategies are identical in that they place the api token in the `Authorization` header. The `OAuth2Authenticator` goes an additional step further and has mechanisms to, given a refresh token, refresh the current access token. Note that the `OAuth2Authenticator` currently only supports refresh tokens and not the full OAuth2.0 loop.
|
||||
|
||||
Using either authenticator is as simple as passing the created authenticator into the relevant `HTTPStream` constructor. Here is an [example](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-stripe/source_stripe/source.py#L242) from the Stripe API.
|
||||
|
||||
## Pagination
|
||||
|
||||
Most APIs, when facing a large call, tend to return the results in pages. The CDK accommodates paging via the `next_page_token` function. This function is meant to extract the next page "token" from the latest response. The contents of a "token" are completely up to the developer: it can be an ID, a page number, a partial URL etc.. The CDK will continue making requests as long as the `next_page_token` continues returning non-`None` results. This can then be used in the `request_params` and other methods in `HttpStream` to page through API responses. Here is an [example](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-stripe/source_stripe/streams.py#L34) from the Stripe API.
|
||||
|
||||
## Rate Limiting
|
||||
|
||||
The CDK, by default, will conduct exponential backoff on the HTTP code 429 and any 5XX exceptions, and fail after 5 tries.
|
||||
|
||||
Retries are governed by the `should_retry` and the `backoff_time` methods. Override these methods to customise retry behavior. Here is an [example](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-slack/source_slack/source.py#L72) from the Slack API.
|
||||
|
||||
Note that Airbyte will always attempt to make as many requests as possible and only slow down if there are errors. It is not currently possible to specify a rate limit Airbyte should adhere to when making requests.
|
||||
|
||||
### Stream Slicing
|
||||
|
||||
When implementing [stream slicing](incremental-stream.md#streamstream_slices) in an `HTTPStream` each Slice is equivalent to a HTTP request; the stream will make one request per element returned by the `stream_slices` function. The current slice being read is passed into every other method in `HttpStream` e.g: `request_params`, `request_headers`, `path`, etc.. to be injected into a request. This allows you to dynamically determine the output of the `request_params`, `path`, and other functions to read the input slice and return the appropriate value.
|
||||
|
||||
## Nested Streams & Caching
|
||||
|
||||
It's possible to cache data from a stream onto a temporary file on disk.
|
||||
|
||||
This is especially useful when dealing with streams that depend on the results of another stream e.g: `/employees/{id}/details`. In this case, we can use caching to write the data of the parent stream to a file to use this data when the child stream synchronizes, rather than performing a full HTTP request again.
|
||||
|
||||
The caching mechanism works as follows: If the request is made for the first time, the returned value will be written to disk (all requests made by the `read_records` method will be written to the cache file). When the same request is made again, instead of making another HTTP request, the result will instead be read from disk. It is checked whether the required request is in the cache file, and if so, the data from this file is returned. However, if the check for the request's existence in the cache file fails, a new request will be made, and its result will be added to the cache file.
|
||||
|
||||
Caching can be enabled by overriding the `use_cache` property of the `HttpStream` class to return `True`.
|
||||
|
||||
The caching mechanism is related to parent streams. For child streams, there is an `HttpSubStream` class inheriting from `HttpStream` and overriding the `stream_slices` method that returns a generator of all parent entries.
|
||||
|
||||
To use caching in the parent/child relationship, perform the following steps:
|
||||
|
||||
1. Turn on parent stream caching by overriding the `use_cache` property.
|
||||
2. Inherit child stream class from `HttpSubStream` class.
|
||||
|
||||
#### Example
|
||||
|
||||
```python
|
||||
class Employees(HttpStream):
|
||||
...
|
||||
|
||||
@property
|
||||
def use_cache(self) -> bool:
|
||||
return True
|
||||
|
||||
class EmployeeDetails(HttpSubStream):
|
||||
...
|
||||
```
|
||||
@@ -0,0 +1,104 @@
|
||||
# Incremental Streams
|
||||
|
||||
An incremental Stream is a stream which reads data incrementally. That is, it only reads data that was generated or updated since the last time it ran, and is thus far more efficient than a stream which reads all the source data every time it runs. If possible, developers are encouraged to implement incremental streams to reduce sync times and resource usage.
|
||||
|
||||
Several new pieces are essential to understand how incrementality works with the CDK:
|
||||
|
||||
- `AirbyteStateMessage`
|
||||
- cursor fields
|
||||
- `IncrementalMixin`
|
||||
- `Stream.get_updated_state` (deprecated)
|
||||
|
||||
as well as a few other optional concepts.
|
||||
|
||||
### `AirbyteStateMessage`
|
||||
|
||||
The `AirbyteStateMessage` persists state between syncs, and allows a new sync to pick up from where the previous sync last finished. See the [incremental sync guide](https://docs.airbyte.com/understanding-airbyte/connections/incremental-append) for more information.
|
||||
|
||||
### Cursor fields
|
||||
|
||||
The `cursor_field` refers to the field in the stream's output records used to determine the "recency" or ordering of records. An example is a `created_at` or `updated_at` field in an API or DB table.
|
||||
|
||||
Cursor fields can be input by the user \(e.g: a user can choose to use an auto-incrementing `id` column in a DB table\) or they can be defined by the source e.g: where an API defines that `updated_at` is what determines the ordering of records.
|
||||
|
||||
In the context of the CDK, setting the `Stream.cursor_field` property to any truthy value informs the framework that this stream is incremental.
|
||||
|
||||
### `StateMixin`
|
||||
|
||||
This class mixin adds property `state` with abstract setter and getter.
|
||||
The `state` attribute helps the CDK figure out the current state of sync at any moment (in contrast to deprecated `Stream.get_updated_state` method).
|
||||
The setter typically deserialize state saved by CDK and initialize internal state of the stream.
|
||||
The getter should serialize internal state of the stream.
|
||||
|
||||
```python
|
||||
@property
|
||||
def state(self) -> Mapping[str, Any]:
|
||||
return {self.cursor_field: str(self._cursor_value)}
|
||||
|
||||
@state.setter
|
||||
def state(self, value: Mapping[str, Any]):
|
||||
self._cursor_value = value[self.cursor_field]
|
||||
```
|
||||
|
||||
The actual logic of updating state during reading is implemented somewhere else, usually as part of `read_records` method, right after the latest record returned that matches the new state.
|
||||
Therefore, the state represents the latest checkpoint successfully achieved, and all next records should match the next state after that one.
|
||||
|
||||
```python
|
||||
def read_records(self, ...):
|
||||
...
|
||||
yield record
|
||||
yield record
|
||||
yield record
|
||||
self._cursor_value = max(record[self.cursor_field], self._cursor_value)
|
||||
yield record
|
||||
yield record
|
||||
yield record
|
||||
self._cursor_value = max(record[self.cursor_field], self._cursor_value)
|
||||
```
|
||||
|
||||
### `Stream.get_updated_state`
|
||||
|
||||
(deprecated since 1.48.0, see `IncrementalMixin`)
|
||||
|
||||
This function helps the stream keep track of the latest state by inspecting every record output by the stream \(as returned by the `Stream.read_records` method\) and comparing it against the most recent state object. This allows sync to resume from where the previous sync last stopped, regardless of success or failure. This function typically compares the state object's and the latest record's cursor field, picking the latest one.
|
||||
|
||||
## Checkpointing state
|
||||
|
||||
There are two ways to checkpointing state \(i.e: controlling the timing of when state is saved\) while reading data from a connector:
|
||||
|
||||
1. Interval-based checkpointing
|
||||
2. Stream Slices
|
||||
|
||||
### Interval based checkpointing
|
||||
|
||||
This is the simplest method for checkpointing. When the interval is set to a truthy value e.g: 100, then state is persisted after every 100 records output by the connector e.g: state is saved after reading 100 records, then 200, 300, etc..
|
||||
|
||||
While this is very simple, **it requires that records are output in ascending order with regards to the cursor field**. For example, if your stream outputs records in ascending order of the `updated_at` field, then this is a good fit for your usecase. But if the stream outputs records in a random order, then you cannot use this method because we can only be certain that we read records after a particular `updated_at` timestamp once all records have been fully read.
|
||||
|
||||
Interval based checkpointing can be implemented by setting the `Stream.state_checkpoint_interval` property e.g:
|
||||
|
||||
```text
|
||||
class MyAmazingStream(Stream):
|
||||
# Save the state every 100 records
|
||||
state_checkpoint_interval = 100
|
||||
```
|
||||
|
||||
### `Stream.stream_slices`
|
||||
|
||||
Stream slices can be used to achieve finer grain control of when state is checkpointed.
|
||||
|
||||
Conceptually, a Stream Slice is a subset of the records in a stream which represent the smallest unit of data which can be re-synced. Once a full slice is read, an `AirbyteStateMessage` will be output, causing state to be saved. If a connector fails while reading the Nth slice of a stream, then the next time it retries, it will begin reading at the beginning of the Nth slice again, rather than re-read slices `1...N-1`.
|
||||
|
||||
A Slice object is not typed, and the developer is free to include any information necessary to make the request. This function is called when the `Stream` is about to be read. Typically, the `stream_slices` function, via inspecting the state object, generates a Slice for every request to be made.
|
||||
|
||||
As an example, suppose an API is able to dispense data hourly. If the last sync was exactly 24 hours ago, we can either make an API call retrieving all data at once, or make 24 calls each retrieving an hour's worth of data. In the latter case, the `stream_slices` function, sees that the previous state contains yesterday's timestamp, and returns a list of 24 Slices, each with a different hourly timestamp to be used when creating request. If the stream fails halfway through \(at the 12th slice\), then the next time it starts reading, it will read from the beginning of the 12th slice.
|
||||
|
||||
For a more in-depth description of stream slicing, see the [Stream Slices guide](https://github.com/airbytehq/airbyte/tree/8500fef4133d3d06e16e8b600d65ebf2c58afefd/docs/connector-development/cdk-python/stream-slices.md).
|
||||
|
||||
## Conclusion
|
||||
|
||||
In summary, an incremental stream requires:
|
||||
|
||||
- the `cursor_field` property
|
||||
- to be inherited from `IncrementalMixin` and state methods implemented
|
||||
- Optionally, the `stream_slices` function
|
||||
@@ -0,0 +1,78 @@
|
||||
# Migration Guide: How to make a python connector use our base image
|
||||
|
||||
We currently enforce our certified python connectors to use our [base image](https://hub.docker.com/r/airbyte/python-connector-base).
|
||||
This guide will help connector developers to migrate their connector to use our base image.
|
||||
|
||||
N.B: This guide currently only applies to Python CDK connectors.
|
||||
|
||||
## Prerequisite
|
||||
|
||||
[Install the airbyte-ci tool](https://github.com/airbytehq/airbyte/blob/master/airbyte-ci/connectors/pipelines/README.md#L1)
|
||||
|
||||
## Definition of a successful migration
|
||||
|
||||
1. The connector `Dockerfile` is removed from the connector folder
|
||||
2. The connector `metadata.yaml` is referencing the latest base image in the `data.connectorBuildOptions.baseImage` key
|
||||
3. The connector version is bumped by a patch increment
|
||||
4. A changelog entry is added to the connector documentation file
|
||||
5. The connector is successfully built and tested by our CI
|
||||
6. If you add `build_customization.py` to your connector, the Connector Operations team has reviewed and approved your changes.
|
||||
|
||||
In order for a connector to use our base image it has to declare it in its `metadata.yaml` file under the `data.connectorBuildOptions.baseImage` key:
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
connectorBuildOptions:
|
||||
baseImage: docker.io/airbyte/python-connector-base:3.0.0@sha256:1a0845ff2b30eafa793c6eee4e8f4283c2e52e1bbd44eed6cb9e9abd5d34d844
|
||||
```
|
||||
|
||||
### Why are we using long addresses instead of tags?
|
||||
|
||||
**For build reproducibility!**.
|
||||
Using full image address allows us to have a more deterministic build process.
|
||||
If we used tags our connector could get built with a different base image if the tag was overwritten.
|
||||
In other word, using the image digest (sha256), we have the guarantee that a build, on the same commit, will always use the same base image.
|
||||
|
||||
### What if my connector needs specific system dependencies?
|
||||
|
||||
Declaring the base image in the metadata.yaml file makes the Dockerfile obselete and the connector will be built using our internal build process declared [here](https://github.com/airbytehq/airbyte/blob/master/airbyte-ci/connectors/pipelines/pipelines/airbyte_ci/connectors/build_image/steps/python_connectors.py#L55).
|
||||
If your connector has specific system dependencies, or has to set environment variables, we have a pre/post build hook framework for that.
|
||||
|
||||
You can customize our build process by adding a `build_customization.py` module to your connector.
|
||||
This module should contain a `pre_connector_install` and `post_connector_install` async function that will mutate the base image and the connector container respectively.
|
||||
It will be imported at runtime by our build process and the functions will be called if they exist.
|
||||
|
||||
Here is an example of a `build_customization.py` module:
|
||||
|
||||
```python
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import TYPE_CHECKING
|
||||
|
||||
if TYPE_CHECKING:
|
||||
# Feel free to check the dagger documentation for more information on the Container object and its methods.
|
||||
# https://dagger-io.readthedocs.io/en/sdk-python-v0.6.4/
|
||||
from dagger import Container
|
||||
|
||||
|
||||
async def pre_connector_install(base_image_container: Container) -> Container:
|
||||
return await base_image_container.with_env_variable("MY_PRE_BUILD_ENV_VAR", "my_pre_build_env_var_value")
|
||||
|
||||
async def post_connector_install(connector_container: Container) -> Container:
|
||||
return await connector_container.with_env_variable("MY_POST_BUILD_ENV_VAR", "my_post_build_env_var_value")
|
||||
```
|
||||
|
||||
### Listing migrated / non migrated connectors:
|
||||
|
||||
To list all migrated certified connectors you can ran:
|
||||
|
||||
```bash
|
||||
airbyte-ci connectors --support-level=certified --metadata-query="data.connectorBuildOptions.baseImage is not None" list
|
||||
```
|
||||
|
||||
To list all non migrated certified connectors you can ran:
|
||||
|
||||
```bash
|
||||
airbyte-ci connectors --metadata-query="data.supportLevel == 'certified' and 'connectorBuildOptions' not in data.keys()" list
|
||||
```
|
||||
@@ -0,0 +1,92 @@
|
||||
# Resumable Full Refresh Streams
|
||||
|
||||
:::warning
|
||||
This feature is currently in-development. CDK interfaces and classes relating to this feature may change without notice.
|
||||
:::
|
||||
|
||||
A resumable full refresh stream is one that cannot offer incremental sync functionality because the API endpoint
|
||||
does not offer a way to retrieve data relative to a specific point in time. Being able to only fetch records after
|
||||
a specific timestamp (i.e. 2024-04-01) is an example of an API endpoint that supports incremental sync. An API that
|
||||
only supports pagination using an arbitrary page number is a candidate for resumable full refresh.
|
||||
|
||||
## Synthetic cursors
|
||||
|
||||
Unlike Incremental stream cursors which rely on values such as a date (i.e. `2024-04-30`) to reliably partition the
|
||||
data retrieved from an API after the provided point, Resumable Full Refresh streams define cursors according to
|
||||
values like a page number or next record cursor. Some APIs don't provide guarantees that records in between
|
||||
requests might have changed relative to others when using pagination parameters. We refer to the artificial page
|
||||
values used to checkpoint state in between resumable full refresh sync attempts as synthetic cursors.
|
||||
|
||||
## Criteria for Resumable Full Refresh
|
||||
|
||||
:::warning
|
||||
Resumable full refresh in the Python CDK does not currently support substreams. This work is currently in progress.
|
||||
:::
|
||||
|
||||
Determining if a stream can implement checkpointing state using resumable full refresh is based on criteria of the
|
||||
API endpoint being used to fetch data. This can be done either by reading the API documentation or making cURL
|
||||
requests to API endpoint itself:
|
||||
|
||||
1. The API endpoint must support pagination. If records are only returned within a single page request, there is no suitable checkpoint value. The synthetic cursor should be based on value included in the request to fetch the next set of records.
|
||||
2. When requesting a page of records, the same request should yield the same records in the response. Because RFR relies on getting records after the last checkpointed pagination cursor, it relies on the API to return roughly the same records on a subsequent attempt. An API that returns different set of records for a specific page each time a request is made would not be compatible with RFR.
|
||||
|
||||
An example of an endpoint compatible with resumable full refresh is the [Hubspot GET /contacts](https://legacydocs.hubspot.com/docs/methods/contacts/get_contacts) API endpoint.
|
||||
This endpoint does not support getting records relative to a timestamp. However, it does allow for cursor-based
|
||||
pagination using `vidOffset` and records are always returned on the same page and in the same order if a request
|
||||
is retried.
|
||||
|
||||
## Implementing Resumable Full Refresh streams
|
||||
|
||||
### `StateMixin`
|
||||
|
||||
This class mixin adds property `state` with abstract setter and getter.
|
||||
The `state` attribute helps the CDK figure out the current state of sync at any moment.
|
||||
The setter typically deserializes state saved by CDK and initialize internal state of the stream.
|
||||
The getter should serialize internal state of the stream.
|
||||
|
||||
```python
|
||||
@property
|
||||
def state(self) -> Mapping[str, Any]:
|
||||
return {self.cursor_field: str(self._cursor_value)}
|
||||
|
||||
@state.setter
|
||||
def state(self, value: Mapping[str, Any]):
|
||||
self._cursor_value = value[self.cursor_field]
|
||||
```
|
||||
|
||||
### `Stream.read_records()`
|
||||
|
||||
To implement resumable full refresh, the stream must override it's `Stream.read_records()` method. This implementation is responsible for:
|
||||
|
||||
1. Reading the stream's current state and assigning it to `next_page_token` which populates the pagination page parameter for the next request
|
||||
2. Make the outbound API request to retrieve the next page of records.
|
||||
3. Transform (if needed) and emit each response record.
|
||||
4. Update the stream's state to the page of records to retrieve using the stream's `next_page_token()` method.
|
||||
|
||||
### State object format
|
||||
|
||||
In the `Stream.read_records()` implementation, the stream must structure the state object representing the next page
|
||||
to request according to a certain format.
|
||||
|
||||
Stream state that invokes a subsequent request to retrieve more records should be formatted with a single `key:value` pair:
|
||||
|
||||
```json
|
||||
{
|
||||
"page": 25
|
||||
}
|
||||
```
|
||||
|
||||
The empty object `{}` indicates that a resumable full refresh stream has no more records to sync.
|
||||
|
||||
### `AirbyteStateMessage`
|
||||
|
||||
The `AirbyteStateMessage` persists state between sync attempts after a prior attempt fails. Subsequent sync attempts
|
||||
of a job can pick up from the last checkpoint of the previous one. For resumable full refresh syncs, state is passed
|
||||
in between sync attempts, but deleted at the beginning of new sync jobs.
|
||||
|
||||
## Conclusion
|
||||
|
||||
In summary, a resumable full refresh stream requires:
|
||||
|
||||
- to be inherited from `StateMixin` and state methods implemented
|
||||
- implementing `Stream.read_records()` to get the Stream's current state, request a single page of records, and update the Stream's state with the next page to fetch or `{}`.
|
||||
164
docs/platform/connector-development/cdk-python/schemas.md
Normal file
@@ -0,0 +1,164 @@
|
||||
# Defining Stream Schemas
|
||||
|
||||
Your connector must describe the schema of each stream it can output using [JSONSchema](https://json-schema.org).
|
||||
|
||||
The simplest way to do this is to describe the schema of your streams using one `.json` file per stream. You can also dynamically generate the schema of your stream in code, or you can combine both approaches: start with a `.json` file and dynamically add properties to it.
|
||||
|
||||
The schema of a stream is the return value of `Stream.get_json_schema`.
|
||||
|
||||
## Static schemas
|
||||
|
||||
By default, `Stream.get_json_schema` reads a `.json` file in the `schemas/` directory whose name is equal to the value of the `Stream.name` property. In turn `Stream.name` by default returns the name of the class in snake case. Therefore, if you have a class `class EmployeeBenefits(HttpStream)` the default behavior will look for a file called `schemas/employee_benefits.json`. You can override any of these behaviors as you need.
|
||||
|
||||
Important note: any objects referenced via `$ref` should be placed in the `shared/` directory in their own `.json` files.
|
||||
|
||||
### Generating schemas from OpenAPI definitions
|
||||
|
||||
If you are implementing a connector to pull data from an API which publishes an [OpenAPI/Swagger spec](https://swagger.io/specification/), you can use a tool we've provided for generating JSON schemas from the OpenAPI definition file. Detailed information can be found [here](https://github.com/airbytehq/airbyte/tree/master/tools/openapi2jsonschema/).
|
||||
|
||||
### Generating schemas using the output of your connector's read command
|
||||
|
||||
We also provide a tool for generating schemas using a connector's `read` command output. Detailed information can be found [here](https://github.com/airbytehq/airbyte/tree/master/tools/schema_generator/).
|
||||
|
||||
### Backwards Compatibility
|
||||
|
||||
Because statically defined schemas explicitly define how data is represented in a destination, updates to a schema must be backwards compatible with prior versions. More information about breaking changes can be found [here](../best-practices.md#schema-breaking-changes)
|
||||
|
||||
## Dynamic schemas
|
||||
|
||||
If you'd rather define your schema in code, override `Stream.get_json_schema` in your stream class to return a `dict` describing the schema using [JSONSchema](https://json-schema.org).
|
||||
|
||||
## Dynamically modifying static schemas
|
||||
|
||||
Override `Stream.get_json_schema` to run the default behavior, edit the returned value, then return the edited value:
|
||||
|
||||
```text
|
||||
def get_json_schema(self):
|
||||
schema = super().get_json_schema()
|
||||
schema['dynamically_determined_property'] = "property"
|
||||
return schema
|
||||
```
|
||||
|
||||
## Type transformation
|
||||
|
||||
It is important to ensure output data conforms to the declared json schema. This is because the destination receiving this data to load into tables may strictly enforce schema \(e.g. when data is stored in a SQL database, you can't put CHAR type into INTEGER column\). In the case of changes to API output \(which is almost guaranteed to happen over time\) or a minor mistake in jsonschema definition, data syncs could thus break because of mismatched datatype schemas.
|
||||
|
||||
To remain robust in operation, the CDK provides a transformation ability to perform automatic object mutation to align with desired schema before outputting to the destination. All streams inherited from airbyte*cdk.sources.streams.core.Stream class have this transform configuration available. It is \_disabled* by default and can be configured per stream within a source connector.
|
||||
|
||||
### Default type transformation
|
||||
|
||||
Here's how you can configure the TypeTransformer:
|
||||
|
||||
```python
|
||||
from airbyte_cdk.sources.utils.transform import TransformConfig, Transformer
|
||||
from airbyte_cdk.sources.streams.core import Stream
|
||||
|
||||
class MyStream(Stream):
|
||||
...
|
||||
transformer = Transformer(TransformConfig.DefaultSchemaNormalization)
|
||||
...
|
||||
```
|
||||
|
||||
In this case default transformation will be applied. For example if you have schema like this
|
||||
|
||||
```javascript
|
||||
{"type": "object", "properties": {"value": {"type": "string"}}}
|
||||
```
|
||||
|
||||
and source API returned object with non-string type, it would be casted to string automaticaly:
|
||||
|
||||
```javascript
|
||||
{"value": 12} -> {"value": "12"}
|
||||
```
|
||||
|
||||
Also it works on complex types:
|
||||
|
||||
```javascript
|
||||
{"value": {"unexpected_object": "value"}} -> {"value": "{'unexpected_object': 'value'}"}
|
||||
```
|
||||
|
||||
And objects inside array of referenced by $ref attribute.
|
||||
|
||||
If the value cannot be cast \(e.g. string "asdf" cannot be casted to integer\), the field would retain its original value. Schema type transformation support any jsonschema types, nested objects/arrays and reference types. Types described as array of more than one type \(except "null"\), types under oneOf/anyOf keyword wont be transformed.
|
||||
|
||||
_Note:_ This transformation is done by the source, not the stream itself. I.e. if you have overriden "read_records" method in your stream it wont affect object transformation. All transformation are done in-place by modifing output object before passing it to "get_updated_state" method, so "get_updated_state" would receive the transformed object.
|
||||
|
||||
### Custom schema type transformation
|
||||
|
||||
Default schema type transformation performs simple type casting. Sometimes you want to perform more sophisticated transform like making "date-time" field compliant to rcf3339 standard. In this case you can use custom schema type transformation:
|
||||
|
||||
```python
|
||||
class MyStream(Stream):
|
||||
...
|
||||
transformer = Transformer(TransformConfig.CustomSchemaNormalization)
|
||||
...
|
||||
|
||||
@transformer.registerCustomTransform
|
||||
def transform_function(original_value: Any, field_schema: Dict[str, Any]) -> Any:
|
||||
# transformed_value = ...
|
||||
return transformed_value
|
||||
```
|
||||
|
||||
Where original_value is initial field value and field_schema is part of jsonschema describing field type. For schema
|
||||
|
||||
```javascript
|
||||
{"type": "object", "properties": {"value": {"type": "string", "format": "date-time"}}}
|
||||
```
|
||||
|
||||
field_schema variable would be equal to
|
||||
|
||||
```javascript
|
||||
{"type": "string", "format": "date-time"}
|
||||
```
|
||||
|
||||
In this case default transformation would be skipped and only custom transformation apply. If you want to run both default and custom transformation you can configure transdormer object by combining config flags:
|
||||
|
||||
```python
|
||||
transformer = Transformer(TransformConfig.DefaultSchemaNormalization | TransformConfig.CustomSchemaNormalization)
|
||||
```
|
||||
|
||||
In this case custom transformation will be applied after default type transformation function. Note that order of flags doesn't matter, default transformation will always be run before custom.
|
||||
|
||||
In some specific cases, you might want to make your custom transform not static, e.g. Formatting a field according to the connector configuration.
|
||||
To do so, we suggest you to declare a function to generate another, a.k.a a closure:
|
||||
|
||||
```python
|
||||
class MyStream(Stream):
|
||||
...
|
||||
transformer = TypeTransformer(TransformConfig.CustomSchemaNormalization)
|
||||
...
|
||||
def __init__(self, config_based_date_format):
|
||||
self.config_based_date_format = config_based_date_format
|
||||
transform_function = self.get_custom_transform()
|
||||
self.transformer.registerCustomTransform(transform_function)
|
||||
|
||||
def get_custom_transform(self):
|
||||
def custom_transform_function(original_value, field_schema):
|
||||
if original_value and "format" in field_schema and field_schema["format"] == "date":
|
||||
transformed_value = pendulum.from_format(original_value, self.config_based_date_format).to_date_string()
|
||||
return transformed_value
|
||||
return original_value
|
||||
return custom_transform_function
|
||||
```
|
||||
|
||||
### Performance consideration
|
||||
|
||||
Transforming each object on the fly would add some time for each object processing. This time is depends on object/schema complexity and hardware configuration.
|
||||
|
||||
There are some performance benchmarks we've done with ads_insights facebook schema \(it is complex schema with objects nested inside arrays ob object and a lot of references\) and example object. Here is the average transform time per single object, seconds:
|
||||
|
||||
```text
|
||||
regular transform:
|
||||
0.0008423403530008121
|
||||
|
||||
transform without type casting (but value still being write to dict/array):
|
||||
0.000776215762666349
|
||||
|
||||
transform without actual value setting (but iterating through object properties):
|
||||
0.0006788729513330812
|
||||
|
||||
just traverse/validate through json schema and object fields:
|
||||
0.0006139181846665452
|
||||
```
|
||||
|
||||
On my PC \(AMD Ryzen 7 5800X\) it took 0.8 milliseconds per object. As you can see most time \(~ 75%\) is taken by jsonschema traverse/validation routine and very little \(less than 10 %\) by actual converting. Processing time can be reduced by skipping jsonschema type checking but it would be no warnings about possible object jsonschema inconsistency.
|
||||
@@ -0,0 +1,27 @@
|
||||
# Stream Slices
|
||||
|
||||
## Stream Slicing
|
||||
|
||||
A Stream Slice is a subset of the records in a stream.
|
||||
|
||||
When a stream is being read incrementally, Slices can be used to control when state is saved.
|
||||
|
||||
When slicing is enabled, a state message will be output by the connector after reading every slice. Slicing is completely optional and is provided as a way for connectors to checkpoint state in a more granular way than basic interval-based state checkpointing. Slicing is typically used when reading a large amount of data or when the underlying data source imposes strict rate limits that make it difficult to re-read the same data over and over again. This being said, interval-based checkpointing is compatible with slicing with one difference: intervals are counted within a slice rather than across all records. In other words, the counter used to determine if the interval has been reached \(e.g: every 10k records\) resets at the beginning of every slice.
|
||||
|
||||
The relationship between records in a slice is up to the developer, but the list of slices must be yielded in ascending order, using the cursor field as context for the ordering. This is to ensure that the state can't be updated to a timestamp that is ahead of other slices yet to be processed. Slices are typically used to implement date-based checkpointing, for example to group records generated within a particular hour, day, or month etc.
|
||||
|
||||
Slices can be hard-coded or generated dynamically \(e.g: by making a query\).
|
||||
|
||||
An important restriction imposed on slices is that they must be described with a list of `dict`s returned from the `Stream.stream_slices()` method, where each `dict` describes a slice. The `dict`s may have any schema, and are passed as input to each stream's `read_stream` method. This way, the connector can read the current slice description \(the input `dict`\) and use that to make queries as needed. As described above, this list of dicts must be in appropriate ascending order based on the cursor field.
|
||||
|
||||
### Use cases
|
||||
|
||||
If your use case requires saving state based on an interval e.g: only 10,000 records but nothing more sophisticated, then slicing is not necessary and you can instead set the `state_checkpoint_interval` property on a stream.
|
||||
|
||||
#### The Slack connector: time-based slicing for large datasets
|
||||
|
||||
Slack is a chat platform for businesses. Collectively, a company can easily post tens or hundreds of thousands of messages in a single Slack instance per day. So when writing a connector to pull chats from Slack, it's easy to run into rate limits or for the sync to take a very long time to complete because of the large amount of data. So we want a way to frequently "save" which data we already read from the connector so that if there is a halfway failure, we pick up reading where we left off. In addition, the Slack API does not return messages sorted by timestamp, so we cannot use `state_checkpoint_interval`s.
|
||||
|
||||
This is a great usecase for stream slicing. The `messages` stream, which outputs one record per chat message, can slice records by time e.g: hourly. It implements this by specifying the beginning and end timestamp of each hour that it wants to pull data from. Then after all the records in a given hour \(i.e: slice\) have been read, the connector outputs a STATE message to indicate that state should be saved. This way, if the connector ever fails during a sync \(for example if the API goes down\) then at most, it will reread only one hour's worth of messages.
|
||||
|
||||
See the implementation of the Slack connector [here](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-slack/source_slack/source.py).
|
||||
@@ -0,0 +1,3 @@
|
||||
# Component Schema Reference
|
||||
|
||||
A JSON schema representation of the relationships between the components that can be used in the YAML configuration can be found [here](https://github.com/airbytehq/airbyte-python-cdk/blob/main/airbyte_cdk/sources/declarative/declarative_component_schema.yaml).
|
||||
@@ -0,0 +1,58 @@
|
||||
# Custom Components
|
||||
|
||||
:::info
|
||||
Please help us improve the low code CDK! If you find yourself needing to build a custom component,please [create a feature request issue](https://github.com/airbytehq/airbyte/issues/new?assignees=&labels=type%2Fenhancement%2C+%2Cneeds-triage%2C+area%2Flow-code%2Fcomponents&template=feature-request.md&title=Low%20Code%20Feature:). If appropriate, we'll add it directly to the framework (or you can submit a PR)!
|
||||
|
||||
If an issue already exist for the missing feature you need, please upvote or comment on it so we can prioritize the issue accordingly.
|
||||
:::
|
||||
|
||||
Any built-in components can be overloaded by a custom Python class.
|
||||
To create a custom component, define a new class in a new file in the connector's module.
|
||||
The class must implement the interface of the component it is replacing. For instance, a pagination strategy must implement `airbyte_cdk.sources.declarative.requesters.paginators.strategies.pagination_strategy.PaginationStrategy`.
|
||||
The class must also be a dataclass where each field represents an argument to configure from the yaml file, and an `InitVar` named parameters.
|
||||
|
||||
For example:
|
||||
|
||||
```
|
||||
@dataclass
|
||||
class MyPaginationStrategy(PaginationStrategy):
|
||||
my_field: Union[InterpolatedString, str]
|
||||
parameters: InitVar[Mapping[str, Any]]
|
||||
|
||||
def __post_init__(self, parameters: Mapping[str, Any]):
|
||||
pass
|
||||
|
||||
def next_page_token(self, response: requests.Response, last_records: List[Mapping[str, Any]]) -> Optional[Any]:
|
||||
pass
|
||||
|
||||
def reset(self):
|
||||
pass
|
||||
```
|
||||
|
||||
This class can then be referred from the yaml file by specifying the type of custom component and using its fully qualified class name:
|
||||
|
||||
```yaml
|
||||
pagination_strategy:
|
||||
type: "CustomPaginationStrategy"
|
||||
class_name: "my_connector_module.MyPaginationStrategy"
|
||||
my_field: "hello world"
|
||||
```
|
||||
|
||||
### Custom Components that pass fields to child components
|
||||
|
||||
There are certain scenarios where a child subcomponent might rely on a field defined on a parent component. For regular components, we perform this propagation of fields from the parent component to the child automatically.
|
||||
However, custom components do not support this behavior. If you have a child subcomponent of your custom component that falls under this use case, you will see an error message like:
|
||||
|
||||
```
|
||||
Error creating component 'DefaultPaginator' with parent custom component source_example.components.CustomRetriever: Please provide DefaultPaginator.$parameters.url_base
|
||||
```
|
||||
|
||||
When you receive this error, you can address this by defining the missing field within the `$parameters` block of the child component.
|
||||
|
||||
```yaml
|
||||
paginator:
|
||||
type: "DefaultPaginator"
|
||||
<...>
|
||||
$parameters:
|
||||
url_base: "https://example.com"
|
||||
```
|
||||
@@ -0,0 +1,10 @@
|
||||
# How the Framework Works
|
||||
|
||||
1. Given the connection config and an optional stream state, the `PartitionRouter` computes the partitions that should be routed to read data.
|
||||
2. Iterate over all the partitions defined by the stream's partition router.
|
||||
3. For each partition,
|
||||
1. Submit a request to the partner API as defined by the requester
|
||||
2. Select the records from the response
|
||||
3. Repeat for as long as the paginator points to a next page
|
||||
|
||||

|
||||
@@ -0,0 +1,46 @@
|
||||
# Object Instantiation
|
||||
|
||||
If the component is a literal, then it is returned as is:
|
||||
|
||||
```
|
||||
3
|
||||
```
|
||||
|
||||
will result in
|
||||
|
||||
```
|
||||
3
|
||||
```
|
||||
|
||||
If the component definition is a mapping with a "type" field,
|
||||
the factory will lookup the [CLASS_TYPES_REGISTRY](https://github.com/airbytehq/airbyte-python-cdk/blob/main//airbyte_cdk/sources/declarative/parsers/class_types_registry.py) and replace the "type" field by "class_name" -> CLASS_TYPES_REGISTRY[type]
|
||||
and instantiate the object from the resulting mapping
|
||||
|
||||
If the component definition is a mapping with neither a "class_name" nor a "type" field,
|
||||
the factory will do a best-effort attempt at inferring the component type by looking up the parent object's constructor type hints.
|
||||
If the type hint is an interface present in [DEFAULT_IMPLEMENTATIONS_REGISTRY](https://github.com/airbytehq/airbyte-python-cdk/blob/main//airbyte_cdk/sources/declarative/parsers/default_implementation_registry.py),
|
||||
then the factory will create an object of its default implementation.
|
||||
|
||||
If the component definition is a list, then the factory will iterate over the elements of the list,
|
||||
instantiate its subcomponents, and return a list of instantiated objects.
|
||||
|
||||
If the component has subcomponents, the factory will create the subcomponents before instantiating the top level object
|
||||
|
||||
```
|
||||
{
|
||||
"type": TopLevel
|
||||
"param":
|
||||
{
|
||||
"type": "ParamType"
|
||||
"k": "v"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
will result in
|
||||
|
||||
```
|
||||
TopLevel(param=ParamType(k="v"))
|
||||
```
|
||||
|
||||
More details on object instantiation can be found [here](https://airbyte-cdk.readthedocs.io/en/latest/api/airbyte_cdk.sources.declarative.parsers.html?highlight=factory#airbyte_cdk.sources.declarative.parsers.factory.DeclarativeComponentFactory).
|
||||
@@ -0,0 +1,50 @@
|
||||
# Parameters
|
||||
|
||||
Parameters can be passed down from a parent component to its subcomponents using the $parameters key.
|
||||
This can be used to avoid repetitions.
|
||||
|
||||
Schema:
|
||||
|
||||
```yaml
|
||||
"$parameters":
|
||||
type: object
|
||||
additionalProperties: true
|
||||
```
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
outer:
|
||||
$parameters:
|
||||
MyKey: MyValue
|
||||
inner:
|
||||
k2: v2
|
||||
```
|
||||
|
||||
This the example above, if both outer and inner are types with a "MyKey" field, both of them will evaluate to "MyValue".
|
||||
|
||||
These parameters can be overwritten by subcomponents as a form of specialization:
|
||||
|
||||
```yaml
|
||||
outer:
|
||||
$parameters:
|
||||
MyKey: MyValue
|
||||
inner:
|
||||
$parameters:
|
||||
MyKey: YourValue
|
||||
k2: v2
|
||||
```
|
||||
|
||||
In this example, "outer.MyKey" will evaluate to "MyValue", and "inner.MyKey" will evaluate to "YourValue".
|
||||
|
||||
The value can also be used for string interpolation:
|
||||
|
||||
```yaml
|
||||
outer:
|
||||
$parameters:
|
||||
MyKey: MyValue
|
||||
inner:
|
||||
k2: "MyKey is {{ parameters['MyKey'] }}"
|
||||
```
|
||||
|
||||
In this example, outer.inner.k2 will evaluate to "MyKey is MyValue"
|
||||
@@ -0,0 +1,102 @@
|
||||
# References
|
||||
|
||||
Strings can contain references to previously defined values.
|
||||
The parser will dereference these values to produce a complete object definition.
|
||||
|
||||
References can be defined using a `#/{arg}` string.
|
||||
|
||||
```yaml
|
||||
key: 1234
|
||||
reference: "#/key"
|
||||
```
|
||||
|
||||
will produce the following definition:
|
||||
|
||||
```yaml
|
||||
key: 1234
|
||||
reference: 1234
|
||||
```
|
||||
|
||||
This also works with objects:
|
||||
|
||||
```yaml
|
||||
key_value_pairs:
|
||||
k1: v1
|
||||
k2: v2
|
||||
same_key_value_pairs: "#/key_value_pairs"
|
||||
```
|
||||
|
||||
will produce the following definition:
|
||||
|
||||
```yaml
|
||||
key_value_pairs:
|
||||
k1: v1
|
||||
k2: v2
|
||||
same_key_value_pairs:
|
||||
k1: v1
|
||||
k2: v2
|
||||
```
|
||||
|
||||
The $ref keyword can be used to refer to an object and enhance it with addition key-value pairs
|
||||
|
||||
```yaml
|
||||
key_value_pairs:
|
||||
k1: v1
|
||||
k2: v2
|
||||
same_key_value_pairs:
|
||||
$ref: "#/key_value_pairs"
|
||||
k3: v3
|
||||
```
|
||||
|
||||
will produce the following definition:
|
||||
|
||||
```yaml
|
||||
key_value_pairs:
|
||||
k1: v1
|
||||
k2: v2
|
||||
same_key_value_pairs:
|
||||
k1: v1
|
||||
k2: v2
|
||||
k3: v3
|
||||
```
|
||||
|
||||
References can also point to nested values.
|
||||
Nested references are ambiguous because one could define a key containing with `/`
|
||||
in this example, we want to refer to the limit key in the dict object:
|
||||
|
||||
```yaml
|
||||
dict:
|
||||
limit: 50
|
||||
limit_ref: "#/dict/limit"
|
||||
```
|
||||
|
||||
will produce the following definition:
|
||||
|
||||
```yaml
|
||||
dict
|
||||
limit: 50
|
||||
limit-ref: 50
|
||||
```
|
||||
|
||||
whereas here we want to access the `nested/path` value.
|
||||
|
||||
```yaml
|
||||
nested:
|
||||
path: "first one"
|
||||
nested.path: "uh oh"
|
||||
value: "ref(nested.path)
|
||||
```
|
||||
|
||||
will produce the following definition:
|
||||
|
||||
```yaml
|
||||
nested:
|
||||
path: "first one"
|
||||
nested/path: "uh oh"
|
||||
value: "uh oh"
|
||||
```
|
||||
|
||||
To resolve the ambiguity, we try looking for the reference key at the top-level, and then traverse the structs downward
|
||||
until we find a key with the given path, or until there is nothing to traverse.
|
||||
|
||||
More details on referencing values can be found [here](https://airbyte-cdk.readthedocs.io/en/latest/api/airbyte_cdk.sources.declarative.parsers.html?highlight=yamlparser#airbyte_cdk.sources.declarative.parsers.yaml_parser.YamlParser).
|
||||
@@ -0,0 +1,36 @@
|
||||
# String Interpolation
|
||||
|
||||
String values can be evaluated as Jinja2 templates.
|
||||
|
||||
If the input string is a raw string, the interpolated string will be the same.
|
||||
`"hello world" -> "hello world"`
|
||||
|
||||
The engine will evaluate the content passed within `{{...}}`, interpolating the keys from context-specific arguments.
|
||||
The "parameters" keyword [see ($parameters)](./parameters.md) can be referenced.
|
||||
|
||||
For example, some_object.inner_object.key will evaluate to "Hello airbyte" at runtime.
|
||||
|
||||
```yaml
|
||||
some_object:
|
||||
$parameters:
|
||||
name: "airbyte"
|
||||
inner_object:
|
||||
key: "Hello {{ parameters.name }}"
|
||||
```
|
||||
|
||||
Some components also pass in additional arguments to the context.
|
||||
This is the case for the [record selector](../understanding-the-yaml-file/record-selector.md), which passes in an additional `response` argument.
|
||||
|
||||
Both dot notation and bracket notations (with single quotes ( `'`)) are interchangeable.
|
||||
This means that both these string templates will evaluate to the same string:
|
||||
|
||||
1. `"{{ parameters.name }}"`
|
||||
2. `"{{ parameters['name'] }}"`
|
||||
|
||||
In addition to passing additional values through the $parameters argument, macros can be called from within the string interpolation.
|
||||
For example,
|
||||
`"{{ max(2, 3) }}" -> 3`
|
||||
|
||||
The macros and variables available in all possible contexts are documented in the [YAML Reference](../understanding-the-yaml-file/reference.md#variables).
|
||||
|
||||
Additional information on jinja templating can be found at [https://jinja.palletsprojects.com/en/3.1.x/templates/#](https://jinja.palletsprojects.com/en/3.1.x/templates/#)
|
||||
|
After Width: | Height: | Size: 310 KiB |
|
After Width: | Height: | Size: 401 KiB |
|
After Width: | Height: | Size: 46 KiB |
|
After Width: | Height: | Size: 40 KiB |
|
After Width: | Height: | Size: 104 KiB |
|
After Width: | Height: | Size: 206 KiB |
@@ -0,0 +1,99 @@
|
||||
# Low-code connector development
|
||||
|
||||
Airbyte's low-code framework enables you to build source connectors for REST APIs via a [connector builder UI](../connector-builder-ui/overview.md) or by modifying boilerplate YAML files via terminal or text editor. Low-code CDK is a part of Python CDK that provides a mapping from connector manifest YAML files to actual behavior implementations.
|
||||
|
||||
## Why low-code?
|
||||
|
||||
### API Connectors are common and formulaic
|
||||
|
||||
In building and maintaining hundreds of connectors at Airbyte, we've observed that whereas API source connectors constitute the overwhelming majority of connectors, they are also the most formulaic. API connector code almost always solves small variations of these problems:
|
||||
|
||||
1. Making requests to various endpoints under the same API URL e.g: `https://api.stripe.com/customers`, `https://api.stripe.com/transactions`, etc..
|
||||
2. Authenticating using a common auth strategy such as Oauth or API keys
|
||||
3. Pagination using one of the 4 ubiquitous pagination strategies: limit-offset, page-number, cursor pagination, and header link pagination
|
||||
4. Gracefully handling rate limiting by implementing exponential backoff, fixed-time backoff, or variable-time backoff
|
||||
5. Describing the schema of the data returned by the API, so that downstream warehouses can create normalized tables
|
||||
6. Decoding the format of the data returned by the API (e.g JSON, XML, CSV, etc..) and handling compression (GZIP, BZIP, etc..)
|
||||
7. Supporting incremental data exports by remembering what data was already synced, usually using date-based cursors
|
||||
|
||||
and so on.
|
||||
|
||||
### A declarative, low-code paradigm commoditizes solving formulaic problems
|
||||
|
||||
Given that these problems each have a very finite number of solutions, we can remove the need for writing the code to build these API connectors by providing configurable off-the-shelf components to solve them. In doing so, we significantly decrease development effort and bugs while improving maintainability and accessibility. In this paradigm, instead of having to write the exact lines of code to solve this problem over and over, a developer can pick the solution to each problem from an available component, and rely on the framework to run the logic for them.
|
||||
|
||||
|
||||
|
||||
## Overview of the process
|
||||
|
||||
To use the low-code framework to build an REST API Source connector:
|
||||
|
||||
1. Generate the API key or credentials for the source you want to build a connector for
|
||||
2. Set up the project on your local machine
|
||||
3. Set up your local development environment
|
||||
4. Use the connector builder UI to define the connector YAML manifest and test the connector
|
||||
5. Specify stream schemas
|
||||
6. Add the connector to the Airbyte platform
|
||||
|
||||
For a step-by-step tutorial, refer to the [Getting Started with the Connector Builder](../connector-builder-ui/tutorial.mdx) or the [video tutorial](https://youtu.be/i7VSL2bDvmw)
|
||||
|
||||
## Configuring the YAML file
|
||||
|
||||
The low-code framework involves editing the Connector Manifest, which is a boilerplate YAML file. The general structure of the YAML file is as follows:
|
||||
|
||||
```
|
||||
version: "0.1.0"
|
||||
definitions:
|
||||
<key-value pairs defining objects which will be reused in the YAML connector>
|
||||
streams:
|
||||
<list stream definitions>
|
||||
check:
|
||||
<definition of connection checker>
|
||||
spec:
|
||||
<connector spec>
|
||||
```
|
||||
|
||||
The following table describes the components of the YAML file:
|
||||
|
||||
| Component | Description |
|
||||
| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `version` | Indicates the framework version |
|
||||
| `definitions` | Describes the objects to be reused in the YAML connector |
|
||||
| `streams` | Lists the streams of the source |
|
||||
| `check` | Describes how to test the connection to the source by trying to read a record from a specified list of streams and failing if no records could be read |
|
||||
| `spec` | A [connector specification](../../understanding-airbyte/airbyte-protocol#actor-specification) which describes the required and optional parameters which can be input by the end user to configure this connector |
|
||||
|
||||
:::tip
|
||||
Streams define the schema of the data to sync, as well as how to read it from the underlying API source. A stream generally corresponds to a resource within the API. They are analogous to tables for a relational database source.
|
||||
:::
|
||||
|
||||
For each stream, configure the following components:
|
||||
|
||||
| Component | Sub-component | Description |
|
||||
| ---------------------- | ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| Name | | Name of the stream |
|
||||
| Primary key (Optional) | | Used to uniquely identify records, enabling deduplication. Can be a string for single primary keys, a list of strings for composite primary keys, or a list of list of strings for composite primary keys consisting of nested fields |
|
||||
| Schema | | Describes the data to sync |
|
||||
| Incremental sync | | Describes the behavior of an incremental sync which enables checkpointing and replicating only the data that has changed since the last sync to a destination. |
|
||||
| Data retriever | | Describes how to retrieve data from the API |
|
||||
| | Requester | Describes how to prepare HTTP requests to send to the source API and defines the base URL and path, the request options provider, the HTTP method, authenticator, error handler components |
|
||||
| | Pagination | Describes how to navigate through the API's pages |
|
||||
| | Record Selector | Describes how to extract records from a HTTP response |
|
||||
| | Partition Router | Describes how to partition the stream, enabling incremental syncs and checkpointing |
|
||||
| Cursor field | | Field to use as stream cursor. Can either be a string, or a list of strings if the cursor is a nested field. |
|
||||
| Transformations | | A set of transformations to be applied on the records read from the source before emitting them to the destination |
|
||||
|
||||
For a deep dive into each of the components, refer to [Understanding the YAML file](./understanding-the-yaml-file/yaml-overview.md) or the [full YAML Schema definition](https://github.com/airbytehq/airbyte-python-cdk/blob/main/airbyte_cdk/sources/declarative/declarative_component_schema.yaml)
|
||||
|
||||
|
||||
## Sample connectors
|
||||
|
||||
For examples of production-ready config-based connectors, refer to:
|
||||
|
||||
- [Greenhouse](https://github.com/airbytehq/airbyte/tree/master/airbyte-integrations/connectors/source-greenhouse/source_greenhouse/manifest.yaml)
|
||||
- [Sendgrid](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-sendgrid/source_sendgrid/manifest.yaml)
|
||||
- [Sentry](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-sentry/source_sentry/manifest.yaml)
|
||||
|
||||
## Reference
|
||||
|
||||
The full schema definition for the YAML file can be found [here](https://raw.githubusercontent.com/airbytehq/airbyte-python-cdk/main/airbyte_cdk/sources/declarative/declarative_component_schema.yaml).
|
||||
@@ -0,0 +1,293 @@
|
||||
# Authentication
|
||||
|
||||
The `Authenticator` defines how to configure outgoing HTTP requests to authenticate on the API source.
|
||||
|
||||
Schema:
|
||||
|
||||
```yaml
|
||||
Authenticator:
|
||||
type: object
|
||||
description: "Authenticator type"
|
||||
anyOf:
|
||||
- "$ref": "#/definitions/OAuth"
|
||||
- "$ref": "#/definitions/ApiKeyAuthenticator"
|
||||
- "$ref": "#/definitions/BearerAuthenticator"
|
||||
- "$ref": "#/definitions/BasicHttpAuthenticator"
|
||||
```
|
||||
|
||||
## Authenticators
|
||||
|
||||
### ApiKeyAuthenticator
|
||||
|
||||
The `ApiKeyAuthenticator` sets an HTTP header on outgoing requests.
|
||||
The following definition will set the header "Authorization" with a value "Bearer hello":
|
||||
|
||||
Schema:
|
||||
|
||||
```yaml
|
||||
ApiKeyAuthenticator:
|
||||
type: object
|
||||
additionalProperties: true
|
||||
required:
|
||||
- header
|
||||
- api_token
|
||||
properties:
|
||||
"$parameters":
|
||||
"$ref": "#/definitions/$parameters"
|
||||
header:
|
||||
type: string
|
||||
api_token:
|
||||
type: string
|
||||
```
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
authenticator:
|
||||
type: "ApiKeyAuthenticator"
|
||||
header: "Authorization"
|
||||
api_token: "Bearer hello"
|
||||
```
|
||||
|
||||
For more information see [ApiKeyAuthenticator Reference](https://docs.airbyte.com/connector-development/config-based/understanding-the-yaml-file/reference#/definitions/ApiKeyAuthenticator)
|
||||
|
||||
### BearerAuthenticator
|
||||
|
||||
The `BearerAuthenticator` is a specialized `ApiKeyAuthenticator` that always sets the header "Authorization" with the value `Bearer {token}`.
|
||||
The following definition will set the header "Authorization" with a value "Bearer hello"
|
||||
|
||||
Schema:
|
||||
|
||||
```yaml
|
||||
BearerAuthenticator:
|
||||
type: object
|
||||
additionalProperties: true
|
||||
required:
|
||||
- api_token
|
||||
properties:
|
||||
"$parameters":
|
||||
"$ref": "#/definitions/$parameters"
|
||||
api_token:
|
||||
type: string
|
||||
```
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
authenticator:
|
||||
type: "BearerAuthenticator"
|
||||
api_token: "hello"
|
||||
```
|
||||
|
||||
More information on bearer authentication can be found [here](https://swagger.io/docs/specification/authentication/bearer-authentication/).
|
||||
|
||||
For more information see [BearerAuthenticator Reference](https://docs.airbyte.com/connector-development/config-based/understanding-the-yaml-file/reference#/definitions/BearerAuthenticator)
|
||||
|
||||
### BasicHttpAuthenticator
|
||||
|
||||
The `BasicHttpAuthenticator` set the "Authorization" header with a (USER ID/password) pair, encoded using base64 as per [RFC 7617](https://developer.mozilla.org/en-US/docs/Web/HTTP/Authentication#basic_authentication_scheme).
|
||||
The following definition will set the header "Authorization" with a value `Basic {encoded credentials}`
|
||||
|
||||
Schema:
|
||||
|
||||
```yaml
|
||||
BasicHttpAuthenticator:
|
||||
type: object
|
||||
additionalProperties: true
|
||||
required:
|
||||
- username
|
||||
properties:
|
||||
"$parameters":
|
||||
"$ref": "#/definitions/$parameters"
|
||||
username:
|
||||
type: string
|
||||
password:
|
||||
type: string
|
||||
```
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
authenticator:
|
||||
type: "BasicHttpAuthenticator"
|
||||
username: "hello"
|
||||
password: "world"
|
||||
```
|
||||
|
||||
The password is optional. Authenticating with APIs using Basic HTTP and a single API key can be done as:
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
authenticator:
|
||||
type: "BasicHttpAuthenticator"
|
||||
username: "hello"
|
||||
```
|
||||
|
||||
For more information see [BasicHttpAuthenticator Reference](https://docs.airbyte.com/connector-development/config-based/understanding-the-yaml-file/reference#/definitions/BasicHttpAuthenticator)
|
||||
|
||||
### OAuth
|
||||
|
||||
The OAuth authenticator is a declarative way to authenticate with an API using OAuth 2.0.
|
||||
|
||||
To learn more about the OAuth authenticator, see the [OAuth 2.0](../advanced-topics/oauth.md) documentation.
|
||||
|
||||
### JWT Authenticator
|
||||
|
||||
JSON Web Token (JWT) authentication is supported through the `JwtAuthenticator`.
|
||||
|
||||
Schema
|
||||
|
||||
```yaml
|
||||
JwtAuthenticator:
|
||||
title: JWT Authenticator
|
||||
description: Authenticator for requests using JWT authentication flow.
|
||||
type: object
|
||||
required:
|
||||
- type
|
||||
- secret_key
|
||||
- algorithm
|
||||
properties:
|
||||
type:
|
||||
type: string
|
||||
enum: [JwtAuthenticator]
|
||||
secret_key:
|
||||
type: string
|
||||
description: Secret used to sign the JSON web token.
|
||||
examples:
|
||||
- "{{ config['secret_key'] }}"
|
||||
base64_encode_secret_key:
|
||||
type: boolean
|
||||
description: When set to true, the secret key will be base64 encoded prior to being encoded as part of the JWT. Only set to "true" when required by the API.
|
||||
default: False
|
||||
algorithm:
|
||||
type: string
|
||||
description: Algorithm used to sign the JSON web token.
|
||||
enum:
|
||||
[
|
||||
"HS256",
|
||||
"HS384",
|
||||
"HS512",
|
||||
"ES256",
|
||||
"ES256K",
|
||||
"ES384",
|
||||
"ES512",
|
||||
"RS256",
|
||||
"RS384",
|
||||
"RS512",
|
||||
"PS256",
|
||||
"PS384",
|
||||
"PS512",
|
||||
"EdDSA",
|
||||
]
|
||||
examples:
|
||||
- ES256
|
||||
- HS256
|
||||
- RS256
|
||||
- "{{ config['algorithm'] }}"
|
||||
token_duration:
|
||||
type: integer
|
||||
title: Token Duration
|
||||
description: The amount of time in seconds a JWT token can be valid after being issued.
|
||||
default: 1200
|
||||
examples:
|
||||
- 1200
|
||||
- 3600
|
||||
header_prefix:
|
||||
type: string
|
||||
title: Header Prefix
|
||||
description: The prefix to be used within the Authentication header.
|
||||
examples:
|
||||
- "Bearer"
|
||||
- "Basic"
|
||||
jwt_headers:
|
||||
type: object
|
||||
title: JWT Headers
|
||||
description: JWT headers used when signing JSON web token.
|
||||
additionalProperties: false
|
||||
properties:
|
||||
kid:
|
||||
type: string
|
||||
title: Key Identifier
|
||||
description: Private key ID for user account.
|
||||
examples:
|
||||
- "{{ config['kid'] }}"
|
||||
typ:
|
||||
type: string
|
||||
title: Type
|
||||
description: The media type of the complete JWT.
|
||||
default: JWT
|
||||
examples:
|
||||
- JWT
|
||||
cty:
|
||||
type: string
|
||||
title: Content Type
|
||||
description: Content type of JWT header.
|
||||
examples:
|
||||
- JWT
|
||||
additional_jwt_headers:
|
||||
type: object
|
||||
title: Additional JWT Headers
|
||||
description: Additional headers to be included with the JWT headers object.
|
||||
additionalProperties: true
|
||||
jwt_payload:
|
||||
type: object
|
||||
title: JWT Payload
|
||||
description: JWT Payload used when signing JSON web token.
|
||||
additionalProperties: false
|
||||
properties:
|
||||
iss:
|
||||
type: string
|
||||
title: Issuer
|
||||
description: The user/principal that issued the JWT. Commonly a value unique to the user.
|
||||
examples:
|
||||
- "{{ config['iss'] }}"
|
||||
sub:
|
||||
type: string
|
||||
title: Subject
|
||||
description: The subject of the JWT. Commonly defined by the API.
|
||||
aud:
|
||||
type: string
|
||||
title: Audience
|
||||
description: The recipient that the JWT is intended for. Commonly defined by the API.
|
||||
examples:
|
||||
- "appstoreconnect-v1"
|
||||
additional_jwt_payload:
|
||||
type: object
|
||||
title: Additional JWT Payload Properties
|
||||
description: Additional properties to be added to the JWT payload.
|
||||
additionalProperties: true
|
||||
$parameters:
|
||||
type: object
|
||||
additionalProperties: true
|
||||
```
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
authenticator:
|
||||
type: JwtAuthenticator
|
||||
secret_key: "{{ config['secret_key'] }}"
|
||||
base64_encode_secret_key: True
|
||||
algorithm: RS256
|
||||
token_duration: 3600
|
||||
header_prefix: Bearer
|
||||
jwt_headers:
|
||||
kid: "{{ config['kid'] }}"
|
||||
cty: "JWT"
|
||||
additional_jwt_headers:
|
||||
test: "{{ config['test']}}"
|
||||
jwt_payload:
|
||||
iss: "{{ config['iss'] }}"
|
||||
sub: "sub value"
|
||||
aud: "aud value"
|
||||
additional_jwt_payload:
|
||||
test: "test custom payload"
|
||||
```
|
||||
|
||||
For more information see [JwtAuthenticator Reference](https://docs.airbyte.com/connector-development/config-based/understanding-the-yaml-file/reference#/definitions/JwtAuthenticator)
|
||||
|
||||
## More readings
|
||||
|
||||
- [Requester](./requester.md)
|
||||
- [Request options](./request-options.md)
|
||||
@@ -0,0 +1,354 @@
|
||||
# Error handling
|
||||
|
||||
By default, only server errors (HTTP 5XX) and too many requests (HTTP 429) will be retried up to 5 times with exponential backoff.
|
||||
Other HTTP errors will result in a failed read.
|
||||
|
||||
Other behaviors can be configured through the `Requester`'s `error_handler` field.
|
||||
|
||||
Schema:
|
||||
|
||||
```yaml
|
||||
ErrorHandler:
|
||||
type: object
|
||||
description: "Error handler"
|
||||
anyOf:
|
||||
- "$ref": "#/definitions/DefaultErrorHandler"
|
||||
- "$ref": "#/definitions/CompositeErrorHandler"
|
||||
```
|
||||
|
||||
## Default error handler
|
||||
|
||||
Schema:
|
||||
|
||||
```yaml
|
||||
DefaultErrorHandler:
|
||||
type: object
|
||||
required:
|
||||
- max_retries
|
||||
additionalProperties: true
|
||||
properties:
|
||||
"$parameters":
|
||||
"$ref": "#/definitions/$parameters"
|
||||
response_filters:
|
||||
type: array
|
||||
items:
|
||||
"$ref": "#/definitions/HttpResponseFilter"
|
||||
max_retries:
|
||||
type: integer
|
||||
default: 5
|
||||
backoff_strategies:
|
||||
type: array
|
||||
items:
|
||||
"$ref": "#/definitions/BackoffStrategy"
|
||||
default: []
|
||||
```
|
||||
|
||||
## Defining errors
|
||||
|
||||
### From status code
|
||||
|
||||
Response filters can be used to define how to handle requests resulting in responses with a specific HTTP status code.
|
||||
For instance, this example will configure the handler to also retry responses with 404 error:
|
||||
|
||||
Schema:
|
||||
|
||||
```yaml
|
||||
HttpResponseFilter:
|
||||
type: object
|
||||
required:
|
||||
- action
|
||||
additionalProperties: true
|
||||
properties:
|
||||
"$parameters":
|
||||
"$ref": "#/definitions/$parameters"
|
||||
action:
|
||||
"$ref": "#/definitions/ResponseAction"
|
||||
http_codes:
|
||||
type: array
|
||||
items:
|
||||
type: integer
|
||||
default: []
|
||||
error_message_contains:
|
||||
type: string
|
||||
predicate:
|
||||
type: string
|
||||
ResponseAction:
|
||||
type: string
|
||||
enum:
|
||||
- SUCCESS
|
||||
- FAIL
|
||||
- IGNORE
|
||||
- RETRY
|
||||
```
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
requester:
|
||||
<...>
|
||||
error_handler:
|
||||
response_filters:
|
||||
- http_codes: [ 404 ]
|
||||
action: RETRY
|
||||
```
|
||||
|
||||
Response filters can be used to specify HTTP errors to ignore.
|
||||
For instance, this example will configure the handler to ignore responses with 404 error:
|
||||
|
||||
```yaml
|
||||
requester:
|
||||
<...>
|
||||
error_handler:
|
||||
response_filters:
|
||||
- http_codes: [ 404 ]
|
||||
action: IGNORE
|
||||
```
|
||||
|
||||
### From error message
|
||||
|
||||
Errors can also be defined by parsing the error message.
|
||||
For instance, this error handler will ignore responses if the error message contains the string "ignorethisresponse"
|
||||
|
||||
```yaml
|
||||
requester:
|
||||
<...>
|
||||
error_handler:
|
||||
response_filters:
|
||||
- error_message_contains: "ignorethisresponse"
|
||||
action: IGNORE
|
||||
```
|
||||
|
||||
This can also be done through a more generic string interpolation strategy with the following parameters:
|
||||
|
||||
- response: the decoded response
|
||||
|
||||
This example ignores errors where the response contains a "code" field:
|
||||
|
||||
```yaml
|
||||
requester:
|
||||
<...>
|
||||
error_handler:
|
||||
response_filters:
|
||||
- predicate: "{{ 'code' in response }}"
|
||||
action: IGNORE
|
||||
```
|
||||
|
||||
The error handler can have multiple response filters.
|
||||
The following example is configured to ignore 404 errors, and retry 429 errors:
|
||||
|
||||
```yaml
|
||||
requester:
|
||||
<...>
|
||||
error_handler:
|
||||
response_filters:
|
||||
- http_codes: [ 404 ]
|
||||
action: IGNORE
|
||||
- http_codes: [ 429 ]
|
||||
action: RETRY
|
||||
```
|
||||
|
||||
## Backoff Strategies
|
||||
|
||||
The error handler supports a few backoff strategies, which are described in the following sections.
|
||||
|
||||
Schema:
|
||||
|
||||
```yaml
|
||||
BackoffStrategy:
|
||||
type: object
|
||||
anyOf:
|
||||
- "$ref": "#/definitions/ExponentialBackoffStrategy"
|
||||
- "$ref": "#/definitions/ConstantBackoffStrategy"
|
||||
- "$ref": "#/definitions/WaitTimeFromHeader"
|
||||
- "$ref": "#/definitions/WaitUntilTimeFromHeader"
|
||||
```
|
||||
|
||||
### Exponential backoff
|
||||
|
||||
This is the default backoff strategy. The requester will backoff with an exponential backoff interval
|
||||
|
||||
Schema:
|
||||
|
||||
```yaml
|
||||
ExponentialBackoffStrategy:
|
||||
type: object
|
||||
additionalProperties: true
|
||||
properties:
|
||||
"$parameters":
|
||||
"$ref": "#/definitions/$parameters"
|
||||
factor:
|
||||
type: integer
|
||||
default: 5
|
||||
```
|
||||
|
||||
### Constant Backoff
|
||||
|
||||
When using the `ConstantBackoffStrategy` strategy, the requester will backoff with a constant interval.
|
||||
|
||||
Schema:
|
||||
|
||||
```yaml
|
||||
ConstantBackoffStrategy:
|
||||
type: object
|
||||
additionalProperties: true
|
||||
required:
|
||||
- backoff_time_in_seconds
|
||||
properties:
|
||||
"$parameters":
|
||||
"$ref": "#/definitions/$parameters"
|
||||
backoff_time_in_seconds:
|
||||
type: number
|
||||
```
|
||||
|
||||
### Wait time defined in header
|
||||
|
||||
When using the `WaitTimeFromHeader`, the requester will backoff by an interval specified in the response header.
|
||||
In this example, the requester will backoff by the response's "wait_time" header value:
|
||||
|
||||
Schema:
|
||||
|
||||
```yaml
|
||||
WaitTimeFromHeader:
|
||||
type: object
|
||||
additionalProperties: true
|
||||
required:
|
||||
- header
|
||||
properties:
|
||||
"$parameters":
|
||||
"$ref": "#/definitions/$parameters"
|
||||
header:
|
||||
type: string
|
||||
regex:
|
||||
type: string
|
||||
```
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
requester:
|
||||
<...>
|
||||
error_handler:
|
||||
<...>
|
||||
backoff_strategies:
|
||||
- type: "WaitTimeFromHeader"
|
||||
header: "wait_time"
|
||||
```
|
||||
|
||||
Optionally, a regular expression can be configured to extract the wait time from the header value.
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
requester:
|
||||
<...>
|
||||
error_handler:
|
||||
<...>
|
||||
backoff_strategies:
|
||||
- type: "WaitTimeFromHeader"
|
||||
header: "wait_time"
|
||||
regex: "[-+]?\d+"
|
||||
```
|
||||
|
||||
### Wait until time defined in header
|
||||
|
||||
When using the `WaitUntilTimeFromHeader` backoff strategy, the requester will backoff until the time specified in the response header.
|
||||
In this example, the requester will wait until the time specified in the "wait_until" header value:
|
||||
|
||||
Schema:
|
||||
|
||||
```yaml
|
||||
WaitUntilTimeFromHeader:
|
||||
type: object
|
||||
additionalProperties: true
|
||||
required:
|
||||
- header
|
||||
properties:
|
||||
"$parameters":
|
||||
"$ref": "#/definitions/$parameters"
|
||||
header:
|
||||
type: string
|
||||
regex:
|
||||
type: string
|
||||
min_wait:
|
||||
type: number
|
||||
```
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
requester:
|
||||
<...>
|
||||
error_handler:
|
||||
<...>
|
||||
backoff_strategies:
|
||||
- type: "WaitUntilTimeFromHeader"
|
||||
header: "wait_until"
|
||||
regex: "[-+]?\d+"
|
||||
min_wait: 5
|
||||
```
|
||||
|
||||
The strategy accepts an optional regular expression to extract the time from the header value, and a minimum time to wait.
|
||||
|
||||
## Advanced error handling
|
||||
|
||||
The error handler can have multiple backoff strategies, allowing it to fallback if a strategy cannot be evaluated.
|
||||
For instance, the following defines an error handler that will read the backoff time from a header, and default to a constant backoff if the wait time could not be extracted from the response:
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
requester:
|
||||
<...>
|
||||
error_handler:
|
||||
<...>
|
||||
backoff_strategies:
|
||||
- type: "WaitTimeFromHeader"
|
||||
header: "wait_time"
|
||||
- type: "ConstantBackoff"
|
||||
backoff_time_in_seconds: 5
|
||||
```
|
||||
|
||||
The `requester` can be configured to use a `CompositeErrorHandler`, which sequentially iterates over a list of error handlers, enabling different retry mechanisms for different types of errors.
|
||||
|
||||
In this example, a constant backoff of 5 seconds, will be applied if the response contains a "code" field, and an exponential backoff will be applied if the error code is 403:
|
||||
|
||||
Schema:
|
||||
|
||||
```yaml
|
||||
CompositeErrorHandler:
|
||||
type: object
|
||||
required:
|
||||
- error_handlers
|
||||
additionalProperties:
|
||||
"$parameters":
|
||||
"$ref": "#/definitions/$parameters"
|
||||
error_handlers:
|
||||
type: array
|
||||
items:
|
||||
"$ref": "#/definitions/ErrorHandler"
|
||||
```
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
requester:
|
||||
<...>
|
||||
error_handler:
|
||||
type: "CompositeErrorHandler"
|
||||
error_handlers:
|
||||
- response_filters:
|
||||
- predicate: "{{ 'code' in response }}"
|
||||
action: RETRY
|
||||
backoff_strategies:
|
||||
- type: "ConstantBackoffStrategy"
|
||||
backoff_time_in_seconds: 5
|
||||
- response_filters:
|
||||
- http_codes: [ 403 ]
|
||||
action: RETRY
|
||||
backoff_strategies:
|
||||
- type: "ExponentialBackoffStrategy"
|
||||
```
|
||||
|
||||
## More readings
|
||||
|
||||
- [Requester](./requester.md)
|
||||
@@ -0,0 +1,152 @@
|
||||
# Incremental Syncs
|
||||
|
||||
An incremental sync is a sync which pulls only the data that has changed since the previous sync (as opposed to all the data available in the data source).
|
||||
|
||||
Incremental syncs are usually implemented using a cursor value (like a timestamp) that delineates which data was pulled and which data is new. A very common cursor value is an `updated_at` timestamp. This cursor means that records whose `updated_at` value is less than or equal than that cursor value have been synced already, and that the next sync should only export records whose `updated_at` value is greater than the cursor value.
|
||||
|
||||
On a stream, `incremental_sync` defines the connector behavior to support cursor based replication.
|
||||
|
||||
When a stream is read incrementally, a state message will be output by the connector after reading all the records, which allows for checkpointing (link: https://docs.airbyte.com/understanding-airbyte/airbyte-protocol/#state--checkpointing). On the next incremental sync, the prior state message will be used to determine the next set of records to read.
|
||||
|
||||
## DatetimeBasedCursor
|
||||
|
||||
The `DatetimeBasedCursor` is used to read records from the underlying data source (e.g: an API) according to a specified datetime range. This time range is partitioned into time windows according to the `step`. For example, if you have `start_time=2022-01-01T00:00:00`, `end_time=2022-01-05T12:00:00`, `step=P1D` and `cursor_granularity=PT1S`, the following partitions will be created:
|
||||
|
||||
| Start | End |
|
||||
| ------------------- | ------------------- |
|
||||
| 2022-01-01T00:00:00 | 2022-01-01T23:59:59 |
|
||||
| 2022-01-02T00:00:00 | 2022-01-02T23:59:59 |
|
||||
| 2022-01-03T00:00:00 | 2022-01-03T23:59:59 |
|
||||
| 2022-01-04T00:00:00 | 2022-01-04T23:59:59 |
|
||||
| 2022-01-05T00:00:00 | 2022-01-05T12:00:00 |
|
||||
|
||||
During the sync, records are read from the API according to these time windows and the `cursor_field` indicates where the datetime value is stored on a record. This cursor is progressed as these partitions of records are successfully transmitted to the destination.
|
||||
|
||||
Upon a successful sync, the final stream state will be the datetime of the last record emitted. On the subsequent sync, the connector will fetch records whose cursor value begins on that datetime and onward.
|
||||
|
||||
Refer to the schema for both [`DatetimeBasedCursor`](reference.md#/definitions/DatetimeBasedCursor) and [`MinMaxDatetime`](reference.md#/definitions/MinMaxDatetime) in the YAML reference for more details.
|
||||
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
incremental_sync:
|
||||
type: DatetimeBasedCursor
|
||||
start_datetime: "2022-01-01T00:00:00"
|
||||
end_datetime: "2022-01-05T12:00:00"
|
||||
datetime_format: "%Y-%m-%dT%H:%M:%S"
|
||||
cursor_granularity: "PT1S"
|
||||
step: "P1D"
|
||||
```
|
||||
|
||||
will result in the datetime partition windows in the example mentioned earlier.
|
||||
|
||||
### Lookback Windows
|
||||
|
||||
The `DatetimeBasedCursor` also supports an optional lookback window, specifying how many days before the start_datetime to read data for.
|
||||
|
||||
```yaml
|
||||
incremental_sync:
|
||||
type: DatetimeBasedCursor
|
||||
start_datetime: "2022-02-01T00:00:00.000000+0000"
|
||||
end_datetime: "2022-03-01T00:00:00.000000+0000"
|
||||
datetime_format: "%Y-%m-%dT%H:%M:%S.%f%z"
|
||||
cursor_granularity: "PT0.000001S"
|
||||
lookback_window: "P31D"
|
||||
step: "P1D"
|
||||
```
|
||||
|
||||
will read data from `2022-01-01` to `2022-03-01`.
|
||||
|
||||
The stream partitions will be of the form `{"start_date": "2021-02-01T00:00:00.000000+0000", "end_date": "2021-02-02T23:59:59.999999+0000"}`
|
||||
The stream partitions' field names can be customized through the `partition_field_start` and `partition_field_end` parameters.
|
||||
|
||||
The `datetime_format` can be used to specify the format of the start and end time. It is [RFC3339](https://datatracker.ietf.org/doc/html/rfc3339#section-5.6) by default.
|
||||
|
||||
The Stream's state will be derived by reading the record's `cursor_field`.
|
||||
If the `cursor_field` is `updated_at`, and the record is `{"id": 1234, "updated_at": "2021-02-02T00:00:00.000000+0000"}`, then the state after reading that record is `"updated_at": "2021-02-02T00:00:00.000000+0000"`. [^1]
|
||||
|
||||
Note that all durations are expressed as [ISO 8601 durations](https://en.wikipedia.org/wiki/ISO_8601#Durations).
|
||||
|
||||
### Filtering according to Cursor Field
|
||||
|
||||
If an API supports filtering data based on the cursor field, the `start_time_option` and `end_time_option` parameters can be used to configure this filtering.
|
||||
For instance, if the API supports filtering using the request parameters `created[gte]` and `created[lte]`, then the component can specify the request parameters as
|
||||
|
||||
```yaml
|
||||
incremental_sync:
|
||||
type: DatetimeBasedCursor
|
||||
<...>
|
||||
start_time_option:
|
||||
type: RequestOption
|
||||
field_name: "created[gte]"
|
||||
inject_into: "request_parameter"
|
||||
end_time_option:
|
||||
type: RequestOption
|
||||
field_name: "created[lte]"
|
||||
inject_into: "request_parameter"
|
||||
```
|
||||
|
||||
### Nested Streams
|
||||
|
||||
Nested streams, subresources, or streams that depend on other streams can be implemented using a [`SubstreamPartitionRouter`](#SubstreamPartitionRouter)
|
||||
|
||||
The default state format is **per partition with fallback to global**, but there are options to enhance efficiency depending on your use case: **incremental_dependency** and **global_substream_cursor**. Here's when and how to use each option, with examples:
|
||||
|
||||
#### Per Partition with Fallback to Global (Default)
|
||||
- **Description**: This is the default state format, where each partition has its own cursor. However, when the number of records in the parent sync exceeds two times the set limit, the cursor automatically falls back to a global state to manage efficiency and scalability.
|
||||
- **Limitation**: The per partition state has a limit of 10,000 partitions. Once this limit is exceeded, the global cursor takes over, aggregating the state across partitions to avoid inefficiencies.
|
||||
- **When to Use**: Use this as the default option for most cases. It provides the flexibility of managing partitions while preventing performance degradation when large numbers of records are involved.
|
||||
- **Example State**:
|
||||
```json
|
||||
{
|
||||
"states": [
|
||||
{"partition_key": "partition_1", "cursor_field": "2021-01-15"},
|
||||
{"partition_key": "partition_2", "cursor_field": "2021-02-14"}
|
||||
],
|
||||
"state": {
|
||||
"cursor_field": "2021-02-15"
|
||||
},
|
||||
"use_global_cursor": false
|
||||
}
|
||||
```
|
||||
#### Incremental Dependency
|
||||
- **Description**: This option allows the parent stream to be read incrementally, ensuring that only new data is synced.
|
||||
- **Requirement**: The API must ensure that the parent record's cursor is updated whenever child records are added or updated. If this requirement is not met, child records added to older parent records will be lost.
|
||||
- **When to Use**: Use this option if the parent stream is incremental and you want to read it with the state. The parent state is updated after processing all the child records for the parent record.
|
||||
- **Example State**:
|
||||
```json
|
||||
{
|
||||
"parent_state": {
|
||||
"parent_stream": { "timestamp": "2024-08-01T00:00:00" }
|
||||
},
|
||||
"states": [
|
||||
{ "partition": "A", "timestamp": "2024-08-01T00:00:00" },
|
||||
{ "partition": "B", "timestamp": "2024-08-01T01:00:00" }
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
#### Global Substream Cursor
|
||||
- **Description**: This option uses a single global cursor for all partitions, significantly reducing the state size. It enforces a minimal lookback window based on the previous sync's duration to avoid losing records added or updated during the sync. Since the global cursor is already part of the per partition with fallback to global approach, it should only be used cautiously for custom connectors with exceptionally large parent streams to avoid managing state per partition.
|
||||
- **When to Use**: Use this option cautiously for custom connectors where the number of partitions in the parent stream is extremely high (e.g., millions of records per sync). The global cursor avoids the inefficiency of managing state per partition but sacrifices some granularity, which may not be suitable for every use case.
|
||||
- **Operational Detail**: The global cursor is updated only at the end of the sync. If the sync fails, only the parent state is updated, provided that the incremental dependency is enabled. The global cursor ensures that records are captured through a lookback window, even if they were added during the sync.
|
||||
- **Example State**:
|
||||
```json
|
||||
[
|
||||
{ "timestamp": "2024-08-01"}
|
||||
]
|
||||
```
|
||||
|
||||
### Summary
|
||||
Summary
|
||||
- **Per Partition with Fallback to Global (Default)**: Best for managing scalability and optimizing state size. Starts with per partition cursors, and automatically falls back to a global cursor if the number of records in the parent sync exceeds two times the partition limit.
|
||||
- **Incremental Dependency**: Use for incremental parent streams with a dependent child cursor. Ensure the API updates the parent cursor when child records are added or updated.
|
||||
- **Global Substream Cursor**: Use cautiously for custom connectors with very large parent streams. Avoids per partition state management but sacrifices some granularity.
|
||||
|
||||
Choose the option that best fits your data structure and sync requirements to optimize performance and data integrity.
|
||||
|
||||
## More readings
|
||||
|
||||
- [Incremental reads](../../cdk-python/incremental-stream.md)
|
||||
- Many of the concepts discussed here are described in the [No-Code Connector Builder docs](../../connector-builder-ui/incremental-sync.md) as well, with more examples.
|
||||
@@ -0,0 +1,231 @@
|
||||
# Pagination
|
||||
|
||||
Given a page size and a pagination strategy, the `DefaultPaginator` will point to pages of results for as long as its strategy returns a `next_page_token`.
|
||||
|
||||
Iterating over pages of result is different from iterating over stream slices.
|
||||
Stream slices have semantic value, for instance, a Datetime stream slice defines data for a specific date range. Two stream slices will have data for different date ranges.
|
||||
Conversely, pages don't have semantic value. More pages simply means that more records are to be read, without specifying any meaningful difference between the records of the first and later pages.
|
||||
|
||||
Schema:
|
||||
|
||||
```yaml
|
||||
Paginator:
|
||||
type: object
|
||||
anyOf:
|
||||
- "$ref": "#/definitions/DefaultPaginator"
|
||||
- "$ref": "#/definitions/NoPagination"
|
||||
NoPagination:
|
||||
type: object
|
||||
additionalProperties: true
|
||||
```
|
||||
|
||||
## Default paginator
|
||||
|
||||
The default paginator is defined by
|
||||
|
||||
- `page_size_option`: How to specify the page size in the outgoing HTTP request
|
||||
- `pagination_strategy`: How to compute the next page to fetch
|
||||
- `page_token_option`: How to specify the next page to fetch in the outgoing HTTP request
|
||||
|
||||
Schema:
|
||||
|
||||
```yaml
|
||||
DefaultPaginator:
|
||||
type: object
|
||||
additionalProperties: true
|
||||
required:
|
||||
- page_token_option
|
||||
- pagination_strategy
|
||||
properties:
|
||||
"$parameters":
|
||||
"$ref": "#/definitions/$parameters"
|
||||
page_size:
|
||||
type: integer
|
||||
page_size_option:
|
||||
"$ref": "#/definitions/RequestOption"
|
||||
page_token_option:
|
||||
anyOf:
|
||||
- "$ref": "#/definitions/RequestOption"
|
||||
- "$ref": "#/definitions/RequestPath"
|
||||
pagination_strategy:
|
||||
"$ref": "#/definitions/PaginationStrategy"
|
||||
```
|
||||
|
||||
3 pagination strategies are supported
|
||||
|
||||
1. Page increment
|
||||
2. Offset increment
|
||||
3. Cursor-based
|
||||
|
||||
## Pagination Strategies
|
||||
|
||||
Schema:
|
||||
|
||||
```yaml
|
||||
PaginationStrategy:
|
||||
type: object
|
||||
anyOf:
|
||||
- "$ref": "#/definitions/CursorPagination"
|
||||
- "$ref": "#/definitions/OffsetIncrement"
|
||||
- "$ref": "#/definitions/PageIncrement"
|
||||
```
|
||||
|
||||
### Page increment
|
||||
|
||||
When using the `PageIncrement` strategy, the page number will be set as part of the `page_token_option`.
|
||||
|
||||
Schema:
|
||||
|
||||
```yaml
|
||||
PageIncrement:
|
||||
type: object
|
||||
additionalProperties: true
|
||||
required:
|
||||
- page_size
|
||||
properties:
|
||||
"$parameters":
|
||||
"$ref": "#/definitions/$parameters"
|
||||
page_size:
|
||||
type: integer
|
||||
```
|
||||
|
||||
The following paginator example will fetch 5 records per page, and specify the page number as a request_parameter:
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
paginator:
|
||||
type: "DefaultPaginator"
|
||||
page_size_option:
|
||||
type: "RequestOption"
|
||||
inject_into: "request_parameter"
|
||||
field_name: "page_size"
|
||||
pagination_strategy:
|
||||
type: "PageIncrement"
|
||||
page_size: 5
|
||||
page_token_option:
|
||||
type: "RequestOption"
|
||||
inject_into: "request_parameter"
|
||||
field_name: "page"
|
||||
```
|
||||
|
||||
If the page contains less than 5 records, then the paginator knows there are no more pages to fetch.
|
||||
If the API returns more records than requested, all records will be processed.
|
||||
|
||||
Assuming the endpoint to fetch data from is `https://cloud.airbyte.com/api/get_data`,
|
||||
the first request will be sent as `https://cloud.airbyte.com/api/get_data?page_size=5&page=0`
|
||||
and the second request as `https://cloud.airbyte.com/api/get_data?page_size=5&page=1`,
|
||||
|
||||
### Offset increment
|
||||
|
||||
When using the `OffsetIncrement` strategy, the number of records read will be set as part of the `page_token_option`.
|
||||
|
||||
Schema:
|
||||
|
||||
```yaml
|
||||
OffsetIncrement:
|
||||
type: object
|
||||
additionalProperties: true
|
||||
required:
|
||||
- page_size
|
||||
properties:
|
||||
"$parameters":
|
||||
"$ref": "#/definitions/$parameters"
|
||||
page_size:
|
||||
type: integer
|
||||
```
|
||||
|
||||
The following paginator example will fetch 5 records per page, and specify the offset as a request_parameter:
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
paginator:
|
||||
type: "DefaultPaginator"
|
||||
page_size_option:
|
||||
type: "RequestOption"
|
||||
inject_into: "request_parameter"
|
||||
field_name: "page_size"
|
||||
pagination_strategy:
|
||||
type: "OffsetIncrement"
|
||||
page_size: 5
|
||||
page_token_option:
|
||||
type: "RequestOption"
|
||||
field_name: "offset"
|
||||
inject_into: "request_parameter"
|
||||
```
|
||||
|
||||
Assuming the endpoint to fetch data from is `https://cloud.airbyte.com/api/get_data`,
|
||||
the first request will be sent as `https://cloud.airbyte.com/api/get_data?page_size=5&offset=0`
|
||||
and the second request as `https://cloud.airbyte.com/api/get_data?page_size=5&offset=5`,
|
||||
|
||||
### Cursor
|
||||
|
||||
The `CursorPagination` outputs a token by evaluating its `cursor_value` string with the following parameters:
|
||||
|
||||
- `response`: The decoded response
|
||||
- `headers`: HTTP headers on the response
|
||||
- `last_records`: List of records selected from the last response
|
||||
|
||||
This cursor value can be used to request the next page of record.
|
||||
|
||||
Schema:
|
||||
|
||||
```yaml
|
||||
CursorPagination:
|
||||
type: object
|
||||
additionalProperties: true
|
||||
required:
|
||||
- cursor_value
|
||||
properties:
|
||||
"$parameters":
|
||||
"$ref": "#/definitions/$parameters"
|
||||
cursor_value:
|
||||
type: string
|
||||
stop_condition:
|
||||
type: string
|
||||
page_size:
|
||||
type: integer
|
||||
```
|
||||
|
||||
#### Cursor paginator in request parameters
|
||||
|
||||
In this example, the next page of record is defined by setting the `from` request parameter to the id of the last record read:
|
||||
|
||||
```yaml
|
||||
paginator:
|
||||
type: "DefaultPaginator"
|
||||
<...>
|
||||
pagination_strategy:
|
||||
type: "CursorPagination"
|
||||
cursor_value: "{{ last_records[-1]['id'] }}"
|
||||
page_token_option:
|
||||
type: "RequestPath"
|
||||
field_name: "from"
|
||||
inject_into: "request_parameter"
|
||||
```
|
||||
|
||||
Assuming the endpoint to fetch data from is `https://cloud.airbyte.com/api/get_data`,
|
||||
the first request will be sent as `https://cloud.airbyte.com/api/get_data`.
|
||||
Assuming the id of the last record fetched is 1000,
|
||||
the next request will be sent as `https://cloud.airbyte.com/api/get_data?from=1000`.
|
||||
|
||||
#### Cursor paginator in path
|
||||
|
||||
Some APIs directly point to the URL of the next page to fetch. In this example, the URL of the next page is extracted from the response headers:
|
||||
|
||||
```yaml
|
||||
paginator:
|
||||
type: "DefaultPaginator"
|
||||
<...>
|
||||
pagination_strategy:
|
||||
type: "CursorPagination"
|
||||
cursor_value: "{{ headers['link']['next']['url'] }}"
|
||||
page_token_option:
|
||||
type: "RequestPath"
|
||||
```
|
||||
|
||||
Assuming the endpoint to fetch data from is `https://cloud.airbyte.com/api/get_data`,
|
||||
the first request will be sent as `https://cloud.airbyte.com/api/get_data`
|
||||
Assuming the response's next url is `https://cloud.airbyte.com/api/get_data?page=1&page_size=100`,
|
||||
the next request will be sent as `https://cloud.airbyte.com/api/get_data?page=1&page_size=100`
|
||||
@@ -0,0 +1,161 @@
|
||||
# Retrieving Records Spread Across Partitions
|
||||
|
||||
In some cases, the data you are replicating is spread across multiple partitions. You can specify a set of parameters to be iterated over and used while requesting all of your data. On each iteration, using the current element being iterated upon, the connector will perform a cycle of requesting data from your source.
|
||||
|
||||
`PartitionRouter`s gives you the ability to specify either a static or dynamic set of elements that will be iterated over one at a time. This in turn is used to route requests to a partition of your data according to the elements iterated over.
|
||||
|
||||
The most common use case for the `PartitionRouter` component is the retrieval of data from an API endpoint that requires extra request inputs to indicate which partition of data to fetch.
|
||||
|
||||
Schema:
|
||||
|
||||
```yaml
|
||||
partition_router:
|
||||
default: []
|
||||
anyOf:
|
||||
- "$ref": "#/definitions/CustomPartitionRouter"
|
||||
- "$ref": "#/definitions/ListPartitionRouter"
|
||||
- "$ref": "#/definitions/SubstreamPartitionRouter"
|
||||
- type: array
|
||||
items:
|
||||
anyOf:
|
||||
- "$ref": "#/definitions/CustomPartitionRouter"
|
||||
- "$ref": "#/definitions/ListPartitionRouter"
|
||||
- "$ref": "#/definitions/SubstreamPartitionRouter"
|
||||
```
|
||||
|
||||
Notice that you can specify one or more `PartitionRouter`s on a Retriever. When multiple are defined, the result will be Cartesian product of all partitions and a request cycle will be performed for each permutation.
|
||||
|
||||
## ListPartitionRouter
|
||||
|
||||
`ListPartitionRouter` iterates over values from a given list. It is defined by
|
||||
|
||||
- The partition values, which are the valid values for the cursor field
|
||||
- The cursor field on a record
|
||||
- request_option: optional request option to set on outgoing request parameters
|
||||
|
||||
Schema:
|
||||
|
||||
```yaml
|
||||
ListPartitionRouter:
|
||||
description: Partition router that is used to retrieve records that have been partitioned according to a list of values
|
||||
type: object
|
||||
required:
|
||||
- type
|
||||
- cursor_field
|
||||
- slice_values
|
||||
properties:
|
||||
type:
|
||||
type: string
|
||||
enum: [ListPartitionRouter]
|
||||
cursor_field:
|
||||
type: string
|
||||
partition_values:
|
||||
anyOf:
|
||||
- type: string
|
||||
- type: array
|
||||
items:
|
||||
type: string
|
||||
request_option:
|
||||
"$ref": "#/definitions/RequestOption"
|
||||
$parameters:
|
||||
type: object
|
||||
additionalProperties: true
|
||||
```
|
||||
|
||||
As an example, this partition router will iterate over the 2 repositories ("airbyte" and "airbyte-secret") and will set a request_parameter on outgoing HTTP requests.
|
||||
|
||||
```yaml
|
||||
partition_router:
|
||||
type: ListPartitionRouter
|
||||
values:
|
||||
- "airbyte"
|
||||
- "airbyte-secret"
|
||||
cursor_field: "repository"
|
||||
request_option:
|
||||
type: RequestOption
|
||||
field_name: "repository"
|
||||
inject_into: "request_parameter"
|
||||
```
|
||||
|
||||
## SubstreamPartitionRouter
|
||||
|
||||
Substreams are streams that depend on the records of another stream
|
||||
|
||||
We might for instance want to read all the commits for a given repository (parent stream).
|
||||
|
||||
Substreams are implemented by defining their partition router as a `SubstreamPartitionRouter`.
|
||||
|
||||
`SubstreamPartitionRouter` is used to route requests to fetch data that has been partitioned according to a parent stream's records . We might for instance want to read all the commits for a given repository (parent resource).
|
||||
|
||||
- what the parent stream is
|
||||
- what is the key of the records in the parent stream
|
||||
- what is the attribute on the parent record that is being used to partition the substream data
|
||||
- how to specify that attribute on an outgoing HTTP request to retrieve that set of records
|
||||
|
||||
Schema:
|
||||
|
||||
```yaml
|
||||
SubstreamPartitionRouter:
|
||||
description: Partition router that is used to retrieve records that have been partitioned according to records from the specified parent streams
|
||||
type: object
|
||||
required:
|
||||
- type
|
||||
- parent_stream_configs
|
||||
properties:
|
||||
type:
|
||||
type: string
|
||||
enum: [SubstreamPartitionRouter]
|
||||
parent_stream_configs:
|
||||
type: array
|
||||
items:
|
||||
"$ref": "#/definitions/ParentStreamConfig"
|
||||
$parameters:
|
||||
type: object
|
||||
additionalProperties: true
|
||||
```
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
partition_router:
|
||||
type: SubstreamPartitionRouter
|
||||
parent_stream_configs:
|
||||
- stream: "#/repositories_stream"
|
||||
parent_key: "id"
|
||||
partition_field: "repository"
|
||||
request_option:
|
||||
type: RequestOption
|
||||
field_name: "repository"
|
||||
inject_into: "request_parameter"
|
||||
```
|
||||
|
||||
REST APIs often nest sub-resources in the URL path.
|
||||
If the URL to fetch commits was "/repositories/:id/commits", then the `Requester`'s path would need to refer to the stream slice's value and no `request_option` would be set:
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
retriever:
|
||||
<...>
|
||||
requester:
|
||||
<...>
|
||||
path: "/respositories/{{ stream_slice.repository }}/commits"
|
||||
partition_router:
|
||||
type: SubstreamPartitionRouter
|
||||
parent_streams_configs:
|
||||
- stream: "#/repositories_stream"
|
||||
parent_key: "id"
|
||||
partition_field: "repository"
|
||||
incremental_dependency: true
|
||||
```
|
||||
|
||||
## Nested streams
|
||||
|
||||
Nested streams, subresources, or streams that depend on other streams can be implemented using a [`SubstreamPartitionRouter`](#SubstreamPartitionRouter)
|
||||
|
||||
## More readings
|
||||
|
||||
- [Incremental streams](../../cdk-python/incremental-stream.md)
|
||||
- [Stream slices](../../cdk-python/stream-slices.md)
|
||||
|
||||
[^1] This is a slight oversimplification. See [update cursor section](#cursor-update) for more details on how the cursor is updated.
|
||||
@@ -0,0 +1,292 @@
|
||||
# Rate limiting (API Budget)
|
||||
|
||||
In order to prevent sending too many requests to the API in a short period of time, you can configure an API Budget. This budget determines the maximum number of calls that can be made within a specified time interval (or intervals). This mechanism is particularly useful for respecting third-party API rate limits and avoiding potential throttling or denial of service.
|
||||
|
||||
When using an **HTTPAPIBudget**, rate limit updates can be automatically extracted from HTTP response headers such as _remaining calls_ or _time-to-reset_ values.
|
||||
|
||||
## Schema
|
||||
|
||||
```yaml
|
||||
HTTPAPIBudget:
|
||||
type: object
|
||||
title: HTTP API Budget
|
||||
description: >
|
||||
An HTTP-specific API budget that extends APIBudget by updating rate limiting information based
|
||||
on HTTP response headers. It extracts available calls and the next reset timestamp from the HTTP responses.
|
||||
required:
|
||||
- type
|
||||
- policies
|
||||
properties:
|
||||
type:
|
||||
type: string
|
||||
enum: [HTTPAPIBudget]
|
||||
policies:
|
||||
type: array
|
||||
description: List of call rate policies that define how many calls are allowed.
|
||||
items:
|
||||
anyOf:
|
||||
- "$ref": "#/definitions/FixedWindowCallRatePolicy"
|
||||
- "$ref": "#/definitions/MovingWindowCallRatePolicy"
|
||||
- "$ref": "#/definitions/UnlimitedCallRatePolicy"
|
||||
ratelimit_reset_header:
|
||||
type: string
|
||||
default: "ratelimit-reset"
|
||||
description: The HTTP response header name that indicates when the rate limit resets.
|
||||
ratelimit_remaining_header:
|
||||
type: string
|
||||
default: "ratelimit-remaining"
|
||||
description: The HTTP response header name that indicates the number of remaining allowed calls.
|
||||
status_codes_for_ratelimit_hit:
|
||||
type: array
|
||||
default: [429]
|
||||
items:
|
||||
type: integer
|
||||
description: List of HTTP status codes that indicate a rate limit has been hit.
|
||||
additionalProperties: true
|
||||
```
|
||||
|
||||
An `HTTPAPIBudget` may contain one or more rate policies. These policies define how rate limits should be enforced.
|
||||
|
||||
## Example usage
|
||||
```yaml
|
||||
api_budget:
|
||||
type: "HTTPAPIBudget"
|
||||
ratelimit_reset_header: "X-RateLimit-Reset"
|
||||
ratelimit_remaining_header: "X-RateLimit-Remaining"
|
||||
status_codes_for_ratelimit_hit: [ 429 ]
|
||||
policies:
|
||||
- type: "UnlimitedCallRatePolicy"
|
||||
matchers: []
|
||||
- type: "FixedWindowCallRatePolicy"
|
||||
period: "PT1H"
|
||||
call_limit: 1000
|
||||
matchers:
|
||||
- method: "GET"
|
||||
url_base: "https://api.example.com"
|
||||
url_path_pattern: "^/users"
|
||||
- type: "MovingWindowCallRatePolicy"
|
||||
rates:
|
||||
- limit: 100
|
||||
interval: "PT1M"
|
||||
matchers:
|
||||
- method: "POST"
|
||||
url_base: "https://api.example.com"
|
||||
url_path_pattern: "^/users"
|
||||
```
|
||||
Above, we define:
|
||||
|
||||
1. **UnlimitedCallRatePolicy**: A policy with no limit on requests.
|
||||
2. **FixedWindowCallRatePolicy**: Allows a set number of calls within a fixed time window (in the example, 1000 calls per 1 hour).
|
||||
3. **MovingWindowCallRatePolicy**: Uses a moving time window to track how many calls were made in the last interval. In the example, up to 100 calls per 1 minute for `POST /users`.
|
||||
|
||||
|
||||
## Rate Policies
|
||||
### Unlimited call rate policy
|
||||
Use this policy if you want to allow unlimited calls for a subset of requests.
|
||||
For instance, the policy below will not limit requests that match its `matchers`:
|
||||
|
||||
```yaml
|
||||
UnlimitedCallRatePolicy:
|
||||
type: object
|
||||
title: Unlimited Call Rate Policy
|
||||
description: A policy that allows unlimited calls for specific requests.
|
||||
required:
|
||||
- type
|
||||
- matchers
|
||||
properties:
|
||||
type:
|
||||
type: string
|
||||
enum: [UnlimitedCallRatePolicy]
|
||||
matchers:
|
||||
type: array
|
||||
items:
|
||||
"$ref": "#/definitions/HttpRequestRegexMatcher"
|
||||
```
|
||||
|
||||
#### Example
|
||||
```yaml
|
||||
api_budget:
|
||||
type: "HTTPAPIBudget"
|
||||
policies:
|
||||
- type: "UnlimitedCallRatePolicy"
|
||||
# For any GET request on https://api.example.com/sandbox
|
||||
matchers:
|
||||
- method: "GET"
|
||||
url_base: "https://api.example.com"
|
||||
url_path_pattern: "^/sandbox"
|
||||
```
|
||||
Here, any request matching the above matcher is not rate-limited.
|
||||
|
||||
### Fixed Window Call Rate Policy
|
||||
This policy allows **n** calls per specified interval (for example, 1000 calls per hour). After the time window ends (the “fixed window”), it resets, and you can make new calls.
|
||||
|
||||
```yaml
|
||||
FixedWindowCallRatePolicy:
|
||||
type: object
|
||||
title: Fixed Window Call Rate Policy
|
||||
description: A policy that allows a fixed number of calls within a specific time window.
|
||||
required:
|
||||
- type
|
||||
- period
|
||||
- call_limit
|
||||
- matchers
|
||||
properties:
|
||||
type:
|
||||
type: string
|
||||
enum: [FixedWindowCallRatePolicy]
|
||||
period:
|
||||
type: string
|
||||
format: duration
|
||||
call_limit:
|
||||
type: integer
|
||||
matchers:
|
||||
type: array
|
||||
items:
|
||||
"$ref": "#/definitions/HttpRequestRegexMatcher"
|
||||
additionalProperties: true
|
||||
```
|
||||
- **period**: In ISO 8601 duration format (e.g. `PT1H` for 1 hour, `PT15M` for 15 minutes).
|
||||
- **call_limit**: Maximum allowed calls within that period.
|
||||
- **matchers**: A list of request matchers (by HTTP method, URL path, etc.) that this policy applies to.
|
||||
|
||||
#### Example
|
||||
```yaml
|
||||
api_budget:
|
||||
type: "HTTPAPIBudget"
|
||||
policies:
|
||||
- type: "FixedWindowCallRatePolicy"
|
||||
period: "PT1H"
|
||||
call_limit: 1000
|
||||
matchers:
|
||||
- method: "GET"
|
||||
url_base: "https://api.example.com"
|
||||
url_path_pattern: "^/users"
|
||||
```
|
||||
|
||||
### Moving Window Call Rate Policy
|
||||
This policy allows a certain number of calls in a “sliding” or “moving” window, using timestamps for each call. For example, 100 requests allowed within the last 60 seconds.
|
||||
|
||||
```yaml
|
||||
MovingWindowCallRatePolicy:
|
||||
type: object
|
||||
title: Moving Window Call Rate Policy
|
||||
description: A policy that allows a fixed number of calls within a moving time window.
|
||||
required:
|
||||
- type
|
||||
- rates
|
||||
- matchers
|
||||
properties:
|
||||
type:
|
||||
type: string
|
||||
enum: [MovingWindowCallRatePolicy]
|
||||
rates:
|
||||
type: array
|
||||
items:
|
||||
"$ref": "#/definitions/Rate"
|
||||
matchers:
|
||||
type: array
|
||||
items:
|
||||
"$ref": "#/definitions/HttpRequestRegexMatcher"
|
||||
additionalProperties: true
|
||||
```
|
||||
|
||||
- **rates**: A list of `Rate` objects, each specifying a `limit` and `interval`.
|
||||
- **interval**: An ISO 8601 duration (e.g., `"PT1M"` is 1 minute).
|
||||
- **limit**: Number of calls allowed within that interval.
|
||||
|
||||
#### Example
|
||||
```yaml
|
||||
api_budget:
|
||||
type: "HTTPAPIBudget"
|
||||
policies:
|
||||
- type: "MovingWindowCallRatePolicy"
|
||||
rates:
|
||||
- limit: 100
|
||||
interval: "PT1M"
|
||||
matchers:
|
||||
- method: "GET"
|
||||
url_base: "https://api.example.com"
|
||||
url_path_pattern: "^/orders"
|
||||
```
|
||||
In this example, at most 100 requests to `GET /orders` can be made in any rolling 1-minute period.
|
||||
|
||||
## Matching requests with matchers
|
||||
Each policy has a `matchers` array of objects defining which requests it applies to. The schema for each matcher:
|
||||
|
||||
```yaml
|
||||
HttpRequestRegexMatcher:
|
||||
type: object
|
||||
properties:
|
||||
method:
|
||||
type: string
|
||||
description: The HTTP method (e.g. GET, POST).
|
||||
url_base:
|
||||
type: string
|
||||
description: The base URL to match (e.g. "https://api.example.com" without trailing slash).
|
||||
url_path_pattern:
|
||||
type: string
|
||||
description: A regular expression to match the path portion.
|
||||
params:
|
||||
type: object
|
||||
additionalProperties: true
|
||||
headers:
|
||||
type: object
|
||||
additionalProperties: true
|
||||
additionalProperties: true
|
||||
```
|
||||
- **method**: Matches if the request method equals the one in the matcher (case-insensitive).
|
||||
- **url_base**: Matches the scheme + host portion (no trailing slash).
|
||||
- **url_path_pattern**: Regex is tested against the request path.
|
||||
- **params**: The query parameters must match.
|
||||
- **headers**: The headers must match.
|
||||
A request is rate-limited by the first policy whose matchers pass. If no policy matches, then the request will be allowed if you have not defined a default/other policy that catches everything else.
|
||||
|
||||
## Putting it all together
|
||||
You may define multiple policies for different endpoints. For example:
|
||||
|
||||
```yaml
|
||||
api_budget:
|
||||
type: "HTTPAPIBudget"
|
||||
# Use standard rate limit headers from your API
|
||||
ratelimit_reset_header: "X-RateLimit-Reset"
|
||||
ratelimit_remaining_header: "X-RateLimit-Remaining"
|
||||
status_codes_for_ratelimit_hit: [429, 420]
|
||||
|
||||
policies:
|
||||
# Policy 1: Unlimited
|
||||
- type: "UnlimitedCallRatePolicy"
|
||||
matchers:
|
||||
- url_base: "https://api.example.com"
|
||||
method: "GET"
|
||||
url_path_pattern: "^/sandbox"
|
||||
|
||||
# Policy 2: 1000 calls per hour
|
||||
- type: "FixedWindowCallRatePolicy"
|
||||
period: "PT1H"
|
||||
call_limit: 1000
|
||||
matchers:
|
||||
- method: "GET"
|
||||
url_base: "https://api.example.com"
|
||||
url_path_pattern: "^/users"
|
||||
|
||||
# Policy 3: 500 calls per hour
|
||||
- type: "FixedWindowCallRatePolicy"
|
||||
period: "PT1H"
|
||||
call_limit: 500
|
||||
matchers:
|
||||
- method: "POST"
|
||||
url_base: "https://api.example.com"
|
||||
url_path_pattern: "^/orders"
|
||||
|
||||
# Policy 4: 20 calls every 5 minutes (moving window).
|
||||
- type: "MovingWindowCallRatePolicy"
|
||||
rates:
|
||||
- limit: 20
|
||||
interval: "PT5M"
|
||||
matchers:
|
||||
- url_base: "https://api.example.com"
|
||||
url_path_pattern: "^/internal"
|
||||
```
|
||||
1. The request attempts to match the first policy (unlimited on `GET /sandbox`). If it matches, it’s unlimited.
|
||||
2. Otherwise, it checks the second policy (1000/hour for `GET /users`), etc.
|
||||
3. If still no match, it is not rate-limited by these defined policies (unless you add a “catch-all” policy).
|
||||
@@ -0,0 +1,392 @@
|
||||
# Record selector
|
||||
|
||||
The record selector is responsible for translating an HTTP response into a list of Airbyte records by extracting records from the response and optionally filtering and shaping records based on a heuristic.
|
||||
Schema:
|
||||
|
||||
```yaml
|
||||
HttpSelector:
|
||||
type: object
|
||||
anyOf:
|
||||
- "$ref": "#/definitions/RecordSelector"
|
||||
RecordSelector:
|
||||
type: object
|
||||
required:
|
||||
- extractor
|
||||
properties:
|
||||
"$parameters":
|
||||
"$ref": "#/definitions/$parameters"
|
||||
extractor:
|
||||
"$ref": "#/definitions/RecordExtractor"
|
||||
record_filter:
|
||||
"$ref": "#/definitions/RecordFilter"
|
||||
```
|
||||
|
||||
The current record extraction implementation uses [dpath](https://pypi.org/project/dpath/) to select records from the json-decoded HTTP response.
|
||||
For nested structures `*` can be used to iterate over array elements.
|
||||
Schema:
|
||||
|
||||
```yaml
|
||||
DpathExtractor:
|
||||
type: object
|
||||
additionalProperties: true
|
||||
required:
|
||||
- field_path
|
||||
properties:
|
||||
"$parameters":
|
||||
"$ref": "#/definitions/$parameters"
|
||||
field_path:
|
||||
type: array
|
||||
items:
|
||||
type: string
|
||||
```
|
||||
|
||||
## Common recipes:
|
||||
|
||||
Here are some common patterns:
|
||||
|
||||
### Selecting the whole response
|
||||
|
||||
If the root of the response is an array containing the records, the records can be extracted using the following definition:
|
||||
|
||||
```yaml
|
||||
selector:
|
||||
extractor:
|
||||
field_path: []
|
||||
```
|
||||
|
||||
If the root of the response is a json object representing a single record, the record can be extracted and wrapped in an array.
|
||||
For example, given a response body of the form
|
||||
|
||||
```json
|
||||
{
|
||||
"id": 1
|
||||
}
|
||||
```
|
||||
|
||||
and a selector
|
||||
|
||||
```yaml
|
||||
selector:
|
||||
extractor:
|
||||
field_path: []
|
||||
```
|
||||
|
||||
The selected records will be
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"id": 1
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
### Selecting a field
|
||||
|
||||
Given a response body of the form
|
||||
|
||||
```
|
||||
{
|
||||
"data": [{"id": 0}, {"id": 1}],
|
||||
"metadata": {"api-version": "1.0.0"}
|
||||
}
|
||||
```
|
||||
|
||||
and a selector
|
||||
|
||||
```yaml
|
||||
selector:
|
||||
extractor:
|
||||
field_path: ["data"]
|
||||
```
|
||||
|
||||
The selected records will be
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"id": 0
|
||||
},
|
||||
{
|
||||
"id": 1
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
### Selecting an inner field
|
||||
|
||||
Given a response body of the form
|
||||
|
||||
```json
|
||||
{
|
||||
"data": {
|
||||
"records": [
|
||||
{
|
||||
"id": 1
|
||||
},
|
||||
{
|
||||
"id": 2
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
and a selector
|
||||
|
||||
```yaml
|
||||
selector:
|
||||
extractor:
|
||||
field_path: ["data", "records"]
|
||||
```
|
||||
|
||||
The selected records will be
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"id": 1
|
||||
},
|
||||
{
|
||||
"id": 2
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
### Selecting fields nested in arrays
|
||||
|
||||
Given a response body of the form
|
||||
|
||||
```json
|
||||
{
|
||||
"data": [
|
||||
{
|
||||
"record": {
|
||||
"id": "1"
|
||||
}
|
||||
},
|
||||
{
|
||||
"record": {
|
||||
"id": "2"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
and a selector
|
||||
|
||||
```yaml
|
||||
selector:
|
||||
extractor:
|
||||
field_path: ["data", "*", "record"]
|
||||
```
|
||||
|
||||
The selected records will be
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"id": 1
|
||||
},
|
||||
{
|
||||
"id": 2
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
## Filtering records
|
||||
|
||||
Records can be filtered by adding a record_filter to the selector.
|
||||
The expression in the filter will be evaluated to a boolean returning true if the record should be included.
|
||||
|
||||
In this example, all records with a `created_at` field greater than the stream slice's `start_time` will be filtered out:
|
||||
|
||||
```yaml
|
||||
selector:
|
||||
extractor:
|
||||
field_path: []
|
||||
record_filter:
|
||||
condition: "{{ record['created_at'] < stream_slice['start_time'] }}"
|
||||
```
|
||||
|
||||
## Transformations
|
||||
|
||||
Fields can be added or removed from records by adding `Transformation`s to a stream's definition.
|
||||
|
||||
Schema:
|
||||
|
||||
```yaml
|
||||
RecordTransformation:
|
||||
type: object
|
||||
anyOf:
|
||||
- "$ref": "#/definitions/AddFields"
|
||||
- "$ref": "#/definitions/RemoveFields"
|
||||
```
|
||||
|
||||
### Adding fields
|
||||
|
||||
Fields can be added with the `AddFields` transformation.
|
||||
This example adds a top-level field "field1" with a value "static_value"
|
||||
|
||||
Schema:
|
||||
|
||||
```yaml
|
||||
AddFields:
|
||||
type: object
|
||||
required:
|
||||
- fields
|
||||
additionalProperties: true
|
||||
properties:
|
||||
"$parameters":
|
||||
"$ref": "#/definitions/$parameters"
|
||||
fields:
|
||||
type: array
|
||||
items:
|
||||
"$ref": "#/definitions/AddedFieldDefinition"
|
||||
AddedFieldDefinition:
|
||||
type: object
|
||||
required:
|
||||
- path
|
||||
- value
|
||||
additionalProperties: true
|
||||
properties:
|
||||
"$parameters":
|
||||
"$ref": "#/definitions/$parameters"
|
||||
path:
|
||||
"$ref": "#/definitions/FieldPointer"
|
||||
value:
|
||||
type: string
|
||||
FieldPointer:
|
||||
type: array
|
||||
items:
|
||||
type: string
|
||||
```
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
stream:
|
||||
<...>
|
||||
transformations:
|
||||
- type: AddFields
|
||||
fields:
|
||||
- path: [ "field1" ]
|
||||
value: "static_value"
|
||||
```
|
||||
|
||||
This example adds a top-level field "start_date", whose value is evaluated from the stream slice:
|
||||
|
||||
```yaml
|
||||
stream:
|
||||
<...>
|
||||
transformations:
|
||||
- type: AddFields
|
||||
fields:
|
||||
- path: [ "start_date" ]
|
||||
value: { { stream_slice[ 'start_date' ] } }
|
||||
```
|
||||
|
||||
Fields can also be added in a nested object by writing the fields' path as a list.
|
||||
|
||||
Given a record of the following shape:
|
||||
|
||||
```
|
||||
{
|
||||
"id": 0,
|
||||
"data":
|
||||
{
|
||||
"field0": "some_data"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
this definition will add a field in the "data" nested object:
|
||||
|
||||
```yaml
|
||||
stream:
|
||||
<...>
|
||||
transformations:
|
||||
- type: AddFields
|
||||
fields:
|
||||
- path: [ "data", "field1" ]
|
||||
value: "static_value"
|
||||
```
|
||||
|
||||
resulting in the following record:
|
||||
|
||||
```
|
||||
{
|
||||
"id": 0,
|
||||
"data":
|
||||
{
|
||||
"field0": "some_data",
|
||||
"field1": "static_value"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Removing fields
|
||||
|
||||
Fields can be removed from records with the `RemoveFields` transformation.
|
||||
|
||||
Schema:
|
||||
|
||||
```yaml
|
||||
RemoveFields:
|
||||
type: object
|
||||
required:
|
||||
- field_pointers
|
||||
additionalProperties: true
|
||||
properties:
|
||||
"$parameters":
|
||||
"$ref": "#/definitions/$parameters"
|
||||
field_pointers:
|
||||
type: array
|
||||
items:
|
||||
"$ref": "#/definitions/FieldPointer"
|
||||
```
|
||||
|
||||
Given a record of the following shape:
|
||||
|
||||
```
|
||||
{
|
||||
"path":
|
||||
{
|
||||
"to":
|
||||
{
|
||||
"field1": "data_to_remove",
|
||||
"field2": "data_to_keep"
|
||||
}
|
||||
},
|
||||
"path2": "data_to_remove",
|
||||
"path3": "data_to_keep"
|
||||
}
|
||||
```
|
||||
|
||||
this definition will remove the 2 instances of "data_to_remove" which are found in "path2" and "path.to.field1":
|
||||
|
||||
```yaml
|
||||
the_stream:
|
||||
<...>
|
||||
transformations:
|
||||
- type: RemoveFields
|
||||
field_pointers:
|
||||
- [ "path", "to", "field1" ]
|
||||
- [ "path2" ]
|
||||
```
|
||||
|
||||
resulting in the following record:
|
||||
|
||||
```
|
||||
{
|
||||
"path":
|
||||
{
|
||||
"to":
|
||||
{
|
||||
"field2": "data_to_keep"
|
||||
}
|
||||
},
|
||||
"path3": "data_to_keep"
|
||||
}
|
||||
```
|
||||
@@ -0,0 +1,58 @@
|
||||
import ManifestYamlDefinitions from '@site/src/components/ManifestYamlDefinitions';
|
||||
import schema from "@site/src/data/declarative_component_schema.yaml";
|
||||
|
||||
# YAML Reference
|
||||
|
||||
This page lists all components, interpolation variables and interpolation macros that can be used when defining a low code YAML file.
|
||||
|
||||
For the technical JSON schema definition that low code manifests are validated against, see [here](https://github.com/airbytehq/airbyte-python-cdk/blob/main/airbyte_cdk/sources/declarative/declarative_component_schema.yaml).
|
||||
|
||||
<ManifestYamlDefinitions />
|
||||
|
||||
export const toc = [
|
||||
{
|
||||
"value": "Components:",
|
||||
"id": "components",
|
||||
"level": 2
|
||||
},
|
||||
{
|
||||
value: "DeclarativeSource",
|
||||
id: "/definitions/DeclarativeSource",
|
||||
level: 3
|
||||
},
|
||||
...Object.keys(schema.definitions).map((id) => ({
|
||||
value: id,
|
||||
id: `/definitions/${id}`,
|
||||
level: 3
|
||||
})),
|
||||
{
|
||||
"value": "Interpolation variables:",
|
||||
"id": "variables",
|
||||
"level": 2
|
||||
},
|
||||
...schema.interpolation.variables.map((def) => ({
|
||||
value: def.title,
|
||||
id: `/variables/${def.title}`,
|
||||
level: 3
|
||||
})),
|
||||
{
|
||||
"value": "Interpolation macros:",
|
||||
"id": "macros",
|
||||
"level": 2
|
||||
},
|
||||
...schema.interpolation.macros.map((def) => ({
|
||||
value: def.title,
|
||||
id: `/macros/${def.title}`,
|
||||
level: 3
|
||||
})),
|
||||
{
|
||||
"value": "Interpolation filters:",
|
||||
"id": "filters",
|
||||
"level": 2
|
||||
},
|
||||
...schema.interpolation.filters.map((def) => ({
|
||||
value: def.title,
|
||||
id: `/filters/${def.title}`,
|
||||
level: 3
|
||||
}))
|
||||
];
|
||||
@@ -0,0 +1,215 @@
|
||||
# Request Options
|
||||
|
||||
The primary way to set request parameters and headers is to define them as key-value pairs using a `RequestOptionsProvider`.
|
||||
Other components, such as an `Authenticator` can also set additional request params or headers as needed.
|
||||
|
||||
Additionally, some stateful components use a `RequestOption` to configure the options and update the value. Example of such components are [Paginators](./pagination.md) and [DatetimeBasedCursors](./incremental-syncs.md#DatetimeBasedCursor).
|
||||
|
||||
## Request Options Provider
|
||||
|
||||
The primary way to set request options is through the `Requester`'s `RequestOptionsProvider`.
|
||||
The options can be configured as key value pairs:
|
||||
|
||||
Schema:
|
||||
|
||||
```yaml
|
||||
RequestOptionsProvider:
|
||||
type: object
|
||||
anyOf:
|
||||
- "$ref": "#/definitions/InterpolatedRequestOptionsProvider"
|
||||
InterpolatedRequestOptionsProvider:
|
||||
type: object
|
||||
additionalProperties: true
|
||||
properties:
|
||||
"$parameters":
|
||||
"$ref": "#/definitions/$parameters"
|
||||
request_parameters:
|
||||
"$ref": "#/definitions/RequestInput"
|
||||
request_headers:
|
||||
"$ref": "#/definitions/RequestInput"
|
||||
request_body_data:
|
||||
"$ref": "#/definitions/RequestInput"
|
||||
request_body_json:
|
||||
"$ref": "#/definitions/RequestInput"
|
||||
```
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
requester:
|
||||
type: HttpRequester
|
||||
url_base: "https://api.exchangeratesapi.io/v1/"
|
||||
http_method: "GET"
|
||||
request_options_provider:
|
||||
request_parameters:
|
||||
k1: v1
|
||||
k2: v2
|
||||
request_headers:
|
||||
header_key1: header_value1
|
||||
header_key2: header_value2
|
||||
```
|
||||
|
||||
It is also possible to configure add a json-encoded body to outgoing requests.
|
||||
|
||||
```yaml
|
||||
requester:
|
||||
type: HttpRequester
|
||||
url_base: "https://api.exchangeratesapi.io/v1/"
|
||||
http_method: "GET"
|
||||
request_options_provider:
|
||||
request_body_json:
|
||||
key: value
|
||||
```
|
||||
|
||||
### Request Option Component
|
||||
|
||||
Some components can be configured to inject additional request options to the requests sent to the API endpoint.
|
||||
|
||||
Schema:
|
||||
|
||||
```yaml
|
||||
RequestOption:
|
||||
description: A component that specifies the key field and where in the request a component's value should be inserted into.
|
||||
type: object
|
||||
required:
|
||||
- type
|
||||
- inject_into
|
||||
properties:
|
||||
type:
|
||||
type: string
|
||||
enum: [RequestOption]
|
||||
inject_into:
|
||||
enum:
|
||||
- request_parameter
|
||||
- header
|
||||
- body_data
|
||||
- body_json
|
||||
oneOf:
|
||||
- properties:
|
||||
field_name:
|
||||
type: string
|
||||
description: The key where the value will be injected. Used for non-nested injection
|
||||
field_path:
|
||||
type: array
|
||||
items:
|
||||
type: string
|
||||
description: For body_json injection, specifies the nested path to the inject values. Particularly useful for GraphQL queries where values need to be injected into the variables object.
|
||||
```
|
||||
|
||||
### GraphQL request injection
|
||||
|
||||
For `body_json` injections, the `field_path` property is used to provide a list of strings representing a path to a nested key to inject. This is particularly useful when working with GraphQL APIs. GraphQL queries typically accept variables as a separate object in the request body, allowing values to be parameterized without string manipulation of the query itself. As an example, to inject a page size option into a GraphQL query, you might need to provide a `limit` key in the request's `variables` as:
|
||||
|
||||
```yaml
|
||||
page_size_option:
|
||||
request_option:
|
||||
type: RequestOption
|
||||
inject_into: body_json
|
||||
field_path:
|
||||
- variables
|
||||
- limit
|
||||
```
|
||||
|
||||
This would inject the following value in the request body:
|
||||
|
||||
```json
|
||||
{ "variables": { "limit": value }}
|
||||
```
|
||||
|
||||
Here's an example of what your final request might look like:
|
||||
|
||||
```json
|
||||
{
|
||||
"query": "query($limit: Int) { users(limit: $limit) { id name } }",
|
||||
"variables": {
|
||||
"limit": 10
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
:::note
|
||||
Nested key injection is ONLY available for `body_json` injection. All other injection types use the top-level `field_name` instead.
|
||||
The `field_name` field is slated to be deprecated in favor of `field_path` in the future.
|
||||
:::
|
||||
|
||||
### Request Path
|
||||
|
||||
As an alternative to adding various options to the request being sent, some components can be configured to
|
||||
modify the HTTP path of the API endpoint being accessed.
|
||||
|
||||
Schema:
|
||||
|
||||
```yaml
|
||||
RequestPath:
|
||||
description: A component that specifies where in the request path a component's value should be inserted into.
|
||||
type: object
|
||||
required:
|
||||
- type
|
||||
properties:
|
||||
type:
|
||||
type: string
|
||||
enum: [RequestPath]
|
||||
```
|
||||
|
||||
## Authenticators
|
||||
|
||||
It is also possible for authenticators to set request parameters or headers as needed.
|
||||
For instance, the `BearerAuthenticator` will always set the `Authorization` header.
|
||||
|
||||
More details on the various authenticators can be found in the [authentication section](authentication.md).
|
||||
|
||||
## Paginators
|
||||
|
||||
The `DefaultPaginator` can optionally set request options through the `page_size_option` and the `page_token_option`.
|
||||
The respective values can be set on the outgoing HTTP requests by specifying where it should be injected.
|
||||
|
||||
The following example will set the "page" request parameter value to the page to fetch, and the "page_size" request parameter to 5:
|
||||
|
||||
```yaml
|
||||
paginator:
|
||||
type: "DefaultPaginator"
|
||||
page_size_option:
|
||||
type: "RequestOption"
|
||||
inject_into: request_parameter
|
||||
field_name: page_size
|
||||
pagination_strategy:
|
||||
type: "PageIncrement"
|
||||
page_size: 5
|
||||
page_token:
|
||||
type: "RequestOption"
|
||||
inject_into: "request_parameter"
|
||||
field_name: "page"
|
||||
```
|
||||
|
||||
More details on paginators can be found in the [pagination section](./pagination.md).
|
||||
|
||||
## Incremental syncs
|
||||
|
||||
The `DatetimeBasedCursor` can optionally set request options through the `start_time_option` and `end_time_option` fields.
|
||||
The respective values can be set on the outgoing HTTP requests by specifying where it should be injected.
|
||||
|
||||
The following example will set the "created[gte]" request parameter value to the start of the time window, and "created[lte]" to the end of the time window.
|
||||
|
||||
```yaml
|
||||
incremental_sync:
|
||||
type: DatetimeBasedCursor
|
||||
start_datetime: "2021-02-01T00:00:00.000000+0000",
|
||||
end_datetime: "2021-03-01T00:00:00.000000+0000",
|
||||
step: "P1D"
|
||||
start_time_option:
|
||||
type: "RequestOption"
|
||||
field_name: "created[gte]"
|
||||
inject_into: "request_parameter"
|
||||
end_time_option:
|
||||
type: "RequestOption"
|
||||
field_name: "created[lte]"
|
||||
inject_into: "request_parameter"
|
||||
```
|
||||
|
||||
More details on incremental syncs can be found in the [incremental syncs section](./incremental-syncs.md).
|
||||
|
||||
## More readings
|
||||
|
||||
- [Requester](./requester.md)
|
||||
- [Pagination](./pagination.md)
|
||||
- [Incremental Syncs](./incremental-syncs.md)
|
||||
@@ -0,0 +1,60 @@
|
||||
# Requester
|
||||
|
||||
The `Requester` defines how to prepare HTTP requests to send to the source API.
|
||||
There is currently only one implementation, the `HttpRequester`, which is defined by
|
||||
|
||||
1. A base url: The root of the API source
|
||||
2. A path: The specific endpoint to fetch data from for a resource
|
||||
3. The HTTP method: the HTTP method to use (GET or POST)
|
||||
4. [A request options provider](./request-options.md#request-options-provider): Defines the request parameters (query parameters), headers, and request body to set on outgoing HTTP requests
|
||||
5. [An authenticator](./authentication.md): Defines how to authenticate to the source
|
||||
6. [An error handler](./error-handling.md): Defines how to handle errors
|
||||
|
||||
The schema of a requester object is:
|
||||
|
||||
```yaml
|
||||
Requester:
|
||||
type: object
|
||||
anyOf:
|
||||
- "$ref": "#/definitions/HttpRequester"
|
||||
HttpRequester:
|
||||
type: object
|
||||
additionalProperties: true
|
||||
required:
|
||||
- url_base
|
||||
- path
|
||||
properties:
|
||||
"$parameters":
|
||||
"$ref": "#/definitions/$parameters"
|
||||
url_base:
|
||||
type: string
|
||||
description: "base url"
|
||||
path:
|
||||
type: string
|
||||
description: "path"
|
||||
http_method:
|
||||
"$ref": "#/definitions/HttpMethod"
|
||||
default: "GET"
|
||||
request_options_provider:
|
||||
"$ref": "#/definitions/RequestOptionsProvider"
|
||||
authenticator:
|
||||
"$ref": "#/definitions/Authenticator"
|
||||
error_handler:
|
||||
"$ref": "#/definitions/ErrorHandler"
|
||||
HttpMethod:
|
||||
type: string
|
||||
enum:
|
||||
- GET
|
||||
- POST
|
||||
```
|
||||
|
||||
## Configuring request parameters and headers
|
||||
|
||||
The primary way to set request parameters and headers is to define them as key-value pairs using a `RequestOptionsProvider`.
|
||||
Other components, such as an `Authenticator` can also set additional request params or headers as needed.
|
||||
|
||||
Additionally, some stateful components use a `RequestOption` to configure the options and update the value. Example of such components are [Paginators](./pagination.md) and [Partition routers](./partition-router.md).
|
||||
|
||||
## More readings
|
||||
|
||||
- [Request options](./request-options.md)
|
||||
@@ -0,0 +1,138 @@
|
||||
# Understanding the YAML file
|
||||
|
||||
The low-code framework involves editing a boilerplate [YAML file](../low-code-cdk-overview.md#configuring-the-yaml-file). This section deep dives into the components of the YAML file.
|
||||
|
||||
## Stream
|
||||
|
||||
Streams define the schema of the data to sync, as well as how to read it from the underlying API source.
|
||||
A stream generally corresponds to a resource within the API. They are analogous to tables for a relational database source.
|
||||
|
||||
By default, the schema of a stream's data is defined as a [JSONSchema](https://json-schema.org/) file in `<source_connector_name>/schemas/<stream_name>.json`.
|
||||
|
||||
Alternately, the stream's data schema can be stored in YAML format inline in the YAML file, by including the optional `schema_loader` key. If the data schema is provided inline, any schema on disk for that stream will be ignored.
|
||||
|
||||
More information on how to define a stream's schema can be found [here](https://github.com/airbytehq/airbyte-python-cdk/blob/main/airbyte_cdk/sources/declarative/declarative_component_schema.yaml)
|
||||
|
||||
The stream object is represented in the YAML file as:
|
||||
|
||||
```yaml
|
||||
DeclarativeStream:
|
||||
description: A stream whose behavior is described by a set of declarative low code components
|
||||
type: object
|
||||
additionalProperties: true
|
||||
required:
|
||||
- type
|
||||
- retriever
|
||||
properties:
|
||||
type:
|
||||
type: string
|
||||
enum: [DeclarativeStream]
|
||||
retriever:
|
||||
"$ref": "#/definitions/Retriever"
|
||||
schema_loader:
|
||||
definition: The schema loader used to retrieve the schema for the current stream
|
||||
anyOf:
|
||||
- "$ref": "#/definitions/InlineSchemaLoader"
|
||||
- "$ref": "#/definitions/JsonFileSchemaLoader"
|
||||
stream_cursor_field:
|
||||
definition: The field of the records being read that will be used during checkpointing
|
||||
anyOf:
|
||||
- type: string
|
||||
- type: array
|
||||
items:
|
||||
- type: string
|
||||
transformations:
|
||||
definition: A list of transformations to be applied to each output record in the
|
||||
type: array
|
||||
items:
|
||||
anyOf:
|
||||
- "$ref": "#/definitions/AddFields"
|
||||
- "$ref": "#/definitions/CustomTransformation"
|
||||
- "$ref": "#/definitions/RemoveFields"
|
||||
$parameters:
|
||||
type: object
|
||||
additional_properties: true
|
||||
```
|
||||
|
||||
More details on streams and sources can be found in the [basic concepts section](../../cdk-python/basic-concepts.md).
|
||||
|
||||
### Configuring a stream for incremental syncs
|
||||
|
||||
If you want to allow your stream to be configured so that only data that has changed since the prior sync is replicated to a destination, you can specify a `DatetimeBasedCursor` on your `Streams`'s `incremental_sync` field.
|
||||
|
||||
Given a start time, an end time, and a step function, it will partition the interval [start, end] into small windows of the size described by the step.
|
||||
|
||||
More information on `incremental_sync` configurations and the `DatetimeBasedCursor` component can be found in the [incremental syncs](./incremental-syncs.md) section.
|
||||
|
||||
## Data retriever
|
||||
|
||||
The data retriever defines how to read the data for a Stream and acts as an orchestrator for the data retrieval flow.
|
||||
|
||||
It is described by:
|
||||
|
||||
1. [Requester](./requester.md): Describes how to submit requests to the API source
|
||||
2. [Paginator](./pagination.md): Describes how to navigate through the API's pages
|
||||
3. [Record selector](./record-selector.md): Describes how to extract records from a HTTP response
|
||||
4. [Partition router](./partition-router.md): Describes how to retrieve data across multiple resource locations
|
||||
|
||||
Each of those components (and their subcomponents) are defined by an explicit interface and one or many implementations.
|
||||
The developer can choose and configure the implementation they need depending on specifications of the integration they are building against.
|
||||
|
||||
Since the `Retriever` is defined as part of the Stream configuration, different Streams for a given Source can use different `Retriever` definitions if needed.
|
||||
|
||||
The schema of a retriever object is:
|
||||
|
||||
```yaml
|
||||
retriever:
|
||||
description: Retrieves records by synchronously sending requests to fetch records. The retriever acts as an orchestrator between the requester, the record selector, the paginator, and the partition router.
|
||||
type: object
|
||||
required:
|
||||
- requester
|
||||
- record_selector
|
||||
- requester
|
||||
properties:
|
||||
"$parameters":
|
||||
"$ref": "#/definitions/$parameters"
|
||||
requester:
|
||||
"$ref": "#/definitions/Requester"
|
||||
record_selector:
|
||||
"$ref": "#/definitions/HttpSelector"
|
||||
paginator:
|
||||
"$ref": "#/definitions/Paginator"
|
||||
stream_slicer:
|
||||
"$ref": "#/definitions/StreamSlicer"
|
||||
PrimaryKey:
|
||||
type: string
|
||||
```
|
||||
|
||||
### Routing to Data that is Partitioned in Multiple Locations
|
||||
|
||||
Some sources might require specifying additional parameters that are needed to retrieve data. Using the `PartitionRouter` component, you can specify a static or dynamic set of elements which will be iterated upon and made available for use when a connector dispatches requests to get data from a source.
|
||||
|
||||
More information on how to configure the `partition_router` field on a Retriever to retrieve data from multiple location can be found in the [iteration](./partition-router.md) section.
|
||||
|
||||
### Combining Incremental Syncs and Iterable Locations
|
||||
|
||||
A stream can be configured to support incrementally syncing data that is spread across multiple partitions by defining `incremental_sync` on the `Stream` and `partition_router` on the `Retriever`.
|
||||
|
||||
During a sync where both are configured, the Cartesian product of these parameters will be calculated and the connector will repeat requests to the source using the different combinations of parameters to get all of the data.
|
||||
|
||||
For example, if we had a `DatetimeBasedCursor` requesting data over a 3-day range partitioned by day and a `ListPartitionRouter` with the following locations `A`, `B`, and `C`. This would result in the following combinations that will be used to request data.
|
||||
|
||||
| Partition | Date Range |
|
||||
| --------- | ----------------------------------------- |
|
||||
| A | 2022-01-01T00:00:00 - 2022-01-01T23:59:59 |
|
||||
| B | 2022-01-01T00:00:00 - 2022-01-01T23:59:59 |
|
||||
| C | 2022-01-01T00:00:00 - 2022-01-01T23:59:59 |
|
||||
| A | 2022-01-02T00:00:00 - 2022-01-02T23:59:59 |
|
||||
| B | 2022-01-02T00:00:00 - 2022-01-02T23:59:59 |
|
||||
| C | 2022-01-02T00:00:00 - 2022-01-02T23:59:59 |
|
||||
| A | 2022-01-03T00:00:00 - 2022-01-03T23:59:59 |
|
||||
| B | 2022-01-03T00:00:00 - 2022-01-03T23:59:59 |
|
||||
| C | 2022-01-03T00:00:00 - 2022-01-03T23:59:59 |
|
||||
|
||||
## More readings
|
||||
|
||||
- [Requester](./requester.md)
|
||||
- [Incremental Syncs](./incremental-syncs.md)
|
||||
- [Partition Router](./partition-router.md)
|
||||
@@ -0,0 +1,68 @@
|
||||
---
|
||||
title: AI Assistant
|
||||
---
|
||||
# AI Assistant for the Connector Builder (Beta)
|
||||
|
||||
Welcome to the **AI Assistant**, your personal helper for creating Airbyte connectors through our Connector Builder. While still in beta, this AI tool promises to significantly speed up your development time by automating and simplifying the process of building connectors.
|
||||
|
||||
Check out our [Hands-on Tutorial](https://airbyte.com/blog/hands-on-with-the-new-ai-assistant) to get started.
|
||||
|
||||
## Key Features
|
||||
|
||||
1. **Pre-fill and Configure Connectors**: When starting a new connector, AI Assistant can automatically prefill and configure a number of fields and sections in the Airbyte Connector Builder, drastically reducing setup time. This currently includes:
|
||||
- Base URL
|
||||
- Authentication
|
||||
- Pagination
|
||||
- Primary Key
|
||||
- Record Selection
|
||||
- Available streams
|
||||
- Stream configuration
|
||||
2. **Ongoing Field and Section Suggestions**: As you continue working on a connector, AI Assistant will provide intelligent suggestions, helping you fine-tune your connector’s configuration.
|
||||
|
||||
## Why You’ll Love It
|
||||
|
||||
- **Faster Development**: The AI Assistant automates much of the setup and configuration, cutting down the time it takes to build connectors.
|
||||
- **Less Time Spend Understanding the API Documentation**: The AI Assistant can read and understand the API documentation for you, so you don't have to.
|
||||
|
||||
## What Should You Expect?
|
||||
|
||||
- **Human Oversight Required**: Since it's an AI-based tool, you should still review the output to ensure everything is set up correctly. As it’s in beta, it won’t always be perfect, but it will save you significant time by handling most of the tedious setup tasks.
|
||||
- **Optimized for Common API Types**: While the AI Assistant supports a wide range of APIs, like any AI feature it works best common use cases. It certainly performs best with **REST APIs** that return JSON responses. However, you can also use it with less common APIs like **GraphQL**.
|
||||
|
||||
## How It Works
|
||||
|
||||
### Provide API Documentation
|
||||
|
||||

|
||||
|
||||
Start by pasting a link to the API documentation or an OpenAPI spec into the Assistant.
|
||||
|
||||
### Automatic Configuration
|
||||
|
||||
AI Assistant will scan the documentation, finding critical information like the base URL, authentication methods, and pagination schemes.
|
||||
|
||||
### Suggestions for Fields and Sections
|
||||
|
||||

|
||||
|
||||
As you progress, the Assistant will offer suggestions for other fields and sections within the Connector Builder, making it easier to complete your setup.
|
||||
|
||||
### Stream Configuration
|
||||
|
||||

|
||||
|
||||
The Assistant will also help you setup your streams, providing you with a list of available streams and their likely configurations.
|
||||
|
||||
### Test & Review
|
||||
|
||||
After configuration, you can run tests to ensure the setup is correct. If the Assistant misses anything (like headers or pagination), you can adjust these manually and re-test.
|
||||
|
||||
## Where Can I Use It?
|
||||
|
||||
You can use the AI Assistant in the following scenarios:
|
||||
- **When creating a new connector** from scratch in the Airbyte Connector Builder.
|
||||
- **Within your existing connectors** by clicking the "AI Assist" button at the top of the builder.
|
||||
|
||||
---
|
||||
|
||||
We’re excited to see how much time AI Assistant can save you during the beta phase. While it's not perfect yet, it already simplifies the process of building and managing connectors.
|
||||
|
After Width: | Height: | Size: 133 KiB |
|
After Width: | Height: | Size: 54 KiB |
|
After Width: | Height: | Size: 3.3 MiB |
|
After Width: | Height: | Size: 211 KiB |
|
After Width: | Height: | Size: 266 KiB |
|
After Width: | Height: | Size: 106 KiB |
|
After Width: | Height: | Size: 104 KiB |
|
After Width: | Height: | Size: 338 KiB |
|
After Width: | Height: | Size: 401 KiB |
|
After Width: | Height: | Size: 46 KiB |
|
After Width: | Height: | Size: 40 KiB |
|
After Width: | Height: | Size: 104 KiB |
|
After Width: | Height: | Size: 93 KiB |
|
After Width: | Height: | Size: 206 KiB |
|
After Width: | Height: | Size: 232 KiB |
|
After Width: | Height: | Size: 171 KiB |
@@ -0,0 +1,223 @@
|
||||
# Asynchronous Job streams
|
||||
|
||||
In the Connector Builder UI, you can create two types of streams: **Synchronous Request** and **Asynchronous Job**. Understanding the difference is important for efficiently extracting data from APIs that use asynchronous processing.
|
||||
|
||||
## Synchronous Streams
|
||||
|
||||
Synchronous streams operate in real-time, where:
|
||||
- The connector makes a request to an API endpoint
|
||||
- The API responds immediately with data
|
||||
- The connector processes and returns that data in the same operation
|
||||
|
||||
This is the simpler, more common pattern used for most APIs that can return data immediately.
|
||||
|
||||
## Asynchronous Streams
|
||||
|
||||
Asynchronous streams handle scenarios where data extraction happens over multiple steps:
|
||||
1. **Creation**: You request a job to be created (like a report generation)
|
||||
2. **Polling**: You periodically check if the job is complete
|
||||
3. **Download**: Once the job is complete, you download the results
|
||||
|
||||
This approach is necessary for APIs that handle large datasets or resource-intensive operations that cannot be completed in a single request-response cycle.
|
||||
|
||||
## When to Use Asynchronous Streams
|
||||
|
||||
Use asynchronous streams when:
|
||||
- The API requires you to trigger a job and wait for it to complete
|
||||
- You're working with large datasets that need server-side processing
|
||||
- The API documentation mentions job creation, status checking, and result download
|
||||
- Data generation takes too long to be handled in a single request
|
||||
|
||||
Common examples include analytics report generation, large data exports (like SendGrid contacts), complex data processing operations, and batch-processed operations.
|
||||
|
||||
## Configuring an Asynchronous Stream
|
||||
|
||||
To make an existing stream asynchronous, at the top-right of the stream configuration, select `Request type` > `Asynchronous Job`.
|
||||
|
||||
To create a new stream as an asynchronous stream, click the `+` add stream button, and select `Request type`> `Asynchronous Job`.
|
||||
|
||||
An asynchronous stream in the Connector Builder UI is divided into three main tabs:
|
||||
|
||||
### 1. Creation Tab
|
||||
|
||||
The Creation tab configures how to request that a job be created on the server.
|
||||
|
||||

|
||||
|
||||
#### Key Components:
|
||||
|
||||
- **URL**: The full URL that the request should be sent to to create the job
|
||||
- **HTTP Method**: Typically POST for job creation, but this can vary by API
|
||||
- **HTTP Response Format**: Format of the response from the job creation request. This will also be used for the polling response.
|
||||
- **Authentication**: Authentication method for the creation request
|
||||
- **Request Options**: Headers, query parameters, and request body for the creation request
|
||||
|
||||
#### Example Configuration (SendGrid):
|
||||
|
||||
In the UI, for the [SendGrid contacts export](https://www.twilio.com/docs/sendgrid/api-reference/contacts/export-contacts), you would configure:
|
||||
|
||||
- **URL** field: `https://api.sendgrid.com/v3/marketing/contacts/exports`
|
||||
- **HTTP Method** dropdown: `POST`
|
||||
- In the **Authentication** section:
|
||||
- Select **Bearer Token** authentication type
|
||||
- Fill out the **API Key** user input with your SendGrid API key
|
||||
|
||||
### 2. Polling Tab
|
||||
|
||||
The Polling tab defines how to check the status of a running job.
|
||||
|
||||

|
||||
|
||||
#### Key Components:
|
||||
|
||||
- **URL**: The full URL that the request should be sent to to check the status of the job. Use the `{{ creation_response }}` variable to reference the response from the creation request when constructing this URL.
|
||||
- **HTTP Method**: Typically GET for status checking, but this can vary by API
|
||||
- **Status Extractor**: Extracts the job status from the response
|
||||
- **Status Mapping**: Maps API-specific status values to standard statuses
|
||||
- **Download Target Extractor**: Extracts the URL or ID for downloading results
|
||||
|
||||
#### Status Extractor and Status Mapping Explained
|
||||
|
||||
The **Status Extractor** and **Status Mapping** work together:
|
||||
|
||||
1. **Status Extractor** defines a path to extract the status value from the API response. It uses a field path to point into the JSON response to find the status value.
|
||||
|
||||
2. **Status Mapping** maps the extracted status values to standard connector statuses:
|
||||
- **Completed**: Job finished successfully and data is ready for download
|
||||
- **Failed**: Job encountered an error and cannot be completed
|
||||
- **Running**: Job is still processing, need to keep polling
|
||||
- **Timeout**: Job took too long to complete, should be aborted
|
||||
|
||||
For each of these, you should put all of the possible status values that the API might return which indicate that the job is in that state.
|
||||
|
||||
The connector first uses the Status Extractor to get the raw status value, then uses the Status Mapping to determine what action to take next.
|
||||
|
||||
#### Download Target Extractor Explained
|
||||
|
||||
The **Download Target Extractor** works similarly to the Status Extractor but extracts a download URL or identifier from the successful API response. This extracted value will be used in the Download stage to retrieve the data.
|
||||
|
||||
#### Example Configuration (SendGrid):
|
||||
|
||||
In the UI, for the [SendGrid contacts export](https://www.twilio.com/docs/sendgrid/api-reference/contacts/export-contacts), you would configure:
|
||||
|
||||
- **URL** field: `https://api.sendgrid.com/v3/marketing/contacts/exports/{{creation_response['id']}}`
|
||||
- **HTTP Method** dropdown: `GET`
|
||||
- In the **Status Extractor** section:
|
||||
- Set the **Field Path** to: `status`
|
||||
- In the **Status Mapping** section:
|
||||
- Set **Completed** to: `ready`
|
||||
- Set **Failed** to: `failed`
|
||||
- Set **Running** to: `pending`
|
||||
- Set **Timeout** to: `timeout`
|
||||
- In the **Download Target Extractor** section:
|
||||
- Set the **Field Path** to: `urls`
|
||||
|
||||
### 3. Download Tab
|
||||
|
||||
The Download tab configures how to retrieve the results once the job is complete.
|
||||
|
||||

|
||||
|
||||
#### Key Components:
|
||||
|
||||
- **URL**: The full URL that the request should be sent to to download the results. Use the `{{ download_target }}` variable to reference the value extracted by the Download Target Extractor in the Polling tab.
|
||||
- **HTTP Method**: Typically GET for downloading, but this can vary by API
|
||||
- **HTTP Response Format**: Format of the downloaded data
|
||||
- **Download Extractor**: Optional path to extract a specific path from the download response for constructing records.
|
||||
- **Primary Key**: Unique identifier for records
|
||||
- **Record Selector**: Identifies individual records in the response
|
||||
|
||||
#### Example Configuration (SendGrid):
|
||||
|
||||
In the UI, for the [SendGrid contacts export](https://www.twilio.com/docs/sendgrid/api-reference/contacts/export-contacts), you would configure:
|
||||
|
||||
- **URL** field: `{{ download_target }}`
|
||||
- **HTTP Method** dropdown: `GET`
|
||||
- **HTTP Response Format** dropdown: `CSV`
|
||||
- In the **Download Extractor** section:
|
||||
- Leave the **Field Path** empty since we want to use the entire CSV content
|
||||
- Set **Primary Key** to: `CONTACT_ID`
|
||||
|
||||
## Testing Asynchronous Streams in the UI
|
||||
|
||||
Click the "Test" button in the top right corner of the Connector Builder UI to test your connector.
|
||||
|
||||
Asynchronous streams can take longer to test than synchronous streams, so be patient. However, you can use the `Cancel` button to stop the test at any time.
|
||||
|
||||
After the test completes, you'll see several important panels that help you understand what's happening during the asynchronous process:
|
||||
|
||||
### Main Records / Request / Response Tabs
|
||||
|
||||
The main Records / Request / Response tabs in the testing panel show the **final download phase** of your asynchronous stream:
|
||||
|
||||
1. **Records Tab**: Shows the records that were created from the final download response.
|
||||
|
||||
2. **Request Tab**: Shows the HTTP request sent to download the final data, including the download URL, headers, and any request parameters.
|
||||
|
||||
3. **Response Tab**: Shows the actual data returned from the API, which is used to produce the records you're trying to sync.
|
||||
|
||||
For the SendGrid example, the main Request tab would show a GET request to the URL extracted from the polling response, and the Response tab would show the CSV data containing the contacts.
|
||||
|
||||
### Other Requests Panel
|
||||
|
||||
The **Other Requests** panel can be helpful for debugging asynchronous streams. This panel shows the intermediate requests and responses that occurred during the asynchronous process:
|
||||
|
||||
1. **Creation Requests**: Shows the request sent to create the job and the API's response, including any job ID or token returned.
|
||||
|
||||

|
||||
|
||||
2. **Polling Requests**: Shows each polling request sent to check the job status and the API's response, including the status values returned.
|
||||
|
||||

|
||||
|
||||
To view the details of the creation and polling stages:
|
||||
|
||||
1. After running a test, look for the "Other Requests" panel at the bottom of the screen
|
||||
2. Select a request from the dropdown menu (they are typically labeled like "Async Job -- Create" or "Async Job -- Polling")
|
||||
3. Use the tabs to switch between the request and response details
|
||||
- Request tab: The request URL, method, headers, and body
|
||||
- Response tab: The response status, headers, and body
|
||||
|
||||
## How Asynchronous Data Extraction Works
|
||||
|
||||
Let's walk through the complete flow using the [SendGrid contacts export](https://www.twilio.com/docs/sendgrid/api-reference/contacts/export-contacts) example:
|
||||
|
||||
1. **Job Creation**:
|
||||
- The connector sends a POST request to the exports endpoint with the user's API key
|
||||
- SendGrid begins generating the export and returns a response with a job ID
|
||||
- In the Other Requests panel, you'll see this creation request and its response
|
||||
|
||||
2. **Status Checking**:
|
||||
- The connector uses the job ID to construct the polling URL
|
||||
- It periodically sends GET requests to this URL to check the status
|
||||
- SendGrid returns updates about the job status
|
||||
- In the Other Requests panel, you'll see these polling requests and responses
|
||||
|
||||
3. **Status Evaluation**:
|
||||
- The connector extracts the status value (e.g., "pending") using the Status Extractor
|
||||
- It maps this to a standard status (e.g., "running") using the Status Mapping
|
||||
- It continues polling until the status changes to "ready" (mapped to "completed")
|
||||
|
||||
4. **Download URL Extraction**:
|
||||
- When the job is complete, the connector extracts the download URL using the Download Target Extractor
|
||||
- In the Other Requests panel, you'll see the final polling response that contains this URL
|
||||
|
||||
5. **Result Download**:
|
||||
- The connector sends a GET request to the extracted URL
|
||||
- SendGrid returns a CSV file with the contacts data
|
||||
- In the main Request/Response tabs, you'll see this download request and the resulting data
|
||||
|
||||
## Best Practices for Asynchronous Streams
|
||||
|
||||
1. **Verify Each Stage**: Use the Other Requests panel to verify that each stage is working correctly:
|
||||
- Confirm the job creation request is successful
|
||||
- Verify that status is at the expected place with the expected values in the polling response
|
||||
- Ensure the download target is correctly extracted by checking the main Request tab URL
|
||||
|
||||
2. **Configure Status Mapping Properly**: Include all possible API status values in your status mapping.
|
||||
|
||||
3. **Use the Right Http Response Format**: Make sure you select the correct HTTP Response Format (JSON, CSV, XML, etc.) for both creation and download
|
||||
|
||||
4. **Check for Rate Limits**: Many APIs limit how frequently you can poll for job status. Configure appropriate error handling if you hit these limits.
|
||||
|
||||
Remember that asynchronous streams often take longer to test than synchronous streams, especially if the API takes time to process jobs.
|
||||
@@ -0,0 +1,239 @@
|
||||
# Authentication
|
||||
|
||||
Authentication allows the connector to check whether it has sufficient permission to fetch data and communicate its identity to the API. The authentication feature provides a secure way to configure authentication using a variety of methods.
|
||||
|
||||
The credentials itself (e.g. username and password) are _not_ specified as part of the connector, instead they are part of the configuration that is specified by the end user when setting up a source based on the connector. During development, it's possible to provide testing credentials in the "Testing values" menu, but those are not saved along with the connector. Credentials that are part of the source configuration are stored in a secure way in your Airbyte instance while the connector configuration is saved in the regular database.
|
||||
|
||||
In the "Authentication" section on the "Global Configuration" page in the connector builder, the authentication method can be specified. This configuration is shared for all streams - it's not possible to use different authentication methods for different streams in the same connector. In case your API uses multiple or custom authentication methods, you can use the [low-code CDK](/platform/connector-development/config-based/low-code-cdk-overview) or [Python CDK](/platform/connector-development/cdk-python/).
|
||||
|
||||
If your API doesn't need authentication, leave it set at "No auth". This means the connector will be able to make requests to the API without providing any credentials which might be the case for some public open APIs or private APIs only available in local networks.
|
||||
|
||||
<iframe width="640" height="430" src="https://www.loom.com/embed/4e65a2090134478d920764b43d1eaef4" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>
|
||||
|
||||
## Authentication methods
|
||||
|
||||
Check the documentation of the API you want to integrate for the used authentication method. The following ones are supported in the connector builder:
|
||||
|
||||
- [Basic HTTP](#basic-http)
|
||||
- [Bearer Token](#bearer-token)
|
||||
- [API Key](#api-key)
|
||||
- [OAuth](#oauth)
|
||||
- [Session Token](#session-token)
|
||||
|
||||
Select the matching authentication method for your API and check the sections below for more information about individual methods.
|
||||
|
||||
### Basic HTTP
|
||||
|
||||
If requests are authenticated using the Basic HTTP authentication method, the documentation page will likely contain one of the following keywords:
|
||||
|
||||
- "Basic Auth"
|
||||
- "Basic HTTP"
|
||||
- "Authorization: Basic"
|
||||
|
||||
The Basic HTTP authentication method is a standard and doesn't require any further configuration. Username and password are set via "Testing values" in the connector builder and by the end user when configuring this connector as a Source.
|
||||
|
||||
#### Example
|
||||
|
||||
The [Greenhouse API](https://developers.greenhouse.io/harvest.html#authentication) is an API using basic authentication.
|
||||
|
||||
Sometimes, only a username and no password is required, like for the [Chargebee API](https://apidocs.chargebee.com/docs/api/auth?prod_cat_ver=2) - in these cases simply leave the password input empty.
|
||||
|
||||
In the basic authentication scheme, the supplied username and password are concatenated with a colon `:` and encoded using the base64 algorithm. For username `user` and password `passwd`, the base64-encoding of `user:passwd` is `dXNlcjpwYXNzd2Q=`.
|
||||
|
||||
When fetching records, this string is sent as part of the `Authorization` header:
|
||||
|
||||
```
|
||||
curl -X GET \
|
||||
-H "Authorization: Basic dXNlcjpwYXNzd2Q=" \
|
||||
https://harvest.greenhouse.io/v1/<stream path>
|
||||
```
|
||||
|
||||
### Bearer Token
|
||||
|
||||
If requests are authenticated using Bearer authentication, the documentation will probably mention "bearer token" or "token authentication". In this scheme, the `Authorization` header of the HTTP request is set to `Bearer <token>`.
|
||||
|
||||
Like the Basic HTTP authentication it does not require further configuration. The bearer token can be set via "Testing values" in the connector builder and by the end user when configuring this connector as a Source.
|
||||
|
||||
#### Example
|
||||
|
||||
The [Sendgrid API](https://docs.sendgrid.com/api-reference/how-to-use-the-sendgrid-v3-api/authentication) and the [Square API](https://developer.squareup.com/docs/build-basics/access-tokens) are supporting Bearer authentication.
|
||||
|
||||
When fetching records, the token is sent along as the `Authorization` header:
|
||||
|
||||
```
|
||||
curl -X GET \
|
||||
-H "Authorization: Bearer <bearer token>" \
|
||||
https://api.sendgrid.com/<stream path>
|
||||
```
|
||||
|
||||
### API Key
|
||||
|
||||
The API key authentication method is similar to the Bearer authentication but allows to configure where to inject the API key (header, request param or request body), as well as under which field name. The used injection mechanism and the field name is part of the connector definition while the API key itself can be set via "Testing values" in the connector builder as well as when configuring this connector as a Source.
|
||||
|
||||
The following table helps with which mechanism to use for which API:
|
||||
|
||||
| Description | Injection mechanism |
|
||||
| ------------------------------------------------------------------ | ------------------- |
|
||||
| (HTTP) header | `header` |
|
||||
| Query parameter / query string / request parameter / URL parameter | `request_parameter` |
|
||||
| Form encoded request body / form data | `body_data` |
|
||||
| JSON encoded request body | `body_json` |
|
||||
|
||||
#### Example
|
||||
|
||||
The [CoinAPI.io API](https://docs.coinapi.io/market-data/rest-api#authorization) is using API key authentication via the `X-CoinAPI-Key` header.
|
||||
|
||||
When fetching records, the api token is included in the request using the configured header:
|
||||
|
||||
```
|
||||
curl -X GET \
|
||||
-H "X-CoinAPI-Key: <api-key>" \
|
||||
https://rest.coinapi.io/v1/<stream path>
|
||||
```
|
||||
|
||||
In this case the injection mechanism is `header` and the field name is `X-CoinAPI-Key`.
|
||||
|
||||
### OAuth
|
||||
|
||||
#### Declarative OAuth 2.0
|
||||
Declarative OAuth 2.0 provides a flexible way to configure any OAuth 2.0 flow by describing the exact structure and behavior of the authentication endpoints. This allows integration with OAuth 2.0 implementations that may deviate from the standard specification or have custom requirements.
|
||||
|
||||
The configuration consists of three main components:
|
||||
|
||||
1. **Consent URL Configuration** - Defines how to construct the URL where users grant consent:
|
||||
- URL template with placeholders for client ID, redirect URI, and other parameters
|
||||
- Specification of where parameters should be injected (query params, headers, etc.)
|
||||
- Support for custom scopes and additional parameters
|
||||
|
||||
2. **Access Token URL Configuration** - Specifies how to exchange the authorization code for an access token:
|
||||
- Token endpoint URL
|
||||
- Where to place client ID, client secret, and authorization code
|
||||
- Response parsing configuration for access token and expiration
|
||||
- Support for custom parameters and headers
|
||||
|
||||
3. **Optional Refresh Token URL Configuration** - Defines how to refresh expired access tokens:
|
||||
- Refresh token endpoint URL (if different from access token URL)
|
||||
- Parameter placement for client credentials and refresh token
|
||||
- Custom header and body configurations
|
||||
|
||||
To learn more about the Declarative OAuth 2.0 add see a variety of example implementations please refer to the [Declarative OAuth 2.0](/platform/connector-development/config-based/advanced-topics/oauth) documentation.
|
||||
|
||||
|
||||
#### Partial OAuth (legacy)
|
||||
|
||||
The partial OAuth flow is a simpler flow that allows you to use an existing access (or refresh) token to authenticate with the API.
|
||||
|
||||
|
||||
The catch is that the user needs to implement the start of the OAuth flow manually to obtain the `access_token` and the `refresh_token` and pass them to the connector as a part of the `config` object.
|
||||
|
||||
|
||||
The OAuth authentication method implements authentication using an OAuth2.0 flow with a [refresh token grant type](https://oauth.net/2/grant-types/refresh-token/) and [client credentials grant type](https://oauth.net/2/grant-types/client-credentials/).
|
||||
|
||||
In this scheme, the OAuth endpoint of an API is called with client id and client secret and/or a long-lived refresh token that's provided by the end user when configuring this connector as a Source. These credentials are used to obtain a short-lived access token that's used to make requests actually extracting records. If the access token expires, the connection will automatically request a new one.
|
||||
|
||||
The connector needs to be configured with the endpoint to call to obtain access tokens with the client id/secret and/or the refresh token. OAuth client id/secret and the refresh token are provided via "Testing values" in the connector builder as well as when configuring this connector as a Source.
|
||||
|
||||
Depending on how the refresh endpoint is implemented exactly, additional configuration might be necessary to specify how to request an access token with the right permissions (configuring OAuth scopes and grant type) and how to extract the access token and the expiry date out of the response (configuring expiry date format and property name as well as the access key property name):
|
||||
|
||||
- **Scopes** - the [OAuth scopes](https://oauth.net/2/scope/) the access token will have access to. if not specified, no scopes are sent along with the refresh token request
|
||||
- **Grant type** - the used OAuth grant type (either refresh token or client credentials). In case of refresh_token, a refresh token has to be provided by the end user when configuring the connector as a Source.
|
||||
- **Token expiry property name** - the name of the property in the response that contains token expiry information. If not specified, it's set to `expires_in`
|
||||
- **Token expire property date format** - if not specified, the expiry property is interpreted as the number of seconds the access token will be valid
|
||||
- **Access token property name** - the name of the property in the response that contains the access token to do requests. If not specified, it's set to `access_token`
|
||||
|
||||
If the API uses other grant types (like PKCE), it's not possible to use the connector builder with OAuth authentication.
|
||||
|
||||
Keep in mind that the OAuth authentication method does not implement a single-click authentication experience for the end user configuring the connector - it will still be necessary to obtain client id, client secret and refresh token from the API and manually enter them into the configuration form.
|
||||
|
||||
#### Example
|
||||
|
||||
The [Square API](https://developer.squareup.com/docs/build-basics/access-tokens#get-an-oauth-access-token) supports OAuth.
|
||||
|
||||
In this case, the authentication method has to be configured like this:
|
||||
|
||||
- "Token refresh endpoint" is `https://connect.squareup.com/oauth2/token`
|
||||
- "Token expiry property name" is `expires_at`
|
||||
|
||||
When running a sync, the connector is first sending client id, client secret and refresh token to the token refresh endpoint:
|
||||
|
||||
```
|
||||
|
||||
curl -X POST \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"client_id": "<client id>", "client_secret": "<client secret>", "refresh_token": "<refresh token>", "grant_type": "refresh_token" }' \
|
||||
<token refresh endpoint>
|
||||
```
|
||||
|
||||
The response is a JSON object containing an `access_token` property and an `expires_at` property:
|
||||
|
||||
```
|
||||
{"access_token":"<access-token>", "expires_at": "2023-12-12T00:00:00"}
|
||||
```
|
||||
|
||||
The `expires_at` date tells the connector how long the access token can be used - if this point in time is passed, a new access token is requested automatically.
|
||||
|
||||
When fetching records, the access token is sent along as part of the `Authorization` header:
|
||||
|
||||
```
|
||||
curl -X GET \
|
||||
-H "Authorization: Bearer <access-token>" \
|
||||
https://connect.squareup.com/v2/<stream path>
|
||||
```
|
||||
|
||||
#### Update refresh token from authentication response
|
||||
|
||||
In a lot of cases, OAuth refresh tokens are long-lived and can be used to create access tokens for every sync. In some cases however, a refresh token becomes invalid after it has been used to create an access token. In these situations, a new refresh token is returned along with the access token. One example of this behavior is the [Smartsheets API](https://smartsheet.redoc.ly/#section/OAuth-Walkthrough/Get-or-Refresh-an-Access-Token). In these cases, it's necessary to update the refresh token in the configuration every time an access token is generated, so the next sync will still succeed.
|
||||
|
||||
This can be done using the "Overwrite config with refresh token response" setting. If enabled, the authenticator expects a new refresh token to be returned from the token refresh endpoint. By default, the property `refresh_token` is used to extract the new refresh token, but this can be configured using the "Refresh token property name" setting. The connector then updates its own configuration with the new refresh token and uses it the next time an access token needs to be generated. If this option is used, it's necessary to specify an initial access token along with its expiry date in the "Testing values" menu.
|
||||
|
||||
### Session Token
|
||||
|
||||
Some APIs require callers to first fetch a unique token from one endpoint, then make the rest of their calls to all other endpoints using that token to authenticate themselves. These tokens usually have an expiration time, after which a new token needs to be re-fetched to continue making requests. This flow can be achieved through using the Session Token Authenticator.
|
||||
|
||||
If requests are authenticated using the Session Token authentication method, the API documentation page will likely contain one of the following keywords:
|
||||
|
||||
- "Session Token"
|
||||
- "Session ID"
|
||||
- "Auth Token"
|
||||
- "Access Token"
|
||||
- "Temporary Token"
|
||||
|
||||
#### Configuration
|
||||
|
||||
The configuration of a Session Token authenticator is a bit more involved than other authenticators, as you need to configure both how to make requests to the session token retrieval endpoint (which requires its own authentication method), as well as how the token is extracted from that response and used for the data requests.
|
||||
|
||||
We will walk through each part of the configuration below. Throughout this, we will refer to the [Metabase API](https://www.metabase.com/learn/administration/metabase-api#authenticate-your-requests-with-a-session-token) as an example of an API that uses session token authentication.
|
||||
|
||||
- `Session Token Retrieval` - this is a group of fields which configures how the session token is fetched from the session token endpoint in your API. Once the session token is retrieved, your connector will reuse that token until it expires, at which point it will retrieve a new session token using this configuration.
|
||||
- `URL` - the full URL of the session token endpoint
|
||||
- For Metabase, this would be `https://<app_name>.metabaseapp.com/api/session`.
|
||||
- `HTTP Method` - the HTTP method that should be used when retrieving the session token endpoint, either `GET` or `POST`
|
||||
- Metabase requires `POST` for its `/api/session` requests.
|
||||
- `Authentication Method` - configures the method of authentication to use **for the session token retrieval request only**
|
||||
- Note that this is separate from the parent Session Token Authenticator. It contains the same options as the parent Authenticator Method dropdown, except for OAuth (which is unlikely to be used for obtaining session tokens) and Session Token (as it does not make sense to nest).
|
||||
- For Metabase, the `/api/session` endpoint takes in a `username` and `password` in the request body. Since this is a non-standard authentication method, we must set this inner `Authentication Method` to `No Auth`, and instead configure the `Request Body` to pass these credentials (discussed below).
|
||||
- `Query Parameters` - used to attach query parameters to the session token retrieval request
|
||||
- Metabase does not require any query parameters in the `/api/session` request, so this is left unset.
|
||||
- `Request Headers` - used to attach headers to the sesssion token retrieval request
|
||||
- Metabase does not require any headers in the `/api/session` request, so this is left unset.
|
||||
- `Request Body` - used to attach a request body to the session token retrieval request
|
||||
- As mentioned above, Metabase requires the username and password to be sent in the request body, so we can select `JSON (key-value pairs)` here and set the username and password fields (using User Inputs for the values to make the connector reusable), so this would end up looking like:
|
||||
- Key: `username`, Value: `{{ config['username'] }}`
|
||||
- Key: `password`, Value: `{{ config['password'] }}`
|
||||
- `Error Handler` - used to handle errors encountered when retrieving the session token
|
||||
- See the [Error Handling](/platform/connector-development/connector-builder-ui/error-handling) page for more info about configuring this component.
|
||||
- `Session Token Path` - an array of values to form a path into the session token retrieval response which points to the session token value
|
||||
- For Metabase, the `/api/session` response looks like `{"id":"<session-token-value>"}`, so the value here would simply be `id`.
|
||||
- `Expiration Duration` - an [ISO 8601 duration](https://en.wikipedia.org/wiki/ISO_8601#Durations) indicating how long the session token has until it expires
|
||||
- Once this duration is reached, your connector will automatically fetch a new session token, and continue making data requests with that new one.
|
||||
- If this is left unset, the session token will be refreshed before every single data request. This is **not recommended** if it can be avoided, as this will cause the connector to run much slower, as it will need to make an extra token request for every data request.
|
||||
- Note: this **does _not_ support dynamic expiration durations of session tokens**. If your token expiration duration is dynamic, you should set the `Expiration Duration` field to the expected minimum duration to avoid problems during syncing.
|
||||
- For Metabase, the token retrieved from the `/api/session` endpoint expires after 14 days by default, so this value can be set to `P2W` or `P14D`.
|
||||
- `Data Request Authentication` - configures how the session token is used to authenticate the data requests made to the API
|
||||
- Choose `API Key` if your session token needs to be injected into a query parameter or header of the data requests.
|
||||
- Metabase takes in the session token through a specific header, so this would be set to `API Key`, Inject Session Token into outgoing HTTP Request would be set to `Header`, and Header Name would be set to `X-Metabase-Session`.
|
||||
- Choose `Bearer` if your session token needs to be sent as a standard Bearer token.
|
||||
|
||||
### Custom authentication methods
|
||||
|
||||
Some APIs require complex custom authentication schemes involving signing requests or doing multiple requests to authenticate. In these cases, it's required to use the [low-code CDK](/platform/connector-development/config-based/low-code-cdk-overview) or [Python CDK](/platform/connector-development/cdk-python/).
|
||||
@@ -0,0 +1,125 @@
|
||||
---
|
||||
products: oss-community, oss-enterprise
|
||||
---
|
||||
|
||||
# Custom components for the Connector Builder
|
||||
|
||||
Use Custom Components to extend the Connector Builder with your own Python implementations when Airbyte's built-in components don't meet your specific needs.
|
||||
|
||||
This feature enables you to:
|
||||
|
||||
- Override any built-in component with a custom Python class
|
||||
|
||||
- Implement specialized logic for handling complex API behaviors
|
||||
|
||||
- Maintain full control over the connection process while still leveraging the Connector Builder framework
|
||||
|
||||
The following example shows a simple RecordTransformation component that appends text to a record's name field.
|
||||
|
||||

|
||||
|
||||
## What are Custom Components?
|
||||
|
||||
Custom Components are Python classes that implement specific interfaces from the Airbyte CDK. They follow a consistent pattern:
|
||||
|
||||
- A dataclass that implements the interface of the component it's replacing
|
||||
|
||||
- Fields representing configurable arguments from the YAML configuration
|
||||
|
||||
- Implementation of required methods to handle the component's specific capability
|
||||
|
||||
## Why Custom Components are powerful
|
||||
|
||||
When enabled, Custom Components bring the full flexibility of the Low-Code CDK into the simpler Connector Builder UI. Custom Components provide significant advantages when building complex connectors, and they equip you to integrate with virtually any API, regardless of complexity or your unique requirements.
|
||||
|
||||
1. **Handle Edge Cases**: Address unique API behaviors that aren't covered by built-in components, such as unusual pagination patterns, complex authentication schemes, or specialized data transformation needs.
|
||||
|
||||
2. **Extend Functionality**: When standard components don't offer the precise capabilities you need, Custom Components let you implement exactly what's required without compromising.
|
||||
|
||||
3. **Maintain Framework Benefits**: While providing customization, you still benefit from the structure, testing capabilities, and deployment options of the Connector Builder framework.
|
||||
|
||||
4. **Iterative Development**: You can start with built-in components and gradually replace only the specific parts that need customization, rather than building an entire connector from scratch.
|
||||
|
||||
5. **Specialized Transformations**: Implement complex data manipulation, normalization, or enrichment that goes beyond what declarative configuration can provide.
|
||||
|
||||
## How to enable Custom Components
|
||||
|
||||
:::danger Security Warning
|
||||
Custom Components are currently considered **UNSAFE** and **EXPERIMENTAL**. Airbyte doesn't provide any sandboxing guarantees. This feature could execute arbitrary code in your Airbyte environment. Enable it at your own risk.
|
||||
:::
|
||||
|
||||
Airbyte disables Custom Components by default due to their experimental nature and security implications. Administrators can enable this feature in Self-Managed Community and Self-Managed Enterprise deployments using one of the following methods:
|
||||
|
||||
### Using abctl
|
||||
|
||||
If you deploy Airbyte with abctl, follow the steps below to update your values and redeploy Airbyte.
|
||||
|
||||
1. Edit your existing `values.yaml` file or create a new override file with this configuration:
|
||||
|
||||
```yaml title="values.yaml"
|
||||
workload-launcher:
|
||||
extraEnv:
|
||||
- name: AIRBYTE_ENABLE_UNSAFE_CODE
|
||||
value: "true"
|
||||
connector-builder-server:
|
||||
extraEnv:
|
||||
- name: AIRBYTE_ENABLE_UNSAFE_CODE
|
||||
value: "true"
|
||||
```
|
||||
|
||||
2. Use this file during deployment with the abctl command:
|
||||
|
||||
```bash
|
||||
abctl local install --values values.yaml
|
||||
```
|
||||
|
||||
### Using Helm charts
|
||||
|
||||
If you're deploying Airbyte using public Helm charts without abctl, follow the steps below to update your values and redeploy Airbyte.
|
||||
|
||||
1. Edit your existing `values.yaml` file or create a new override file with this configuration:
|
||||
|
||||
```yaml title="values.yaml"
|
||||
workload-launcher:
|
||||
extraEnv:
|
||||
- name: AIRBYTE_ENABLE_UNSAFE_CODE
|
||||
value: "true"
|
||||
connector-builder-server:
|
||||
extraEnv:
|
||||
- name: AIRBYTE_ENABLE_UNSAFE_CODE
|
||||
value: "true"
|
||||
```
|
||||
|
||||
2. Apply the configuration during Helm installation or upgrade:
|
||||
|
||||
```bash
|
||||
helm upgrade --install airbyte airbyte/airbyte -f values.yaml -f values.yaml
|
||||
```
|
||||
|
||||
:::caution
|
||||
Monitor your deployment for any security or performance issues. Remember that this feature allows execution of arbitrary code in your Airbyte environment.
|
||||
:::
|
||||
|
||||
## How to use Custom Components
|
||||
|
||||
Custom Components in the Connector Builder UI extend the capabilities available in the Low-Code CDK. For detailed implementation information, please refer to the [Custom Components documentation](../config-based/advanced-topics/custom-components.md).
|
||||
|
||||
Key implementation steps include:
|
||||
|
||||
1. Create a Python class that implements the interface of the component you want to customize
|
||||
|
||||
2. Define the necessary fields and methods required by that interface
|
||||
|
||||
3. Reference your custom component in the connector configuration using its fully qualified class name
|
||||
|
||||
The existing documentation provides examples of:
|
||||
|
||||
- How to create custom component classes
|
||||
|
||||
- Required implementation interfaces
|
||||
|
||||
- Properly referencing custom components in your configuration
|
||||
|
||||
- Handling parameter propagation between parent and child components
|
||||
|
||||
While using the Connector Builder UI, you need to switch to the YAML editor view to implement custom components. You can't configure them through the visual interface.
|
||||
@@ -0,0 +1,123 @@
|
||||
# Error Handling
|
||||
|
||||
:::warning
|
||||
When using the "Test" button to run a test sync of the connector, the Connector Builder UI will not retry failed requests. This is done to reduce the amount of waiting time in between test syncs.
|
||||
:::
|
||||
|
||||
Error handlers allow for the connector to decide how to continue fetching data according to the contents of the response from the partner API. Depending on attributes of the response such as status code, text body, or headers, the connector can continue making requests, retry unsuccessful attempts, or fail the sync.
|
||||
|
||||
An error handler is made of two parts, "Backoff strategy" and "Response filter". When the conditions of the response filter are met, the connector will proceed with the sync according to behavior specified. See the [Response filter](#response-filter) section for a detailed breakdown of possible response filter actions. In the event of a failed request that needs to be retried, the backoff strategy determines how long the connector should wait before attempting the request again.
|
||||
|
||||
When an error handler is not configured for a stream, the connector will default to retrying requests that received a 429 and 5XX status code in the response 5 times using a 5-second exponential backoff. This default retry behavior is recommended if the API documentation does not specify error handling or retry behavior.
|
||||
|
||||
Refer to the documentation of the API you are building a connector for to determine how to handle response errors. There can either be a dedicated section listing expected error responses (ex. [Delighted](https://app.delighted.com/docs/api#http-status-codes)) or API endpoints will list their error responses individually (ex. [Intercom](https://developers.intercom.com/intercom-api-reference/reference/listcompaniesforacontact)). There is also typically a section on rate limiting that summarizes how rate limits are communicated in the response and when to retry.
|
||||
|
||||
## Backoff strategies
|
||||
|
||||
The API documentation will usually cover when to reattempt a failed request that is retryable. This is often through a `429 Too Many Requests` response status code, but it can vary for different APIs. The following backoff strategies are supported in the connector builder:
|
||||
|
||||
- [Constant](#constant)
|
||||
- [Exponential](#exponential)
|
||||
- [Wait time from header](#wait-time-from-header)
|
||||
- [Wait until time from header](#wait-until-time-from-header)
|
||||
|
||||
### Constant
|
||||
|
||||
When the API documentation recommends that requests be retried after waiting a constant amount of time, the "Constant" backoff strategy should be set on the error handler.
|
||||
|
||||
#### Example
|
||||
|
||||
The [Intercom API](https://developers.intercom.com/intercom-api-reference/reference/http-responses) is an API that recommends a constant backoff strategy when retrying requests.
|
||||
|
||||
### Exponential
|
||||
|
||||
When the API documentation recommends that requests be retried after waiting an exponentially increasing amount of time, the "Exponential" backoff strategy should be set on the error handler.
|
||||
|
||||
The exponential backoff strategy is similar to constant where the connector waits to retry a request based on a numeric value "Multiplier" defined on the connector. For a backoff strategy with "Multiplier" set to 5 seconds, when the connector receives an API response that should be retried, it will wait 5 seconds before reattempting the request. Upon receiving subsequent failed responses, the connector will wait 10, 20, 40, and 80, permanently stopping after a total of 5 retries.
|
||||
|
||||
Note: When no backoff strategy is defined, the connector defaults to using an exponential backoff to retry requests.
|
||||
|
||||
#### Example
|
||||
|
||||
The [Delighted API](https://app.delighted.com/docs/api#rate-limits) is an API that recommends using an exponential backoff. In this case, the API documentation recommends retrying requests after 2 seconds, 4 seconds, then 8 seconds and so on.
|
||||
|
||||
Although a lot of API documentation does not call out using an exponential backoff, some APIs like the [Posthog API](https://posthog.com/docs/api) mention rate limits that are advantageous to use an exponential backoff. In this case, the rate limit of 240 requests/min should work for most syncs. However, if there is a spike in traffic, then the exponential backoff allows the connector to avoid sending more requests than the endpoint can support.
|
||||
|
||||
### Wait time from header
|
||||
|
||||
The "Wait time from header" backoff strategy allows the connector to wait before retrying a request based on the value specified in the API response.
|
||||
|
||||
<iframe width="640" height="545" src="https://www.loom.com/embed/84b65299b5cd4f83a8e3b6abdfa0ebd2" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>
|
||||
|
||||
#### Example
|
||||
|
||||
The [Chargebee API](https://apidocs.chargebee.com/docs/api/error-handling) documentation recommends using the `Retry-After` in the response headers to determine when to retry the request.
|
||||
|
||||
When running a sync, the connector receives from the Chargebee API a response with a 429 status code and the `Retry-After` header set to 60. The connector interprets the response retrieving that value from the `Retry-After` header and will pause the sync for 60 seconds before retrying.
|
||||
|
||||
### Wait until time from header
|
||||
|
||||
The "Wait until time from header" backoff strategy allows the connector to wait until a specific time before retrying a request according to the API response.
|
||||
|
||||
<iframe width="640" height="562" src="https://www.loom.com/embed/023bc8a5e5464b2fba125f9344e3f02f" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>
|
||||
|
||||
#### Example
|
||||
|
||||
The [Recurly API](https://recurly.com/developers/api/v2021-02-25/index.html#section/Getting-Started/Limits) is an API that defines a header `X-RateLimit-Reset` which specifies when the request rate limit will be reset.
|
||||
|
||||
Take for example a connector that makes a request at 25/04/2023 01:00:00 GMT and receives a response with a 429 status code and the header `X-RateLimit-Reset` set to 1682413200. This epoch time is equivalent to 25/04/2023 02:00:00 GMT. Using the `X-RateLimit-Reset` header value, the connector will pause the sync for one hour before attempting subsequent requests to the Recurly API.
|
||||
|
||||
## Response filter
|
||||
|
||||
A response filter should be used when a connector needs to interpret an API response to decide how the sync should proceed. Common use cases for this feature include ignoring error codes to continue fetching data, retrying requests for specific error codes, and stopping a sync based on the response received from the API.
|
||||
|
||||
<iframe width="640" height="716" src="https://www.loom.com/embed/dc86147384204156a2b79442a00c0dd3" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>
|
||||
|
||||
### Response conditions
|
||||
|
||||
The following conditions can be specified on the "Response filter" and are used to determine if attributes of the response match the filter. When more than one of condition is specified, the filter will take action if the response satisfies any of the conditions:
|
||||
|
||||
- [If error message matches](#if-error-message-matches)
|
||||
- [and predicate is fulfilled](#and-predicate-is-fulfilled)
|
||||
- [and HTTP codes match](#and-http-codes-match)
|
||||
|
||||
#### If error message matches
|
||||
|
||||
For a response filter that defines the "If error message matches" field, the connector will check if the provided text exists within the text body of the API response. If the text is present, the response filter will carry out the specified action.
|
||||
|
||||
##### Example
|
||||
|
||||
For the Chargebee API, some endpoints are only available for a specific API version and if an endpoint is unavailable, the response text will contain `"This API operation is not enabled for this site"`. The Airbyte Chargebee integration allows customers to configure which API version to use when retrieving data for a stream. When the connector makes requests to Chargebee using an unsupported version, the response filter will match according to the response text and proceeds based on the "Then execute action".
|
||||
|
||||
#### and predicate is fulfilled
|
||||
|
||||
This field allows for more granular control over how the response filter matches against attributes of an API response. For a filter that defines the "and predicate is fulfilled" field, the connector evaluates the interpolation expression against an API response's text body or headers.
|
||||
|
||||
##### Example
|
||||
|
||||
For the Zoom API, the response text body can include a special non-error status codes under the `code` field. An example response text body would look like `{"code": 300}`. The "If error message matches" condition is too broad because there could be record data containing the text "300". Instead, for a response filter defining "and predicate is fulfilled" as `{{ response.code == 300 }}`, during a sync, the predicate expression will be evaluated to true and the connector proceeds based on the "Then execute action".
|
||||
|
||||
#### and HTTP codes match
|
||||
|
||||
A response filter can specify for the "and HTTP codes match" field a set of numeric HTTP status codes (ex. 200, 404, 500). When receiving an API response, the connector will check to see if the status code of the response is in the provided set of HTTP status codes.
|
||||
|
||||
##### Example
|
||||
|
||||
The Pocket API emits API responses for rate limiting errors using a 403 error status code. The default error handler interprets 403 errors as non-retryable and will fail the sync when they are encountered. The connector can configures a response filter field "and HTTP status codes" that contains 403 within the set. When a 403 error response from the API is encountered, the connector proceeds based on the "Then execute action"
|
||||
|
||||
### Then execute action
|
||||
|
||||
If a response from the API matches the predicates of the response filter the connector will continue the sync according to the "Then execute action" definition. This is a list of the actions that a connector can take:
|
||||
|
||||
- SUCCESS: The response was successful and the connector will extract records from the response and emit them to a destination. The connector will continue fetching the next set of records from the API.
|
||||
- RETRY: The response was unsuccessful, but the error is transient and may be successful on subsequent attempts. The request will be retried according to the backoff policy defined on the error handler.
|
||||
- IGNORE: The response was unsuccessful, but the error should be ignored. The connector will not emit any records for the current response. The connector will continue fetching the next set of records from the API.
|
||||
- FAIL: The response was unsuccessful and the connector should stop syncing records and indicate that it failed to retrieve the complete set of records.
|
||||
|
||||
### Error message
|
||||
|
||||
The "Error message" field is used to customize the message that is relayed back to users when the API response matches a response filter that returns an error.
|
||||
|
||||
## Multiple error handlers
|
||||
|
||||
In the "Error handlers" section of a stream, one or more handlers can be defined. In the case multiple error handlers are specified, the response will be evaluated against each error handler in the order they are defined. The connector will take the action of the first error handler that matches the response and ignore subsequent handlers.
|
||||
@@ -0,0 +1,172 @@
|
||||
# Incremental sync
|
||||
|
||||
An incremental sync is a sync which pulls only the data that has changed since the previous sync (as opposed to all the data available in the data source).
|
||||
|
||||
This is especially important if there are a large number of records to sync and/or the API has tight request limits which makes a full sync of all records on a regular schedule too expensive or too slow.
|
||||
|
||||
Incremental syncs are usually implemented using a cursor value (like a timestamp) that delineates which data was pulled and which data is new. A very common cursor value is an `updated_at` timestamp. This cursor means that records whose `updated_at` value is less than or equal than that cursor value have been synced already, and that the next sync should only export records whose `updated_at` value is greater than the cursor value.
|
||||
|
||||
To use incremental syncs, the API endpoint needs to fullfil the following requirements:
|
||||
|
||||
- Records contain a top-level date/time field that defines when this record was last updated (the "cursor field")
|
||||
- If the record's cursor field is nested, you can use an "Add Field" transformation to copy it to the top-level, and a Remove Field to remove it from the object. This will effectively move the field to the top-level of the record
|
||||
- It's possible to filter/request records by the cursor field
|
||||
|
||||
The knowledge of a cursor value also allows the Airbyte system to automatically keep a history of changes to records in the destination. To learn more about how different modes of incremental syncs, check out the [Incremental Sync - Append](/platform/using-airbyte/core-concepts/sync-modes/incremental-append/) and [Incremental Sync - Append + Deduped](/platform/using-airbyte/core-concepts/sync-modes/incremental-append-deduped) pages.
|
||||
|
||||
## Configuration
|
||||
|
||||
To configure incremental syncs for a stream in the connector builder, you have to specify how the records will represent the **"last changed" / "updated at" timestamp**, the **initial time range** to fetch records for and **how to request records from a certain time range**.
|
||||
|
||||
In the builder UI, these things are specified like this:
|
||||
|
||||
- The "Cursor field" is the property in the record that defines the date and time when the record got changed. It's used to decide which records are synced already and which records are "new"
|
||||
- The "Datetime format" specifies the format the cursor field is using to specify date and time. Check out the [YAML reference](/platform/connector-development/config-based/understanding-the-yaml-file/reference#/definitions/DatetimeBasedCursor) for a full list of supported formats.
|
||||
- "API time filtering capabilities" specifies if the API allows filtering by start and end datetime or whether it's a "feed" of data going from newest to oldest records. See the ["Incremental sync without time filtering"](#incremental-sync-without-time-filtering) section below for details.
|
||||
- The "Start datetime" is the initial start date of the time range to fetch records for. When doing incremental syncs, the second sync will overwrite this date with the last record that got synced so far.
|
||||
- The "End datetime" is the end date of the time range to fetch records for. In most cases it's set to the current date and time when the sync is started to sync all changes that happened so far.
|
||||
- The "Inject start/end time into outgoing HTTP request" defines how to request records that got changed in the time range to sync. In most cases the start and end time is added as a query parameter or body parameter.
|
||||
|
||||
## Example
|
||||
|
||||
The [API of The Guardian](https://open-platform.theguardian.com/documentation/search) has a `/search` endpoint that allows to extract a list of articles.
|
||||
|
||||
The `/search` endpoint has a `from-date` and a `to-date` query parameter which can be used to only request data for a certain time range.
|
||||
|
||||
Content records have the following form:
|
||||
|
||||
```
|
||||
{
|
||||
"id": "world/2022/oct/21/russia-ukraine-war-latest-what-we-know-on-day-240-of-the-invasion",
|
||||
"type": "article",
|
||||
"sectionId": "world",
|
||||
"sectionName": "World news",
|
||||
"webPublicationDate": "2022-10-21T14:06:14Z",
|
||||
"webTitle": "Russia-Ukraine war latest: what we know on day 240 of the invasion",
|
||||
// ...
|
||||
}
|
||||
```
|
||||
|
||||
As this fulfills the requirements for incremental syncs, we can configure the "Incremental sync" section in the following way:
|
||||
|
||||
- "Cursor field" is set to `webPublicationDate`
|
||||
- "Datetime format" is set to `%Y-%m-%dT%H:%M:%SZ`
|
||||
- "Start datetime" is set to "user input" to allow the user of the connector configuring a Source to specify the time to start syncing
|
||||
- "End datetime" is set to "now" to fetch all articles up to the current date
|
||||
- "Inject start time into outgoing HTTP request" is set to `request_parameter` with "Field" set to `from-date`
|
||||
- "Inject end time into outgoing HTTP request" is set to `request_parameter` with "Field" set to `to-date`
|
||||
|
||||
<iframe width="640" height="835" src="https://www.loom.com/embed/78eb5da26e2e4f4aa9c3a48573d9ed3b" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>
|
||||
|
||||
Setting the start date in the "Testing values" to a date in the past like **2023-04-09T00:00:00Z** results in the following request:
|
||||
|
||||
<pre>
|
||||
curl 'https://content.guardianapis.com/search?from-date=<b>2023-04-09T00:00:00Z</b>&to-date={`now`}'
|
||||
</pre>
|
||||
|
||||
The most recent encountered date will be saved as the [*state*](../../understanding-airbyte/airbyte-protocol.md#state--checkpointing) of the connection - when the next sync is running, it picks up from that cutoff date as the new start date. Let's assume the last ecountered article looked like this:
|
||||
|
||||
<pre>
|
||||
{`{
|
||||
"id": "business/live/2023/apr/15/uk-bosses-more-optimistic-energy-prices-fall-ai-spending-boom-economics-business-live",
|
||||
"type": "liveblog",
|
||||
"sectionId": "business",
|
||||
"sectionName": "Business",
|
||||
"webPublicationDate": `}<b>"2023-04-15T07:30:58Z"</b>{`,
|
||||
}`}
|
||||
</pre>
|
||||
|
||||
Then when a sync is triggered for the same connection the next day, the following request is made:
|
||||
|
||||
<pre>
|
||||
curl 'https://content.guardianapis.com/search?from-date=<b>2023-04-15T07:30:58Z</b>&to-date={`<now>`}'
|
||||
</pre>
|
||||
|
||||
:::info
|
||||
If the last record read has a datetime earlier than the end time of the stream interval, the end time of the interval will be stored in the state.
|
||||
:::
|
||||
|
||||
The `from-date` is set to the cutoff date of articles synced already and the `to-date` is set to the current date.
|
||||
|
||||
:::info
|
||||
In some cases, it's helpful to reference the start and end date of the interval that's currently synced, for example if it needs to be injected into the URL path of the current stream. In these cases it can be referenced using the `{{ stream_interval.start_time }}` and `{{ stream_interval.end_time }}` [placeholders](/platform/connector-development/config-based/understanding-the-yaml-file/reference#variables). Check out [the tutorial](./tutorial.mdx#adding-incremental-reads) for such a case.
|
||||
:::
|
||||
|
||||
## Incremental sync without time filtering
|
||||
|
||||
Some APIs do not allow filtering records by a date field, but instead only provide a paginated "feed" of data that is ordered from newest to oldest. In these cases, the "API time filtering capabilities" option needs to be set to "No filter". As they can't be applied in this situation, the "Inject start time into outgoing HTTP request" and "Inject end time into outgoing HTTP request" options as well as the "Split up interval" option are disabled automatically.
|
||||
|
||||
The `/new` endpoint of the [Reddit API](https://www.reddit.com/dev/api/#GET_new) is such an API. By configuring pagination and setting time filtering capabilities to the "No filter" option, the connector will automatically request the next page of records until the cutoff datetime is encountered. This is done by comparing the cursor value of the records with the either the configured start date or the latest cursor value that was encountered in a previous sync - if the cursor value is less than or equal to that cutoff date, the sync is finished. The latest cursor value is saved as part of the connection and used as the cutoff date for the next sync.
|
||||
|
||||
:::warning
|
||||
The "No filter" option can only be used if the data is sorted from newest to oldest across pages. If the data is sorted differently, the connector will stop syncing records too late or too early. In these cases it's better to disable incremental syncs and sync the full set of records on a regular schedule.
|
||||
:::
|
||||
|
||||
## Advanced settings
|
||||
|
||||
The description above is sufficient for a lot of APIs. However there are some more subtle configurations which sometimes become relevant.
|
||||
|
||||
### Split up interval
|
||||
|
||||
When incremental syncs are enabled and "Split Up Interval" is set, the connector is not fetching all records since the cutoff date at once - instead it's splitting up the time range between the cutoff date and the desired end date into intervals based on the "Step" configuration expressed as [ISO 8601 duration](https://en.wikipedia.org/wiki/ISO_8601#Durations).
|
||||
|
||||
The "Cursor Granularity" also needs to be set to an ISO 8601 duration - it represents the smallest possible time unit the API supports to filter records by. It's used to ensure the start of a interval does not overlap with the end of the previous one.
|
||||
|
||||
For example if the "Step" is set to 10 days (`P10D`) and the "Cursor granularity" set to one second (`PT1S`) for the Guardian articles stream described above and a longer time range, then the following requests will be performed:
|
||||
|
||||
<pre>
|
||||
curl 'https://content.guardianapis.com/search?from-date=<b>2023-01-01T00:00:00Z</b>&to-date=<b>2023-01-09T23:59:59Z</b>'{`\n`}
|
||||
curl 'https://content.guardianapis.com/search?from-date=<b>2023-01-10T00:00:00Z</b>&to-date=<b>2023-01-19T23:59:59Z</b>'{`\n`}
|
||||
curl 'https://content.guardianapis.com/search?from-date=<b>2023-01-20T00:00:00Z</b>&to-date=<b>2023-01-29T23:59:59Z</b>'{`\n`}
|
||||
...
|
||||
</pre>
|
||||
|
||||
After an interval is processed, the cursor value of the last record will be saved as part of the connection as the new cutoff date, as described in the [example above](#example).
|
||||
|
||||
If "Split Up Interval" is left unset, the connector will not split up the time range at all but will instead just request all records for the entire target time range. This configuration works for all connectors, but there are two reasons to change it:
|
||||
|
||||
- **To protect a connection against intermittent failures** - if the "Step" size is a day, the cutoff date is saved after all records associated with a day are proccessed. If a sync fails halfway through because the API, the Airbyte system, the destination or the network between these components has a failure, then at most one day worth of data needs to be resynced. However, a smaller step size might cause more requests to the API and more load on the system. It depends on the expected amount of data and load characteristics of an API what step size is optimal, but for a lot of applications the default of one month is a good starting point.
|
||||
- **The API requires the connector to fetch data in pre-specified chunks** - for example the [Exchange Rates API](https://exchangeratesapi.io/documentation/) makes the date to fetch data for part of the URL path and only allows to fetch data for a single day at a time
|
||||
|
||||
### Lookback window
|
||||
|
||||
The "Lookback window" specifies a duration that is subtracted from the last cutoff date before starting to sync.
|
||||
|
||||
Some APIs update records over time but do not allow to filter or search by modification date, only by creation date. For example the API of The Guardian might change the title of an article after it got published, but the `webPublicationDate` still shows the original date the article got published initially.
|
||||
|
||||
In these cases, there are two options:
|
||||
|
||||
- **Do not use incremental sync** and always sync the full set of records to always have a consistent state, losing the advantages of reduced load and [automatic history keeping in the destination](/platform/using-airbyte/core-concepts/sync-modes/incremental-append-deduped)
|
||||
- **Configure the "Lookback window"** to not only sync exclusively new records, but resync some portion of records before the cutoff date to catch changes that were made to existing records, trading off data consistency and the amount of synced records. In the case of the API of The Guardian, news articles tend to only be updated for a few days after the initial release date, so this strategy should be able to catch most updates without having to resync all articles.
|
||||
|
||||
Reiterating the example from above with a "Lookback window" of 2 days configured, let's assume the last encountered article looked like this:
|
||||
|
||||
<pre>
|
||||
{`{
|
||||
"id": "business/live/2023/apr/15/uk-bosses-more-optimistic-energy-prices-fall-ai-spending-boom-economics-business-live",
|
||||
"type": "liveblog",
|
||||
"sectionId": "business",
|
||||
"sectionName": "Business",
|
||||
"webPublicationDate": `}<b>{`"2023-04-15T07:30:58Z"`}</b>{`,
|
||||
}`}
|
||||
</pre>
|
||||
|
||||
Then when a sync is triggered for the same connection the next day, the following request is made:
|
||||
|
||||
<pre>
|
||||
curl 'https://content.guardianapis.com/search?from-date=<b>2023-04-13T07:30:58Z</b>&to-date={`<now>`}'
|
||||
</pre>
|
||||
|
||||
## Custom parameter injection
|
||||
|
||||
Using the "Inject start time / end time into outgoing HTTP request" option in the incremental sync form works for most cases, but sometimes the API has special requirements that can't be handled this way:
|
||||
|
||||
- The API requires adding a prefix or a suffix to the actual value
|
||||
- Multiple values need to be put together in a single parameter
|
||||
- The value needs to be injected into the URL path
|
||||
- Some conditional logic needs to be applied
|
||||
|
||||
To handle these cases, disable injection in the incremental sync form and use the generic parameter section at the bottom of the stream configuration form to freely configure query parameters, headers and properties of the JSON body, by using jinja expressions and [available variables](/platform/connector-development/config-based/understanding-the-yaml-file/reference/#/variables). You can also use these variables as part of the URL path.
|
||||
|
||||
For example the [Sendgrid API](https://docs.sendgrid.com/api-reference/e-mail-activity/filter-all-messages) requires setting both start and end time in a `query` parameter.
|
||||
For this case, you can use the `stream_interval` variable to configure a query parameter with "key" `query` and "value" `last_event_time BETWEEN TIMESTAMP "{{stream_interval.start_time}}" AND TIMESTAMP "{{stream_interval.end_time}}"` to filter down to the right window in time.
|
||||
@@ -0,0 +1,40 @@
|
||||
# Connector Builder Intro
|
||||
|
||||
Connector Builder is a no-code tool that’s part of the Airbyte UI.
|
||||
It provides an intuitive user interface on top of the [low-code YAML format](https://docs.airbyte.com/connector-development/config-based/understanding-the-yaml-file/yaml-overview) and lets you develop a connector to use in data syncs without ever needing to leave your Airbyte workspace.
|
||||
Connector Builder offers the most straightforward method for building, contributing, and maintaining connectors.
|
||||
|
||||
## When should I use Connector Builder?
|
||||
|
||||
First, check if the API you want to use has an available connector in the [catalog](/integrations). If you find it there, you can use it as is.
|
||||
If the connector you're looking for doesn't already exist and you'd like to try creating your own implementation, the Connector Builder should be your first destination.
|
||||
|
||||
## Getting started
|
||||
|
||||
The high-level process for using Connector Builder is as follows:
|
||||
|
||||
1. Access Connector Builder in the Airbyte web app by selecting "Builder" in the left-hand sidebar
|
||||
2. Iterate on the connector by providing details for global configuration and user inputs, and streams
|
||||
3. Once the connector is ready, publish it to your workspace, or contribute it to Airbyte catalog
|
||||
4. Configure a Source based on the released connector
|
||||
5. Use the Source in a connection to sync data
|
||||
|
||||
The concept pages in this section of the docs share more details related to the following topics: [authentication](./authentication.md), [record processing](./record-processing.mdx), [pagination](./pagination.md), [incremental sync](./incremental-sync.md), [partitioning](./partitioning.md), and [error handling](./error-handling.md).
|
||||
|
||||
:::tip
|
||||
Do not hardcode things like API keys or passwords while configuring a connector in the builder. They will be used, but not saved, during development when you provide them as Testing Values. For use in production, these should be passed in as user inputs after publishing the connector to the workspace, when you configure a source using your connector.
|
||||
|
||||
Follow [the tutorial](./tutorial.mdx) for an example of what this looks like in practice.
|
||||
:::
|
||||
|
||||
## Contributing the connector
|
||||
|
||||
If you'd like to share your connector with other Airbyte users, you can contribute it to Airbyte's GitHub repository right from the Builder.
|
||||
|
||||
1. Click "Publish" chevron -> "Contribute to Marketplace"
|
||||
2. Fill out the form: add the connector description, and provide your GitHub PAT (Personal Access Token) to create a pull request
|
||||
3. Click "Contribute" to submit the connector to the Airbyte catalog
|
||||
|
||||
Reviews typically take under a week.
|
||||
|
||||
You can also export the YAML manifest file for your connector and share it with others. The manifest file contains all the information about the connector, including the global configuration, streams, and user inputs.
|
||||
@@ -0,0 +1,283 @@
|
||||
# Pagination
|
||||
|
||||
Pagination is a mechanism used by APIs in which data is split up into "pages" when returning results, so that the entire response data doesn't need to be returned all at once.
|
||||
|
||||
The Connector Builder offers a Pagination section which implements the most common pagination methods used by APIs. When enabled, the connector will use the pagination configuration you have provided to request consecutive pages of data from the API until there are no more pages to fetch.
|
||||
|
||||
If your API doesn't support pagination, simply leave the Pagination section disabled.
|
||||
|
||||
## Pagination methods
|
||||
|
||||
Check the documentation of the API you want to integrate to find which type of pagination is uses. Many API docs have a "Pagination" or "Paging" section that describes this.
|
||||
|
||||
The following pagination mechanisms are supported in the connector builder:
|
||||
|
||||
- [Offset Increment](#offset-increment)
|
||||
- [Page Increment](#page-increment)
|
||||
- [Cursor Pagination](#cursor-pagination)
|
||||
|
||||
Select the matching pagination method for your API and check the sections below for more information about individual methods. If none of these pagination methods work for your API, you will need to use the [low-code CDK](../config-based/low-code-cdk-overview) or [Python CDK](../cdk-python/) instead.
|
||||
|
||||
### Offset Increment
|
||||
|
||||
If your API paginates using offsets, the API docs will likely contain one of the following keywords:
|
||||
|
||||
- `offset`
|
||||
- `limit`
|
||||
|
||||
In this method of pagination, the "limit" specifies the maximum number of records to return per page, while the "offset" indicates the starting position or index from which to retrieve records.
|
||||
|
||||
For example, say that the API has the following dataset:
|
||||
|
||||
```
|
||||
[
|
||||
{"id": 1, "name": "Product A"},
|
||||
{"id": 2, "name": "Product B"},
|
||||
{"id": 3, "name": "Product C"},
|
||||
{"id": 4, "name": "Product D"},
|
||||
{"id": 5, "name": "Product E"}
|
||||
]
|
||||
```
|
||||
|
||||
Then the API may take in a request like this: `GET https://api.example.com/products?limit=2&offset=3`, which could result in the following response:
|
||||
|
||||
```
|
||||
{
|
||||
"data": [
|
||||
{"id": 4, "name": "Product D"},
|
||||
{"id": 5, "name": "Product E"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Normally, the caller of the API would need to implement some logic to then increment the `offset` by the `limit` amount and then submit another call with the updated `offset`, and continue on this pattern until all of the records have been retrieved.
|
||||
|
||||
The Offset Increment pagination mode in the Connector Builder does this for you. So you just need to decide on a `limit` value to set (the general recommendation is to use the largest limit that the API supports in order to minimize the number of API requests), and configure how the limit and offset are injected into the HTTP requests. Most APIs accept these values as query parameters like in the above example, but this can differ depending on the API. If an API does not accept a `limit`, then the injection configuration for the limit can be disabled
|
||||
|
||||
Either way, your connector will automatically increment the `offset` for subsequent requests based on the number of records it receives, and will continue until it receives fewer records than the limit you configured.
|
||||
|
||||
So for the example API and dataset above, you could apply the following Pagination configurations in the Connector Builder:
|
||||
|
||||
<iframe width="640" height="548" src="https://www.loom.com/embed/ec18b3c4e6db4007b4ef10ee808ab873" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>
|
||||
|
||||
- Mode: `Offset Increment`
|
||||
- Limit: `2`
|
||||
- Inject limit into outgoing HTTP request:
|
||||
- Inject into: `request_parameter`
|
||||
- Field name: `limit`
|
||||
- Inject offset into outgoing HTTP request:
|
||||
- Inject into: `request_parameter`
|
||||
- Field name: `offset`
|
||||
|
||||
and this would cause your connector to make the following requests to the API in order to paginate through all of its data:
|
||||
|
||||
```
|
||||
GET https://api.example.com/products?limit=2&offset=0
|
||||
-> [
|
||||
{"id": 1, "name": "Product A"},
|
||||
{"id": 2, "name": "Product B"}
|
||||
]
|
||||
|
||||
GET https://api.example.com/products?limit=2&offset=2
|
||||
-> [
|
||||
{"id": 3, "name": "Product C"},
|
||||
{"id": 4, "name": "Product D"}
|
||||
]
|
||||
|
||||
GET https://api.example.com/products?limit=3&offset=4
|
||||
-> [
|
||||
{"id": 5, "name": "Product E"}
|
||||
]
|
||||
// less than 2 records returned -> stop
|
||||
```
|
||||
|
||||
The Connector Builder currently supports injecting these values into the query parameters (i.e. request parameters), headers, or body.
|
||||
|
||||
#### Examples
|
||||
|
||||
The following APIs accept offset and limit pagination values as query parameters like in the above example:
|
||||
|
||||
- [Spotify API](https://developer.spotify.com/documentation/web-api/concepts/api-calls#pagination)
|
||||
- [GIPHY API](https://developers.giphy.com/docs/api/endpoint#trending)
|
||||
- [Twilio SendGrid API](https://docs.sendgrid.com/api-reference/how-to-use-the-sendgrid-v3-api/responses#pagination)
|
||||
|
||||
### Page Increment
|
||||
|
||||
If your API paginates using page increments, the API docs will likely contain one of the following keywords:
|
||||
|
||||
- `page size` / `page_size` / `pagesize` / `per_page`
|
||||
- `page number` / `page_number` / `pagenum` / `page`
|
||||
|
||||
In this method of pagination, the "page size" specifies the maximum number of records to return per request, while the "page number" indicates the specific page of data to retrieve.
|
||||
|
||||
This is similar to Offset Increment pagination, but instead of increasing the offset parameter by the number of records per page for the next request, the page number is simply increased by one to fetch the next page, iterating through all of them.
|
||||
|
||||
For example, say that the API has the following dataset:
|
||||
|
||||
```
|
||||
[
|
||||
{"id": 1, "name": "Product A"},
|
||||
{"id": 2, "name": "Product B"},
|
||||
{"id": 3, "name": "Product C"},
|
||||
{"id": 4, "name": "Product D"},
|
||||
{"id": 5, "name": "Product E"},
|
||||
{"id": 6, "name": "Product F"}
|
||||
]
|
||||
```
|
||||
|
||||
Then the API may take in a request like this: `GET https://api.example.com/products?page_size=2&page=1`, which could result in the following response:
|
||||
|
||||
```
|
||||
{
|
||||
"data": [
|
||||
{"id": 1, "name": "Product A"},
|
||||
{"id": 2, "name": "Product B"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
then incrementing the `page` by 1 to call it with `GET https://api.example.com/products?page_size=2&page=2` would result in:
|
||||
|
||||
```
|
||||
{
|
||||
"data": [
|
||||
{"id": 3, "name": "Product C"},
|
||||
{"id": 4, "name": "Product D"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
and so on.
|
||||
|
||||
The Connector Builder abstracts this away so that you only need to decide what page size to set (the general recommendation is to use the largest limit that the API supports in order to minimize the number of API requests), what the starting page number should be (usually either 0 or 1 dependent on the API), and how the page size and number are injected into the API requests. Similar to Offset Increment pagination, the page size injection can be disabled if the API does not accept a page size value.
|
||||
|
||||
Either way, your connector will automatically increment the page number by 1 for each subsequent request, and continue until it receives fewer records than the page size you configured.
|
||||
|
||||
So for the example API and dataset above, you could apply the following configurations in the Connector Builder:
|
||||
|
||||
<iframe width="640" height="554" src="https://www.loom.com/embed/c6187b4e21534b9a825e93a002c33d06" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>
|
||||
|
||||
- Mode: `Page Increment`
|
||||
- Page size: `3`
|
||||
- Start from page: `1`
|
||||
- Inject page size into outgoing HTTP request:
|
||||
- Inject into: `request_parameter`
|
||||
- Field name: `page_size`
|
||||
- Inject page number into outgoing HTTP request:
|
||||
- Inject into: `request_parameter`
|
||||
- Field name: `page`
|
||||
|
||||
and this would cause your connector to make the following requests to the API in order to paginate through all of its data:
|
||||
|
||||
```
|
||||
GET https://api.example.com/products?page_size=2&page=1
|
||||
-> [
|
||||
{"id": 1, "name": "Product A"},
|
||||
{"id": 2, "name": "Product B"}
|
||||
]
|
||||
|
||||
GET https://api.example.com/products?page_size=2&page=2
|
||||
-> [
|
||||
{"id": 3, "name": "Product C"},
|
||||
{"id": 4, "name": "Product D"}
|
||||
]
|
||||
|
||||
GET https://api.example.com/products?page_size=3&page=3
|
||||
-> [
|
||||
{"id": 5, "name": "Product E"}
|
||||
]
|
||||
// less than 2 records returned -> stop
|
||||
```
|
||||
|
||||
The Connector Builder currently supports injecting these values into the query parameters (i.e. request parameters), headers, or body.
|
||||
|
||||
#### Examples
|
||||
|
||||
The following APIs accept page size/num paagination values as query parameters like in the above example"
|
||||
|
||||
- [WooCommerce API](https://woocommerce.github.io/woocommerce-rest-api-docs/#pagination)
|
||||
- [FreshDesk API](https://developers.freshdesk.com/api/)
|
||||
|
||||
### Cursor Pagination
|
||||
|
||||
If your API paginates using cursor pagination, the API docs will likely contain one of the following keywords:
|
||||
|
||||
- `cursor`
|
||||
- `link`
|
||||
- `next_token`
|
||||
|
||||
In this method of pagination, some identifier (e.g. a timestamp or record ID) is used to navigate through the API's records, rather than relying on fixed indices or page numbers like in the above methods. When making a request, clients provide a cursor value, and the API returns a subset of records starting from the specified cursor, along with the cursor for the next page. This can be especially helpful in preventing issues like duplicate or skipped records that can arise when using the above pagination methods.
|
||||
|
||||
Using the [Twitter API](https://developer.twitter.com/en/docs/twitter-api/pagination) as an example, a request is made to the `/tweets` endpoint, with the page size (called `max_results` in this case) set to 100. This will return a response like
|
||||
|
||||
```
|
||||
{
|
||||
"data": [
|
||||
{
|
||||
"created_at": "2020-12-11T20:44:52.000Z",
|
||||
"id": "1337498609819021312",
|
||||
"text": "Thanks to everyone who tuned in today..."
|
||||
},
|
||||
{
|
||||
"created_at": "2020-05-06T17:24:31.000Z",
|
||||
"id": "1258085245091368960",
|
||||
"text": "It’s now easier to understand Tweet impact..."
|
||||
},
|
||||
...
|
||||
|
||||
],
|
||||
"meta": {
|
||||
...
|
||||
"result_count": 100,
|
||||
"next_token": "7140w"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The `meta.next_token` value of that response can then be set as the `pagination_token` in the next request, causing the API to return the next 100 tweets.
|
||||
|
||||
To integrate with such an API in the Connector Builder, you must configure how this "Next page cursor" is obtained for each request. In most cases, the next page cursor is either part of the response body or part of the HTTP headers. Select the respective type and define the property (or nested property) that holds the cursor value, for example "`meta`, `next_token`" for the twitter API.
|
||||
|
||||
You can also configure how the cursor value is injected into the API Requests. In the above example, this would be set as a `request_parameter` with the field name `pagination_token`, but this is dependent on the API - check the docs to see if they describe how to set the cursor/token for subsequent requests. For cursor pagination, if `path` is selected as the `Inject into` option, then the entire request URL for the subsequent request will be replaced by the cursor value. This can be useful for APIs that return a full URL that should be requested for the next page of results, such as the [GitHub API](https://docs.github.com/en/rest/guides/using-pagination-in-the-rest-api?apiVersion=2022-11-28).
|
||||
|
||||
<iframe width="640" height="563" src="https://www.loom.com/embed/c4f657153baa407b993bfadf6ea51532" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>
|
||||
|
||||
The "Page size" can optionally be specified as well; if so, how this page size gets injected into the HTTP requests can be configured similar to the above pagination methods.
|
||||
|
||||
When using the "response" or "headers" option for obtaining the next page cursor, the connector will stop requesting more pages as soon as no value can be found at the specified location. In some situations, this is not sufficient. If you need more control over how to obtain the cursor value and when to stop requesting pages, use the "custom" option and specify the "stop condition" using a jinja placeholder. For example if your API also has a boolean `more_results` property included in the response to indicate if there are more items to be retrieved, the stop condition should be `{{ response.more_results is false }}`
|
||||
|
||||
:::info
|
||||
|
||||
One potential variant of cursor pagination is an API that takes in some sort of record identifier to "start after". For example, the [PartnerStack API](https://docs.partnerstack.com/docs/partner-api#pagination) endpoints accept a `starting_after` parameter to which a record `key` is supposed to be passed.
|
||||
|
||||
In order to configure cursor pagination for this API in the connector builder, you will need to extract the `key` off of the last record returned by the previous request, using a "custom" next page cursor.
|
||||
This can be done in a couple different ways:
|
||||
|
||||
1. If you want to access fields on the records that you have defined through the record selector, you can use the `{{ last_records }}` object; so accessing the `key` field of the last record would look like `{{ last_records[-1]['key'] }}`. The `[-1]` syntax points to the last item in that `last_records` array.
|
||||
2. If you want to instead access a field on the raw API response body (e.g. your record selector filtered out the field you need), then you can use the `{{ response }}` object; so accessing the `key` field of the last item would look like `{{ response['data']['items'][-1]['key'] }}`.
|
||||
|
||||
This API also has a boolean `has_more` property included in the response to indicate if there are more items to be retrieved, so the stop condition in this case should be `{{ response.data.has_more is false }}`.
|
||||
|
||||
:::
|
||||
|
||||
#### Examples
|
||||
|
||||
The following APIs implement cursor pagination in various ways:
|
||||
|
||||
- [Twitter API](https://developer.twitter.com/en/docs/twitter-api/pagination) - includes `next_token` IDs in its responses which are passed in as query parameters to subsequent requests
|
||||
- [GitHub API](https://docs.github.com/en/rest/guides/using-pagination-in-the-rest-api?apiVersion=2022-11-28) - includes full-URL `link`s to subsequent pages of results
|
||||
- [FourSquare API](https://location.foursquare.com/developer/reference/pagination) - includes full-URL `link`s to subsequent pages of results
|
||||
|
||||
## Custom parameter injection
|
||||
|
||||
Using the "Inject page size / limit / offset into outgoing HTTP request" option in the pagination form works for most cases, but sometimes the API has special requirements that can't be handled this way:
|
||||
|
||||
- The API requires to add a prefix or a suffix to the actual value
|
||||
- Multiple values need to be put together in a single parameter
|
||||
- The value needs to be injected into the URL path
|
||||
- Some conditional logic needs to be applied
|
||||
|
||||
To handle these cases, disable injection in the pagination form and use the generic parameter section at the bottom of the stream configuration form to freely configure query parameters, headers and properties of the JSON body, by using jinja expressions and [available variables](/platform/connector-development/config-based/understanding-the-yaml-file/reference/#/variables). You can also use these variables as part of the URL path.
|
||||
|
||||
For example the [Prestashop API](https://devdocs.prestashop-project.org/8/webservice/cheat-sheet/#list-options) requires to set offset and limit separated by a comma into a single query parameter (`?limit=<offset>,<limit>`)
|
||||
For this case, you can use the `next_page_token` variable to configure a query parameter with key `limit` and value `{{ next_page_token['next_page_token'] or '0' }},50` to inject the offset from the pagination strategy and a hardcoded limit of 50 into the same parameter.
|
||||
@@ -0,0 +1,149 @@
|
||||
# Partitioning
|
||||
|
||||
Partitioning is required if the records of a stream are grouped into buckets based on an attribute or parent resources that need to be queried separately to extract the records.
|
||||
|
||||
Sometimes records belonging to a single stream are partitioned into subsets that need to be fetched separately. In most cases, these partitions are a parent resource type of the resource type targeted by the connector. The partitioning feature can be used to configure your connector to iterate through all partitions. In API documentation, this concept can show up as mandatory parameters that need to be set on the path, query parameters or request body of the request.
|
||||
|
||||
Common API structures look like this:
|
||||
|
||||
- The [SurveySparrow API](https://developers.surveysparrow.com/rest-apis/response#getV3Responses) allows to fetch a list of responses to surveys. For the `/responses` endpoint, the id of the survey to fetch responses for needs to be specified via the query parameter `survey_id`. The API does not allow to fetch responses for all available surveys in a single request, there needs to be a separate request per survey. The surveys represent the partitions of the responses stream.
|
||||
- The [Woocommerce API](https://woocommerce.github.io/woocommerce-rest-api-docs/#order-notes) includes an endpoint to fetch notes of webshop orders via the `/orders/<id>/notes` endpoint. The `<id>` placeholder needs to be set to the id of the order to fetch the notes for. The orders represent the partitions of the notes stream.
|
||||
|
||||
There are some cases that require multiple requests to fetch all records as well, but partitioning is not the right tool to configure these in the connector builder:
|
||||
|
||||
- If your records are spread out across multiple pages that need to be requested individually if there are too many records, use the Pagination feature.
|
||||
- If your records are spread out over time and multiple requests are necessary to fetch all data (for example one request per day), use the Incremental sync feature.
|
||||
|
||||
## Dynamic and static partitioning
|
||||
|
||||
There are three possible sources for the partitions that need to be queried - the connector itself, supplied by the end user when configuring a Source based on the connector, or the API provides the list of partitions via another endpoint (for example the Woocommerce API also includes an `/orders` endpoint that returns all orders).
|
||||
|
||||
The first two options are a "static" form of partition routing (because the partitions won't change as long as the Airbyte configuration isn't changed). This can be achieved by configuring the Parameterized Requests component in the Connector Builder.
|
||||
|
||||
The API providing the partitions via one or multiple separate requests is a "dynamic" form of partitioning because the partitions can change any time. This can be achieved by configuring the Parent Stream partition component in the Connector Builder.
|
||||
|
||||
### Parameterized Requests
|
||||
|
||||
To configure static partitioning, enable the `Parameterized Requests` component. The following fields have to be configured:
|
||||
|
||||
- The "Parameter Values" can either be set to a list of strings, making the partitions part of the connector itself, or delegated to a user input so the end user configuring a Source based on the connector can control which partitions to fetch. When using "user input" mode for the parameter values, create a user input of type array and reference it as the value using the [placeholder](/platform/connector-development/config-based/understanding-the-yaml-file/reference#variables) value using `{{ config['<your chosen user input name>'] }}`
|
||||
- The "Current Parameter Value Identifier" can be freely choosen and is the identifier of the variable holding the current parameter value. It can for example be used in the path of the stream using the `{{ stream_partition.<identifier> }}` syntax.
|
||||
- The "Inject Parameter Value into outgoing HTTP Request" option allows you to configure how to add the current parameter value to the requests
|
||||
|
||||
#### Example
|
||||
|
||||
To enable static partitioning defined as part of the connector for the [SurveySparrow API](https://developers.surveysparrow.com/rest-apis/response#getV3Responses) responses, the Parameterized Requests component needs to be configured as following:
|
||||
|
||||
- "Parameter Values" are set to the list of survey ids to fetch
|
||||
- "Current Parameter Value Identifier" is set to `survey` (this is not used for this example)
|
||||
- "Inject Parameter Value into outgoing HTTP Request" is set to `request_parameter` for the field name `survey_id`
|
||||
|
||||
When parameter values were set to `123`, `456` and `789`, the following requests will be executed:
|
||||
|
||||
```
|
||||
curl -X GET https://api.surveysparrow.com/v3/responses?survey_id=123
|
||||
curl -X GET https://api.surveysparrow.com/v3/responses?survey_id=456
|
||||
curl -X GET https://api.surveysparrow.com/v3/responses?survey_id=789
|
||||
```
|
||||
|
||||
To enable user-configurable static partitions for the [Woocommerce API](https://woocommerce.github.io/woocommerce-rest-api-docs/#order-notes) order notes, the configuration would look like this:
|
||||
|
||||
- Set "Parameter Values" to "User input"
|
||||
- In the "Value" input, click the user icon and create a new user input
|
||||
- Name it `Order IDs`, set type to `array` and click create
|
||||
- Set "Current Parameter Value Identifier" to `order`
|
||||
- "Inject Parameter Value into outgoing HTTP Request" is disabled, because the order id needs to be injected into the path
|
||||
- In the general section of the stream configuration, the "URL Path" is set to `/orders/{{ stream_partition.order }}/notes`
|
||||
|
||||
<iframe width="640" height="612" src="https://www.loom.com/embed/5a0b36e3269b4e548230013099a56983?sid=6a21bb0a-f351-4f13-ac7e-b29c91e17248" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>
|
||||
|
||||
When Order IDs were set to `123`, `456` and `789` in the testing values, the following requests will be executed:
|
||||
|
||||
```
|
||||
curl -X GET https://example.com/wp-json/wc/v3/orders/123/notes
|
||||
curl -X GET https://example.com/wp-json/wc/v3/orders/456/notes
|
||||
curl -X GET https://example.com/wp-json/wc/v3/orders/789/notes
|
||||
```
|
||||
|
||||
### Parent Stream
|
||||
|
||||
To fetch the list of partitions (in this example surveys or orders) from the API itself, the "Parent Stream" component has to be used. It allows you to select another stream of the same connector to serve as the source for partitions to fetch. Each record of the parent stream is used as a partition for the current stream.
|
||||
|
||||
The following fields have to be configured to use the Parent Stream component:
|
||||
|
||||
- The "Parent Stream" defines the records of which stream should be used as partitions
|
||||
- The "Parent Key" is the property on the parent stream record that should become the partition value (in most cases this is some form of id)
|
||||
- The "Current Parent Key Value Identifier" can be freely choosen and is the identifier of the variable holding the current partition value. It can for example be used in the path of the stream using the `{{ stream_partition.<identifier> }}` [interpolation placeholder](/platform/connector-development/config-based/understanding-the-yaml-file/reference#variables).
|
||||
|
||||
#### Example
|
||||
|
||||
To enable dynamic partitioning for the [Woocommerce API](https://woocommerce.github.io/woocommerce-rest-api-docs/#order-notes) order notes, first an orders stream needs to be configured for the `/orders` endpoint to fetch a list of orders. Once this is done, the Parent Stream component for the responses stream has be configured like this:
|
||||
|
||||
- "Parent Key" is set to `id`
|
||||
- "Current Parent Key Value Identifier" is set to `order`
|
||||
- In the general section of the stream configuration, the "URL Path" is set to `/orders/{{ stream_partition.order }}/notes`
|
||||
|
||||
<iframe width="640" height="612" src="https://www.loom.com/embed/e10a9128d2d9461b999d3d278669cc20?sid=abc9a9a4-5608-4887-adcc-de83ccb73137" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>
|
||||
|
||||
When triggering a sync, the connector will first fetch all records of the orders stream. The records will look like this:
|
||||
|
||||
```
|
||||
{ "id": 123, "currency": "EUR", "shipping_total": "12.23", ... }
|
||||
{ "id": 456, "currency": "EUR", "shipping_total": "45.56", ... }
|
||||
{ "id": 789, "currency": "EUR", "shipping_total": "78.89", ... }
|
||||
```
|
||||
|
||||
To turn a record into a partition value, the "parent key" is extracted, resulting in the partition values `123`, `456` and `789`. In turn, this results in the following requests to fetch the records of the notes stream:
|
||||
|
||||
```
|
||||
curl -X GET https://example.com/wp-json/wc/v3/orders/123/notes
|
||||
curl -X GET https://example.com/wp-json/wc/v3/orders/456/notes
|
||||
curl -X GET https://example.com/wp-json/wc/v3/orders/789/notes
|
||||
```
|
||||
|
||||
## Multiple partition routers
|
||||
|
||||
It is possible to configure multiple partitioning mechanisms on a single stream - if this is the case, all possible combinations of partition values are requested separately.
|
||||
|
||||
For example, the [Google Pagespeed API](https://developers.google.com/speed/docs/insights/v5/reference/pagespeedapi/runpagespeed) allows to specify the URL and the "strategy" to run an analysis for. To allow a user to trigger an analysis for multiple URLs and strategies at the same time, two Parameterized Request lists can be used (one injecting the parameter value into the `url` parameter, one injecting it into the `strategy` parameter).
|
||||
|
||||
If a user configures the URLs `example.com` and `example.org` and the strategies `desktop` and `mobile`, then the following requests will be triggered
|
||||
|
||||
```
|
||||
curl -X GET https://www.googleapis.com/pagespeedonline/v5/runPagespeed?url=example.com&strategy=desktop
|
||||
curl -X GET https://www.googleapis.com/pagespeedonline/v5/runPagespeed?url=example.com&strategy=mobile
|
||||
curl -X GET https://www.googleapis.com/pagespeedonline/v5/runPagespeed?url=example.org&strategy=desktop
|
||||
curl -X GET https://www.googleapis.com/pagespeedonline/v5/runPagespeed?url=example.org&strategy=mobile
|
||||
```
|
||||
|
||||
## Adding the partition value to the record
|
||||
|
||||
Sometimes it's helpful to attach the partition a record belongs to to the record itself so it can be used during analysis in the destination. This can be done using a transformation to add a field and the `{{ stream_partition.<identifier> }}` interpolation placeholder.
|
||||
|
||||
For example when fetching the order notes via the [Woocommerce API](https://woocommerce.github.io/woocommerce-rest-api-docs/#order-notes), the order id itself is not included in the note record, which means it won't be possible to associate which note belongs to which order:
|
||||
|
||||
```
|
||||
{ "id": 999, "author": "Jon Doe", "note": "Great product!" }
|
||||
```
|
||||
|
||||
However the order id can be added by taking the following steps:
|
||||
|
||||
- Making sure the "Current Parameter Value Identifier" is set to `order`
|
||||
- Add an "Add field" transformation with "Path" `order_id` and "Value" `{{ stream_partition.order }}`
|
||||
|
||||
Using this configuration, the notes record looks like this:
|
||||
|
||||
```
|
||||
{ "id": 999, "author": "Jon Doe", "note": "Great product!", "order_id": 123 }
|
||||
```
|
||||
|
||||
## Custom parameter injection
|
||||
|
||||
Using the "Inject Parameter / Parent Key Value into outgoing HTTP Request" option in the Parameterized Requests and Parent Stream components works for most cases, but sometimes the API has special requirements that can't be handled this way:
|
||||
|
||||
- The API requires to add a prefix or a suffix to the actual value
|
||||
- Multiple values need to be put together in a single parameter
|
||||
- The value needs to be injected into the URL path
|
||||
- Some conditional logic needs to be applied
|
||||
|
||||
To handle these cases, disable injection in the component and use the generic parameter section at the bottom of the stream configuration form to freely configure query parameters, headers and properties of the JSON body, by using jinja expressions and [available variables](/platform/connector-development/config-based/understanding-the-yaml-file/reference/#/variables). You can also use these variables (like `stream_partition`) as part of the URL path as shown in the Woocommerce example above.
|
||||
@@ -0,0 +1,828 @@
|
||||
import Diff from "./assets/record-processing-schema-diff.png";
|
||||
|
||||
# Record processing
|
||||
|
||||
Connectors built with the connector builder always make HTTP requests, receive the responses and emit records. Besides making the right requests, it's important to properly hand over the records to the system:
|
||||
|
||||
- Decode the response body (HTTP response format)
|
||||
- Extract the records (record selection)
|
||||
- Do optional post-processing (transformations)
|
||||
- Provide record meta data to the system to inform downstream processes (primary key and declared schema)
|
||||
|
||||
## Response Decoding
|
||||
|
||||
The first step in converting an HTTP response into records is decoding the response body into normalized JSON objects, as the rest of the record processing logic performed by the connector expects to operate on JSON objects.
|
||||
|
||||
The HTTP Response Format is used to configure this decoding by declaring what the encoding of the response body is.
|
||||
|
||||
Each of the supported formats are explained below.
|
||||
|
||||
### JSON
|
||||
|
||||
Example JSON response body:
|
||||
|
||||
```json
|
||||
{
|
||||
"cod": "200",
|
||||
"message": 0,
|
||||
"cnt": 40,
|
||||
"list": [
|
||||
{
|
||||
"dt": 1728604800,
|
||||
"main": {
|
||||
"temp": 283.51,
|
||||
"feels_like": 283.21,
|
||||
"temp_min": 283.51,
|
||||
"temp_max": 285.11,
|
||||
"pressure": 1014,
|
||||
"sea_level": 1014,
|
||||
"grnd_level": 982,
|
||||
"humidity": 100,
|
||||
"temp_kf": -1.6
|
||||
}
|
||||
},
|
||||
{
|
||||
"dt": 1728615600,
|
||||
"main": {
|
||||
"temp": 283.55,
|
||||
"feels_like": 283.13,
|
||||
"temp_min": 283.55,
|
||||
"temp_max": 283.63,
|
||||
"pressure": 1014,
|
||||
"sea_level": 1014,
|
||||
"grnd_level": 983,
|
||||
"humidity": 95,
|
||||
"temp_kf": -0.08
|
||||
}
|
||||
},
|
||||
...
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
This is the most common response format. APIs usually include a `"Content-Type": "application/json"` response header when returning a JSON body. Consult that API's documentation to verify which content type you should expect.
|
||||
|
||||
In this case, no extra decoding needs to happen to convert these responses into JSON because they are already in JSON format.
|
||||
|
||||
### XML
|
||||
|
||||
Example XML response body:
|
||||
|
||||
```xml
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<weatherdata>
|
||||
<location>
|
||||
<name>Lyon</name>
|
||||
<type></type>
|
||||
<country>FR</country>
|
||||
<timezone>7200</timezone>
|
||||
</location>
|
||||
<sun rise="2024-10-11T05:52:02" set="2024-10-11T17:02:14"></sun>
|
||||
<forecast>
|
||||
<time from="2024-10-10T21:00:00" to="2024-10-11T00:00:00">
|
||||
<symbol number="800" name="clear sky" var="01n"></symbol>
|
||||
<precipitation probability="0"></precipitation>
|
||||
<windDirection deg="156" code="SSE" name="South-southeast"></windDirection>
|
||||
<windSpeed mps="0.59" unit="m/s" name="Calm"></windSpeed>
|
||||
<windGust gust="0.73" unit="m/s"></windGust>
|
||||
</time>
|
||||
<time from="2024-10-11T00:00:00" to="2024-10-11T03:00:00">
|
||||
<symbol number="800" name="clear sky" var="01n"></symbol>
|
||||
<precipitation probability="0"></precipitation>
|
||||
<windDirection deg="307" code="NW" name="Northwest"></windDirection>
|
||||
<windSpeed mps="0.77" unit="m/s" name="Calm"></windSpeed>
|
||||
<windGust gust="0.89" unit="m/s"></windGust>
|
||||
</time>
|
||||
...
|
||||
</forecast>
|
||||
</weatherdata>
|
||||
```
|
||||
|
||||
APIs usually include a `"Content-Type": "application/xml"` response header when returning an XML body. Consult that API's documentation to verify which content type you should expect.
|
||||
|
||||
In this case, the XML body is converted into a normalized JSON format by following the patterns described in [this spec](https://www.xml.com/pub/a/2006/05/31/converting-between-xml-and-json.html) from xml.com.
|
||||
|
||||
For the above example, the XML response format setting would result in the following normalized JSON output:
|
||||
|
||||
```json
|
||||
{
|
||||
"weatherdata": {
|
||||
"location": {
|
||||
"name": "Lyon",
|
||||
"country": "FR",
|
||||
"timezone": "7200",
|
||||
},
|
||||
"sun": {
|
||||
"@rise": "2024-10-11T05:52:02",
|
||||
"@set": "2024-10-11T17:02:14"
|
||||
},
|
||||
"forecast": {
|
||||
"time": [
|
||||
{
|
||||
"@from": "2024-10-10T21:00:00",
|
||||
"@to": "2024-10-11T00:00:00",
|
||||
"symbol": {
|
||||
"@number": "800",
|
||||
"@name": "clear sky",
|
||||
"@var": "01n"
|
||||
},
|
||||
"precipitation": {
|
||||
"@probability": "0"
|
||||
},
|
||||
"windDirection": {
|
||||
"@deg": "156",
|
||||
"@code": "SSE",
|
||||
"@name": "South-southeast"
|
||||
},
|
||||
"windSpeed": {
|
||||
"@mps": "0.59",
|
||||
"@unit": "m/s",
|
||||
"@name": "Calm"
|
||||
},
|
||||
"windGust": {
|
||||
"@gust": "0.73",
|
||||
"@unit": "m/s"
|
||||
}
|
||||
},
|
||||
{
|
||||
"@from": "2024-10-11T00:00:00",
|
||||
"@to": "2024-10-11T03:00:00",
|
||||
"symbol": {
|
||||
"@number": "800",
|
||||
"@name": "clear sky",
|
||||
"@var": "01n"
|
||||
},
|
||||
"precipitation": {
|
||||
"@probability": "0"
|
||||
},
|
||||
"windDirection": {
|
||||
"@deg": "307",
|
||||
"@code": "NW",
|
||||
"@name": "Northwest"
|
||||
},
|
||||
"windSpeed": {
|
||||
"@mps": "0.77",
|
||||
"@unit": "m/s",
|
||||
"@name": "Calm"
|
||||
},
|
||||
"windGust": {
|
||||
"@gust": "0.89",
|
||||
"@unit": "m/s"
|
||||
}
|
||||
},
|
||||
...
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### JSON Lines
|
||||
|
||||
Example JSON Lines response body:
|
||||
|
||||
```json
|
||||
{"name": "John", "age": 30, "city": "New York"}
|
||||
{"name": "Alice", "age": 25, "city": "Los Angeles"}
|
||||
{"name": "Bob", "age": 50, "city": "Las Vegas"}
|
||||
```
|
||||
|
||||
[JSON Lines](https://jsonlines.org/) is a text format that contains one JSON object per line, with newlines in between.
|
||||
|
||||
There is no standardized `Content-Type` header for API responses containing JSON Lines, so it is common for APIs to just include a `"Content-Type": "text/html"` or `"Content-Type": "text/plain"` response header in this case. Consult that API's documentation to verify which content type you should expect.
|
||||
|
||||
For the above example, the JSON Lines response format setting would result in the following normalized JSON output:
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"name": "John",
|
||||
"age": 30,
|
||||
"city": "New York"
|
||||
},
|
||||
{
|
||||
"name": "Alice",
|
||||
"age": 25,
|
||||
"city": "Los Angeles"
|
||||
},
|
||||
{
|
||||
"name": "Bob",
|
||||
"age": 50,
|
||||
"city": "Las Vegas"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
### Iterable
|
||||
|
||||
Example iterable response body:
|
||||
|
||||
```text
|
||||
2021-04-14 16:52:18 +00:00
|
||||
2021-04-14 16:52:23 +00:00
|
||||
2021-04-14 16:52:21 +00:00
|
||||
2021-04-14 16:52:23 +00:00
|
||||
2021-04-14 16:52:27 +00:00
|
||||
```
|
||||
|
||||
This response format option is used for API response bodies that are text containing strings separated by newlines.
|
||||
|
||||
APIs are likely to include a `"Content-Type": "text/html"` or `"Content-Type": "text/plain"` response header in this case. Consult that API's documentation to verify which content type you should expect.
|
||||
|
||||
By convention, the connector will wrap each of these strings in a JSON object under a `record` key.
|
||||
|
||||
For the above example, the Iterable response format setting would result in the following normalized JSON output:
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"record": "2021-04-14 16:52:18 +00:00"
|
||||
},
|
||||
{
|
||||
"record": "2021-04-14 16:52:23 +00:00"
|
||||
},
|
||||
{
|
||||
"record": "2021-04-14 16:52:21 +00:00"
|
||||
},
|
||||
{
|
||||
"record": "2021-04-14 16:52:23 +00:00"
|
||||
},
|
||||
{
|
||||
"record": "2021-04-14 16:52:27 +00:00"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
### CSV
|
||||
|
||||
Example CSV (Comma-Separated Values) response body:
|
||||
|
||||
```csv
|
||||
id,name,email,created_at
|
||||
1,John Doe,john.doe@example.com,2023-01-15T09:30:00Z
|
||||
2,Jane Smith,jane.smith@example.com,2023-01-16T14:20:00Z
|
||||
3,Bob Johnson,bob.johnson@example.com,2023-01-17T11:45:00Z
|
||||
```
|
||||
|
||||
This response format is used for API responses that return data in CSV format. APIs typically include a `"Content-Type": "text/csv"` response header when returning CSV data. Consult that API's documentation to verify which content type you should expect.
|
||||
|
||||
The CSV decoder converts each row of the CSV into a JSON object, using the header row to determine field names. CSV files can be delimited with characters other than commas. Set a different delimiter if your file isn't comma-delimited.
|
||||
|
||||
For the preceding example, the CSV response format would result in the following normalized JSON output:
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"id": "1",
|
||||
"name": "John Doe",
|
||||
"email": "john.doe@example.com",
|
||||
"created_at": "2023-01-15T09:30:00Z"
|
||||
},
|
||||
{
|
||||
"id": "2",
|
||||
"name": "Jane Smith",
|
||||
"email": "jane.smith@example.com",
|
||||
"created_at": "2023-01-16T14:20:00Z"
|
||||
},
|
||||
{
|
||||
"id": "3",
|
||||
"name": "Bob Johnson",
|
||||
"email": "bob.johnson@example.com",
|
||||
"created_at": "2023-01-17T11:45:00Z"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
### ZIP and gzip
|
||||
|
||||
Some APIs respond with a compressed ZIP or gzip file. This is more common when you're dealing with large, reporting-based datasets. Amplitude's [Export API](https://amplitude.com/docs/apis/analytics/export) is an example. Occasionally, APIs nest compressed zip files within each other, like a zip file that contains gzip files, which themselves contain more files.
|
||||
|
||||
When you set your HTTP response format to ZIP or gzip, you can nest another decoder inside it. You can do this recursively if you need Airbyte to work through a number of layers of compressed files to access the data within.
|
||||
|
||||
## Record Selection
|
||||
|
||||
<iframe
|
||||
width="583"
|
||||
height="393"
|
||||
src="https://www.loom.com/embed/f4a36e769a1d4f87a14e3982f59d1fb2"
|
||||
frameborder="0"
|
||||
webkitallowfullscreen
|
||||
mozallowfullscreen
|
||||
allowfullscreen
|
||||
></iframe>
|
||||
|
||||
After decoding the response into normalized JSON objects (see [Response Decoding](#response-decoding)), the connector must then decide how to extract records from those JSON objects.
|
||||
|
||||
The Record Selector component contains a few different levers to configure this extraction:
|
||||
- Field Path
|
||||
- Record Filter
|
||||
- Cast Record Fields to Schema Types
|
||||
|
||||
These will be explained below.
|
||||
|
||||
### Field Path
|
||||
The Field Path feature lets you define a path into the fields of the response to point to the part of the response which should be treated as the record(s).
|
||||
|
||||
Below are a few different examples of what this can look like depending on the API.
|
||||
|
||||
#### Top-level key pointing to array
|
||||
Very often, the response body contains an array of records along with some suplementary information (for example meta data for pagination).
|
||||
|
||||
For example the ["Most popular" NY Times API](https://developer.nytimes.com/docs/most-popular-product/1/overview) returns the following response body:
|
||||
|
||||
<pre>
|
||||
{`{
|
||||
"status": "OK",
|
||||
"copyright": "Copyright (c) 2023 The New York Times Company. All Rights Reserved.",
|
||||
"num_results": 20,
|
||||
`}
|
||||
<b>{`"results": [`}</b>
|
||||
{`
|
||||
{
|
||||
"uri": "nyt://article/c15e5227-ed68-54d9-9e5b-acf5a451ec37",
|
||||
"url": "https://www.nytimes.com/2023/04/16/us/science-of-reading-literacy-parents.html",
|
||||
"id": 100000008811231,
|
||||
"asset_id": 100000008811231,
|
||||
"source": "New York Times",
|
||||
// ...
|
||||
},
|
||||
// ..
|
||||
`}
|
||||
<b>{`]`}</b>
|
||||
{`,
|
||||
// ...
|
||||
}`}
|
||||
</pre>
|
||||
|
||||
In this case, **setting the Field Path to `results`** selects the array with the actual records, everything else is discarded.
|
||||
|
||||
#### Nested array
|
||||
|
||||
In some cases the array of actual records is nested multiple levels deep in the response, like for the ["Archive" NY Times API](https://developer.nytimes.com/docs/archive-product/1/overview):
|
||||
|
||||
<pre>
|
||||
{`{
|
||||
"copyright": "Copyright (c) 2020 The New York Times Company. All Rights Reserved.",
|
||||
"response": {
|
||||
`}
|
||||
<b>{`"docs": [`}</b>
|
||||
{`
|
||||
{
|
||||
"abstract": "From the Treaty of Versailles to Prohibition, the events of that year shaped America, and the world, for a century to come. ",
|
||||
"web_url": "https://www.nytimes.com/2018/12/31/opinion/1919-america.html",
|
||||
"snippet": "From the Treaty of Versailles to Prohibition, the events of that year shaped America, and the world, for a century to come. ",
|
||||
// ...
|
||||
},
|
||||
// ...
|
||||
`}
|
||||
<b>{`]`}</b>
|
||||
{`
|
||||
}
|
||||
}`}
|
||||
</pre>
|
||||
|
||||
In this case, **setting the Field Path to `response`,`docs`** selects the nested array.
|
||||
|
||||
#### Root array
|
||||
|
||||
In some cases, the response body itself is an array of records, like in the [CoinAPI API](https://docs.coinapi.io/market-data/rest-api/quotes):
|
||||
|
||||
<pre>
|
||||
<b>{`[`}</b>
|
||||
{`
|
||||
{
|
||||
"symbol_id": "BITSTAMP_SPOT_BTC_USD",
|
||||
"time_exchange": "2013-09-28T22:40:50.0000000Z",
|
||||
"time_coinapi": "2017-03-18T22:42:21.3763342Z",
|
||||
// ...
|
||||
},
|
||||
{
|
||||
"symbol_id": "BITSTAMP_SPOT_BTC_USD",
|
||||
"time_exchange": "2013-09-28T22:40:50.0000000Z",
|
||||
"time_coinapi": "2017-03-18T22:42:21.3763342Z",
|
||||
// ..
|
||||
}
|
||||
// ...
|
||||
`}
|
||||
<b>{`]`}</b>
|
||||
</pre>
|
||||
|
||||
In this case, **the Field Path can be omitted** and the whole response becomes the list of records.
|
||||
|
||||
#### Single object
|
||||
|
||||
Sometimes, there is only one record returned per request from the API. In this case, the field path can also point to an object instead of an array which will be handled as the only record, like in the case of the [Exchange Rates API](https://exchangeratesapi.io/documentation/#historicalrates):
|
||||
|
||||
<pre>
|
||||
{`{
|
||||
"success": true,
|
||||
"historical": true,
|
||||
"date": "2013-12-24",
|
||||
"timestamp": 1387929599,
|
||||
"base": "GBP",
|
||||
`}
|
||||
<b>{`"rates": {`}</b>
|
||||
{`
|
||||
"USD": 1.636492,
|
||||
"EUR": 1.196476,
|
||||
"CAD": 1.739516
|
||||
`}
|
||||
<b>{`}`}</b>
|
||||
{`
|
||||
}`}
|
||||
</pre>
|
||||
|
||||
In this case, **setting the Field Path to `rates`** will yield a single record which contains all the exchange rates in a single object.
|
||||
|
||||
#### Fields nested in arrays
|
||||
|
||||
In some cases, records are located in multiple branches of the response object (for example within each item of an array):
|
||||
|
||||
```
|
||||
|
||||
{
|
||||
"data": [
|
||||
{
|
||||
"record": {
|
||||
"id": "1"
|
||||
}
|
||||
},
|
||||
{
|
||||
"record": {
|
||||
"id": "2"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
```
|
||||
|
||||
A Field Path with a placeholder `*` selects all children at the current position in the path, so in this case **setting Field Path to `data`,`*`,`record`** will return the following records:
|
||||
|
||||
```
|
||||
[
|
||||
{
|
||||
"id": 1
|
||||
},
|
||||
{
|
||||
"id": 2
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
### Record Filter
|
||||
In some cases, certain certain records should be excluded from the final output of the connector, which can be accomplished through the Record Filter feature within the Record Selector component.
|
||||
|
||||
For example, say your API response looks like this:
|
||||
```
|
||||
[
|
||||
{
|
||||
"id": 1,
|
||||
"status": "pending"
|
||||
},
|
||||
{
|
||||
"id": 2,
|
||||
"status": "active"
|
||||
},
|
||||
{
|
||||
"id": 3,
|
||||
"status": "expired"
|
||||
}
|
||||
]
|
||||
```
|
||||
and you only want to sync records for which the status is not `expired`.
|
||||
|
||||
You can accomplish this by setting the Record Filter to `{{ record.status != 'expired' }}`
|
||||
|
||||
Any records for which this expression evaluates to `true` will be emitted by the connector, and any for which it evaluates to `false` will be excluded from the output.
|
||||
|
||||
Note that Record Filter value must be an [interpolated string](/platform/connector-development/config-based/advanced-topics/string-interpolation) with the filtering condition placed inside double curly braces `{{ }}`.
|
||||
|
||||
### Cast Record Fields to Schema Types
|
||||
Sometimes the type of a field in the record is not the desired type. If the existing field type can be simply cast to the desired type, this can be solved by setting the stream's declared schema to the desired type and enabling `Cast Record Fields to Schema Types`.
|
||||
|
||||
For example, say the API response looks like this:
|
||||
```
|
||||
[
|
||||
{
|
||||
"street": "Kulas Light",
|
||||
"city": "Gwenborough",
|
||||
"geo": {
|
||||
"lat": "-37.3159",
|
||||
"lng": "81.1496"
|
||||
}
|
||||
},
|
||||
{
|
||||
"street": "Victor Plains",
|
||||
"city": "Wisokyburgh",
|
||||
"geo": {
|
||||
"lat": "-43.9509",
|
||||
"lng": "-34.4618"
|
||||
}
|
||||
}
|
||||
]
|
||||
```
|
||||
Notice that the `lat` and `lng` values are strings despite them all being numeric. If you would rather have these fields contain raw number values in your output records, you can do the following:
|
||||
- In the Declared Schema tab, disable `Automatically import detected schema`
|
||||
- Change the `type` of the `lat` and `lng` fields from `string` to `number`
|
||||
- Enable `Cast Record Fields to Schema Types` in the Record Selector component
|
||||
|
||||
This will cause those fields in the output records to be cast to the type declared in the schema, so the output records will now look like this:
|
||||
```
|
||||
[
|
||||
{
|
||||
"street": "Kulas Light",
|
||||
"city": "Gwenborough",
|
||||
"geo": {
|
||||
"lat": -37.3159,
|
||||
"lng": 81.1496
|
||||
}
|
||||
},
|
||||
{
|
||||
"street": "Victor Plains",
|
||||
"city": "Wisokyburgh",
|
||||
"geo": {
|
||||
"lat": -43.9509,
|
||||
"lng": -34.4618
|
||||
}
|
||||
}
|
||||
]
|
||||
```
|
||||
Note that this casting is performed on a best-effort basis; if you tried to set the `city` field's type to `number` in the schema, for example, it would remain unchanged because those string values cannot be cast to numbers.
|
||||
|
||||
|
||||
## Transformations
|
||||
|
||||
It is recommended to not change records during the extraction process the connector is performing, but instead load them into the downstream warehouse unchanged and perform necessary transformations there in order to stay flexible in what data is required. However there are some reasons that require the modifying the fields of records before they are sent to the warehouse:
|
||||
|
||||
- Remove personally identifiable information (PII) to ensure compliance with local legislation
|
||||
- Pseudonymise sensitive fields
|
||||
- Remove large fields that don't contain interesting information and significantly increase load on the system
|
||||
|
||||
The "transformations" feature can be used for these purposes.
|
||||
|
||||
### Removing fields
|
||||
|
||||
<iframe
|
||||
width="640"
|
||||
height="538"
|
||||
src="https://www.loom.com/embed/8dca8feaa54f49848667d3fd5b945372"
|
||||
frameborder="0"
|
||||
webkitallowfullscreen
|
||||
mozallowfullscreen
|
||||
allowfullscreen
|
||||
></iframe>
|
||||
|
||||
To remove a field from a record, add a new transformation in the "Transformations" section of type "remove" and enter the field path. For example in case of the [EmailOctopus API](https://emailoctopus.com/api-documentation/campaigns/get-all), the campaigns records also include the html content of the mailing which takes up a lot of space:
|
||||
|
||||
```
|
||||
{
|
||||
"data": [
|
||||
{
|
||||
"id": "00000000-0000-0000-0000-000000000000",
|
||||
"status": "SENT",
|
||||
"name": "Foo",
|
||||
"subject": "Bar",
|
||||
"from": {
|
||||
"name": "John Doe",
|
||||
"email_address": "john.doe@gmail.com"
|
||||
},
|
||||
"content": {
|
||||
"html": "<html>lots of text here...<html>",
|
||||
"plain_text": "Lots of plain text here...."
|
||||
},
|
||||
"created_at": "2023-04-13T15:28:37+00:00",
|
||||
"sent_at": "2023-04-14T15:28:37+00:00"
|
||||
},
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Setting the "Path" of the remove-transformation to `content` removes these fields from the records:
|
||||
|
||||
```
|
||||
{
|
||||
"id": "00000000-0000-0000-0000-000000000000",
|
||||
"status": "SENT",
|
||||
"from": {
|
||||
"name": "John Doe",
|
||||
"email_address": "john.doe@gmail.com"
|
||||
},
|
||||
"name": "Foo",
|
||||
"subject": "Bar",
|
||||
"created_at": "2023-04-13T15:28:37+00:00",
|
||||
"sent_at": "2023-04-14T15:28:37+00:00"
|
||||
}
|
||||
```
|
||||
|
||||
Like in case of the record selector's Field Path, properties of deeply nested objects can be removed as well by specifying the path of properties to the target field that should be removed.
|
||||
|
||||
### Removing fields that match a glob pattern
|
||||
|
||||
Imagine that regardless of which level a properties appears, it should be removed from the data. This can be achieved by adding a `**` to the "Path" - for example "`**`, `name`" will remove all "name" fields anywhere in the record:
|
||||
|
||||
```
|
||||
{
|
||||
"id": "00000000-0000-0000-0000-000000000000",
|
||||
"status": "SENT",
|
||||
"from": {
|
||||
"email_address": "john.doe@gmail.com"
|
||||
},
|
||||
"subject": "Bar",
|
||||
"created_at": "2023-04-13T15:28:37+00:00",
|
||||
"sent_at": "2023-04-14T15:28:37+00:00"
|
||||
}
|
||||
```
|
||||
|
||||
The `*` character can also be used as a placeholder to filter for all fields that start with a certain prefix - the "Path" `s*` will remove all fields from the top level that start with the character s:
|
||||
|
||||
```
|
||||
{
|
||||
"id": "00000000-0000-0000-0000-000000000000",
|
||||
"from": {
|
||||
"email_address": "john.doe@gmail.com"
|
||||
},
|
||||
"created_at": "2023-04-13T15:28:37+00:00",
|
||||
}
|
||||
```
|
||||
|
||||
### Adding fields
|
||||
|
||||
<iframe
|
||||
width="640"
|
||||
height="409"
|
||||
src="https://www.loom.com/embed/ab3b72cafb734112b645607ab6d1ab1f"
|
||||
frameborder="0"
|
||||
webkitallowfullscreen
|
||||
mozallowfullscreen
|
||||
allowfullscreen
|
||||
></iframe>
|
||||
|
||||
Adding fields can be used to apply a hashing function to an existing field to pseudonymize it. To do this, add a new transformation in the "Transformations" section of type "add" and enter the field path and the new value. For example in case of the [EmailOctopus API](https://emailoctopus.com/api-documentation/campaigns/get-all), the campaigns records include the name of the sender:
|
||||
|
||||
```
|
||||
{
|
||||
"data": [
|
||||
{
|
||||
"id": "00000000-0000-0000-0000-000000000000",
|
||||
"status": "SENT",
|
||||
"name": "Foo",
|
||||
"subject": "Bar",
|
||||
"from": {
|
||||
"name": "John Doe",
|
||||
"email_address": "john.doe@gmail.com"
|
||||
},
|
||||
"created_at": "2023-04-13T15:28:37+00:00",
|
||||
"sent_at": "2023-04-14T15:28:37+00:00"
|
||||
},
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
To apply a hash function to it, set the "Path" to "`from`, `name`" to select the name property nested in the from object and set the value to `{{ record['from']['name'] | hash('md5') }}`. This hashes the name in the record:
|
||||
|
||||
```
|
||||
{
|
||||
"id": "00000000-0000-0000-0000-000000000000",
|
||||
"status": "SENT",
|
||||
"name": "Foo",
|
||||
"subject": "Bar",
|
||||
"from": {
|
||||
"name": "4c2a904bafba06591225113ad17b5cec",
|
||||
"email_address": "john.doe@gmail.com"
|
||||
},
|
||||
"created_at": "2023-04-13T15:28:37+00:00",
|
||||
"sent_at": "2023-04-14T15:28:37+00:00"
|
||||
}
|
||||
```
|
||||
|
||||
Another common use case of the "add" transformation is the enriching of records with their parent resource - check out the [partitioning documentation](/platform/connector-development/connector-builder-ui/partitioning#adding-the-partition-value-to-the-record) for more details.
|
||||
|
||||
It's not recommended to use this feature to do projections (like concatenating firstname and lastname into a single "name" field) - in most cases it's beneficial to leave these tasks to a later stage in the data pipeline.
|
||||
|
||||
## Meta data
|
||||
|
||||
Besides bringing the records in the right shape, it's important to communicate some pieces of meta data about records to the downstream system so they can be handled properly.
|
||||
|
||||
### Primary key
|
||||
|
||||
The "Primary key" field specifies how to uniquely identify a record. This is important for downstream de-duplication of records (e.g. by the [incremental sync - Append + Deduped sync mode](/platform/using-airbyte/core-concepts/sync-modes/incremental-append-deduped)).
|
||||
|
||||
In a lot of cases, like for the EmailOctopus example from above, there is a dedicated id field that can be used for this purpose. It's important that the value of the id field is guaranteed to only occur once for a single record.
|
||||
|
||||
In some cases there is no such field but a combination of multiple fields is guaranteed to be unique, for example the shipping zone locations of the [Woocommerce API](https://woocommerce.github.io/woocommerce-rest-api-docs/#shipping-zone-locations) do not have an id, but each combination of the `code` and `type` fields is guaranteed to be unique:
|
||||
|
||||
```
|
||||
[
|
||||
{
|
||||
"code": "BR",
|
||||
"type": "country"
|
||||
},
|
||||
{
|
||||
"code": "DE",
|
||||
"type": "country"
|
||||
},
|
||||
]
|
||||
```
|
||||
|
||||
In this case, the "Primary key" can be set to "`code`, `type`" to allow automatic downstream deduplication of records based on the value of these two fields.
|
||||
|
||||
### Declared schema
|
||||
|
||||
Similar to the "Primary key", the "Declared schema" defines how the records will be shaped via a [JSON Schema definition](https://json-schema.org/). It defines which fields and nested fields occur in the records, whether they are always available or sometimes missing and which types they are.
|
||||
|
||||
This information is used by the Airbyte system for different purposes:
|
||||
|
||||
- **Column selection** when configuring a connection - in Airbyte cloud, the declared schema allows the user to pick which columns/fields are passed to the destination to dynamically reduce the amount of synced data
|
||||
- **Recreating the data structure with right columns** in destination - this allows a warehouse destination to create a SQL table which the columns matching the fields of records
|
||||
- **Detecting schema changes** - if the schema of a stream changes for an existing connection, this situation can be handled gracefully by Airbyte instead of causing errors in the destination
|
||||
|
||||
When doing test reads, the connector builder analyzes the test records and shows the derived schema in the "Detected schema" tab. By default, new streams are configured to automatically import the detected schema into the declared schema on every test read.
|
||||
This behavior can be toggled off by disabling the `Automatically import declared schema` switch, in which case the declared schema can be manually edited in the UI and it will no longer be automatically updated when triggering test reads.
|
||||
|
||||
For example the following test records:
|
||||
|
||||
```
|
||||
[
|
||||
{
|
||||
"id": "40205efe-5f94-11ed-aa11-7d1ac831a909",
|
||||
"status": "SENT",
|
||||
"subject": "Hello from Integration Test",
|
||||
"created_at": "2022-11-08T18:36:25+00:00",
|
||||
"sent_at": "2022-11-08T18:36:55+00:00"
|
||||
},
|
||||
{
|
||||
"id": "91546616-5ef0-11ed-90c7-fbeacb2ee1eb",
|
||||
"status": "SENT",
|
||||
"subject": "Hello my first campaign",
|
||||
"created_at": "2022-11-07T23:04:44+00:00",
|
||||
"sent_at": "2022-11-08T12:48:27+00:00"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
result in the following schema:
|
||||
|
||||
```
|
||||
{
|
||||
"$schema": "http://json-schema.org/schema#",
|
||||
"properties": {
|
||||
"created_at": {
|
||||
"type": "string"
|
||||
},
|
||||
"id": {
|
||||
"type": "string"
|
||||
},
|
||||
"sent_at": {
|
||||
"type": "string"
|
||||
},
|
||||
"status": {
|
||||
"type": "string"
|
||||
},
|
||||
"subject": {
|
||||
"type": "string"
|
||||
}
|
||||
},
|
||||
"type": "object"
|
||||
}
|
||||
```
|
||||
|
||||
More strict is always better, but the detected schema is a good default to rely on. See the documentation about [supported data types](https://docs.airbyte.com/understanding-airbyte/supported-data-types/) for JSON schema structures that will be picked up by the Airbyte system.
|
||||
|
||||
If `Automatically import detected schema` is disabled, and the declared schema deviates from the detected schema, the "Detected schema" tab in the testing panel highlights the differences. It's important to note that differences are not necessarily a problem that needs to be fixed - in some cases the currently loaded set of records in the testing panel doesn't feature all possible cases so the detected schema is too strict. However, if the declared schema is incompatible with the detected schema based on the test records, it's very likely there will be errors when running syncs.
|
||||
|
||||
<img
|
||||
src={Diff}
|
||||
width="600"
|
||||
alt="Detected schema with highlighted differences"
|
||||
/>
|
||||
|
||||
In the case of the example above, there are two differences between detected and declared schema. The first difference for the `name` field is not problematic:
|
||||
|
||||
```
|
||||
"name": {
|
||||
- "type": [
|
||||
- "string",
|
||||
- "null"
|
||||
- ]
|
||||
+ "type": "null"
|
||||
},
|
||||
```
|
||||
|
||||
The declared schema allows the `null` value for the name while the detected schema only encountered strings. If it's possible the `name` is set to null, the detected schema is configured correctly.
|
||||
|
||||
The second difference will likely cause problems:
|
||||
|
||||
```
|
||||
"subject": {
|
||||
- "type": "number"
|
||||
+ "type": "string"
|
||||
}
|
||||
```
|
||||
|
||||
The `subject` field was detected as `string`, but is configured to be a `number` in the declared schema. As the API returned string subjects during testing, it's likely this will also happen during syncs which would render the declared schema inaccurate. Depending on the situation this can be fixed in multiple ways:
|
||||
|
||||
- If the API changed and subject is always a string now, the declared schema should be updated to reflect this: `"subject": { "type": "string" }`
|
||||
- If the API is sometimes returning subject as number of string depending on the record, the declared schema should be updated to allow both data types: `"subject": { "type": ["string","number"] }`
|
||||
|
||||
A common situation is that certain record fields do not have any any values for the test read data, so they are set to `null`. In the detected schema, these field are of type `"null"` which is most likely not correct for all cases. In these situations, the declared schema should be manually corrected.
|
||||
@@ -0,0 +1,207 @@
|
||||
# Tutorial
|
||||
|
||||
This tutorial will walk you through the creation of an Airbyte connector using the connector builder UI to read and extract data from an HTTP API.
|
||||
|
||||
You will build a connector reading data from the Exchange Rates API, but the steps apply to other HTTP APIs you might be interested in integrating with.
|
||||
|
||||
The API documentation can be found [here](https://apilayer.com/marketplace/exchangerates_data-api).
|
||||
In this tutorial, we will read data from the following endpoints:
|
||||
|
||||
- `Latest Rates Endpoint`
|
||||
- `Historical Rates Endpoint`
|
||||
|
||||
With the end goal of implementing a source connector with a single `Stream` containing exchange rates going from a base currency to many other currencies.
|
||||
The output schema of our stream will look like the following:
|
||||
|
||||
```json
|
||||
{
|
||||
"base": "USD",
|
||||
"date": "2022-07-15",
|
||||
"rates": {
|
||||
"CAD": 1.28,
|
||||
"EUR": 0.98
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Step 1 - Setup
|
||||
|
||||
### Setting up Exchange Rates API key
|
||||
|
||||
Before we get started, you'll need to generate an API access key for the Exchange Rates API.
|
||||
This can be done by signing up for the Free tier plan on [Exchange Rates API](https://apilayer.com/marketplace/exchangerates_data-api):
|
||||
|
||||
1. Visit https://apilayer.com and click Sign In on the top right - either sign into an existing account or sign up for a new account on the free tier.
|
||||
2. Once you're signed in, visit https://apilayer.com/marketplace/exchangerates_data-api
|
||||
3. You should see an API Key displayed with a `Copy API Key` button next to it. This is your API key.
|
||||
|
||||
### Setting up the environment
|
||||
|
||||
Besides an Exchange Rates API key, only a browser and an up-to-date running Airbyte instance is required. There are two ways to accomplish this:
|
||||
* Sign up on [cloud.airbyte.com](https://cloud.airbyte.com/)
|
||||
* Download and run Airbyte [on your own infrastructure](https://github.com/airbytehq/airbyte#quick-start). Make sure you are running version 0.43.0 or later
|
||||
|
||||
## Step 2 - Basic configuration
|
||||
|
||||
### Creating a connector builder project
|
||||
|
||||
<div style={{ position: "relative", paddingBottom: "59.66850828729282%", height: 0 }}><iframe src="https://www.loom.com/embed/c21f7b421f954e0a82b931446dda3d51" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen style={{position: "absolute", top: 0, left: 0, width: "100%", height: "100%"}}></iframe></div>
|
||||
|
||||
When developing a connector using the connector builder UI, the current state is saved in a connector builder project. These projects are saved as part of the Airbyte workspace and are separate from your source configurations and connections. In the last step of this tutorial you will publish the connector builder project to make it ready to use in connections to run syncs.
|
||||
|
||||
To get started, follow the following steps:
|
||||
* Visit the Airbyte UI in your browser
|
||||
* Go to the connector builder page by clicking the "Builder" item in the left hand navigation bar
|
||||
* Select "Start from scratch" to start a new connector builder project (in case you have already created builder connectors before, click the "New custom connector" button in the top right corner)
|
||||
* Set the connector name to `Exchange Rates (Tutorial)`
|
||||
|
||||
Your connector builder project is now set up. The next steps describe how to configure your connector to extract records from the Exchange Rates API.
|
||||
|
||||
### Setting up global configuration
|
||||
|
||||
<div style={{ position: "relative", paddingBottom: "65.69343065693431%", height: 0 }}><iframe src="https://www.loom.com/embed/86832440d02e45b0b6c45d169b3606a1" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen style={{position: "absolute", top: 0, left: 0, width: "100%", height: "100%"}}></iframe></div>
|
||||
|
||||
On the "global configuration" page, general settings applying to all streams are configured - the base URL that requests are sent to, as well as configuration for how to authenticate with the API server.
|
||||
|
||||
* Set the base URL to `https://api.apilayer.com`
|
||||
* Select the "API Key" authentication method
|
||||
* Set the "Header" to `apikey`
|
||||
|
||||
The actual API Key you copied from apilayer.com will not be part of the connector itself - instead it will be set as part of the source configuration when configuring a connection based on your connector in a later step.
|
||||
|
||||
You can find more information about authentication method on the [authentication concept page](./authentication).
|
||||
|
||||
### Setting up and testing a stream
|
||||
|
||||
<div style={{ position: "relative", paddingBottom: "59.66850828729282%", height: 0 }}><iframe src="https://www.loom.com/embed/9c31b779dc9e4c6bbaa10f19b00ee893" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen style={{position: "absolute", top: 0, left: 0, width: "100%", height: "100%"}}></iframe></div>
|
||||
|
||||
Now that you configured how to talk the API, let's set up the stream of records that will be sent to a destination later on. To do so, click the button with the plus icon next to the "Streams" section in the side menu and fill out the form:
|
||||
* Set the name to "Rates"
|
||||
* Set the "URL path" to `/exchangerates_data/latest`
|
||||
* Submit
|
||||
|
||||
Now the basic stream is configured and can be tested. To send test requests, supply testing values by clicking the "Testing values" button in top right and pasting your API key in the input.
|
||||
|
||||
This form corresponds to what a user of this connector will need to provide when setting up a connection later on. The testing values are not saved along with the connector project.
|
||||
|
||||
Now, click the "Test" button to trigger a test read to simulate what will happen during a sync. After a little while, you should see a single record that looks like this:
|
||||
```
|
||||
{
|
||||
"base": "EUR",
|
||||
"date": "2023-04-13",
|
||||
"rates": {
|
||||
"AED": 4.053261,
|
||||
"AFN": 95.237669,
|
||||
"ALL": 112.964844,
|
||||
"AMD": 432.048005,
|
||||
// ...
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
In a real sync, this record will be passed on to a destination like a warehouse.
|
||||
|
||||
The request/response tabs are helpful during development to see which requests and responses your connector will send and receive from the API.
|
||||
|
||||
The detected schema tab indicates the schema that was detected by analyzing the returned records; this detected schema is automatically set as the declared schema for this stream, which you can see by visiting the Declared schema tab in the center stream configuration view.
|
||||
|
||||
## Step 3 - Advanced configuration
|
||||
|
||||
### Making the stream configurable
|
||||
|
||||
<div style={{ position: "relative", paddingBottom: "59.66850828729282%", height: 0 }}><iframe src="https://www.loom.com/embed/9fa4f22914db48a1876413f67cd6e2f0" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen style={{position: "absolute", top: 0, left: 0, width: "100%", height: "100%"}}></iframe></div>
|
||||
|
||||
The exchange rate API supports configuring a different base currency via query parameter - let's make this part of the user inputs that can be controlled by the user of the connector when configuring a source, similar to the API key.
|
||||
|
||||
To do so, follow these steps:
|
||||
* Scroll down to the "Query Parameters" section and add a new query parameter
|
||||
* Set the key to `base`
|
||||
* For the value, click the user icon in the input and select "New user input"
|
||||
* Set the name to "Base"
|
||||
* Click "Create"
|
||||
|
||||
The value input is now set to `{{ config['base'] }}`. When making requests, the connector will replace this placeholder by the user supplied value. This syntax can be used in all fields that have a user icon on the right side, see the full reference [here](/platform/connector-development/config-based/understanding-the-yaml-file/reference#variables).
|
||||
|
||||
|
||||
Now your connector has a second configuration input. To test it, click the "Testing values" button again, set the "Base" to `USD`. Then, click the "Test" button again to issue a new test read.
|
||||
|
||||
The record should update to use USD as the base currency:
|
||||
```
|
||||
{
|
||||
"base": "USD",
|
||||
"date": "2023-04-13",
|
||||
"rates": {
|
||||
"AED": 3.6723,
|
||||
"AFN": 86.286329,
|
||||
"ALL": 102.489617,
|
||||
"AMD": 391.984204,
|
||||
// ...
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Adding incremental reads
|
||||
|
||||
<div style={{ position: "relative", paddingBottom: "59.66850828729282%", height: 0 }}><iframe src="https://www.loom.com/embed/d52259513b664119a842809a4fd13c15" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen style={{position: "absolute", top: 0, left: 0, width: "100%", height: "100%"}}></iframe></div>
|
||||
|
||||
We now have a working implementation of a connector reading the latest exchange rates for a given currency.
|
||||
In this section, we'll update the source to read historical data instead of only reading the latest exchange rates.
|
||||
|
||||
According to the API documentation, we can read the exchange rate for a specific date range by querying the `"/exchangerates_data/{date}"` endpoint instead of `"/exchangerates_data/latest"`.
|
||||
|
||||
To configure your connector to request every day individually, follow these steps:
|
||||
* On top of the form, change the "URL Path" input to `/exchangerates_data/{{ stream_interval.start_time }}` to [inject](/platform/connector-development/config-based/understanding-the-yaml-file/reference#variables) the date to fetch data for into the path of the request
|
||||
* Enable "Incremental sync" for the Rates stream
|
||||
* Set the "Cursor Field" to `date` - this is the property in our records to check what date got synced last
|
||||
* Set the "Cursor Field Datetime Format" to `%Y-%m-%d` to match the format of the date in the record returned from the API
|
||||
* Leave start time to "User input" so the end user can set the desired start time for syncing data
|
||||
* Leave end time to "Now" to always sync exchange rates up to the current date
|
||||
* In a lot of cases the start and end date are injected into the request body or query parameters. However in the case of the exchange rate API it needs to be added to the path of the request, so disable the "Inject start/end time into outgoing HTTP request" options
|
||||
* Open the "Advanced" section and enable "Split up interval" so that the connector will partition the dataset into chunks
|
||||
* Set "Step" to `P1D` to configure the connector to do one separate request per day
|
||||
* Set the "Cursor granularity" to `P1D` to tell the connector the API only supports daily increments
|
||||
* Set a start date (like `2023-06-11`) in the "Testing values" menu
|
||||
* Hit the "Test" button to trigger a new test read
|
||||
|
||||
Now, you should see a dropdown above the records view that lets you step through the daily exchange rates along with the requests performed to fetch this data. Note that in the connector builder at most 5 partitions are requested to speed up testing. During a proper sync the full time range between your configured start date and the current day will be executed.
|
||||
|
||||
When used in a connection, the connector will make sure exchange rates for the same day are not requested multiple times - the date of the last fetched record will be stored and the next scheduled sync will pick up from where the previous one stopped.
|
||||
|
||||
You can find more information about incremental syncs on the [incremental sync concept page](./incremental-sync).
|
||||
|
||||
## Step 4 - Publishing and syncing
|
||||
|
||||
<div style={{ position: "relative", paddingBottom: "59.66850828729282%", height: 0 }}><iframe src="https://www.loom.com/embed/a6896c6aa8f047f0aeefec08d37a1384" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen style={{position: "absolute", top: 0, left: 0, width: "100%", height: "100%"}}></iframe></div>
|
||||
|
||||
So far, the connector is only configured as part of the connector builder project. To make it possible to use it in actual connections, you need to publish the connector. This captures the current state of the configuration and makes the connector available as a custom connector within the current Airbyte workspace.
|
||||
|
||||
To use the connector for a proper sync, follow these steps:
|
||||
* Click the "Publish" button and publish the first version of the "Exchange Rates (Tutorial)" connector
|
||||
* Go to the "Connections" page and create a new connection
|
||||
* As Source type, pick the "Exchange Rates (Tutorial)" connector you just created
|
||||
* Set API Key, base currency and start date for the sync - to avoid a large number of requests, set the start date to one week in the past
|
||||
* Click "Set up source" and wait for the connection check to validate the provided configuration is valid
|
||||
* Set up a destination - to keep things simple let's choose the "E2E Testing" destination type
|
||||
* Click "Set up destination", keeping the default configurations
|
||||
* Wait for Airbyte to check the record schema, then click "Set up connection" - this will create the connection and kick off the first sync
|
||||
* After a short while, the sync should complete successfully
|
||||
|
||||
Congratulations! You just completed the following steps:
|
||||
* Configured a production-ready connector to extract currency exchange data from an HTTP-based API:
|
||||
* Configurable API key, start date and base currency
|
||||
* Incremental sync to keep the number of requests small
|
||||
* Tested whether the connector works correctly in the builder
|
||||
* Made the working connector available to configure sources in the workspace
|
||||
* Set up a connection using the published connector and synced data from the Exchange Rates API
|
||||
|
||||
## Next steps
|
||||
|
||||
This tutorial didn't go into depth about all features that can be used in the connector builder. Check out the concept pages for more information about certain topics:
|
||||
* [Authentication](/platform/connector-development/connector-builder-ui/authentication/)
|
||||
* [Record processing](/platform/connector-development/connector-builder-ui/record-processing/)
|
||||
* [Pagination](/platform/connector-development/connector-builder-ui/pagination/)
|
||||
* [Incremental sync](/platform/connector-development/connector-builder-ui/incremental-sync/)
|
||||
* [Partitioning](/platform/connector-development/connector-builder-ui/partitioning/)
|
||||
* [Error handling](/platform/connector-development/connector-builder-ui/error-handling/)
|
||||
|
||||
Not every possible API can be consumed by connectors configured in the connector builder. If you need more flexibility, consider using the [Low Code CDK](/platform/connector-development/config-based/low-code-cdk-overview) or the [Python CDK](/platform/connector-development/cdk-python/) to build a connector with more advanced features.
|
||||
243
docs/platform/connector-development/connector-metadata-file.md
Normal file
@@ -0,0 +1,243 @@
|
||||
# Connector Metadata.yaml File
|
||||
|
||||
The `metadata.yaml` file contains crucial information about the connector, including its type, definition ID, Docker image tag, Docker repository, and much more. It plays a key role in the way Airbyte handles connector data and improves the overall organization and accessibility of this data.
|
||||
|
||||
N.B. the `metadata.yaml` file replaces the previous `source_definitions.yaml` and `destinations_definitions.yaml` files.
|
||||
|
||||
## Structure
|
||||
|
||||
Below is an example of a `metadata.yaml` file for the Postgres source:
|
||||
|
||||
```yaml
|
||||
data:
|
||||
allowedHosts:
|
||||
hosts:
|
||||
- ${host}
|
||||
- ${tunnel_method.tunnel_host}
|
||||
connectorSubtype: database
|
||||
connectorType: source
|
||||
definitionId: decd338e-5647-4c0b-adf4-da0e75f5a750
|
||||
dockerImageTag: 2.0.28
|
||||
maxSecondsBetweenMessages: 7200
|
||||
dockerRepository: airbyte/source-postgres
|
||||
githubIssueLabel: source-postgres
|
||||
icon: postgresql.svg
|
||||
license: MIT
|
||||
name: Postgres
|
||||
tags:
|
||||
- language:java
|
||||
- language:python
|
||||
registries:
|
||||
cloud:
|
||||
dockerRepository: airbyte/source-postgres-strict-encrypt
|
||||
enabled: true
|
||||
oss:
|
||||
enabled: true
|
||||
supportLevel: certified
|
||||
documentationUrl: https://docs.airbyte.com/integrations/sources/postgres
|
||||
metadataSpecVersion: "1.0"
|
||||
```
|
||||
|
||||
## The `registries` Section
|
||||
|
||||
The `registries` section within the `metadata.yaml` file plays a vital role in determining the contents of the `oss_registry.json` and `cloud_registry.json` files.
|
||||
|
||||
This section contains two subsections: `cloud` and `oss` (Open Source Software). Each subsection contains details about the specific registry, such as the Docker repository associated with it and whether it's enabled or not.
|
||||
|
||||
### Structure
|
||||
|
||||
Here's how the `registries` section is structured in our previous `metadata.yaml` example:
|
||||
|
||||
```yaml
|
||||
registries:
|
||||
cloud:
|
||||
dockerRepository: airbyte/source-postgres-strict-encrypt
|
||||
enabled: true
|
||||
oss:
|
||||
enabled: true
|
||||
```
|
||||
|
||||
In this example, both `cloud` and `oss` registries are enabled, and the Docker repository for the `cloud` registry is overrode to `airbyte/source-postgres-strict-encrypt`.
|
||||
|
||||
### Updating Registries
|
||||
|
||||
When the `metadata.yaml` file is updated, this data is automatically uploaded to Airbyte's metadata service. This service then generates the publicly available `oss_registry.json` and `cloud_registry.json` registries based on the information provided in the `registries` section.
|
||||
|
||||
For instance, if a connector is listed as `enabled: true` under the `oss` section, it will be included in the `oss_registry.json` file. Similarly, if it's listed as `enabled: true` under the `cloud` section, it will be included in the `cloud_registry.json` file.
|
||||
|
||||
Thus, the `registries` section in the `metadata.yaml` file provides a flexible and organized way to manage which connectors are included in each registry.
|
||||
|
||||
## The `tags` Section
|
||||
|
||||
The `tags` field is an optional part of the `metadata.yaml` file. It is designed to provide additional context about a connector and improve the connector's discoverability. This field can contain any string, making it a flexible tool for adding additional details about a connector.
|
||||
|
||||
In the `metadata.yaml` file, `tags` is a list that may contain any number of string elements. Each element in the list is a separate tag. For instance:
|
||||
|
||||
```yaml
|
||||
tags:
|
||||
- "language:java"
|
||||
- "keyword:database"
|
||||
- "keyword:SQL"
|
||||
```
|
||||
|
||||
In the example above, the connector has three tags. Tags are used for two primary purposes in Airbyte:
|
||||
|
||||
1. **Denoting the Programming Language(s)**: Tags that begin with language: are used to specify the programming languages that are utilized by the connector. This information is auto-generated by a script that scans the connector's files for recognized programming languages. In the example above, language:java means that the connector uses Java.
|
||||
|
||||
2. **Keywords for Searching**: Tags that begin with keyword: are used to make the connector more discoverable by adding searchable terms related to it. In the example above, the tags keyword:database and keyword:SQL can be used to find this connector when searching for `database` or `SQL`.
|
||||
|
||||
These are just examples of how tags can be used. As a free-form field, the tags list can be customized as required for each connector. This flexibility allows tags to be a powerful tool for managing and discovering connectors.
|
||||
|
||||
## The `icon` Field
|
||||
|
||||
_⚠️ This property is in the process of being refactored to be a file in the connector folder_
|
||||
|
||||
You may notice a `icon.svg` file in the connectors folder.
|
||||
|
||||
This is because we are transitioning away from icons being stored in the `airbyte-platform` repository. Instead, we will be storing them in the connector folder itself. This will allow us to have a single source of truth for all connector-related information.
|
||||
|
||||
This transition is currently in progress. Once it is complete, the `icon` field in the `metadata.yaml` file will be removed, and the `icon.svg` file will be used instead.
|
||||
|
||||
## The `releases` Section
|
||||
|
||||
The `releases` section contains extra information about certain types of releases. The current types of releases are:
|
||||
|
||||
- `breakingChanges`
|
||||
|
||||
### `breakingChanges`
|
||||
|
||||
The `breakingChanges` section of `releases` contains a dictionary of version numbers (usually major versions, i.e. `1.0.0`) and information about
|
||||
their associated breaking changes. Each entry must contain the following parameters:
|
||||
|
||||
- `message`: A description of the breaking change, written in a user-friendly format. This message should briefly describe
|
||||
- What the breaking change is, and which users it effects (e.g. all users of the source, or only those using a certain stream)
|
||||
- Why the change is better for the user (fixed a bug, something got faster, etc)
|
||||
- What the user should do to fix the issue (e.g. a full reset, run a SQL query in the destinaton, etc)
|
||||
- `upgradeDeadline`: (`YYYY-MM-DD`) The date by which the user should upgrade to the new version.
|
||||
|
||||
When considering what the `upgradeDeadline` should be, target the amount of time which would be reasonable for the user to make the required changes described in the `message` and upgrade giude. If the required changes are _simple_ (e.g. "do a full reset"), 2 weeks is recommended. Note that you do _not_ want to link the duration of `upgradeDeadline` to an upstream API's deprecation date. While it is true that the older version of a connector will continue to work for that period of time, it means that users who are pinned to the older version of the connector will not benefit from future updates and fixes.
|
||||
|
||||
Without all 3 of these points, the breaking change message is not helpful to users.
|
||||
|
||||
Here is an example:
|
||||
|
||||
```yaml
|
||||
releases:
|
||||
breakingChanges:
|
||||
1.0.0:
|
||||
message: "This version changes the connector’s authentication by removing ApiKey authentication, which is now deprecated by the [upstream source](upsteam-docs-url.com). Users currently using ApiKey auth will need to reauthenticate with OAuth after upgrading to continue syncing."
|
||||
upgradeDeadline: "2023-12-31"
|
||||
```
|
||||
|
||||
#### `scopedImpact`
|
||||
|
||||
The optional `scopedImpact` property allows you to provide a list of scopes for which the change is breaking.
|
||||
This allows you to reduce the scope of the change; it's assumed that any scopes not listed are unaffected by the breaking change.
|
||||
|
||||
For example, consider the following `scopedImpact` definition:
|
||||
|
||||
```yaml
|
||||
releases:
|
||||
breakingChanges:
|
||||
1.0.0:
|
||||
message: "This version changes the cursor for the `Users` stream. After upgrading, please reset the stream."
|
||||
upgradeDeadline: "2023-12-31"
|
||||
scopedImpact:
|
||||
- scopeType: stream
|
||||
impactedScopes: ["users"]
|
||||
```
|
||||
|
||||
This change only breaks the `users` stream - all other streams are unaffected. A user can safely ignore the breaking change
|
||||
if they are not syncing the `users` stream.
|
||||
|
||||
The supported scope types are listed below.
|
||||
|
||||
| Scope Type | Value Type | Value Description |
|
||||
| ---------- | ----------- | -------------------- |
|
||||
| stream | `list[str]` | List of stream names |
|
||||
|
||||
#### `remoteRegistries`
|
||||
|
||||
The optional `remoteRegistries` property allows you to configure how a connector should be published to registries like Pypi.
|
||||
|
||||
**Important note**: Currently no automated publishing will occur.
|
||||
|
||||
```yaml
|
||||
remoteRegistries:
|
||||
pypi:
|
||||
enabled: true
|
||||
packageName: airbyte-source-connector-name
|
||||
```
|
||||
|
||||
The `packageName` property of the `pypi` section is the name of the installable package in the PyPi registry.
|
||||
|
||||
If not specified, all remote registry configurations are disabled by default.
|
||||
|
||||
## The `connectorTestSuitesOptions` section
|
||||
|
||||
The `connectorTestSuitesOptions` contains a list of test suite options for a connector.
|
||||
The list of declared test suites affects which suite will run in CI.
|
||||
We currently accept three values for the `suite` field:
|
||||
* `unitTests`
|
||||
* `integrationTests`
|
||||
* `acceptanceTests`
|
||||
|
||||
Each list entry can also declare a `testSecrets` object which will enable our CI to fetch connector specific secret credentials which are required to run the `suite`.
|
||||
|
||||
### The `testSecrets` object
|
||||
The `testSecrets` object has three properties:
|
||||
* `name` (required `string`): The name of the secret in the secret store.
|
||||
* `secretStore` (required `secretStore` object): Where the secret is stored (more details on the object structure below).
|
||||
* `fileName` (optional `string`): The name of the file in which our CI will persist the secret (inside the connector's `secrets` directory).
|
||||
|
||||
**If you are a community contributor please note that addition of a new secret to our secret store requires manual intervention from an Airbyter. Please reach out to your PR reviewers if you want to add a test secret to our CI.**
|
||||
|
||||
#### The `secretStore` object
|
||||
This object has two properties:
|
||||
* `type`: Defines the secret store type, only `GSM` (Google Secret Manager) is currently supported
|
||||
* `alias`: The alias of this secret store in our system, which is resolved into an actual secret store address by our CI. We currently have a single alias to store our connector test secrets: `airbyte-connector-testing-secret-store` .
|
||||
|
||||
#### How to enable a test suite
|
||||
We currently support three test suite types:
|
||||
* `unitTests`,
|
||||
* `integrationTests`
|
||||
* `acceptanceTests`
|
||||
|
||||
|
||||
To enable a test suite, add the suite name to the `connectorTestSuitesOptions` list:
|
||||
|
||||
```yaml
|
||||
connectorTestSuitesOptions:
|
||||
- suite: unitTests
|
||||
# This will enable acceptanceTests for this connector
|
||||
# It declares that this test suite requires a secret named SECRET_SOURCE-FAKER_CREDS
|
||||
# In our secret store, and that the secret should be stored in the connector secret folder in a file named config.json
|
||||
- suite: acceptanceTests
|
||||
testSecrets:
|
||||
- name: SECRET_SOURCE-FAKER_CREDS
|
||||
fileName: config.json
|
||||
secretStore:
|
||||
type: GSM
|
||||
alias: airbyte-connector-testing-secret-store
|
||||
```
|
||||
|
||||
##### Default paths and conventions
|
||||
|
||||
The [`airbyte-ci`](https://github.com/airbytehq/airbyte/blob/master/airbyte-ci/connectors/pipelines/README.md) tool will automatically locate specific test types based on established conventions and will automatically store secret files (when needed) in the established secrets directory - which should be already excluded from accidental git commits.
|
||||
|
||||
**Python connectors**
|
||||
Tests are discovered by Pytest and are expected to be located in:
|
||||
* `unit_tests` directory for the `unitTests` suite
|
||||
* `integration_tests` directory for the `integrationTests` suite
|
||||
|
||||
**Java connectors**
|
||||
No specific directory is determined. Which test will run is determined by the Gradle configuration of the connector.
|
||||
`airbyt-ci` runs the `test` Gradle task for the `unitTests` suite and the `integrationTest` Gradle task for the `integrationTests` suite.
|
||||
|
||||
**Acceptance tests**
|
||||
|
||||
They are language agnostic and are configured via the `acceptance-test-config.yml` file in the connector's root directory. More on that [here](https://docs.airbyte.com/connector-development/testing-connectors/connector-acceptance-tests-reference).
|
||||
|
||||
**Default secret paths**
|
||||
The listed secrets in `testSecrets` with a file name will be mounted to the connector's `secrets` directory. The `fileName` should be relative to this directory.
|
||||
E.G.: `fileName: config.json` will be mounted to `<connector-directory>/secrets/config.json`
|
||||
@@ -0,0 +1,433 @@
|
||||
# Connector Specification Reference
|
||||
|
||||
The [connector specification](../understanding-airbyte/airbyte-protocol.md#spec) describes what inputs can be used to configure a connector. Like the rest of the Airbyte Protocol, it uses [JsonSchema](https://json-schema.org), but with some slight modifications.
|
||||
|
||||
## Demoing your specification
|
||||
|
||||
While iterating on your specification, you can preview what it will look like in the UI in realtime by following the instructions below.
|
||||
|
||||
1. Open the `ConnectorForm` preview component in our deployed Storybook at: https://storybook.airbyte.dev/?path=/story/connector-connectorform--preview
|
||||
2. Press `raw` on the `connectionSpecification` property, so you will be able to paste a JSON structured string
|
||||
3. Set the string you want to preview the UI for
|
||||
4. When submitting the form you can see a preview of the values in the "Actions" tab
|
||||
|
||||
### Secret obfuscation
|
||||
|
||||
By default, any fields in a connector's specification are visible can be read in the UI. However, if you want to obfuscate fields in the UI and API \(for example when working with a password\), add the `airbyte_secret` annotation to your connector's `spec.json` e.g:
|
||||
|
||||
```text
|
||||
"password": {
|
||||
"type": "string",
|
||||
"examples": ["hunter2"],
|
||||
"airbyte_secret": true
|
||||
},
|
||||
```
|
||||
|
||||
Here is an example of what the password field would look like: 
|
||||
|
||||
### Ordering fields in the UI
|
||||
|
||||
Use the `order` property inside a definition to determine the order in which it will appear relative to other objects on the same level of nesting in the UI.
|
||||
|
||||
For example, using the following spec:
|
||||
|
||||
```
|
||||
{
|
||||
"username": {"type": "string", "order": 1},
|
||||
"password": {"type": "string", "order": 2},
|
||||
"cloud_provider": {
|
||||
"order": 0,
|
||||
"type": "object",
|
||||
"properties" : {
|
||||
"name": {"type": "string", "order": 0},
|
||||
"region": {"type": "string", "order": 1}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
will result in the following configuration on the UI:
|
||||
|
||||

|
||||
|
||||
:::info
|
||||
|
||||
Within an object definition, if some fields have the `order` property defined, and others don't, then the fields without the `order` property defined should be rendered last in the UI. Among those elements (which don't have `order` defined), required fields are ordered before optional fields, and both categories are further ordered alphabetically by their field name.
|
||||
|
||||
Additionally, `order` values cannot be duplicated within the same object or group. See the [Grouping fields](#grouping-fields) section for more info on field groups.
|
||||
|
||||
:::
|
||||
|
||||
### Collapsing optional fields
|
||||
|
||||
By default, all optional fields will be collapsed into an `Optional fields` section which can be expanded or collapsed by the user. This helps streamline the UI for setting up a connector by initially focusing attention on the required fields only. For existing connectors, if their configuration contains a non-empty and non-default value for a collapsed optional field, then that section will be automatically opened when the connector is opened in the UI.
|
||||
|
||||
These `Optional fields` sections are placed at the bottom of a field group, meaning that all required fields in the same group will be placed above it. To interleave optional fields with required fields, set `always_show: true` on the optional field along with an `order`, which will cause the field to no longer be collapsed in an `Optional fields` section and be ordered as normal.
|
||||
|
||||
**Note:** `always_show` also causes fields that are normally hidden by an OAuth button to still be shwon.
|
||||
|
||||
Within a collapsed `Optional fields` section, the optional fields' `order` defines their position in the section; those without an `order` will be placed after fields with an `order`, and will themselves be ordered alphabetically by field name.
|
||||
|
||||
For example, using the following spec:
|
||||
|
||||
```
|
||||
{
|
||||
"connectionSpecification": {
|
||||
"type": "object",
|
||||
"required": ["username", "account_id"],
|
||||
"properties": {
|
||||
"username": {
|
||||
"type": "string",
|
||||
"title": "Username",
|
||||
"order": 1
|
||||
},
|
||||
"password": {
|
||||
"type": "string",
|
||||
"title": "Password",
|
||||
"order": 2,
|
||||
"always_show": true
|
||||
},
|
||||
"namespace": {
|
||||
"type": "string",
|
||||
"title": "Namespace",
|
||||
"order": 3
|
||||
},
|
||||
"region": {
|
||||
"type": "string",
|
||||
"title": "Region",
|
||||
"order": 4
|
||||
},
|
||||
"account_id": {
|
||||
"type": "integer",
|
||||
"title": "Account ID",
|
||||
"order": 5
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
will result in the following configuration on the UI (left side shows initial collapsed state, right side shows the optional fields section expanded):
|
||||
|
||||

|
||||
|
||||
### Grouping fields
|
||||
|
||||
Fields in the connector spec can be grouped into cards in the UI by utilizing the `group` property on a field. All fields that share the same `group` value will be grouped into the same card in the UI, and fields without a `group` will be placed into their own group card.
|
||||
|
||||
:::info
|
||||
|
||||
`group` can only be set on top-level properties in the connectionSpecification; it is not allowed on fields of objects nested inside the connectionSpecification.
|
||||
|
||||
Additionally, within a group the `order` values set on each field determine how they are ordered in the UI, and therefore an `order` value cannot be duplicated within a group.
|
||||
|
||||
:::
|
||||
|
||||
Groups themselves can also be ordered and titled by setting the `groups` property on the connectorSpecification. The value of this field is an array containing objects with `id` that matches the `group` values that were set on fields, and optionally a `title` which causes the Airbyte UI to render that title at the top of the group's card.
|
||||
|
||||
The order of entries in this `groups` array decides the order of the cards; `group` IDs that are set on fields which do not appear in this `groups` array will be ordered after those that do appear and will be ordered alphanumerically.
|
||||
|
||||
For example, using the following spec:
|
||||
|
||||
```
|
||||
{
|
||||
"connectionSpecification": {
|
||||
"type": "object",
|
||||
"required": ["username", "namespace", "account_id"],
|
||||
"properties": {
|
||||
"username": {
|
||||
"type": "string",
|
||||
"title": "Username",
|
||||
"order": 1,
|
||||
"group": "auth"
|
||||
},
|
||||
"password": {
|
||||
"type": "string",
|
||||
"title": "Password",
|
||||
"always_show": true,
|
||||
"order": 2,
|
||||
"group": "auth"
|
||||
},
|
||||
"namespace": {
|
||||
"type": "string",
|
||||
"title": "Namespace",
|
||||
"order": 1,
|
||||
"group": "location"
|
||||
},
|
||||
"region": {
|
||||
"type": "string",
|
||||
"title": "Region",
|
||||
"order": 2,
|
||||
"group": "location"
|
||||
},
|
||||
"account_id": {
|
||||
"type": "integer",
|
||||
"title": "Account ID"
|
||||
}
|
||||
},
|
||||
"groups": [
|
||||
{
|
||||
"id": "auth",
|
||||
"title": "Authentication"
|
||||
},
|
||||
{
|
||||
"id": "location"
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
will result in the following configuration on the UI:
|
||||

|
||||
|
||||
### Multi-line String inputs
|
||||
|
||||
Sometimes when a user is inputting a string field into a connector, newlines need to be preserveed. For example, if we want a connector to use an RSA key which looks like this:
|
||||
|
||||
```text
|
||||
---- BEGIN PRIVATE KEY ----
|
||||
123
|
||||
456
|
||||
789
|
||||
---- END PRIVATE KEY ----
|
||||
```
|
||||
|
||||
we need to preserve the line-breaks. In other words, the string `---- BEGIN PRIVATE KEY ----123456789---- END PRIVATE KEY ----` is not equivalent to the one above since it loses linebreaks.
|
||||
|
||||
By default, string inputs in the UI can lose their linebreaks. In order to accept multi-line strings in the UI, annotate your string field with `multiline: true` e.g:
|
||||
|
||||
```text
|
||||
"private_key": {
|
||||
"type": "string",
|
||||
"description": "RSA private key to use for SSH connection",
|
||||
"airbyte_secret": true,
|
||||
"multiline": true
|
||||
},
|
||||
```
|
||||
|
||||
this will display a multi-line textbox in the UI like the following screenshot: 
|
||||
|
||||
### Hiding inputs in the UI
|
||||
|
||||
In some rare cases, a connector may wish to expose an input that is not available in the UI, but is still potentially configurable when running the connector outside of Airbyte, or via the UI. For example, exposing a very technical configuration like the page size of an outgoing HTTP requests may only be relevant to power users, and therefore shouldn't be available via the UI but might make sense to expose via the API.
|
||||
|
||||
In this case, use the `"airbyte_hidden": true` keyword to hide that field from the UI. E.g:
|
||||
|
||||
```
|
||||
{
|
||||
"first_name": {
|
||||
"type": "string",
|
||||
"title": "First Name"
|
||||
},
|
||||
"secret_name": {
|
||||
"type": "string",
|
||||
"title": "You can't see me!!!",
|
||||
"airbyte_hidden": true
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Results in the following form:
|
||||
|
||||

|
||||
|
||||
## Pattern descriptors
|
||||
|
||||
Setting a `pattern` on a field in a connector spec enforces that the value entered into that input matches the `pattern` regex value. However, this causes the regex pattern to be displayed in the input's error message, which is usually not very helpful for the end-user.
|
||||
|
||||
The `pattern_descriptor` property allows the connector developer to set a human-readable format that should be displayed above the field, and if set in conjunction with a `pattern`, this `pattern_descriptor` will be used in the invalid format error message instead of the raw regex.
|
||||
|
||||
For example, having a field in the spec like:
|
||||
|
||||
```
|
||||
"start_date": {
|
||||
"type": "string",
|
||||
"title": "Start date",
|
||||
"pattern": "^[0-9]{4}-[0-9]{2}-[0-9]{2}$",
|
||||
"pattern_descriptor": "YYYY-MM-DD"
|
||||
},
|
||||
```
|
||||
|
||||
will result in the following look in the UI (empty state, valid state, and error state):
|
||||
|
||||

|
||||
|
||||
## Airbyte Modifications to `jsonschema`
|
||||
|
||||
### Using `oneOf`
|
||||
|
||||
In some cases, a connector needs to accept one out of many options. For example, a connector might need to know the compression codec of the file it will read, which will render in the Airbyte UI as a list of the available codecs. In JSONSchema, this can be expressed using the [oneOf](https://json-schema.org/understanding-json-schema/reference/combining.html#oneof) keyword.
|
||||
|
||||
:::info
|
||||
|
||||
Some connectors may follow an older format for dropdown lists, we are currently migrating away from that to this standard.
|
||||
|
||||
:::
|
||||
|
||||
In order for the Airbyte UI to correctly render a specification, however, a few extra rules must be followed:
|
||||
|
||||
1. The top-level item containing the `oneOf` must have `type: object`.
|
||||
2. Each item in the `oneOf` array must be a property with `type: object`.
|
||||
3. One `string` field with the same property name must be consistently present throughout each object inside the `oneOf` array. It is required to add a [`const`](https://json-schema.org/understanding-json-schema/reference/generic.html#constant-values) value unique to that `oneOf` option.
|
||||
|
||||
Let's look at the [source-file](/integrations/sources/file) implementation as an example. In this example, we have `provider` as a dropdown list option, which allows the user to select what provider their file is being hosted on. We note that the `oneOf` keyword lives under the `provider` object as follows:
|
||||
|
||||
In each item in the `oneOf` array, the `option_title` string field exists with the aforementioned `const` value unique to that item. This helps the UI and the connector distinguish between the option that was chosen by the user. This can be displayed with adapting the file source spec to this example:
|
||||
|
||||
```javascript
|
||||
{
|
||||
"connection_specification": {
|
||||
"$schema": "http://json-schema.org/draft-07/schema#",
|
||||
"title": "File Source Spec",
|
||||
"type": "object",
|
||||
"required": ["dataset_name", "format", "url", "provider"],
|
||||
"properties": {
|
||||
"dataset_name": {
|
||||
...
|
||||
},
|
||||
"format": {
|
||||
...
|
||||
},
|
||||
"reader_options": {
|
||||
...
|
||||
},
|
||||
"url": {
|
||||
...
|
||||
},
|
||||
"provider": {
|
||||
"type": "object",
|
||||
"oneOf": [
|
||||
{
|
||||
"required": [
|
||||
"option_title"
|
||||
],
|
||||
"properties": {
|
||||
"option_title": {
|
||||
"type": "string",
|
||||
"const": "HTTPS: Public Web",
|
||||
"order": 0
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"required": [
|
||||
"option_title"
|
||||
],
|
||||
"properties": {
|
||||
"option_title": {
|
||||
"type": "string",
|
||||
"const": "GCS: Google Cloud Storage",
|
||||
"order": 0
|
||||
},
|
||||
"service_account_json": {
|
||||
"type": "string",
|
||||
"description": "In order to access private Buckets stored on Google Cloud, this connector would need a service account json credentials with the proper permissions as described <a href=\"https://cloud.google.com/iam/docs/service-accounts\" target=\"_blank\">here</a>. Please generate the credentials.json file and copy/paste its content to this field (expecting JSON formats). If accessing publicly available data, this field is not necessary."
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### oneOf display type
|
||||
|
||||
You can also configure the way that oneOf fields are displayed in the Airbyte UI through the `display_type` property. Valid values for this property are:
|
||||
|
||||
- `dropdown`
|
||||
- Renders a dropdown menu containing the title of each option for the user to select
|
||||
- This is a compact look that works well in most cases
|
||||
- The descriptions of the options can be found in the oneOf field's tooltip
|
||||
- `radio`
|
||||
- Renders radio-button cards side-by-side containing the title and description of each option for the user to select
|
||||
- This choice draws more attention to the field and shows the descriptions of each option at all times, which can be useful for important or complicated fields
|
||||
|
||||
Here is an example of setting the `display_type` of a oneOf field to `dropdown`, along with how it looks in the Airbyte UI:
|
||||
|
||||
```
|
||||
"update_method": {
|
||||
"type": "object",
|
||||
"title": "Update Method",
|
||||
"display_type": "dropdown",
|
||||
"oneOf": [
|
||||
{
|
||||
"title": "Read Changes using Binary Log (CDC)",
|
||||
"description": "<i>Recommended</i> - Incrementally reads new inserts, updates, and deletes using the MySQL <a href=\"https://docs.airbyte.com/integrations/sources/mysql/#change-data-capture-cdc\">binary log</a>. This must be enabled on your database.",
|
||||
"required": ["method"],
|
||||
"properties": {
|
||||
"method": {
|
||||
"type": "string",
|
||||
"const": "CDC",
|
||||
"order": 0
|
||||
},
|
||||
"initial_waiting_seconds": {
|
||||
...
|
||||
},
|
||||
"server_time_zone": {
|
||||
...
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Scan Changes with User Defined Cursor",
|
||||
"description": "Incrementally detects new inserts and updates using the <a href=\"https://docs.airbyte.com/understanding-airbyte/connections/incremental-append/#user-defined-cursor\">cursor column</a> chosen when configuring a connection (e.g. created_at, updated_at).",
|
||||
"required": ["method"],
|
||||
"properties": {
|
||||
"method": {
|
||||
"type": "string",
|
||||
"const": "STANDARD",
|
||||
"order": 0
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||

|
||||
|
||||
And here is how it looks if the `display_type` property is set to `radio` instead:
|
||||

|
||||
|
||||
### Using `enum`
|
||||
|
||||
In regular `jsonschema`, some drafts enforce that `enum` lists must contain distinct values, while others do not. For consistency, Airbyte enforces this restriction.
|
||||
|
||||
For example, this spec is invalid, since `a_format` is listed twice under the enumerated property `format`:
|
||||
|
||||
```javascript
|
||||
{
|
||||
"connection_specification": {
|
||||
"$schema": "http://json-schema.org/draft-07/schema#",
|
||||
"title": "File Source Spec",
|
||||
"type": "object",
|
||||
"required": ["format"],
|
||||
"properties": {
|
||||
"dataset_name": {
|
||||
...
|
||||
},
|
||||
"format": {
|
||||
type: "string",
|
||||
enum: ["a_format", "another_format", "a_format"]
|
||||
},
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Forbidden keys
|
||||
|
||||
In connector specs, the following JSON schema keys are forbidden, as Airbyte does not currently contain logic to interpret them
|
||||
|
||||
- `not`
|
||||
- `anyOf`
|
||||
- `patternProperties`
|
||||
- `prefixItems`
|
||||
- `allOf`
|
||||
- `if`
|
||||
- `then`
|
||||
- `else`
|
||||
- `dependentSchemas`
|
||||
- `dependentRequired`
|
||||
167
docs/platform/connector-development/debugging-docker.md
Normal file
@@ -0,0 +1,167 @@
|
||||
# Debugging Docker Containers
|
||||
|
||||
This guide will cover debugging **JVM docker containers** either started via Docker Compose or started by the
|
||||
worker container, such as a Destination container. This guide will assume use of [IntelliJ Community edition](https://www.jetbrains.com/idea/),
|
||||
however the steps could be applied to another IDE or debugger.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
You should have the airbyte repo downloaded and should be able to [run the platform locally](https://docs.airbyte.com/deploying-airbyte/local-deployment).
|
||||
Also, if you're on macOS you will need to follow the installation steps for [Docker Mac Connect](https://github.com/chipmk/docker-mac-net-connect).
|
||||
|
||||
## Connecting your debugger
|
||||
|
||||
This solution utilizes the environment variable `JAVA_TOOL_OPTIONS` which when set to a specific value allows us to connect our debugger.
|
||||
We will also be setting up a **Remote JVM Debug** run configuration in IntelliJ which uses the IP address or hostname to connect.
|
||||
|
||||
> **Note**
|
||||
> The [Docker Mac Connect](https://github.com/chipmk/docker-mac-net-connect) tool is what makes it possible for macOS users to connect to a docker container
|
||||
> by IP address.
|
||||
|
||||
### Docker Compose Extension
|
||||
|
||||
By default, the `docker compose` command will look for a `docker-compose.yaml` file in your directory and execute its instructions. However, you can
|
||||
provide multiple files to the `docker compose` command with the `-f` option. You can read more about how Docker compose combines or overrides values when
|
||||
you provide multiple files [on Docker's Website](https://docs.docker.com/compose/extends/).
|
||||
|
||||
In the Airbyte repo, there is already another file `docker-compose.debug.yaml` which extends the `docker-compose.yaml` file. Our goal is to set the
|
||||
`JAVA_TOOL_OPTIONS` environment variable in the environment of the container we wish to debug. If you look at the `server` configuration under `services`
|
||||
in the `docker-compose.debug.yaml` file, it should look like this:
|
||||
|
||||
```yaml
|
||||
server:
|
||||
environment:
|
||||
- JAVA_TOOL_OPTIONS=${DEBUG_SERVER_JAVA_OPTIONS}
|
||||
```
|
||||
|
||||
What this is saying is: For the Service `server` add an environment variable `JAVA_TOOL_OPTIONS` with the value of the variable `DEBUG_SERVER_JAVA_OPTIONS`.
|
||||
`DEBUG_SERVER_JAVA_OPTIONS` has no default value, so if we don't provide one, `JAVA_TOOL_OPTIONS` will be blank or empty. When running the `docker compose` command,
|
||||
Docker will look to your local environment variables, to see if you have set a value for `DEBUG_SERVER_JAVA_OPTIONS` and copy that value. To set this value
|
||||
you can either `export` the variable in your environment prior to running the `docker compose` command, or prepend the variable to the command. For our debugging purposes,
|
||||
we want the value to be `-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=*:5005` so to connect our debugger to the `server` container, run the following:
|
||||
|
||||
```bash
|
||||
DEBUG_SERVER_JAVA_OPTIONS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=*:5005" VERSION="dev" docker compose -f docker-compose.yaml -f docker-compose.debug.yaml up
|
||||
```
|
||||
|
||||
> **Note**
|
||||
> This command also passes in the `VERSION=dev` environment variable, which is recommended from the comments in the `docker-compose.debug.yaml`
|
||||
|
||||
### Connecting the Debugger
|
||||
|
||||
Now we need to connect our debugger. In IntelliJ, open `Edit Configurations...` from the run menu (Or search for `Edit Configurations` in the command palette).
|
||||
Create a new _Remote JVM Debug_ Run configuration. The `host` option defaults to `localhost` which if you're on Linux you can leave this unchanged.
|
||||
On a Mac however, you need to find the IP address of your container. **Make sure you've installed and started the [Docker Mac Connect](https://github.com/chipmk/docker-mac-net-connect)
|
||||
service prior to running the `docker compose` command**. With your containers running, run the following command to easily fetch the IP addresses:
|
||||
|
||||
```bash
|
||||
$ docker inspect $(docker ps -q ) --format='{{ printf "%-50s" .Name}} {{printf "%-50s" .Config.Image}} {{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}'
|
||||
/airbyte-proxy airbyte/proxy:dev 172.18.0.10172.19.0.4
|
||||
/airbyte-server airbyte/server:dev 172.18.0.9
|
||||
/airbyte-worker airbyte/worker:dev 172.18.0.8
|
||||
/airbyte-source sha256:5eea76716a190d10fd866f5ac6498c8306382f55c6d910231d37a749ad305960 172.17.0.2
|
||||
/airbyte-connector-builder-server airbyte/connector-builder-server:dev 172.18.0.6
|
||||
/airbyte-webapp airbyte/webapp:dev 172.18.0.7
|
||||
/airbyte-cron airbyte/cron:dev 172.18.0.5
|
||||
/airbyte-temporal airbyte/temporal:dev 172.18.0.2
|
||||
/airbyte-db airbyte/db:dev 172.18.0.4172.19.0.3
|
||||
/airbyte-temporal-ui temporalio/web:1.13.0 172.18.0.3172.19.0.2
|
||||
```
|
||||
|
||||
You should see an entry for `/airbyte-server` which is the container we've been targeting so copy its IP address (`172.18.0.9` in the example output above)
|
||||
and replace `localhost` in your IntelliJ Run configuration with the IP address.
|
||||
|
||||
Save your Remote JVM Debug run configuration and run it with the debug option. You should now be able to place breakpoints in any code that is being executed by the
|
||||
`server` container. If you need to debug another container from the original `docker-compose.yaml` file, you could modify the `docker-compose.debug.yaml` file with a similar option.
|
||||
|
||||
### Debugging Containers Launched by the Worker container
|
||||
|
||||
The Airbyte platform launches some containers as needed at runtime, which are not defined in the `docker-compose.yaml` file. These containers are the source or destination
|
||||
tasks, among other things. But if we can't pass environment variables to them through the `docker-compose.debug.yaml` file, then how can we set the
|
||||
`JAVA_TOOL_OPTIONS` environment variable? Well, the answer is that we can _pass it through_ the container which launches the other containers - the `worker` container.
|
||||
|
||||
For this example, lets say that we want to debug something that happens in the `destination-postgres` connector container. To follow along with this example, you will
|
||||
need to have set up a connection which uses postgres as a destination, however if you want to use a different connector like `source-postgres`, `destination-bigquery`, etc. that's fine.
|
||||
|
||||
In the `docker-compose.debug.yaml` file you should see an entry for the `worker` service which looks like this
|
||||
|
||||
```yaml
|
||||
worker:
|
||||
environment:
|
||||
- DEBUG_CONTAINER_IMAGE=${DEBUG_CONTAINER_IMAGE}
|
||||
- DEBUG_CONTAINER_JAVA_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=*:5005
|
||||
```
|
||||
|
||||
Similar to the previous debugging example, we want to pass an environment variable to the `docker compose` command. This time we're setting the
|
||||
`DEBUG_CONTAINER_IMAGE` environment variable to the name of the container we're targeting. For our example that is `destination-postgres` so run the command:
|
||||
|
||||
```bash
|
||||
DEBUG_CONTAINER_IMAGE="destination-postgres:5005" VERSION="dev" docker compose -f docker-compose.yaml -f docker-compose.debug.yaml up
|
||||
```
|
||||
|
||||
The `worker` container now has an environment variable `DEBUG_CONTAINER_IMAGE` with a value of `destination-postgres` which when it compares when it is
|
||||
spawning containers. If the container name matches the environment variable, it will set the `JAVA_TOOL_OPTIONS` environment variable in the container to
|
||||
the value of its `DEBUG_CONTAINER_JAVA_OPTS` environment variable, which is the same value we used in the `server` example.
|
||||
|
||||
#### Connecting the Debugger to a Worker Spawned Container
|
||||
|
||||
To connect your debugger, **the container must be running**. This `destination-postgres` container will only run when we're running one of its tasks,
|
||||
such as when a replication is running. Navigate to a connection in your local Airbyte instance at http://localhost:8000 which uses postgres as a destination.
|
||||
If you ran through the [Postgres to Postgres replication tutorial](https://airbyte.com/tutorials/postgres-replication), you can use this connection.
|
||||
|
||||
On the connection page, trigger a manual sync with the "Sync now" button. Because we set the `suspend` option to `y` in our `JAVA_TOOL_OPTIONS` the
|
||||
container will pause all execution until the debugger is connected. This can be very useful for methods which run very quickly, such as the Check method.
|
||||
However, this could be very detrimental if it were pushed into a production environment. For now, it gives us time to set a new Remote JVM Debug Configuraiton.
|
||||
|
||||
This container will have a different IP than the `server` Remote JVM Debug Run configuration we set up earlier. So lets set up a new one with the IP of
|
||||
the `destination-postgres` container:
|
||||
|
||||
```bash
|
||||
$ docker inspect $(docker ps -q ) --format='{{ printf "%-50s" .Name}} {{printf "%-50s" .Config.Image}} {{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}'
|
||||
/destination-postgres-write-52-0-grbsw airbyte/destination-postgres:0.3.26
|
||||
/airbyte-proxy airbyte/proxy:dev 172.18.0.10172.19.0.4
|
||||
/airbyte-worker airbyte/worker:dev 172.18.0.8
|
||||
/airbyte-server airbyte/server:dev 172.18.0.9
|
||||
/airbyte-destination postgres 172.17.0.3
|
||||
/airbyte-source sha256:5eea76716a190d10fd866f5ac6498c8306382f55c6d910231d37a749ad305960 172.17.0.2
|
||||
/airbyte-connector-builder-server airbyte/connector-builder-server:dev 172.18.0.6
|
||||
/airbyte-webapp airbyte/webapp:dev 172.18.0.7
|
||||
/airbyte-cron airbyte/cron:dev 172.18.0.5
|
||||
/airbyte-temporal airbyte/temporal:dev 172.18.0.3
|
||||
/airbyte-db airbyte/db:dev 172.18.0.2172.19.0.3
|
||||
/airbyte-temporal-ui temporalio/web:1.13.0 172.18.0.4172.19.0.2
|
||||
```
|
||||
|
||||
Huh? No IP address, weird. Interestingly enough, all the IPs are sequential but there is one missing, `172.18.0.1`. If we attempt to use that IP in remote debugger, it works!
|
||||
|
||||
You can now add breakpoints and debug any code which would be executed in the `destination-postgres` container.
|
||||
|
||||
Happy Debugging!
|
||||
|
||||
#### Connecting the Debugger to an Integration Test Spawned Container
|
||||
|
||||
You can also debug code contained in containers spawned in an integration test! This can be used to debug integration tests as well as testing code changes.
|
||||
The steps involved are:
|
||||
|
||||
1. Follow all the steps outlined above to set up the **Remote JVM Debug** run configuration.
|
||||
2. Edit the run configurations associated with the given integration test with the following environment variables:`DEBUG_CONTAINER_IMAGE=source-postgres;DEBUG_CONTAINER_JAVA_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=*:5005`
|
||||
Note that you will have to keep repeating this step for every new integration test run configuration you create.
|
||||
3. Run the integration test in debug mode. In the debug tab, open up the Remote JVM Debugger run configuration you just created.
|
||||
4. Keep trying to attach the Remote JVM Debugger. It will likely fail a couple of times and eventually connect to the test container. If you want a more
|
||||
deterministic way to connect the debugger, you can set a break point in the `DockerProcessFactor.localDebuggingOptions()` method. Resume running the integration test run and
|
||||
then attempt to attach the Remote JVM Debugger (you still might need a couple of tries).
|
||||
|
||||
## Gotchas
|
||||
|
||||
So now that your debugger is set up, what else is there to know?
|
||||
|
||||
### Code changes
|
||||
|
||||
When you're debugging, you might want to make a code change. Anytime you make a code change, your code will become out of sync with the container which is run by the platform.
|
||||
Essentially this means that after you've made a change you will need to rebuild the docker container you're debugging. Additionally, for the connector containers, you may have to navigate to
|
||||
"Settings" in your local Airbyte Platform's web UI and change the version of the container to `dev`. See you connector's `README` for details on how to rebuild the container image.
|
||||
|
||||
### Ports
|
||||
|
||||
In this tutorial we've been using port `5005` for all debugging. It's the default, so we haven't changed it. If you need to debug _multiple_ containers however, they will clash on this port.
|
||||
If you need to do this, you will have to modify your setup to use another port that is not in use.
|
||||
@@ -0,0 +1,36 @@
|
||||
# Developing Connectors Locally
|
||||
|
||||
This document outlines the steps to develop a connector locally using the `local-connector-development` script.
|
||||
|
||||
When developing connectors locally, you'll want to install the following tools locally:
|
||||
|
||||
1. [**Poe the Poet**](#poe-the-poet) - Used as a common task interface for defining and running development tasks.
|
||||
1. [`docker`](#docker) - Used when building and running connector container images.
|
||||
1. [`airbyte-ci`](#airbyte-ci) - Used for a large number of tasks such as building and publishing.
|
||||
1. [`gradle`](#gradle) - Useful when working on Java and Kotlin connectors.
|
||||
|
||||
## Poe the Poet
|
||||
|
||||
Poe the Poet - This tool allows you to perform common connector tasks from a single entrypoint.
|
||||
|
||||
To see a list of available tasks, run `poe` from any directory in the `airbyte` repo.
|
||||
|
||||
Notes:
|
||||
|
||||
1. When running `poe` from the root of the repo, you'll have the options `connector`, `source`, and `destination`. These will each pass the tasks you request along to the specified connector's directory.
|
||||
2. When running `poe` from a connector directory, you'll get a specific list of available tasks, like `lint`, `check-all`, etc. The available commands may vary by connector and connector type (java vs python vs manifest-only), so run `poe` on its own to see what commands are available.
|
||||
3. Poe tasks are there to help you, but they are _not_ the only way to run a task. Please feel encouraged to review, copy, paste, or combing steps from the task definitions in the `poe_tasks` directory. And if you find task invocation patterns that are especially helpful, please consider contributing back to those task definition files by creating a new PR.
|
||||
|
||||
## Docker
|
||||
|
||||
We recommend Docker Desktop but other container runtimes might be available. A full discussion of how to install and use docker is outside the scope of this guide.
|
||||
|
||||
See [Debugging Docker](./debugging-docker.md) for common tips and tricks.
|
||||
|
||||
## airbyte-ci
|
||||
|
||||
Airbyte CI `(airbyte-ci`) is a Dagger-based tool for accomplishing specific tasks. See `airbyte-ci --help` for a list of commands you can run.
|
||||
|
||||
## Gradle
|
||||
|
||||
Gradle is used in Java and Kotlin development. A full discussion of how to install and use docker is outside the scope of this guide. Similar to running `poe`, you can run `gradle tasks` to view a list of available Gradle development tasks.
|
||||
@@ -0,0 +1,126 @@
|
||||
# Requirements for Airbyte Partner Connectors: Bulk and Publish Destinations
|
||||
|
||||
## Welcome
|
||||
|
||||
**Thank you for contributing and committing to maintain your Airbyte destination connector 🥂**
|
||||
|
||||
This document outlines the minimum expectations for partner-certified destination. We will **strongly** recommend that partners use the relevant CDK, but also want to support developers that *need* to develop in a different language. This document covers concepts implicitly built into our CDKs for this use-case.
|
||||
|
||||
## Definitions
|
||||
**Partner Certified Destination:** A destination which is fully supported by the maintainers of the platform that is being loaded to. These connectors are not guaranteed by Airbyte directly, but instead the maintainers of the connector contribute fixes and improvements to ensure a quality experience for Airbyte users. Partner destinations are noted as such with a special “Partner” badge on the Integrations page, distinguishing them from other community maintained connectors on the Marketplace.
|
||||
|
||||
|
||||
**Bulk Destinations:** A destination which accepts tables and columns as input, files, or otherwise unconstrained content. The majority of bulk destinations are database-like tabular (warehouses, data lakes, databases), but may also include file or blob destinations. The defining characteristic of bulk destinations is that they accept data in the shape of the source (e.g. tables, columns or content doesn’t change much from the representation of the source). These destinations can usually hold large amounts of data, and are the fastest to load.
|
||||
|
||||
**Publish Destinations:** A publish-type destination, often called a “reverse ETL” destination loads data to an external service or API. These destinations may be “picky”, having specific schema requirements for incoming streams. Common publish-type use cases include: publishing data to a REST API, publishing data to a messaging endpoint (e.g email, push notifications, etc.), and publishing data to an LLM vector store. Specific examples include: Destination-Pinecone, Destination-Vectara, and Destination-Weaviate. These destinations can usually hold finite amounts of data, and slower to load.
|
||||
|
||||
## “Partner-Certified" Listing Requirements:
|
||||
|
||||
### Issue Tracking:
|
||||
Create a public Github repo/project to be shared with Airbyte and it's users.
|
||||
|
||||
### Airbyte Communications:
|
||||
Monitor a Slack channel for communications directly from the Airbyte Support and Development teams.
|
||||
|
||||
### SLAs:
|
||||
Respect a 3 business day first response maximum to customer inquries or bug reports.
|
||||
|
||||
### Metrics:
|
||||
Maintain >=95% first-sync success and >=95% overall sync success on your destination connector. _Note: config_errors are not counted against this metric._
|
||||
|
||||
### Platform Updates:
|
||||
Adhere to a regular update cadence for either the relevant Airbyte-managed CDK, or a commit to updating your connector to meet any new platform requirements at least once every 6 months.
|
||||
|
||||
### Connector Updates:
|
||||
Important bugs are audited and major problems are solved within a reasonable timeframe.
|
||||
|
||||
### Security:
|
||||
Validate that the connector is using HTTPS and secure-only access to customer data.
|
||||
|
||||
|
||||
## Functional Requirements of Certified Destinations:
|
||||
|
||||
### Protocol
|
||||
|
||||
We won’t call out every requirement of the Airbyte Protocol (link) but below are important requirements that are specific to Destinations and/or specific to Airbyte 1.0 Destinations.
|
||||
|
||||
* Destinations must capture state messages from sources, and must emit those state messages to STDOUT only after all preceding records have been durably committed to the destination
|
||||
* The Airbyte platform interprets state messages emitted from the destination as a logical checkpoint. Destinations must emit all of the state messages they receive, and only after records have been durably written and/or committed to the destination’s long-term storage.
|
||||
* If a destination emits the source’s state message before preceding records are finalized, this is an error.
|
||||
* _Note: In general, state handling should always be handled by the respective CDK. Destination authors should not attempt to handle this themselves._
|
||||
|
||||
* Destinations must append record counts to the Source’s state message before emitting (New for Airbyte 1.0)
|
||||
* For each state record emitted, the destination should attach to the state message the count of records processed and associated with that state message.
|
||||
* This should always be handled by the Python or Java CDK. Destination authors should not attempt to handle this themselves.
|
||||
|
||||
* State messages should be emitted with no gap longer than 15 minutes
|
||||
* Checkpointing requires commit and return state every 15 minutes. When batching records for efficiency, destination should also include logic to finalize batches approximately every 10 minutes, or whatever interval is appropriate to meet the minimum 15 minute checkpoint frequency.
|
||||
* This measure reduces the risk of users and improves the efficiency of retries, should an error occur either in the source or destination.
|
||||
|
||||
### Idempotence
|
||||
|
||||
Syncs should always be re-runnable without negative side effects. For instance, if the table is loaded multiple times, the destination should dedupe records according to the provided primary key information if and when available.
|
||||
|
||||
If deduping is disabled, then loads should either fully replace or append to destination tables - according to the user-provided setting in the configured catalog.
|
||||
|
||||
### Exceptions
|
||||
|
||||
**Bulk Destinations** should handle metadata and logging of exceptions in a consistent manner.
|
||||
|
||||
_Note: Because **Publish Destinations** have little control over table structures, these constraints do not apply to Publish or Reverse-ETL Destinations. This does not apply to vector store destinations, for instance._
|
||||
|
||||
* Columns should include all top-level field declarations.
|
||||
* Destination tables should have column definitions for each declared top-level property. For instance, if a stream has a “user_id” property, the destination table should contain a “user_id” column.
|
||||
* Casing may be normalized (for instance, all-caps or all-lower-case) according to the norms and/or expectations for the destination systems. (For example, Snowflake only works as expected when you normalize to all-caps.)
|
||||
|
||||
* Tables should always include the following Airbyte metadata columns: _airbyte_meta, _airbyte_extracted_at, and _airbyte_raw_id
|
||||
* These column names should be standard across tabular destinations, including all SQL-type and file-type destinations.
|
||||
|
||||
* Bulk Destinations must utilize _airbyte_meta.changes[] to record in-flight fixes or changes
|
||||
* This includes logging information on any fields that had to be nullified due to destination capacity restrictions (e.g. data could not fit), and/or problematic input data (e.g. impossible date or out-of-range date).
|
||||
* It’s also OK for the destination to make record changes (e.g. property too large to fit) as long as the change doesn’t apply to the PK or cursor, and the change is record in _airbyte_meta.changes[] as well.
|
||||
|
||||
* Bulk Destinations must accept new columns arriving from the source. (“Schema Evolution”)
|
||||
* Tabular destinations should be consistent in how they handle schema evolutions over the period of a connection’s lifecycle, including gracefully handling expected organic schema evolutions, including the addition of new columns after the initial sync.
|
||||
* A new column arriving in the source data should never be a breaking change. Destinations should be able to detect the arrival of a new column and automatically add it to the destination table. (The platform will throttle this somewhat, according to the users’ preference.)
|
||||
|
||||
### Configuration Requirements
|
||||
|
||||
All destinations are required to adhere to standard configuration practices for connectors. These requirements include, but are not limited to the following:
|
||||
|
||||
* The connector `SPEC` output should include RegEx validation rules for configuration parameters. These will be used in the Airbyte Platform UI to pre-validate user inputs, and provide appropriate guidance to users during setup.
|
||||
* The `CHECK` operation should consider all configuration inputs and produce reasonable error messages for most common configuration errors.
|
||||
* All customer secrets specified in `SPEC` should be properly annotated with `"airbyte_secret" : true` in the config requirements. This informs the Airbyte Platform that values should not be echoed to the screen during user input, and it ensures that secrets are properly handled as such when storing and retrieving settings in the backend.
|
||||
* The connector’s manifest must specify `AllowedHosts` - limiting which APIs/IPs this connector can communicate with.
|
||||
|
||||
### Data Fidelity and Data Types
|
||||
|
||||
**Every attempt should be made to ensure data does not lose fidelity during transit and that syncs do not fail due to data type mapping issues.**
|
||||
|
||||
_Note: Publish-type destinations may be excluded from some or all of the below rules, if they are constrained to use predefined types. In these cases, the destination should aim to fail early so the user can reconfigure their source before causing any data corruption or data inconsistencies from partially-loaded datasets._
|
||||
|
||||
|
||||
* Data types should be _at least_ as large as needed to store incoming data.
|
||||
* Larger types should be preferred in cases where there is a choice - for instance a choice between INT32 or INT64, the latter should be preferred.
|
||||
|
||||
|
||||
* Floats should be handled with the maximum possible size for floating point numbers
|
||||
* Normally this means a `double` precision floating point type.
|
||||
|
||||
|
||||
* Decimals should be handled with the largest-possible precision and scale, generally `DECIMAL(38, 9)`
|
||||
* This allows for very-large integers (for example, Active Directory IDs) as well as very precise small numbers - to the 9th decimal place.
|
||||
* Floating point storage should _not_ be used for decimals and other numeric values unless they are declared as `float` in the source catalog.
|
||||
|
||||
|
||||
* Destinations should always have a “failsafe” type they can use, in case source type is not known
|
||||
* A classic example of this is receiving a column with the type `anyOf(string, object)`. In the case that a good type cannot be chosen, we should fall back to _either_ string types _or_ variable/variant/json types.
|
||||
* The failsafe type ensures that data loads will not fail, even when there is a failure to recognize or parse the declared data type.
|
||||
|
||||
### Error Handling
|
||||
|
||||
Any errors must be logged by the destination using an approved protocol. Silent errors are not permitted, but we bias towards _not_ failing an entire sync when other valid records are able to be written. Only if errors cannot be logged using an approved protocol, then the destination _must fail_ and should raise the error to the attention of the user and the platform.
|
||||
|
||||
**Bulk Destinations:** Errors should be recorded along with the record data, in the `_airbyte_meta` column, under the `_airbyte_meta.changes` key.
|
||||
|
||||
**Publish Destinations:** In absence of another specific means of communicating to the user that there was an issue, the destination _must fail_ if it is not able to write data to the destination platform. (Additional approved logging protocols may be added in the future for publish-type destinations - for instance, dead letter queues, destination-specific state artifacts, and/or other durable storage medium which could be configured by the user.
|
||||
73
docs/platform/connector-development/schema-reference.md
Normal file
@@ -0,0 +1,73 @@
|
||||
# Schema Reference
|
||||
|
||||
:::note
|
||||
You only need this if you're building a connector with Python or Java CDKs.
|
||||
If you're using Connector Builder, you can use [declared schemas](./connector-builder-ui/record-processing#declared-schema) instead.
|
||||
:::
|
||||
|
||||
This document provides instructions on how to create a static schema for your Airbyte stream, which is necessary for integrating data from various sources.
|
||||
You can check out all the supported data types and examples at [this link](../understanding-airbyte/supported-data-types.md).
|
||||
|
||||
For instance, the example record response for the schema is shown below:
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "hashidstring",
|
||||
"date_created": "2022-11-22T01:23:45",
|
||||
"date_updated": "2023-12-22T01:12:00",
|
||||
"total": 1000,
|
||||
"status": "published",
|
||||
"example_obj": {
|
||||
"steps": "walking",
|
||||
"count_steps": 30
|
||||
},
|
||||
"example_string_array": ["first_string", "second_string"]
|
||||
}
|
||||
```
|
||||
|
||||
The schema is then translated into the following JSON format. Please note that it's essential to include `$schema`, `type`, and `additionalProperties: true` fields in your schema. Typically, Airbyte schemas require null values for each field to make the stream more reliable if the field doesn't receive any data.
|
||||
|
||||
```json
|
||||
{
|
||||
"$schema": "http://json-schema.org/draft-07/schema#",
|
||||
"type": "object",
|
||||
"additionalProperties": true,
|
||||
"properties": {
|
||||
"id": {
|
||||
"type": ["null", "string"]
|
||||
},
|
||||
"date_created": {
|
||||
"format": "date-time",
|
||||
"type": ["null", "string"]
|
||||
},
|
||||
"date_updated": {
|
||||
"format": "date-time",
|
||||
"type": ["null", "string"]
|
||||
},
|
||||
"total": {
|
||||
"type": ["null", "integer"]
|
||||
},
|
||||
"status": {
|
||||
"type": ["string", "null"],
|
||||
"enum": ["published", "draft"]
|
||||
},
|
||||
"example_obj": {
|
||||
"type": ["null", "object"],
|
||||
"additionalProperties": true,
|
||||
"properties": {
|
||||
"steps": {
|
||||
"type": ["null", "string"]
|
||||
},
|
||||
"count_steps": {
|
||||
"type": ["null", "integer"]
|
||||
}
|
||||
}
|
||||
},
|
||||
"example_string_array": {
|
||||
"items": {
|
||||
"type": ["null", "string"]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
@@ -0,0 +1,58 @@
|
||||
# Testing Connectors
|
||||
|
||||
Multiple tests suites compose the Airbyte connector testing pyramid
|
||||
|
||||
## Tests run by our CI pipeline
|
||||
|
||||
- [Connector QA Checks](https://docs.airbyte.com/contributing-to-airbyte/resources/qa-checks): Static asset checks that validate that a connector is correctly packaged to be successfully released to production.
|
||||
- Unit Tests: Connector-specific tests written by the connector developer which don’t require access to the source/destination.
|
||||
- Integration Tests: Connector-specific tests written by the connector developer which _may_ require access to the source/destination.
|
||||
- [Connector Acceptance Tests](https://docs.airbyte.com/connector-development/testing-connectors/connector-acceptance-tests-reference/): Connector-agnostic tests that verify that a connector adheres to the [Airbyte protocol](https://docs.airbyte.com/understanding-airbyte/airbyte-protocol). Credentials to a source/destination sandbox account are **required**.
|
||||
- [Regression Tests](https://github.com/airbytehq/airbyte/tree/master/airbyte-ci/connectors/live-tests): Connector-agnostic tests that verify that the behavior of a connector hasn’t changed unexpectedly between connector versions. A sandbox cloud connection is required. Currently only available for API source connectors.
|
||||
|
||||
|
||||
## 🤖 CI
|
||||
|
||||
If you want to run the global test suite, exactly like what is run in CI, you should install [`airbyte-ci` CLI](https://github.com/airbytehq/airbyte/blob/master/airbyte-ci/connectors/pipelines/README.md) and use the following command:
|
||||
|
||||
```bash
|
||||
airbyte-ci connectors --name=<connector_name> test
|
||||
```
|
||||
|
||||
CI will run all the tests that are available for a connector. This can include all of the tests listed above, if we have the appropriate credentials. At a minimum, it will include the Connector QA checks and any tests that exist in a connector's `unit_tests` and `integration_tests` directories.
|
||||
To run Connector Acceptance tests locally, you must provide connector configuration as a `config.json` file in a `.secrets` folder in the connector code directory.
|
||||
Regression tests may only be run locally with authorization to our cloud resources.
|
||||
|
||||
Our CI infrastructure runs the connector tests with [`airbyte-ci` CLI](https://github.com/airbytehq/airbyte/blob/master/airbyte-ci/connectors/pipelines/README.md). Connectors tests are automatically and remotely triggered on your branch according to the changes made in your branch.
|
||||
**Passing tests are required to merge a connector pull request.**
|
||||
|
||||
## Connector specific tests
|
||||
|
||||
### 🐍 Python connectors
|
||||
|
||||
We use `pytest` to run unit and integration tests:
|
||||
|
||||
```bash
|
||||
# From connector directory
|
||||
poetry run pytest
|
||||
```
|
||||
|
||||
### ☕ Java connectors
|
||||
|
||||
:::warning
|
||||
Airbyte is undergoing a major revamp of the shared core Java destinations codebase, with plans to release a new CDK in 2024.
|
||||
We are actively working on improving usability, speed (through asynchronous loading), and implementing [Typing and Deduplication](/platform/using-airbyte/core-concepts/typing-deduping) (Destinations V2).
|
||||
For this reason, Airbyte is not reviewing/accepting new Java connectors for now.
|
||||
:::
|
||||
|
||||
We run Java connector tests with gradle.
|
||||
|
||||
```bash
|
||||
# Unit tests
|
||||
./gradlew :airbyte-integrations:connectors:source-postgres:test
|
||||
# Integration tests
|
||||
./gradlew :airbyte-integrations:connectors:source-postgres:integrationTestJava
|
||||
```
|
||||
|
||||
Please note that according to the test implementation you might have to provide connector configurations as a `config.json` file in a `.secrets` folder in the connector code directory.
|
||||
|
||||
@@ -0,0 +1,529 @@
|
||||
# Acceptance Tests Reference
|
||||
|
||||
To ensure a minimum quality bar, Airbyte runs all connectors against the same set of integration tests. Those tests ensure that each connector adheres to the [Airbyte Specification](../../understanding-airbyte/airbyte-protocol.md) and responds correctly to Airbyte commands when provided valid \(or invalid\) inputs.
|
||||
|
||||
_Note: If you are looking for reference documentation for the deprecated first version of test suites, see_ [_Standard Tests \(Legacy\)_](https://github.com/airbytehq/airbyte/tree/e378d40236b6a34e1c1cb481c8952735ec687d88/docs/contributing-to-airbyte/building-new-connector/legacy-standard-source-tests.md)_._
|
||||
|
||||
## Architecture of standard tests
|
||||
|
||||
The Standard Test Suite runs its tests against the connector's Docker image. It takes as input the configuration file `acceptance-tests-config.yml`.
|
||||
|
||||

|
||||
|
||||
The Standard Test Suite use pytest as a test runner and was built as pytest plugin `connector-acceptance-test`. This plugin adds a new configuration option `—acceptance-test-config` - it should points to the folder with `acceptance-tests-config.yml`.
|
||||
|
||||
Each test suite has a timeout and will fail if the limit is exceeded.
|
||||
|
||||
See all the test cases, their description, and inputs described in the sections below.
|
||||
|
||||
## Setting up standard acceptance tests for your connector
|
||||
|
||||
Create `acceptance-test-config.yml`. In most cases, your connector already has this file in its root folder. Here is an example of the minimal `acceptance-test-config.yml`:
|
||||
|
||||
```yaml
|
||||
connector_image: airbyte/source-some-connector:dev
|
||||
acceptance-tests:
|
||||
spec:
|
||||
tests:
|
||||
- spec_path: "some_folder/spec.yaml"
|
||||
```
|
||||
|
||||
_Note: Not all types of tests work for all connectors, only configure the ones that make sense in your scenario. The `spec` and `check` test suites are universal for all sources and destinations, the other test suites are only applicable to sources, and the `incremental` test suite is only applicable if the source connector supports incremental syncs._
|
||||
|
||||
Build your connector image if needed.
|
||||
|
||||
**Option A (Preferred): Building the docker image with `airbyte-ci`**
|
||||
|
||||
This is the preferred method for building and testing connectors.
|
||||
|
||||
If you want to open source your connector we encourage you to use our [`airbyte-ci`](https://github.com/airbytehq/airbyte/blob/master/airbyte-ci/connectors/pipelines/README.md) tool to build your connector.
|
||||
It will not use a Dockerfile but will build the connector image from our [base image](https://github.com/airbytehq/airbyte/blob/master/airbyte-ci/connectors/base_images/README.md) and use our internal build logic to build an image from your Python connector code.
|
||||
|
||||
Running `airbyte-ci connectors --name source-<source-name> build` will build your connector image.
|
||||
Once the command is done, you will find your connector image in your local docker host: `airbyte/source-<source-name>:dev`.
|
||||
|
||||
**Option B: Building the docker image with a Dockerfile**
|
||||
|
||||
If you don't want to rely on `airbyte-ci` to build your connector, you can build the docker image using your own Dockerfile. This method is not preferred, and is not supported for certified connectors.
|
||||
|
||||
Create a `Dockerfile` in the root of your connector directory. The `Dockerfile` should look something like this:
|
||||
|
||||
```Dockerfile
|
||||
|
||||
FROM airbyte/python-connector-base:1.1.0
|
||||
|
||||
COPY . ./airbyte/integration_code
|
||||
RUN pip install ./airbyte/integration_code
|
||||
|
||||
# The entrypoint and default env vars are already set in the base image
|
||||
# ENV AIRBYTE_ENTRYPOINT "python /airbyte/integration_code/main.py"
|
||||
# ENTRYPOINT ["python", "/airbyte/integration_code/main.py"]
|
||||
```
|
||||
|
||||
Please use this as an example. This is not optimized.
|
||||
|
||||
Build your image:
|
||||
|
||||
```bash
|
||||
docker build . -t airbyte/source-example-python:dev
|
||||
```
|
||||
|
||||
And test via one of the two following Options
|
||||
|
||||
### Option 1 (Preferred): Run against the Airbyte CI test suite
|
||||
|
||||
Learn how to use and install [`airbyte-ci` here](https://github.com/airbytehq/airbyte/blob/master/airbyte-ci/connectors/pipelines/README.md). Once installed, `airbyte-ci connectors test` command will run unit, integration, and acceptance tests against your connector. Pass `--name <your_connector_name>` to test just one connector.
|
||||
|
||||
```bash
|
||||
airbyte-ci connectors --name=<name-of-your-connector></name-of-your-connector> test
|
||||
```
|
||||
|
||||
### Option 2 (Debugging): Run against the acceptance tests on your branch
|
||||
|
||||
This will run the acceptance test suite directly with pytest. Allowing you to set breakpoints and debug your connector locally.
|
||||
|
||||
The only pre-requisite is that you have [Poetry](https://python-poetry.org/docs/#installation) installed.
|
||||
|
||||
Afterwards you do the following from the root of the `airbyte` repo:
|
||||
|
||||
```bash
|
||||
cd airbyte-integrations/bases/connector-acceptance-test/
|
||||
poetry install
|
||||
poetry run pytest -p connector_acceptance_test.plugin --acceptance-test-config=../../connectors/<your-connector> --pdb
|
||||
```
|
||||
|
||||
See other useful pytest options [here](https://docs.pytest.org/en/stable/usage.html)
|
||||
See a more comprehensive guide in our README [here](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/bases/connector-acceptance-test/README.md)
|
||||
|
||||
## Dynamically managing inputs & resources used in standard tests
|
||||
|
||||
Since the inputs to standard tests are often static, the file-based runner is sufficient for most connectors. However, in some cases, you may need to run pre or post hooks to dynamically create or destroy resources for use in standard tests. For example, if we need to spin up a Redshift cluster to use in the test then tear it down afterwards, we need the ability to run code before and after the tests, as well as customize the Redshift cluster URL we pass to the standard tests. If you have need for this use case, please reach out to us via [Github](https://github.com/airbytehq/airbyte) or [Slack](https://slack.airbyte.io). We currently support it for Java & Python, and other languages can be made available upon request.
|
||||
|
||||
### Python
|
||||
|
||||
Create pytest yield-fixture with your custom setup/teardown code and place it in `integration_tests/acceptance.py`, Example of fixture that starts a docker container before tests and stops before exit:
|
||||
|
||||
```python
|
||||
@pytest.fixture(scope="session", autouse=True)
|
||||
def connector_setup():
|
||||
""" This fixture is a placeholder for external resources that acceptance test might require.
|
||||
"""
|
||||
client = docker.from_env()
|
||||
container = client.containers.run("your/docker-image", detach=True)
|
||||
yield
|
||||
container.stop()
|
||||
```
|
||||
|
||||
These tests are configurable via `acceptance-test-config.yml`. Each test has a number of inputs, you can provide multiple sets of inputs which will cause the same to run multiple times - one for each set of inputs.
|
||||
|
||||
Example of `acceptance-test-config.yml`:
|
||||
|
||||
```yaml
|
||||
connector_image: string # Docker image to test, for example 'airbyte/source-pokeapi:0.1.0'
|
||||
base_path: string # Base path for all relative paths, optional, default - ./
|
||||
custom_environment_variables:
|
||||
foo: bar
|
||||
acceptance_tests: # Tests configuration
|
||||
spec: # list of the test inputs
|
||||
bypass_reason: "Explain why you skipped this test"
|
||||
connection: # list of the test inputs
|
||||
tests:
|
||||
- config_path: string # set #1 of inputs
|
||||
status: string
|
||||
- config_path: string # set #2 of inputs
|
||||
status: string
|
||||
# discovery: # skip this test
|
||||
incremental:
|
||||
bypass_reason: "Incremental sync are not supported on this connector"
|
||||
```
|
||||
|
||||
## Test Spec
|
||||
|
||||
Verify that a `spec` operation issued to the connector returns a valid connector specification.
|
||||
Additional tests are validating the backward compatibility of the current specification compared to the specification of the previous connector version. If no previous connector version is found (by default the test looks for a docker image with the same name but with the `latest` tag), this test is skipped.
|
||||
These backward compatibility tests can be bypassed by changing the value of the `backward_compatibility_tests_config.disable_for_version` input in `acceptance-test-config.yml` (see below).
|
||||
One more test validates the specification against containing exposed secrets. This means fields that potentially could hold a secret value should be explicitly marked with `"airbyte_secret": true`. If an input field like `api_key` / `password` / `client_secret` / etc. is exposed, the test will fail.
|
||||
The inputs in the table are set under the `acceptance_tests.spec.tests` key.
|
||||
|
||||
| Input | Type | Default | Note |
|
||||
| :--------------------------------------------------------------- | :------ | :------------------ | :-------------------------------------------------------------------------------------------------------------------- |
|
||||
| `spec_path` | string | `secrets/spec.json` | Path to a YAML or JSON file representing the spec expected to be output by this connector |
|
||||
| `backward_compatibility_tests_config.previous_connector_version` | string | `latest` | Previous connector version to use for backward compatibility tests (expects a version following semantic versioning). |
|
||||
| `backward_compatibility_tests_config.disable_for_version` | string | None | Disable the backward compatibility test for a specific version (expects a version following semantic versioning). |
|
||||
| `timeout_seconds` | int | 10 | Test execution timeout in seconds |
|
||||
| `auth_default_method` | object | None | Ensure that OAuth is default method, if OAuth uses by source |
|
||||
| `auth_default_method.oauth` | boolean | True | Validate that OAuth is default method if set to True |
|
||||
| `auth_default_method.bypass_reason` | string | | Reason why OAuth is not default method |
|
||||
|
||||
## Test Connection
|
||||
|
||||
Verify that a check operation issued to the connector with the input config file returns a successful response.
|
||||
The inputs in the table are set under the `acceptance_tests.connection.tests` key.
|
||||
|
||||
| Input | Type | Default | Note |
|
||||
| :---------------- | :----------------------------- | :-------------------- | :----------------------------------------------------------------- |
|
||||
| `config_path` | string | `secrets/config.json` | Path to a JSON object representing a valid connector configuration |
|
||||
| `status` | `succeed` `failed` `exception` | | Indicate if connection check should succeed with provided config |
|
||||
| `timeout_seconds` | int | 30 | Test execution timeout in seconds |
|
||||
|
||||
## Test Discovery
|
||||
|
||||
Verifies when a `discover` operation is run on the connector using the given config file, a valid catalog is produced by the connector.
|
||||
Additional tests are validating the backward compatibility of the discovered catalog compared to the catalog of the previous connector version. If no previous connector version is found (by default the test looks for a docker image with the same name but with the `latest` tag), this test is skipped.
|
||||
These backward compatibility tests can be bypassed by changing the value of the `backward_compatibility_tests_config.disable_for_version` input in `acceptance-test-config.yml` (see below).
|
||||
The inputs in the table are set under the `acceptance_tests.discovery.tests` key.
|
||||
|
||||
| Input | Type | Default | Note |
|
||||
|:-----------------------------------------------------------------|:--------|:--------------------------------------------|:----------------------------------------------------------------------------------------------------------------------|
|
||||
| `config_path` | string | `secrets/config.json` | Path to a JSON object representing a valid connector configuration |
|
||||
| `configured_catalog_path` | string | `integration_tests/configured_catalog.json` | Path to configured catalog |
|
||||
| `timeout_seconds` | int | 30 | Test execution timeout in seconds |
|
||||
| `backward_compatibility_tests_config.previous_connector_version` | string | `latest` | Previous connector version to use for backward compatibility tests (expects a version following semantic versioning). |
|
||||
| `backward_compatibility_tests_config.disable_for_version` | string | None | Disable the backward compatibility test for a specific version (expects a version following semantic versioning). |
|
||||
| `validate_primary_keys_data_type` | boolean | True | Verify that primary keys data types are correct |
|
||||
|
||||
## Test Basic Read
|
||||
|
||||
Configuring all streams in the input catalog to full refresh mode verifies that a read operation produces some RECORD messages. Each stream should have some data, if you can't guarantee this for particular streams - add them to the `empty_streams` list.
|
||||
Set `validate_data_points=True` if possible. This validation is going to be enabled by default and won't be configurable in future releases.
|
||||
The inputs in the table are set under the `acceptance_tests.basic_reac.tests` key.
|
||||
|
||||
| Input | Type | Default | Note |
|
||||
|:------------------------------------------------|:-----------------|:--------------------------------------------|:-------------------------------------------------------------------------------------------------------------|
|
||||
| `config_path` | string | `secrets/config.json` | Path to a JSON object representing a valid connector configuration |
|
||||
| `configured_catalog_path` | string | `integration_tests/configured_catalog.json` | Path to configured catalog |
|
||||
| `empty_streams` | array of objects | \[\] | List of streams that might be empty with a `bypass_reason` |
|
||||
| `empty_streams[0].name` | string | | Name of the empty stream |
|
||||
| `empty_streams[0].bypass_reason` | string | None | Reason why this stream is empty |
|
||||
| `ignored_fields[stream][0].name` | string | | Name of the ignored field |
|
||||
| `ignored_fields[stream][0].bypass_reason` | string | None | Reason why this field is ignored |
|
||||
| `validate_schema` | boolean | True | Verify that structure and types of records matches the schema from discovery command |
|
||||
| `validate_stream_statuses` | boolean | False | Ensure that all streams emit status messages |
|
||||
| `validate_state_messages` | boolean | True | Ensure that state messages emitted as expected |
|
||||
| `validate_primary_keys_data_type` | boolean | True | Verify that primary keys data types are correct |
|
||||
| `fail_on_extra_columns` | boolean | True | Fail schema validation if undeclared columns are found in records. Only relevant when `validate_schema=True` |
|
||||
| `validate_data_points` | boolean | False | Validate that all fields in all streams contained at least one data point |
|
||||
| `timeout_seconds` | int | 5\*60 | Test execution timeout in seconds |
|
||||
| `expect_trace_message_on_failure` | boolean | True | Ensure that a trace message is emitted when the connector crashes |
|
||||
| `expect_records` | object | None | Compare produced records with expected records, see details below |
|
||||
| `expect_records.path` | string | | File with expected records |
|
||||
| `expect_records.bypass_reason` | string | | Explain why this test is bypassed |
|
||||
| `expect_records.exact_order` | boolean | False | Ensure that records produced in exact same order |
|
||||
| `file_types` | object | None | Configure file-based connectors specific tests |
|
||||
| `file_types.skip_test` | boolean | False | Skip file-based connectors specific tests for the current config with a `bypass_reason` |
|
||||
| `file_types.bypass_reason` | string | None | Reason why file-based connectors specific tests are skipped |
|
||||
| `file_types.unsupported_types` | array of objects | None | Configure file types which are not supported by a source |
|
||||
| `file_types.unsupported_types[0].extension` | string | | File type in `.csv` format which cannot be added to a test account |
|
||||
| `file_types.unsupported_types[0].bypass_reason` | string | None | Reason why this file type cannot be added to a test account |
|
||||
|
||||
`expect_records` is a nested configuration, if omitted - the part of the test responsible for record matching will be skipped.
|
||||
|
||||
### Schema format checking
|
||||
|
||||
If some field has [format](https://json-schema.org/understanding-json-schema/reference/string.html#format) attribute specified on its catalog json schema, Connector Acceptance Testing framework performs checking against format. It support checking of all [builtin](https://json-schema.org/understanding-json-schema/reference/string.html#built-in-formats) jsonschema formats for draft 7 specification: email, hostnames, ip addresses, time, date and date-time formats.
|
||||
|
||||
Note: For date-time we are not checking against compliance against ISO8601 \(and RFC3339 as subset of it\). Since we are using specified format to set database column type on db normalization stage, value should be compliant to bigquery [timestamp](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#timestamp_type) and SQL "timestamp with timezone" formats.
|
||||
|
||||
### Example of `expected_records.jsonl`:
|
||||
|
||||
In general, the expected_records.jsonl should contain the subset of output of the records of particular stream you need to test. The required fields are: `stream, data, emitted_at`
|
||||
|
||||
```javascript
|
||||
{"stream": "my_stream", "data": {"field_1": "value0", "field_2": "value0", "field_3": null, "field_4": {"is_true": true}, "field_5": 123}, "emitted_at": 1626172757000}
|
||||
{"stream": "my_stream", "data": {"field_1": "value1", "field_2": "value1", "field_3": null, "field_4": {"is_true": false}, "field_5": 456}, "emitted_at": 1626172757000}
|
||||
{"stream": "my_stream", "data": {"field_1": "value2", "field_2": "value2", "field_3": null, "field_4": {"is_true": true}, "field_5": 678}, "emitted_at": 1626172757000}
|
||||
{"stream": "my_stream", "data": {"field_1": "value3", "field_2": "value3", "field_3": null, "field_4": {"is_true": false}, "field_5": 91011}, "emitted_at": 1626172757000}
|
||||
```
|
||||
|
||||
## Test Full Refresh sync
|
||||
|
||||
The inputs in the tables below are set under the `acceptance_tests.full_refresh.tests` key.
|
||||
|
||||
### TestSequentialReads
|
||||
|
||||
This test performs two read operations on all streams which support full refresh syncs. It then verifies that the RECORD messages output from both were identical or the former is a strict subset of the latter.
|
||||
|
||||
| Input | Type | Default | Note |
|
||||
| :---------------------------------------- | :----- | :------------------------------------------ | :--------------------------------------------------------------------- |
|
||||
| `config_path` | string | `secrets/config.json` | Path to a JSON object representing a valid connector configuration |
|
||||
| `configured_catalog_path` | string | `integration_tests/configured_catalog.json` | Path to configured catalog |
|
||||
| `timeout_seconds` | int | 20\*60 | Test execution timeout in seconds |
|
||||
| `ignored_fields` | dict | None | For each stream, list of fields path ignoring in sequential reads test |
|
||||
| `ignored_fields[stream][0].name` | string | | Name of the ignored field |
|
||||
| `ignored_fields[stream][0].bypass_reason` | string | None | Reason why this field is ignored |
|
||||
|
||||
## Test Incremental sync
|
||||
|
||||
The inputs in the tables below are set under the `acceptance_tests.incremental.tests` key.
|
||||
|
||||
### TestTwoSequentialReads
|
||||
|
||||
This test verifies that all streams in the input catalog which support incremental sync can do so correctly. It does this by running two read operations: the first takes the configured catalog and config provided to this test as input. It then verifies that the sync produced a non-zero number of `RECORD` and `STATE` messages. The second read takes the same catalog and config used in the first test, plus the last `STATE` message output by the first read operation as the input state file. It verifies that either no records are produced \(since we read all records in the first sync\) or all records that produced have cursor value greater or equal to cursor value from `STATE` message. This test is performed only for streams that support incremental. Streams that do not support incremental sync are ignored. If no streams in the input catalog support incremental sync, this test is skipped.
|
||||
|
||||
| Input | Type | Default | Note |
|
||||
| :------------------------ | :----- | :------------------------------------------ | :----------------------------------------------------------------- |
|
||||
| `config_path` | string | `secrets/config.json` | Path to a JSON object representing a valid connector configuration |
|
||||
| `configured_catalog_path` | string | `integration_tests/configured_catalog.json` | Path to configured catalog |
|
||||
| `timeout_seconds` | int | 20\*60 | Test execution timeout in seconds |
|
||||
|
||||
### TestReadSequentialSlices
|
||||
|
||||
This test offers more comprehensive verification that all streams in the input catalog which support incremental syncs perform the sync correctly. It does so in two phases. The first phase uses the configured catalog and config provided to this test as input to make a request to the partner API and assemble the complete set of messages to be synced. It then verifies that the sync produced a non-zero number of `RECORD` and `STATE` messages. This set of messages is partitioned into batches of a `STATE` message followed by zero or more `RECORD` messages. For each batch of messages, the initial `STATE` message is used as input for a read operation to get records with respect to the cursor. The test then verifies that all of the `RECORDS` retrieved have a cursor value greater or equal to the cursor from the current `STATE` message. This test is performed only for streams that support incremental. Streams that do not support incremental sync are ignored. If no streams in the input catalog support incremental sync, this test is skipped.
|
||||
|
||||
| Input | Type | Default | Note |
|
||||
| :------------------------------------- | :----- | :------------------------------------------ | :----------------------------------------------------------------------------------------------------------------- |
|
||||
| `config_path` | string | `secrets/config.json` | Path to a JSON object representing a valid connector configuration |
|
||||
| `configured_catalog_path` | string | `integration_tests/configured_catalog.json` | Path to configured catalog |
|
||||
| `timeout_seconds` | int | 20\*60 | Test execution timeout in seconds |
|
||||
| `skip_comprehensive_incremental_tests` | bool | false | For non-GA and in-development connectors, control whether the more comprehensive incremental tests will be skipped |
|
||||
|
||||
**Note that this test samples a fraction of stream slices across an incremental sync in order to reduce test duration and avoid spamming partner APIs**
|
||||
|
||||
### TestStateWithAbnormallyLargeValues
|
||||
|
||||
This test verifies that sync produces no records when run with the STATE with abnormally large values
|
||||
|
||||
| Input | Type | Default | Note | |
|
||||
| :------------------------ | :----- | :------------------------------------------ | :----------------------------------------------------------------- | :-- |
|
||||
| `config_path` | string | `secrets/config.json` | Path to a JSON object representing a valid connector configuration | |
|
||||
| `configured_catalog_path` | string | `integration_tests/configured_catalog.json` | Path to configured catalog | |
|
||||
| `future_state_path` | string | None | Path to the state file with abnormally large cursor values | |
|
||||
| `timeout_seconds` | int | 20\*60 | Test execution timeout in seconds | |
|
||||
| `bypass_reason` | string | None | Explain why this test is bypassed | |
|
||||
|
||||
## Test Connector Attributes
|
||||
|
||||
The inputs in the tables below are set under the `acceptance_tests.connector_attributes.tests` key.
|
||||
|
||||
Verifies that certain properties of the connector and its streams guarantee a higher level of usability standards for certified connectors.
|
||||
Some examples of the types of tests covered are verification that streams define primary keys, correct OAuth spec configuration, or a connector emits the correct stream status during a read.
|
||||
|
||||
| Input | Type | Default | Note |
|
||||
| :------------------------------------------ | :-------------------------- | :-------------------- | :--------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `config_path` | string | `secrets/config.json` | Path to a JSON object representing a valid connector configuration |
|
||||
| `streams_without_primary_key` | array of objects | None | List of streams that do not support a primary key like reports streams |
|
||||
| `streams_without_primary_key.name` | string | None | Name of the stream missing the PK |
|
||||
| `streams_without_primary_key.bypass_reason` | string | None | The reason the stream doesn't have the PK |
|
||||
| `allowed_hosts.bypass_reason` | object with `bypass_reason` | None | Defines the `bypass_reason` description about why the `allowedHosts` check for the certified connector should be skipped |
|
||||
| `suggested_streams.bypass_reason` | object with `bypass_reason` | None | Defines the `bypass_reason` description about why the `suggestedStreams` check for the certified connector should be skipped |
|
||||
|
||||
## Test Connector Documentation
|
||||
|
||||
The inputs in the tables below are set under the `acceptance_tests.connector_documentation.tests` key.
|
||||
|
||||
Verifies that connectors documentation follows our standard template, does have correct order of headings,
|
||||
does not have missing headings and all required fields in Prerequisites section.
|
||||
|
||||
| Input | Type | Default | Note |
|
||||
| :---------------- | :----- | :-------------------- | :----------------------------------------------------------------- |
|
||||
| `config_path` | string | `secrets/config.json` | Path to a JSON object representing a valid connector configuration |
|
||||
| `timeout_seconds` | int | 20\*60 | Test execution timeout in seconds |
|
||||
|
||||
## Strictness level
|
||||
|
||||
To enforce maximal coverage of acceptances tests we expose a `test_strictness_level` field at the root of the `acceptance-test-config.yml` configuration.
|
||||
The default `test_strictness_level` is `low`, but for generally available connectors it is expected to be eventually set to `high`.
|
||||
|
||||
_Note: For now, the strictness level can only be applied for sources, not for destination connectors_
|
||||
|
||||
### Test enforcements in `high` test strictness level
|
||||
|
||||
#### All acceptance tests are declared, a `bypass_reason` is filled if a test can't run
|
||||
|
||||
In `high` test strictness level we expect all acceptance tests to be declared:
|
||||
|
||||
- `spec`
|
||||
- `connection`
|
||||
- `discovery`
|
||||
- `basic_read`
|
||||
- `full_refresh`
|
||||
- `incremental`
|
||||
|
||||
If a test can't be run for a valid technical or organizational reason a `bypass_reason` can be declared to skip this test.
|
||||
E.G. `source-pokeapi` does not support incremental syncs, we can skip this test when `test_strictness_level` is `high` by setting a `bypass_reason` under `incremental`.
|
||||
|
||||
```yaml
|
||||
connector_image: "airbyte/source-pokeapi"
|
||||
test_strictness_level: high
|
||||
acceptance_tests:
|
||||
spec:
|
||||
tests:
|
||||
- spec_path: "source_pokeapi/spec.json"
|
||||
connection:
|
||||
tests:
|
||||
- config_path: "integration_tests/config.json"
|
||||
status: "succeed"
|
||||
discovery:
|
||||
tests:
|
||||
- config_path: "integration_tests/config.json"
|
||||
basic_read:
|
||||
tests:
|
||||
- config_path: "integration_tests/config.json"
|
||||
full_refresh:
|
||||
tests:
|
||||
- config_path: "integration_tests/config.json"
|
||||
configured_catalog_path: "integration_tests/configured_catalog.json"
|
||||
incremental:
|
||||
bypass_reason: "Incremental syncs are not supported on this connector."
|
||||
```
|
||||
|
||||
#### Basic read: no empty streams are allowed without a `bypass_reason`
|
||||
|
||||
In `high` test strictness level we expect that all streams declared in `empty-streams` to have a `bypass_reason` filled in.
|
||||
|
||||
E.G. Two streams from `source-recharge` can't be seeded with test data, they are declared as `empty_stream` we an explicit bypass reason.
|
||||
|
||||
```yaml
|
||||
connector_image: airbyte/source-recharge:dev
|
||||
test_strictness_level: high
|
||||
acceptance_tests:
|
||||
basic_read:
|
||||
tests:
|
||||
- config_path: secrets/config.json
|
||||
empty_streams:
|
||||
- name: collections
|
||||
bypass_reason: "This stream can't be seeded in our sandbox account"
|
||||
- name: discounts
|
||||
bypass_reason: "This stream can't be seeded in our sandbox account"
|
||||
timeout_seconds: 1200
|
||||
```
|
||||
|
||||
#### Basic read: `expect_records` must be set
|
||||
|
||||
In `high` test strictness level we expect the `expect_records` subtest to be set.
|
||||
If you can't create an `expected_records.jsonl` with all the existing stream you need to declare the missing streams in the `empty_streams` section.
|
||||
If you can't get an `expected_records.jsonl` file at all, you must fill in a `bypass_reason`.
|
||||
|
||||
#### Basic read: no `configured_catalog_path` can be set
|
||||
|
||||
In `high` test strictness level we want to run the `basic_read` test on a configured catalog created from the discovered catalog from which we remove declared empty streams. Declaring `configured_catalog_path` in the test configuration is not allowed.
|
||||
|
||||
```yaml
|
||||
connector_image: airbyte/source-recharge:dev
|
||||
test_strictness_level: high
|
||||
acceptance_tests:
|
||||
basic_read:
|
||||
tests:
|
||||
- config_path: secrets/config.json
|
||||
empty_streams:
|
||||
- name: collections
|
||||
bypass_reason: "This stream can't be seeded in our sandbox account"
|
||||
- name: discounts
|
||||
bypass_reason: "This stream can't be seeded in our sandbox account"
|
||||
timeout_seconds: 1200
|
||||
```
|
||||
|
||||
#### Incremental: `future_state` must be set
|
||||
|
||||
In `high` test strictness level we expect the `future_state` configuration to be set.
|
||||
The future state JSON file (usually `abnormal_states.json`) must contain one state for each stream declared in the configured catalog.
|
||||
`missing_streams` can be set to ignore a subset of the streams with a valid bypass reason. E.G:
|
||||
|
||||
```yaml
|
||||
test_strictness_level: high
|
||||
connector_image: airbyte/source-my-connector:dev
|
||||
acceptance_tests:
|
||||
...
|
||||
incremental:
|
||||
tests:
|
||||
- config_path: secrets/config.json
|
||||
configured_catalog_path: integration_tests/configured_catalog.json
|
||||
...
|
||||
future_state:
|
||||
future_state_path: integration_tests/abnormal_state.json
|
||||
missing_streams:
|
||||
- name: my_missing_stream
|
||||
bypass_reason: "Please fill a good reason"
|
||||
```
|
||||
|
||||
## Caching
|
||||
|
||||
We cache discovered catalogs by default for performance and reuse the same discovered catalog through all tests.
|
||||
You can disable this behavior by setting `cached_discovered_catalog: False` at the root of the configuration.
|
||||
|
||||
## Breaking Changes and Backwards Compatibility
|
||||
|
||||
Breaking changes are modifications that make previous versions of the connector incompatible, requiring a major version bump. Here are the various types of changes that we consider breaking:
|
||||
|
||||
1. **Changes to Stream Schema**
|
||||
|
||||
- **Removing a Field**: If a field is removed from the stream's schema, it's a breaking change. Clients expecting the field may fail when it's absent.
|
||||
- **Changing Field Type**: If the data type of a field is changed, it could break clients expecting the original type. For instance, changing a field from string to integer would be a breaking change.
|
||||
- **Renaming a Field**: If a field is renamed, it can break existing clients that expect the field by its original name.
|
||||
|
||||
2. **Changes to Stream Behaviour**
|
||||
|
||||
- **Changing the Cursor**: Changing the cursor field for incremental streams can cause data discrepancies or synchronization issues. Therefore, it's considered a breaking change.
|
||||
- **Renaming a Stream**: If a stream is renamed, it could cause failures for clients expecting the stream with its original name. Hence, this is a breaking change.
|
||||
- **Changing Sync Mechanism**: If a stream's sync mechanism changes, such as switching from full refresh sync to incremental sync (or vice versa), it's a breaking change. Existing workflows may fail or behave unexpectedly due to this change.
|
||||
|
||||
3. **Changes to Configuration Options**
|
||||
|
||||
- **Removing or Renaming Options**: If configuration options are removed or renamed, it could break clients using those options, hence, is considered a breaking change.
|
||||
- **Changing Default Values or Behaviours**: Altering default values or behaviours of configuration options can break existing clients that rely on previous defaults.
|
||||
|
||||
4. **Changes to Authentication Mechanism**
|
||||
|
||||
- Any change to the connector's authentication mechanism that isn't backwards compatible is a breaking change. For example, switching from API key authentication to OAuth without supporting both is a breaking change.
|
||||
|
||||
5. **Changes to Error Handling**
|
||||
|
||||
- Altering the way errors are handled can be a breaking change. For example, if a certain type of error was previously ignored and now causes the connector to fail, it could break user's existing workflows.
|
||||
|
||||
6. **Changes That Require User Intervention**
|
||||
- If a change requires user intervention, such as manually updating settings or reconfiguring workflows, it would be considered a breaking change.
|
||||
|
||||
Please note that this is an exhaustive but not an exclusive list. Other changes could be considered breaking if they disrupt the functionality of the connector or alter user expectations in a significant way.
|
||||
|
||||
## Additional Checks
|
||||
|
||||
While not necessarily related to Connector Acceptance Testing, Airbyte employs a number of additional checks which run on connector Pull Requests which check the following items:
|
||||
|
||||
### Strictness Level
|
||||
|
||||
Generally Available Connectors must enable high-strictness testing for the Connector Acceptance Test suite. This ensures that these connectors have implemented the most robust collection of tests.
|
||||
|
||||
### Allowed Hosts
|
||||
|
||||
GA and Beta connectors are required to provide an entry for Allowed Hosts in the [metadata.yaml](../connector-metadata-file.md) for the connector. You can provide:
|
||||
|
||||
A list of static hostnames or IP addresses. Wildcards are valid.
|
||||
|
||||
```yaml
|
||||
data:
|
||||
# ...
|
||||
allowedHosts:
|
||||
hosts:
|
||||
- "api.github.com"
|
||||
- "*.hubspot.com"
|
||||
```
|
||||
|
||||
A list of dynamic hostnames or IP addresses which reference values from the connector's configuration. The variable names need to match the connector's config exactly. In this example, `subdomain` is a required option defined by the connector's SPEC response. It is also possible to refrence sub-fields with dot-notation, e.g. `networking_options.tunnel_host`.
|
||||
|
||||
```yaml
|
||||
data:
|
||||
# ...
|
||||
allowedHosts:
|
||||
hosts:
|
||||
- "${subdomain}.vendor.com"
|
||||
- "${networking_options.tunnel_host}"
|
||||
```
|
||||
|
||||
or prevent network access for this connector entirely
|
||||
|
||||
```yaml
|
||||
data:
|
||||
# ...
|
||||
allowedHosts:
|
||||
hosts: []
|
||||
```
|
||||
|
||||
## Custom environment variable
|
||||
|
||||
The connector under tests can be run with custom environment variables:
|
||||
|
||||
```yaml
|
||||
connector_image: "airbyte/source-pokeapi"
|
||||
custom_environment_variables:
|
||||
my_custom_environment_variable: value
|
||||
```
|
||||
@@ -0,0 +1,288 @@
|
||||
# Building a Java Destination
|
||||
|
||||
:::warning
|
||||
The template for building a Java Destination connector is currently unavailable. The Airbyte team is working on revamping the Java CDK.
|
||||
:::
|
||||
|
||||
## Summary
|
||||
|
||||
This article provides a checklist for how to create a Java destination. Each step in the checklist
|
||||
has a link to a more detailed explanation below.
|
||||
|
||||
## Requirements
|
||||
|
||||
Docker and Java with the versions listed in the
|
||||
[tech stack section](../../understanding-airbyte/tech-stack.md).
|
||||
|
||||
## Checklist
|
||||
|
||||
### Creating a destination
|
||||
|
||||
- Step 1: Create the destination using one of the other connectors as an example
|
||||
- Step 2: Build the newly generated destination
|
||||
- Step 3: Implement `spec` to define the configuration required to run the connector
|
||||
- Step 4: Implement `check` to provide a way to validate configurations provided to the connector
|
||||
- Step 5: Implement `write` to write data to the destination
|
||||
- Step 6: Set up Acceptance Tests
|
||||
- Step 7: Write unit tests or integration tests
|
||||
- Step 8: Update the docs \(in `docs/integrations/destinations/<destination-name>.md`\)
|
||||
|
||||
:::info
|
||||
|
||||
All `./gradlew` commands must be run from the root of the airbyte project.
|
||||
|
||||
:::
|
||||
|
||||
:::info
|
||||
|
||||
If you need help with any step of the process, feel free to submit a PR with your progress and any
|
||||
questions you have, or ask us on [slack](https://slack.airbyte.io).
|
||||
|
||||
:::
|
||||
|
||||
## Explaining Each Step
|
||||
|
||||
### Step 1: Create the destination
|
||||
|
||||
Use `destination-s3` as an example and copy over the relevant build system pieces.
|
||||
|
||||
### Step 2: Build the newly generated destination
|
||||
|
||||
You can build the destination by running:
|
||||
|
||||
```bash
|
||||
# Must be run from the Airbyte project root
|
||||
./gradlew :airbyte-integrations:connectors:destination-<name>:build
|
||||
```
|
||||
|
||||
This compiles the Java code for your destination and builds a Docker image with the connector. At
|
||||
this point, we haven't implemented anything of value yet, but once we do, you'll use this command to
|
||||
compile your code and Docker image.
|
||||
|
||||
:::info
|
||||
|
||||
Airbyte uses Gradle to manage Java dependencies. To add dependencies for your connector, manage them
|
||||
in the `build.gradle` file inside your connector's directory.
|
||||
|
||||
:::
|
||||
|
||||
#### Iterating on your implementation
|
||||
|
||||
We recommend the following ways of iterating on your connector as you're making changes:
|
||||
|
||||
- Test-driven development \(TDD\) in Java
|
||||
- Test-driven development \(TDD\) using Airbyte's Acceptance Tests
|
||||
- Directly running the docker image
|
||||
|
||||
#### Test-driven development in Java
|
||||
|
||||
This should feel like a standard flow for a Java developer: you make some code changes then run java
|
||||
tests against them. You can do this directly in your IDE, but you can also run all unit tests via
|
||||
Gradle by running the command to build the connector:
|
||||
|
||||
```text
|
||||
./gradlew :airbyte-integrations:connectors:destination-<name>:build
|
||||
```
|
||||
|
||||
This will build the code and run any unit tests. This approach is great when you are testing local
|
||||
behaviors and writing unit tests.
|
||||
|
||||
#### TDD using acceptance tests & integration tests
|
||||
|
||||
Airbyte provides a standard test suite \(dubbed "Acceptance Tests"\) that runs against every
|
||||
destination connector. They are "free" baseline tests to ensure the basic functionality of the
|
||||
destination. When developing a connector, you can simply run the tests between each change and use
|
||||
the feedback to guide your development.
|
||||
|
||||
If you want to try out this approach, check out Step 6 which describes what you need to do to set up
|
||||
the acceptance Tests for your destination.
|
||||
|
||||
The nice thing about this approach is that you are running your destination exactly as Airbyte will
|
||||
run it in the CI. The downside is that the tests do not run very quickly. As such, we recommend this
|
||||
iteration approach only once you've implemented most of your connector and are in the finishing
|
||||
stages of implementation. Note that Acceptance Tests are required for every connector supported by
|
||||
Airbyte, so you should make sure to run them a couple of times while iterating to make sure your
|
||||
connector is compatible with Airbyte.
|
||||
|
||||
#### Directly running the destination using Docker
|
||||
|
||||
If you want to run your destination exactly as it will be run by Airbyte \(i.e. within a docker
|
||||
container\), you can use the following commands from the connector module directory
|
||||
\(`airbyte-integrations/connectors/destination-<name>`\):
|
||||
|
||||
```text
|
||||
# First build the container
|
||||
./gradlew :airbyte-integrations:connectors:destination-<name>:build
|
||||
|
||||
# Then use the following commands to run it
|
||||
# Runs the "spec" command, used to find out what configurations are needed to run a connector
|
||||
docker run --rm airbyte/destination-<name>:dev spec
|
||||
|
||||
# Runs the "check" command, used to validate if the input configurations are valid
|
||||
docker run --rm -v $(pwd)/secrets:/secrets airbyte/destination-<name>:dev check --config /secrets/config.json
|
||||
|
||||
# Runs the "write" command which reads records from stdin and writes them to the underlying destination
|
||||
docker run --rm -v $(pwd)/secrets:/secrets -v $(pwd)/sample_files:/sample_files airbyte/destination-<name>:dev write --config /secrets/config.json --catalog /sample_files/configured_catalog.json
|
||||
```
|
||||
|
||||
Note: Each time you make a change to your implementation you need to re-build the connector image
|
||||
via `./gradlew :airbyte-integrations:connectors:destination-<name>:build`.
|
||||
|
||||
The nice thing about this approach is that you are running your destination exactly as it will be
|
||||
run by Airbyte. The tradeoff is that iteration is slightly slower, because you need to re-build the
|
||||
connector between each change.
|
||||
|
||||
#### Handling Exceptions
|
||||
|
||||
In order to best propagate user-friendly error messages and log error information to the platform,
|
||||
the [Airbyte Protocol](../../understanding-airbyte/airbyte-protocol.md#The Airbyte Protocol)
|
||||
implements AirbyteTraceMessage.
|
||||
|
||||
We recommend using AirbyteTraceMessages for known errors, as in these cases you can likely offer the
|
||||
user a helpful message as to what went wrong and suggest how they can resolve it.
|
||||
|
||||
Airbyte provides a static utility class, `io.airbyte.integrations.base.AirbyteTraceMessageUtility`,
|
||||
to give you a clear and straight-forward way to emit these AirbyteTraceMessages. Example usage:
|
||||
|
||||
```java
|
||||
try {
|
||||
// some connector code responsible for doing X
|
||||
}
|
||||
catch (ExceptionIndicatingIncorrectCredentials credErr) {
|
||||
AirbyteTraceMessageUtility.emitConfigErrorTrace(
|
||||
credErr, "Connector failed due to incorrect credentials while doing X. Please check your connection is using valid credentials.")
|
||||
throw credErr
|
||||
}
|
||||
catch (ExceptionIndicatingKnownErrorY knownErr) {
|
||||
AirbyteTraceMessageUtility.emitSystemErrorTrace(
|
||||
knownErr, "Connector failed because of reason Y while doing X. Please check/do/make ... to resolve this.")
|
||||
throw knownErr
|
||||
}
|
||||
catch (Exception e) {
|
||||
AirbyteTraceMessageUtility.emitSystemErrorTrace(
|
||||
e, "Connector failed while doing X. Possible reasons for this could be ...")
|
||||
throw e
|
||||
}
|
||||
```
|
||||
|
||||
Note the two different error trace methods.
|
||||
|
||||
- Where possible `emitConfigErrorTrace` should be used when we are certain the issue arises from a
|
||||
problem with the user's input configuration, e.g. invalid credentials.
|
||||
- For everything else or if unsure, use `emitSystemErrorTrace`.
|
||||
|
||||
### Step 3: Implement `spec`
|
||||
|
||||
Each destination contains a specification written in JsonSchema that describes its inputs. Defining
|
||||
the specification is a good place to start when developing your destination. Check out the
|
||||
documentation [here](https://json-schema.org/) to learn the syntax. Here's
|
||||
[an example](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/destination-postgres/src/main/resources/spec.json)
|
||||
of what the `spec.json` looks like for the postgres destination.
|
||||
|
||||
Your generated template should have the spec file in
|
||||
`airbyte-integrations/connectors/destination-<name>/src/main/resources/spec.json`. The generated
|
||||
connector will take care of reading this file and converting it to the correct output. Edit it and
|
||||
you should be done with this step.
|
||||
|
||||
For more details on what the spec is, you can read about the Airbyte Protocol
|
||||
[here](../../understanding-airbyte/airbyte-protocol.md).
|
||||
|
||||
See the `spec` operation in action:
|
||||
|
||||
```bash
|
||||
# First build the connector
|
||||
./gradlew :airbyte-integrations:connectors:destination-<name>:build
|
||||
|
||||
# Run the spec operation
|
||||
docker run --rm airbyte/destination-<name>:dev spec
|
||||
```
|
||||
|
||||
### Step 4: Implement `check`
|
||||
|
||||
The check operation accepts a JSON object conforming to the `spec.json`. In other words if the
|
||||
`spec.json` said that the destination requires a `username` and `password` the config object might
|
||||
be `{ "username": "airbyte", "password": "password123" }`. It returns a json object that reports,
|
||||
given the credentials in the config, whether we were able to connect to the destination.
|
||||
|
||||
While developing, we recommend storing any credentials in `secrets/config.json`. Any `secrets`
|
||||
directory in the Airbyte repo is gitignored by default.
|
||||
|
||||
Implement the `check` method in the generated file `<Name>Destination.java`. Here's an
|
||||
[example implementation](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/destination-bigquery/src/main/java/io/airbyte/integrations/destination/bigquery/BigQueryDestination.java#L94)
|
||||
from the BigQuery destination.
|
||||
|
||||
Verify that the method is working by placing your config in `secrets/config.json` then running:
|
||||
|
||||
```text
|
||||
# First build the connector
|
||||
./gradlew :airbyte-integrations:connectors:destination-<name>:build
|
||||
|
||||
# Run the check method
|
||||
docker run -v $(pwd)/secrets:/secrets --rm airbyte/destination-<name>:dev check --config /secrets/config.json
|
||||
```
|
||||
|
||||
### Step 5: Implement `write`
|
||||
|
||||
The `write` operation is the main workhorse of a destination connector: it reads input data from the
|
||||
source and writes it to the underlying destination. It takes as input the config file used to run
|
||||
the connector as well as the configured catalog: the file used to describe the schema of the
|
||||
incoming data and how it should be written to the destination. Its "output" is two things:
|
||||
|
||||
1. Data written to the underlying destination
|
||||
2. `AirbyteMessage`s of type `AirbyteStateMessage`, written to stdout to indicate which records have
|
||||
been written so far during a sync. It's important to output these messages when possible in order
|
||||
to avoid re-extracting messages from the source. See the
|
||||
[write operation protocol reference](https://docs.airbyte.com/understanding-airbyte/airbyte-protocol#write)
|
||||
for more information.
|
||||
|
||||
To implement the `write` Airbyte operation, implement the `getConsumer` method in your generated
|
||||
`<Name>Destination.java` file. Here are some example implementations from different destination
|
||||
conectors:
|
||||
|
||||
- [BigQuery](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/destination-bigquery/src/main/java/io/airbyte/integrations/destination/bigquery/BigQueryDestination.java#L188)
|
||||
- [Google Pubsub](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/destination-pubsub/src/main/java/io/airbyte/integrations/destination/pubsub/PubsubDestination.java#L98)
|
||||
- [Local CSV](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/destination-csv/src/main/java/io/airbyte/integrations/destination/csv/CsvDestination.java#L90)
|
||||
- [Postgres](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/destination-postgres/src/main/java/io/airbyte/integrations/destination/postgres/PostgresDestination.java)
|
||||
|
||||
:::info
|
||||
|
||||
The Postgres destination leverages the `AbstractJdbcDestination` superclass which makes it extremely
|
||||
easy to create a destination for a database or data warehouse if it has a compatible JDBC driver. If
|
||||
the destination you are implementing has a JDBC driver, be sure to check out
|
||||
`AbstractJdbcDestination`.
|
||||
|
||||
:::
|
||||
|
||||
For a brief overview on the Airbyte catalog check out
|
||||
[the Beginner's Guide to the Airbyte Catalog](../../understanding-airbyte/beginners-guide-to-catalog.md).
|
||||
|
||||
### Step 6: Set up Acceptance Tests
|
||||
|
||||
The Acceptance Tests are a set of tests that run against all destinations. These tests are run in
|
||||
the Airbyte CI to prevent regressions and verify a baseline of functionality. The test cases are
|
||||
contained and documented in the
|
||||
[following file](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/bases/standard-destination-test/src/main/java/io/airbyte/integrations/standardtest/destination/DestinationAcceptanceTest.java).
|
||||
|
||||
To setup acceptance Tests for your connector, follow the `TODO`s in the generated file
|
||||
`<name>DestinationAcceptanceTest.java`. Once setup, you can run the tests using
|
||||
`./gradlew :airbyte-integrations:connectors:destination-<name>:integrationTest`. Make sure to run
|
||||
this command from the Airbyte repository root.
|
||||
|
||||
### Step 7: Write unit tests and/or integration tests
|
||||
|
||||
The Acceptance Tests are meant to cover the basic functionality of a destination. Think of it as the
|
||||
bare minimum required for us to add a destination to Airbyte. You should probably add some unit
|
||||
testing or custom integration testing in case you need to test additional functionality of your
|
||||
destination.
|
||||
|
||||
#### Step 8: Update the docs
|
||||
|
||||
Each connector has its own documentation page. By convention, that page should have the following
|
||||
path: in `docs/integrations/destinations/<destination-name>.md`. For the documentation to get
|
||||
packaged with the docs, make sure to add a link to it in `docs/SUMMARY.md`. You can pattern match
|
||||
doing that from existing connectors.
|
||||
|
||||
## Wrapping up
|
||||
|
||||
Well done on making it this far! If you'd like your connector to ship with Airbyte by default,
|
||||
create a PR against the Airbyte repo and we'll work with you to get it across the finish line.
|
||||
@@ -0,0 +1,115 @@
|
||||
# Getting started
|
||||
|
||||
This tutorial will walk you through the creation of a custom Airbyte connector implemented with the
|
||||
Python CDK. It assumes you're already familiar with Airbyte concepts and you've already built a
|
||||
connector using the [Connector Builder](../../connector-builder-ui/tutorial.mdx).
|
||||
|
||||
:::tip
|
||||
We highly recommend using the Connector Builder for most use cases;
|
||||
while the Python CDK is more flexible, it requires a deeper understanding of underlying connector logic,
|
||||
as well as some experience programming in Python.
|
||||
:::
|
||||
|
||||
We'll build an connector for the Survey Monkey API, focusing on the `surveys` and `survey responses`
|
||||
endpoints.
|
||||
|
||||
You can find the documentation for the API
|
||||
[here](https://api.surveymonkey.com/v3/docs?shell#getting-started).
|
||||
|
||||
As a first step, follow the getting started instructions from the docs to register a draft app to
|
||||
your account.
|
||||
|
||||
Next, we'll inspect the API docs to understand how the endpoints work.
|
||||
|
||||
## Surveys endpoint
|
||||
|
||||
The [surveys endpoint doc](https://api.surveymonkey.com/v3/docs?shell#api-endpoints-get-surveys)
|
||||
shows that the endpoint URL is https://api.surveymonkey.com/v3/surveys and that the data is nested
|
||||
in the response's "data" field.
|
||||
|
||||
It also shows there are two ways to iterate through the record pages. We could either keep a page
|
||||
counter and increment it on every request, or use the link sent as part of the response in "links"
|
||||
-> "next".
|
||||
|
||||
The two approaches are equivalent for the Survey Monkey API, but as a rule of thumb, it is
|
||||
preferable to use the links provided by the API if it is available instead of reverse engineering
|
||||
the mechanism. This way, we don't need to modify the connector if the API changes their pagination
|
||||
mechanism, for instance, if they decide to implement server-side pagination.
|
||||
|
||||
:::info When available, server-side pagination should be preferred over client-side pagination
|
||||
because it has lower risks of missing records if the collection is modified while the connector
|
||||
iterates.
|
||||
|
||||
:::
|
||||
|
||||
The "Optional Query Strings for GET" section shows that the `perPage` parameter is important because
|
||||
it’ll define how many records we can fetch with a single request. The maximum page size isn't
|
||||
explicit from the docs. We'll use 1000 as a limit. When unsure, we recommend finding the limit
|
||||
experimentally by trying multiple values.
|
||||
|
||||
Also note that we'll need to add the `include` query parameter to fetch all the properties, such as
|
||||
`date_modified`, which we'll use as our cursor value.
|
||||
|
||||
The section also shows how to filter the data based on the record's timestamp, which will allow the
|
||||
connector to read records incrementally. We'll use the `start_modified_at` and `end_modified_at` to
|
||||
scope our requests.
|
||||
|
||||
We won't worry about the other query params as we won't filter by title or folder.
|
||||
|
||||
:::info
|
||||
|
||||
As a rule of thumb, it's preferable to fetch all the available data rather than ask the user to
|
||||
specify which folder IDs they care about.
|
||||
|
||||
:::
|
||||
|
||||
## Survey responses
|
||||
|
||||
Next, we'll take a look at the
|
||||
[documentation for the survey responses endpoint](https://api.surveymonkey.com/v3/docs?shell#api-endpoints-get-surveys-id-responses).
|
||||
It shows that this endpoint depends on the `surveys` endpoint, since we'll need to first fetch the
|
||||
surveys to fetch the responses.
|
||||
|
||||
It shows that the records are also nested in a "data" field. It's unclear from the examples if the
|
||||
responses include a link to the next page. I already confirmed that's the case for you, but I'd
|
||||
recommend validating this kind of assumption for any connector you plan on running in production.
|
||||
|
||||
We’re not going to worry about the custom variables because we want to pull all the data.
|
||||
|
||||
It’s worth noting that this stream won’t support incremental mode because there’s no timestamp to
|
||||
filter on.
|
||||
|
||||
## Authentication
|
||||
|
||||
The [authentication section](https://api.surveymonkey.com/v3/docs?shell#authentication describes how
|
||||
to authenticate to the API. Follow the instructions to obtain an access key. We'll then be able to
|
||||
authenticate by passing a HTTP header in the format `Authorization: bearer YOUR_ACCESS_TOKEN`.
|
||||
|
||||
## Rate limits
|
||||
|
||||
The
|
||||
[request and responses section](https://api.surveymonkey.com/v3/docs?shell#request-and-response-limits)
|
||||
shows there’s a limit of 120 requests per minute, and of 500 requests per day.
|
||||
|
||||
We’ll handle the 120 requests per minute by throttling, but we’ll let the sync fail if it hits the
|
||||
daily limit because we don’t want to let the sync spin for up to 24 hours without any reason.
|
||||
|
||||
We won’t worry about the increasing the rate limits.
|
||||
|
||||
## Error codes
|
||||
|
||||
The [Error Codes](https://api.surveymonkey.com/v3/docs?shell#error-codes) section shows the error
|
||||
codes 1010-1018 represent authentication failures. These failures should be handled by the end-user,
|
||||
and aren't indicative of a system failure. We'll therefore handle them explicitly so users know how
|
||||
to resolve them should they occur.
|
||||
|
||||
## Putting it all together
|
||||
|
||||
We now know enough about how the API works:
|
||||
|
||||
| Stream | URL | authentication | path to data | pagination | cursor value | time based filters | query params | rate limits | user errors |
|
||||
| ---------------- | ------------------------------------------------------------- | --------------------------------------------- | ------------ | ------------------------- | ------------- | -------------------------------------------------- | ------------------------------------------------------------------------------------------------------------ | ---------------------- | -------------------- |
|
||||
| surveys | https://api.surveymonkey.com/v3/surveys | bearerAuthorization: bearer YOUR_ACCESS_TOKEN | data | response -> links -> next | date_modified | start_modified_at and end_modified_at query params | include: response_count,date_created,date_modified,language,question_count,analyze_url,preview,collect_stats | 120 request per minute | error code 1010-1018 |
|
||||
| survey responses | https://api.surveymonkey.com/v3/surveys/{survey_id}/responses | bearerAuthorization: bearer YOUR_ACCESS_TOKEN | data | response -> links -> next | None | None | None | 120 request per minute | error code 1010-1018 |
|
||||
|
||||
In the [next section](./1-environment-setup.md), we'll setup our development environment.
|
||||
@@ -0,0 +1,72 @@
|
||||
# Environment setup
|
||||
|
||||
Let's first start by cloning the repository, optionally forking it first
|
||||
|
||||
```bash
|
||||
git clone git@github.com:airbytehq/airbyte.git
|
||||
cd airbyte
|
||||
```
|
||||
|
||||
Next, you will want to create a new connector.
|
||||
|
||||
## Initialize connector project
|
||||
|
||||
```bash
|
||||
git clone git@github.com:airbytehq/airbyte.git
|
||||
cd airbyte
|
||||
|
||||
# Make a directory for a new connector and navigate to it
|
||||
mkdir airbyte-integrations/connectors/source-exchange-rates-tutorial
|
||||
cd airbyte-integrations/connectors/source-exchange-rates-tutorial
|
||||
|
||||
# Initialize a project, follow Poetry prompts, and then add airbyte-cdk as a dependency.
|
||||
poetry init
|
||||
poetry add airbyte-cdk
|
||||
```
|
||||
|
||||
For this walkthrough, we'll refer to our source as `exchange-rates-tutorial`.
|
||||
|
||||
## Add Connector Metadata file
|
||||
|
||||
Each Airbyte connector needs to have a valid `metadata.yaml` file in the root of the connector directory. [Here is metadata.yaml format documentation](../../../connector-development/connector-metadata-file.md).
|
||||
|
||||
## Implement connector entrypoint scripts
|
||||
|
||||
Airbyte connectors are expected to be able to run `spec`, `check`, `discover`, and `read` commands. You can use `run.py` file in Airbyte connectors as an example of how to implement them.
|
||||
|
||||
## Running operations
|
||||
|
||||
```bash
|
||||
poetry run source-survey-monkey-demo check --config secrets/config.json
|
||||
```
|
||||
|
||||
It should return a failed connection status
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "CONNECTION_STATUS",
|
||||
"connectionStatus": {
|
||||
"status": "FAILED",
|
||||
"message": "Config validation error: 'TODO' is a required property"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The discover operation should also fail as expected
|
||||
|
||||
```bash
|
||||
poetry run source-survey-monkey-demo discover --config secrets/config.json
|
||||
```
|
||||
|
||||
It should fail because `TODO' is a required property`
|
||||
|
||||
The read operation should also fail as expected
|
||||
|
||||
```bash
|
||||
poetry run source-survey-monkey-demo read --config secrets/config.json --catalog integration_tests/configured_catalog.json
|
||||
```
|
||||
|
||||
It should fail because `TODO' is a required property`
|
||||
|
||||
We're ready to start development. In the [next section](./2-reading-a-page.md), we'll read a page of
|
||||
records from the surveys endpoint.
|
||||
@@ -0,0 +1,347 @@
|
||||
# Reading a page
|
||||
|
||||
In this section, we'll read a single page of records from the surveys endpoint.
|
||||
|
||||
## Write a failing test that reads a single page
|
||||
|
||||
We'll start by writing a failing integration test.
|
||||
|
||||
Create a file `unit_tests/integration/test_surveys.py`
|
||||
|
||||
```bash
|
||||
mkdir unit_tests/integration
|
||||
touch unit_tests/integration/test_surveys.py
|
||||
code .
|
||||
```
|
||||
|
||||
Copy this template to
|
||||
`airbyte-integrations/connectors/source-survey-monkey-demo/unit_tests/integration/test_surveys.py`
|
||||
|
||||
```python
|
||||
# Copyright (c) 2023 Airbyte, Inc., all rights reserved.
|
||||
|
||||
from datetime import datetime, timedelta, timezone
|
||||
from typing import Any, Dict, Mapping, Optional
|
||||
from unittest import TestCase
|
||||
|
||||
import freezegun
|
||||
from airbyte_cdk.sources.source import TState
|
||||
from airbyte_cdk.test.catalog_builder import CatalogBuilder
|
||||
from airbyte_cdk.test.entrypoint_wrapper import EntrypointOutput, read
|
||||
from airbyte_cdk.test.mock_http import HttpMocker, HttpRequest, HttpResponse
|
||||
from airbyte_protocol.models import ConfiguredAirbyteCatalog, SyncMode
|
||||
from source_survey_monkey_demo import SourceSurveyMonkeyDemo
|
||||
|
||||
_A_CONFIG = {
|
||||
<TODO>
|
||||
}
|
||||
_NOW = <TODO>
|
||||
|
||||
@freezegun.freeze_time(_NOW.isoformat())
|
||||
class FullRefreshTest(TestCase):
|
||||
|
||||
@HttpMocker()
|
||||
def test_read_a_single_page(self, http_mocker: HttpMocker) -> None:
|
||||
|
||||
http_mocker.get(
|
||||
HttpRequest(url=),
|
||||
HttpResponse(body=, status_code=)
|
||||
)
|
||||
|
||||
output = self._read(_A_CONFIG, _configured_catalog(<TODO>, SyncMode.full_refresh))
|
||||
|
||||
assert len(output.records) == 2
|
||||
|
||||
def _read(self, config: Mapping[str, Any], configured_catalog: ConfiguredAirbyteCatalog, expecting_exception: bool = False) -> EntrypointOutput:
|
||||
return _read(config, configured_catalog=configured_catalog, expecting_exception=expecting_exception)
|
||||
|
||||
def _read(
|
||||
config: Mapping[str, Any],
|
||||
configured_catalog: ConfiguredAirbyteCatalog,
|
||||
state: Optional[Dict[str, Any]] = None,
|
||||
expecting_exception: bool = False
|
||||
) -> EntrypointOutput:
|
||||
return read(_source(configured_catalog, config, state), config, configured_catalog, state, expecting_exception)
|
||||
|
||||
|
||||
def _configured_catalog(stream_name: str, sync_mode: SyncMode) -> ConfiguredAirbyteCatalog:
|
||||
return CatalogBuilder().with_stream(stream_name, sync_mode).build()
|
||||
|
||||
|
||||
def _source(catalog: ConfiguredAirbyteCatalog, config: Dict[str, Any], state: Optional[TState]) -> SourceSurveyMonkeyDemo:
|
||||
return SourceSurveyMonkeyDemo()
|
||||
```
|
||||
|
||||
Most of this code is boilerplate. The most interesting section is the test
|
||||
|
||||
```python
|
||||
@HttpMocker()
|
||||
def test_read_a_single_page(self, http_mocker: HttpMocker) -> None:
|
||||
|
||||
http_mocker.get(
|
||||
HttpRequest(url=),
|
||||
HttpResponse(body=, status_code=)
|
||||
)
|
||||
|
||||
output = self._read(_A_CONFIG, _configured_catalog(<TODO>, SyncMode.full_refresh))
|
||||
|
||||
assert len(output.records) == 2
|
||||
```
|
||||
|
||||
`http_mocker.get` is used to register mocked requests and responses. You can specify the URL, query
|
||||
params, and request headers the connector is expected to send and mock the response that should be
|
||||
returned by the server to implement fast integration test that can be used to verify the connector's
|
||||
behavior without the need to reach the API. This allows the tests to be fast and reproducible.
|
||||
|
||||
Now, we'll implement a first test verifying the connector will send a request to the right endpoint,
|
||||
with the right parameter, and verify that records are extracted from the data field of the response.
|
||||
|
||||
```python
|
||||
_A_CONFIG = {
|
||||
"access_token": "access_token"
|
||||
}
|
||||
_NOW = datetime.now(timezone.utc)
|
||||
|
||||
@freezegun.freeze_time(_NOW.isoformat())
|
||||
class FullRefreshTest(TestCase):
|
||||
|
||||
@HttpMocker()
|
||||
def test_read_a_single_page(self, http_mocker: HttpMocker) -> None:
|
||||
|
||||
http_mocker.get(
|
||||
HttpRequest(url="https://api.surveymonkey.com/v3/surveys?include=response_count,date_created,date_modified,language,question_count,analyze_url,preview,collect_stats"),
|
||||
HttpResponse(body="""
|
||||
{
|
||||
"data": [
|
||||
{
|
||||
"id": "1234",
|
||||
"title": "My Survey",
|
||||
"nickname": "",
|
||||
"href": "https://api.surveymonkey.com/v3/surveys/1234"
|
||||
},
|
||||
{
|
||||
"id": "1234",
|
||||
"title": "My Survey",
|
||||
"nickname": "",
|
||||
"href": "https://api.surveymonkey.com/v3/surveys/1234"
|
||||
}
|
||||
],
|
||||
"per_page": 50,
|
||||
"page": 1,
|
||||
"total": 2,
|
||||
"links": {
|
||||
"self": "https://api.surveymonkey.com/v3/surveys?page=1&per_page=50"
|
||||
}
|
||||
}
|
||||
""", status_code=200)
|
||||
)
|
||||
|
||||
output = self._read(_A_CONFIG, _configured_catalog("surveys", SyncMode.full_refresh))
|
||||
|
||||
assert len(output.records) == 2
|
||||
```
|
||||
|
||||
Note that the test also required adding the "access_token" field to the config. We'll use this field
|
||||
to store the API key obtained in the first section of the tutorial.
|
||||
|
||||
The test should fail because the expected request was not sent
|
||||
|
||||
```bash
|
||||
poetry run pytest unit_tests/integration
|
||||
```
|
||||
|
||||
> ValueError: Invalid number of matches for
|
||||
> `HttpRequestMatcher(request_to_match=ParseResult(scheme='https', netloc='api.surveymonkey.com', path='/v3/surveys', params='', query='include=response_count,date_created,date_modified,language,question_count,analyze_url,preview,collect_stats', fragment='') with headers {} and body None), minimum_number_of_expected_match=1, actual_number_of_matches=0)`
|
||||
|
||||
We'll now remove the unit tests files. Writing unit tests is left as an exercise for the reader, but
|
||||
it is highly recommended for any productionized connector.
|
||||
|
||||
```
|
||||
rm unit_tests/test_incremental_streams.py unit_tests/test_source.py unit_tests/test_streams.py
|
||||
```
|
||||
|
||||
Replace the content of
|
||||
`airbyte-integrations/connectors/source-survey-monkey-demo/source_survey_monkey_demo/source.py` with
|
||||
the following template:
|
||||
|
||||
```python
|
||||
#
|
||||
# Copyright (c) 2023 Airbyte, Inc., all rights reserved.
|
||||
#
|
||||
|
||||
|
||||
from abc import ABC
|
||||
from typing import Any, Iterable, List, Mapping, MutableMapping, Optional, Tuple, Union
|
||||
|
||||
import requests
|
||||
from airbyte_cdk.sources import AbstractSource
|
||||
from airbyte_cdk.sources.streams import Stream
|
||||
from airbyte_cdk.sources.streams.http import HttpStream
|
||||
from airbyte_cdk.sources.streams.http.requests_native_auth import Oauth2Authenticator, TokenAuthenticator
|
||||
|
||||
|
||||
class SurveyMonkeyBaseStream(HttpStream, ABC):
|
||||
def __init__(self, name: str, path: str, primary_key: Union[str, List[str]], data_field: str, **kwargs: Any) -> None:
|
||||
self._name = name
|
||||
self._path = path
|
||||
self._primary_key = primary_key
|
||||
self._data_field = data_field
|
||||
super().__init__(**kwargs)
|
||||
|
||||
url_base = <TODO>
|
||||
|
||||
def next_page_token(self, response: requests.Response) -> Optional[Mapping[str, Any]]:
|
||||
return None
|
||||
|
||||
def request_params(
|
||||
self, stream_state: Mapping[str, Any], stream_slice: Mapping[str, any] = None, next_page_token: Mapping[str, Any] = None
|
||||
) -> MutableMapping[str, Any]:
|
||||
return {"include": "response_count,date_created,date_modified,language,question_count,analyze_url,preview,collect_stats"}
|
||||
|
||||
def parse_response(self, response: requests.Response, **kwargs) -> Iterable[Mapping]:
|
||||
response_json = response.json()
|
||||
if self._data_field:
|
||||
yield from response_json.get(self._data_field, [])
|
||||
else:
|
||||
yield from response_json
|
||||
|
||||
@property
|
||||
def name(self) -> str:
|
||||
return self._name
|
||||
|
||||
def path(
|
||||
self,
|
||||
*,
|
||||
stream_state: Optional[Mapping[str, Any]] = None,
|
||||
stream_slice: Optional[Mapping[str, Any]] = None,
|
||||
next_page_token: Optional[Mapping[str, Any]] = None,
|
||||
) -> str:
|
||||
return self._path
|
||||
|
||||
@property
|
||||
def primary_key(self) -> Optional[Union[str, List[str], List[List[str]]]]:
|
||||
return self._primary_key
|
||||
|
||||
|
||||
# Source
|
||||
class SourceSurveyMonkeyDemo(AbstractSource):
|
||||
def check_connection(self, logger, config) -> Tuple[bool, any]:
|
||||
return True, None
|
||||
|
||||
def streams(self, config: Mapping[str, Any]) -> List[Stream]:
|
||||
auth = <TODO>
|
||||
return [SurveyMonkeyBaseStream(name=<TODO>, path=<TODO>, primary_key=<TODO>, data_field=<TODO>, authenticator=auth)]
|
||||
```
|
||||
|
||||
:::info This template restructures the code so its easier to extend. Specifically, it provides a
|
||||
base class that can be extended with composition instead of inheritance, which is generally less
|
||||
error prone.
|
||||
|
||||
:::
|
||||
|
||||
Then set the URL base
|
||||
|
||||
```python
|
||||
url_base = "https://api.surveymonkey.com"
|
||||
```
|
||||
|
||||
Set the query parameters:
|
||||
|
||||
```python
|
||||
def request_params(
|
||||
self, stream_state: Mapping[str, Any], stream_slice: Mapping[str, any] = None, next_page_token: Mapping[str, Any] = None
|
||||
) -> MutableMapping[str, Any]:
|
||||
return {"include": "response_count,date_created,date_modified,language,question_count,analyze_url,preview,collect_stats"}
|
||||
```
|
||||
|
||||
and configure the authenticator, the name, the path, and the primary key
|
||||
|
||||
```python
|
||||
def streams(self, config: Mapping[str, Any]) -> List[Stream]:
|
||||
auth = TokenAuthenticator(token=config["access_token"])
|
||||
return [SurveyMonkeyBaseStream(name="surveys", path="v3/surveys", primary_key="id", data_field="data", authenticator=auth)]
|
||||
```
|
||||
|
||||
We'll now update the
|
||||
[connector specification](../../../understanding-airbyte/airbyte-protocol.md#actor-specification).
|
||||
We'll add the access_token as a required property, making sure to flag it as an `airbyte_secret` to
|
||||
ensure the value isn't accidentally leaked, and we'll specify its `order` should be 0 so it shows up
|
||||
first in the Source setup page.
|
||||
|
||||
```yaml
|
||||
documentationUrl: https://docsurl.com
|
||||
connectionSpecification:
|
||||
$schema: http://json-schema.org/draft-07/schema#
|
||||
title: Survey Monkey Demo Spec
|
||||
type: object
|
||||
required:
|
||||
- access_token
|
||||
properties:
|
||||
access_token:
|
||||
type: string
|
||||
description: "Access token for Survey Monkey API"
|
||||
order: 0
|
||||
airbyte_secret: true
|
||||
```
|
||||
|
||||
Let's now rename one of the mocked schema files to `surveys.json` so its used by our new stream, and
|
||||
remove the second one as it isn't needed.
|
||||
|
||||
```
|
||||
mv source_survey_monkey_demo/schemas/customers.json source_survey_monkey_demo/schemas/surveys.json
|
||||
rm source_survey_monkey_demo/schemas/employees.json
|
||||
```
|
||||
|
||||
The test should now pass
|
||||
|
||||
```
|
||||
poetry run pytest unit_tests/
|
||||
```
|
||||
|
||||
Now fill in the `secrets/config.json` file with your API access token
|
||||
|
||||
```json
|
||||
{
|
||||
"access_token": "<TODO>"
|
||||
}
|
||||
```
|
||||
|
||||
and update the configured catalog so it knows about the newly created stream:
|
||||
|
||||
```json
|
||||
{
|
||||
"streams": [
|
||||
{
|
||||
"stream": {
|
||||
"name": "surveys",
|
||||
"json_schema": {},
|
||||
"supported_sync_modes": ["full_refresh"]
|
||||
},
|
||||
"sync_mode": "full_refresh",
|
||||
"destination_sync_mode": "overwrite"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
We can now run a read command to pull data from the endpoint:
|
||||
|
||||
```
|
||||
poetry run source-survey-monkey-demo read --config secrets/config.json --catalog integration_tests/configured_catalog.json
|
||||
```
|
||||
|
||||
The connector should've successfully read records.
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "LOG",
|
||||
"log": { "level": "INFO", "message": "Read 14 records from surveys stream" }
|
||||
}
|
||||
```
|
||||
|
||||
You can also pass in the `--debug` flag to see the real requests and responses sent and received.
|
||||
It's also recommended to use these real requests as templates for the integration tests as they can
|
||||
be more accurate the examples from API documentation.
|
||||
|
||||
In the [next section](./3-reading-multiple-pages.md), we'll implement pagination to read all surveys
|
||||
from the endpoint# Reading a page
|
||||
@@ -0,0 +1,147 @@
|
||||
# Read multiple pages
|
||||
|
||||
In this section, we'll implement pagination to read all the records available in the surveys
|
||||
endpoint.
|
||||
|
||||
Again, we'll start by writing a failing test for fetching multiple pages of records
|
||||
|
||||
```python
|
||||
@HttpMocker()
|
||||
def test_read_multiple_pages(self, http_mocker: HttpMocker) -> None:
|
||||
|
||||
http_mocker.get(
|
||||
HttpRequest(url="https://api.surveymonkey.com/v3/surveys?include=response_count,date_created,date_modified,language,question_count,analyze_url,preview,collect_stats&per_page=1000"),
|
||||
HttpResponse(body="""
|
||||
{
|
||||
"data": [
|
||||
{
|
||||
"id": "1234",
|
||||
"title": "My Survey",
|
||||
"nickname": "",
|
||||
"href": "https://api.surveymonkey.com/v3/surveys/1234"
|
||||
}
|
||||
],
|
||||
"per_page": 50,
|
||||
"page": 1,
|
||||
"total": 2,
|
||||
"links": {
|
||||
"self": "https://api.surveymonkey.com/v3/surveys?page=1&per_page=50",
|
||||
"next": "https://api.surveymonkey.com/v3/surveys?include=response_count,date_created,date_modified,language,question_count,analyze_url,preview,collect_stats&per_page=1000&page=2"
|
||||
}
|
||||
}
|
||||
""", status_code=200)
|
||||
)
|
||||
http_mocker.get(
|
||||
HttpRequest(url="https://api.surveymonkey.com/v3/surveys?include=response_count,date_created,date_modified,language,question_count,analyze_url,preview,collect_stats&per_page=1000&page=2"),
|
||||
HttpResponse(body="""
|
||||
{
|
||||
"data": [
|
||||
{
|
||||
"id": "5678",
|
||||
"title": "My Survey",
|
||||
"nickname": "",
|
||||
"href": "https://api.surveymonkey.com/v3/surveys/1234"
|
||||
}
|
||||
],
|
||||
"per_page": 50,
|
||||
"page": 1,
|
||||
"total": 2,
|
||||
"links": {
|
||||
"self": "https://api.surveymonkey.com/v3/surveys?page=1&per_page=50"
|
||||
}
|
||||
}
|
||||
""", status_code=200)
|
||||
)
|
||||
|
||||
output = self._read(_A_CONFIG, _configured_catalog("surveys", SyncMode.full_refresh))
|
||||
|
||||
assert len(output.records) == 2
|
||||
```
|
||||
|
||||
These tests now have a lot of duplications because we keep pasting the same response templates. You
|
||||
can look at the
|
||||
[source-stripe connector for an example of how this can be DRY'd](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-stripe/unit_tests/integration/test_cards.py).
|
||||
|
||||
The test should fail because the request wasn't matched:
|
||||
|
||||
```bash
|
||||
poetry run pytest unit_tests
|
||||
```
|
||||
|
||||
> ValueError: Invalid number of matches for
|
||||
> `HttpRequestMatcher(request_to_match=ParseResult(scheme='https', netloc='api.surveymonkey.com',
|
||||
> path='/v3/surveys', params='', query='page=2&per_page=100', fragment='')
|
||||
|
||||
First, we'll update the request parameters to only be set if this is not a request. If submitting a
|
||||
paginated request, we'll use the parameters coming from the response.
|
||||
|
||||
```python
|
||||
# add next library to import section
|
||||
from urllib.parse import urlparse
|
||||
```
|
||||
|
||||
```python
|
||||
# Create a pagination constant
|
||||
_PAGE_SIZE: int = 1000
|
||||
```
|
||||
|
||||
```python
|
||||
def request_params(
|
||||
self, stream_state: Mapping[str, Any], stream_slice: Mapping[str, any] = None, next_page_token: Mapping[str, Any] = None
|
||||
) -> MutableMapping[str, Any]:
|
||||
if next_page_token:
|
||||
return urlparse(next_page_token["next_url"]).query
|
||||
else:
|
||||
return {"include": "response_count,date_created,date_modified,language,question_count,analyze_url,preview,collect_stats",
|
||||
"per_page": _PAGE_SIZE}
|
||||
```
|
||||
|
||||
Then we'll extract the next_page_token from the response
|
||||
|
||||
```python
|
||||
def next_page_token(self, response: requests.Response) -> Optional[Mapping[str, Any]]:
|
||||
links = response.json().get("links", {})
|
||||
if "next" in links:
|
||||
return {"next_url": links["next"]}
|
||||
else:
|
||||
return {}
|
||||
```
|
||||
|
||||
The test should now pass. We won't write more integration tests in this tutorial, but they are
|
||||
strongly recommended for any connector used in production. The change on request params will cause
|
||||
a fail in "test_read_a_single_page", fix this unit test is left as an exercise for the reader.
|
||||
|
||||
```bash
|
||||
poetry run pytest unit_tests
|
||||
```
|
||||
|
||||
We'll try reading
|
||||
|
||||
```bash
|
||||
poetry run source-survey-monkey-demo read --config secrets/config.json --catalog integration_tests/configured_catalog.json
|
||||
```
|
||||
|
||||
There might not be enough records in your account to trigger the pagination.
|
||||
|
||||
It might be easier to test pagination by forcing the connector to only fetch one record per page:
|
||||
|
||||
```
|
||||
_PAGE_SIZE: int = 1
|
||||
```
|
||||
|
||||
and reading again
|
||||
|
||||
```bash
|
||||
poetry run source-survey-monkey-demo read --config secrets/config.json --catalog integration_tests/configured_catalog.json
|
||||
```
|
||||
|
||||
All records should be read now.
|
||||
|
||||
Change the \_PAGE_SIZE back to 1000:
|
||||
|
||||
```
|
||||
_PAGE_SIZE: int = 1000
|
||||
```
|
||||
|
||||
In the [next section](./4-check-and-error-handling.md), we'll implement the check operation, and
|
||||
improve the error handling.
|
||||
@@ -0,0 +1,92 @@
|
||||
# Check and error handling
|
||||
|
||||
In this section, we'll implement the check operation, and implement error handling to surface the
|
||||
user-friendly messages when failing due to authentication errors.
|
||||
|
||||
Let's first implement the check operation.
|
||||
|
||||
This operation verifies that the input configuration supplied by the user can be used to connect to
|
||||
the underlying data source.
|
||||
|
||||
Use the following command to run the check operation:
|
||||
|
||||
```bash
|
||||
poetry run source-survey-monkey-demo check --config secrets/config.json
|
||||
```
|
||||
|
||||
The command succeed, but it'll succeed even if the config is invalid. We should modify the check so
|
||||
it fails if the connector is unable to pull any record a stream.
|
||||
|
||||
We'll do this by trying to read a single record from the stream, and fail the connector could not
|
||||
read any.
|
||||
|
||||
```python
|
||||
# import the following libraries
|
||||
from airbyte_cdk.models import AirbyteMessage, SyncMode
|
||||
```
|
||||
|
||||
```python
|
||||
def check_connection(self, logger, config) -> Tuple[bool, any]:
|
||||
first_stream = next(iter(self.streams(config)))
|
||||
|
||||
stream_slice = next(iter(first_stream.stream_slices(sync_mode=SyncMode.full_refresh)))
|
||||
|
||||
try:
|
||||
read_stream = first_stream.read_records(sync_mode=SyncMode.full_refresh, stream_slice=stream_slice)
|
||||
first_record = None
|
||||
while not first_record:
|
||||
first_record = next(read_stream)
|
||||
if isinstance(first_record, AirbyteMessage):
|
||||
if first_record.type == "RECORD":
|
||||
first_record = first_record.record
|
||||
return True, None
|
||||
else:
|
||||
first_record = None
|
||||
return True, None
|
||||
except Exception as e:
|
||||
return False, f"Unable to connect to the API with the provided credentials - {str(e)}"
|
||||
```
|
||||
|
||||
Next, we'll improve the error handling.
|
||||
|
||||
First, we'll disable the availability strategy. Availability strategies are a legacy concept used to
|
||||
filter out streams that might not be available given a user's permissions.
|
||||
```python
|
||||
# import this library
|
||||
from airbyte_cdk.sources.streams.availability_strategy import AvailabilityStrategy
|
||||
```
|
||||
|
||||
```python
|
||||
@property
|
||||
def availability_strategy(self) -> Optional[AvailabilityStrategy]:
|
||||
return None
|
||||
|
||||
```
|
||||
|
||||
Instead of using an availability strategy, we'll raise a config error if we're unable to
|
||||
authenticate:
|
||||
```python
|
||||
# import the following library
|
||||
from airbyte_cdk.utils.traced_exception import AirbyteTracedException, FailureType
|
||||
```
|
||||
|
||||
```python
|
||||
def parse_response(self, response: requests.Response, **kwargs) -> Iterable[Mapping]:
|
||||
response_json = response.json()
|
||||
# https://api.surveymonkey.com/v3/docs?shell#error-codes
|
||||
if response_json.get("error") in (1010, 1011, 1012, 1013, 1014, 1015, 1016, 1017, 1018):
|
||||
internal_message = "Unauthorized credentials. Response: {response_json}"
|
||||
external_message = "Can not get metadata with unauthorized credentials. Try to re-authenticate in source settings."
|
||||
raise AirbyteTracedException(
|
||||
message=external_message, internal_message=internal_message, failure_type=FailureType.config_error
|
||||
)
|
||||
elif self._data_field:
|
||||
yield from response_json[self._data_field]
|
||||
else:
|
||||
yield from response_json
|
||||
```
|
||||
|
||||
The `external_message` will be displayed to the end-user, while the `internal_message` will be
|
||||
logged for troubleshooting purposes.
|
||||
|
||||
In the [next section](./5-discover.md), we'll implement the discover operation.
|
||||
@@ -0,0 +1,124 @@
|
||||
# Discover
|
||||
|
||||
The discover method of the Airbyte Protocol returns an AirbyteCatalog: an object which declares all
|
||||
the streams output by a connector and their schemas. It also declares the sync modes supported by
|
||||
the stream (full refresh or incremental). See the
|
||||
[beginner's guide to the catalog](../../../understanding-airbyte/beginners-guide-to-catalog.md) for
|
||||
more information.
|
||||
|
||||
Run a discover command:
|
||||
|
||||
```bash
|
||||
poetry run source-survey-monkey-demo discover --config secrets/config.json
|
||||
```
|
||||
|
||||
The command should succeed, but the schema will be wrong:
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "CATALOG",
|
||||
"catalog": {
|
||||
"streams": [
|
||||
{
|
||||
"name": "surveys",
|
||||
"json_schema": {
|
||||
"$schema": "http://json-schema.org/draft-07/schema#",
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"id": { "type": ["null", "string"] },
|
||||
"name": { "type": ["null", "string"] },
|
||||
"signup_date": { "type": ["null", "string"], "format": "date-time" }
|
||||
}
|
||||
},
|
||||
"supported_sync_modes": ["full_refresh"],
|
||||
"source_defined_primary_key": [["id"]]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
We'll need to replace the schema `surveys.json` with a json_schema representation of the records yielded by the
|
||||
stream.
|
||||
|
||||
The easiest way to extract the schema from a HTTP response is to use the Connector Builder. You can
|
||||
also paste the schema below, which was generated by the Connector Builder:
|
||||
|
||||
```json
|
||||
{
|
||||
"$schema": "http://json-schema.org/schema#",
|
||||
"properties": {
|
||||
"analyze_url": {
|
||||
"type": "string"
|
||||
},
|
||||
"collect_stats": {
|
||||
"properties": {
|
||||
"status": {
|
||||
"properties": {
|
||||
"open": {
|
||||
"type": "number"
|
||||
}
|
||||
},
|
||||
"type": "object"
|
||||
},
|
||||
"total_count": {
|
||||
"type": "number"
|
||||
},
|
||||
"type": {
|
||||
"properties": {
|
||||
"weblink": {
|
||||
"type": "number"
|
||||
}
|
||||
},
|
||||
"type": "object"
|
||||
}
|
||||
},
|
||||
"type": "object"
|
||||
},
|
||||
"date_created": {
|
||||
"type": "string"
|
||||
},
|
||||
"date_modified": {
|
||||
"type": "string"
|
||||
},
|
||||
"href": {
|
||||
"type": "string"
|
||||
},
|
||||
"id": {
|
||||
"type": "string"
|
||||
},
|
||||
"language": {
|
||||
"type": "string"
|
||||
},
|
||||
"nickname": {
|
||||
"type": "string"
|
||||
},
|
||||
"preview": {
|
||||
"type": "string"
|
||||
},
|
||||
"question_count": {
|
||||
"type": "number"
|
||||
},
|
||||
"response_count": {
|
||||
"type": "number"
|
||||
},
|
||||
"title": {
|
||||
"type": "string"
|
||||
}
|
||||
},
|
||||
"type": "object"
|
||||
}
|
||||
```
|
||||
|
||||
:::info
|
||||
|
||||
If the connector you're building has a dynamic schema, you'll need to overwrite the
|
||||
`AbstractSource::streams`.
|
||||
|
||||
:::
|
||||
|
||||
---
|
||||
|
||||
The three connector operations work as expected. You can now
|
||||
[upload your connector to your Airbyte instance](https://docs.airbyte.com/operator-guides/using-custom-connectors).In
|
||||
the [next section](6-incremental-reads.md), we'll add the connector to our local Airbyte instance.
|
||||
@@ -0,0 +1,182 @@
|
||||
# Incremental reads
|
||||
|
||||
In this section, we'll add support to read data incrementally. While this is optional, you should
|
||||
implement it whenever possible because reading in incremental mode allows users to save time and
|
||||
money by only reading new data.
|
||||
|
||||
We'll first need to implement three new methods on the base stream class
|
||||
|
||||
The `cursor_field` property indicates that records produced by the stream have a cursor that can be
|
||||
used to identify it in the timeline.
|
||||
|
||||
```python
|
||||
@property
|
||||
def cursor_field(self) -> Optional[str]:
|
||||
return self._cursor_field
|
||||
```
|
||||
|
||||
The `get_updated_state` method is used to update the stream's state. We'll set its value to the
|
||||
maximum between the current state's value and the value extracted from the record.
|
||||
```python
|
||||
# import the following library
|
||||
import datetime
|
||||
```
|
||||
|
||||
```python
|
||||
def get_updated_state(self, current_stream_state: MutableMapping[str, Any], latest_record: Mapping[str, Any]) -> Mapping[str, Any]:
|
||||
state_value = max(current_stream_state.get(self.cursor_field, 0), datetime.datetime.strptime(latest_record.get(self._cursor_field, ""), _INCOMING_DATETIME_FORMAT).timestamp())
|
||||
return {self._cursor_field: state_value}
|
||||
```
|
||||
|
||||
Note that we're converting the datetimes to unix epoch. We could've also chosen to persist it as an
|
||||
ISO date. You can use any format that works best for you. Integers are easy to work with so that's
|
||||
what we'll do for this tutorial.
|
||||
|
||||
Then we'll implement the `stream_slices` method, which will be used to partition the stream into
|
||||
time windows. While this isn't mandatory since we could omit the `end_modified_at` parameter from
|
||||
our requests and try to read all new records at once, it is preferable to partition the stream
|
||||
because it enables checkpointing.
|
||||
|
||||
This might mean the connector will make more requests than necessary during the initial sync, and
|
||||
this is most visible when working with a sandbox or an account that does not have many records. The
|
||||
upside are worth the tradeoff because the additional cost is negligible for accounts that have many
|
||||
records, and the time cost will be entirely mitigated in a follow up section when we fetch
|
||||
partitions concurrently.
|
||||
|
||||
```python
|
||||
|
||||
def stream_slices(self, stream_state: Mapping[str, Any] = None, **kwargs) -> Iterable[Optional[Mapping[str, any]]]:
|
||||
start_ts = stream_state.get(self._cursor_field, _START_DATE) if stream_state else _START_DATE
|
||||
now_ts = datetime.datetime.now().timestamp()
|
||||
if start_ts >= now_ts:
|
||||
yield from []
|
||||
return
|
||||
for start, end in self.chunk_dates(start_ts, now_ts):
|
||||
yield {"start_date": start, "end_date": end}
|
||||
|
||||
def chunk_dates(self, start_date_ts: int, end_date_ts: int) -> Iterable[Tuple[int, int]]:
|
||||
step = int(_SLICE_RANGE * 24 * 60 * 60)
|
||||
after_ts = start_date_ts
|
||||
while after_ts < end_date_ts:
|
||||
before_ts = min(end_date_ts, after_ts + step)
|
||||
yield after_ts, before_ts
|
||||
after_ts = before_ts + 1
|
||||
```
|
||||
|
||||
Note that we're introducing the concept of a start date. You might have to fiddle to find the
|
||||
earliest start date that can be queried. You can also choose to make the start date configurable by
|
||||
the end user. This will make your life simpler, at the cost of pushing the complexity to the
|
||||
end-user.
|
||||
|
||||
We'll now update the query params. In addition the passing the page size and the include field,
|
||||
we'll pass in the `start_modified_at` and `end_modified_at` which can be extracted from the
|
||||
`stream_slice` parameter.
|
||||
|
||||
```python
|
||||
def request_params(
|
||||
self, stream_state: Mapping[str, Any], stream_slice: Mapping[str, any] = None, next_page_token: Mapping[str, Any] = None
|
||||
) -> MutableMapping[str, Any]:
|
||||
if next_page_token:
|
||||
return urlparse(next_page_token["next_url"]).query
|
||||
else:
|
||||
return {
|
||||
"per_page": _PAGE_SIZE, "include": "response_count,date_created,date_modified,language,question_count,analyze_url,preview,collect_stats",
|
||||
"start_modified_at": datetime.datetime.strftime(datetime.datetime.fromtimestamp(stream_slice["start_date"]), _OUTGOING_DATETIME_FORMAT),
|
||||
"end_modified_at": datetime.datetime.strftime(datetime.datetime.fromtimestamp(stream_slice["end_date"]), _OUTGOING_DATETIME_FORMAT)
|
||||
}
|
||||
```
|
||||
|
||||
And add the following constants to the source.py file
|
||||
|
||||
```python
|
||||
_START_DATE = datetime.datetime(2020,1,1, 0,0,0).timestamp()
|
||||
_SLICE_RANGE = 365
|
||||
_OUTGOING_DATETIME_FORMAT = "%Y-%m-%dT%H:%M:%SZ"
|
||||
_INCOMING_DATETIME_FORMAT = "%Y-%m-%dT%H:%M:%S"
|
||||
```
|
||||
|
||||
Notice the outgoing and incoming date formats are different!
|
||||
|
||||
Now, update the stream constructor so it accepts a cursor_field parameter.
|
||||
|
||||
```python
|
||||
class SurveyMonkeyBaseStream(HttpStream, ABC):
|
||||
def __init__(self, name: str, path: str, primary_key: Union[str, List[str]], data_field: Optional[str], cursor_field: Optional[str],
|
||||
**kwargs: Any) -> None:
|
||||
self._name = name
|
||||
self._path = path
|
||||
self._primary_key = primary_key
|
||||
self._data_field = data_field
|
||||
self._cursor_field = cursor_field
|
||||
super().__init__(**kwargs)
|
||||
```
|
||||
|
||||
And update the stream's creation:
|
||||
|
||||
```python
|
||||
return [SurveyMonkeyBaseStream(name="surveys", path="/v3/surveys", primary_key="id", data_field="data", cursor_field="date_modified", authenticator=auth)]
|
||||
```
|
||||
|
||||
Finally, modify the configured catalog to run the stream in incremental mode:
|
||||
|
||||
```json
|
||||
{
|
||||
"streams": [
|
||||
{
|
||||
"stream": {
|
||||
"name": "surveys",
|
||||
"json_schema": {},
|
||||
"supported_sync_modes": ["full_refresh", "incremental"]
|
||||
},
|
||||
"sync_mode": "incremental",
|
||||
"destination_sync_mode": "overwrite"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Run another read operation. The state messages should include the cursor:
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "STATE",
|
||||
"state": {
|
||||
"type": "STREAM",
|
||||
"stream": {
|
||||
"stream_descriptor": { "name": "surveys", "namespace": null },
|
||||
"stream_state": { "date_modified": 1623348420.0 }
|
||||
},
|
||||
"sourceStats": { "recordCount": 0.0 }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
And update the sample state to a timestamp earlier than the first record. There should be fewer
|
||||
records
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"type": "STREAM",
|
||||
"stream": {
|
||||
"stream_descriptor": {
|
||||
"name": "surveys"
|
||||
},
|
||||
"stream_state": {
|
||||
"date_modified": 1711753326
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
Run another read command, passing the `--state` flag:
|
||||
|
||||
```bash
|
||||
poetry run source-survey-monkey-demo read --config secrets/config.json --catalog integration_tests/configured_catalog.json --state integration_tests/sample_state.json
|
||||
```
|
||||
|
||||
Only more recent records should be read.
|
||||
|
||||
In the [next section](7-reading-from-a-subresource.md), we'll implement the survey responses stream,
|
||||
which depends on the surveys stream.
|
||||
@@ -0,0 +1,197 @@
|
||||
# Reading from a subresource
|
||||
|
||||
In this section, we'll implement a stream for the survey responses stream. This stream structure is
|
||||
a little different because it depends on the surveys stream.
|
||||
|
||||
Start by creating a new base class for substreams:
|
||||
|
||||
```python
|
||||
class SurveyMonkeySubstream(HttpStream, ABC):
|
||||
|
||||
def __init__(self, name: str, path: str, primary_key: Union[str, List[str]], parent_stream: Stream, **kwargs: Any) -> None:
|
||||
self._name = name
|
||||
self._path = path
|
||||
self._primary_key = primary_key
|
||||
self._parent_stream = parent_stream
|
||||
super().__init__(**kwargs)
|
||||
|
||||
url_base = "https://api.surveymonkey.com"
|
||||
|
||||
def next_page_token(self, response: requests.Response) -> Optional[Mapping[str, Any]]:
|
||||
links = response.json().get("links", {})
|
||||
if "next" in links:
|
||||
return {"next_url": links["next"]}
|
||||
else:
|
||||
return {}
|
||||
|
||||
def request_params(
|
||||
self, stream_state: Mapping[str, Any], stream_slice: Mapping[str, any] = None, next_page_token: Mapping[str, Any] = None
|
||||
) -> MutableMapping[str, Any]:
|
||||
if next_page_token:
|
||||
return urlparse(next_page_token["next_url"]).query
|
||||
else:
|
||||
return {"per_page": _PAGE_SIZE}
|
||||
|
||||
def parse_response(self, response: requests.Response, **kwargs) -> Iterable[Mapping]:
|
||||
yield from response.json().get("data", [])
|
||||
|
||||
@property
|
||||
def name(self) -> str:
|
||||
return self._name
|
||||
|
||||
def path(
|
||||
self,
|
||||
*,
|
||||
stream_state: Optional[Mapping[str, Any]] = None,
|
||||
stream_slice: Optional[Mapping[str, Any]] = None,
|
||||
next_page_token: Optional[Mapping[str, Any]] = None,
|
||||
) -> str:
|
||||
try:
|
||||
return self._path.format(stream_slice=stream_slice)
|
||||
except Exception as e:
|
||||
raise e
|
||||
|
||||
@property
|
||||
def primary_key(self) -> Optional[Union[str, List[str], List[List[str]]]]:
|
||||
return self._primary_key
|
||||
|
||||
def stream_slices(self, stream_state: Mapping[str, Any] = None, **kwargs) -> Iterable[Optional[Mapping[str, any]]]:
|
||||
for _slice in self._parent_stream.stream_slices():
|
||||
for parent_record in self._parent_stream.read_records(sync_mode=SyncMode.full_refresh, stream_slice=_slice):
|
||||
yield parent_record
|
||||
```
|
||||
|
||||
This class is similar to the base class, but it does not support incremental reads, and its stream
|
||||
slices are generated by reading records from a parent stream. This is how we'll ensure we always
|
||||
read all survey responses.
|
||||
|
||||
Note that using this approach, the connector will checkpoint after reading responses for each
|
||||
survey.
|
||||
|
||||
Don't forget to update the `streams` method to also instantiate the surveys responses stream:
|
||||
|
||||
```python
|
||||
def streams(self, config: Mapping[str, Any]) -> List[Stream]:
|
||||
auth = TokenAuthenticator(token=config["access_token"])
|
||||
surveys = SurveyMonkeyBaseStream(name="surveys", path="/v3/surveys", primary_key="id", data_field="data", cursor_field="date_modified", authenticator=auth)
|
||||
survey_responses = SurveyMonkeySubstream(name="survey_responses", path="/v3/surveys/{stream_slice[id]}/responses/", primary_key="id", authenticator=auth, parent_stream=surveys)
|
||||
return [
|
||||
surveys,
|
||||
survey_responses
|
||||
]
|
||||
```
|
||||
|
||||
Before moving on, we'll enable request caching on the surveys stream to avoid fetching the records
|
||||
both for the surveys stream and for the survey responses stream. You can do this by setting the
|
||||
`use_cache` property to true on the `SurveyMonkeyBaseStream` class.
|
||||
|
||||
```
|
||||
@property
|
||||
def use_cache(self) -> bool:
|
||||
return True
|
||||
```
|
||||
|
||||
Now add the stream to the configured catalog:
|
||||
|
||||
```json
|
||||
{
|
||||
"streams": [
|
||||
{
|
||||
"stream": {
|
||||
"name": "surveys",
|
||||
"json_schema": {},
|
||||
"supported_sync_modes": ["full_refresh", "incremental"]
|
||||
},
|
||||
"sync_mode": "incremental",
|
||||
"destination_sync_mode": "overwrite"
|
||||
},
|
||||
{
|
||||
"stream": {
|
||||
"name": "survey_responses",
|
||||
"json_schema": {},
|
||||
"supported_sync_modes": ["full_refresh"]
|
||||
},
|
||||
"sync_mode": "full_refresh",
|
||||
"destination_sync_mode": "overwrite"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
and create a new schema file in `source_survey_monkey_demo/schemas/survey_responses.json`. You can
|
||||
use the connector builder to generate the schema, or paste the one provided below:
|
||||
|
||||
```json
|
||||
{
|
||||
"$schema": "http://json-schema.org/schema#",
|
||||
"properties": {
|
||||
"analyze_url": {
|
||||
"type": ["string", "null"]
|
||||
},
|
||||
"collect_stats": {
|
||||
"properties": {
|
||||
"status": {
|
||||
"properties": {
|
||||
"open": {
|
||||
"type": ["number", "null"]
|
||||
}
|
||||
},
|
||||
"type": ["object", "null"]
|
||||
},
|
||||
"total_count": {
|
||||
"type": ["number", "null"]
|
||||
},
|
||||
"type": {
|
||||
"properties": {
|
||||
"weblink": {
|
||||
"type": ["number", "null"]
|
||||
}
|
||||
},
|
||||
"type": ["object", "null"]
|
||||
}
|
||||
},
|
||||
"type": ["object", "null"]
|
||||
},
|
||||
"date_created": {
|
||||
"type": ["string", "null"]
|
||||
},
|
||||
"date_modified": {
|
||||
"type": ["string", "null"]
|
||||
},
|
||||
"href": {
|
||||
"type": ["string", "null"]
|
||||
},
|
||||
"id": {
|
||||
"type": ["string", "null"]
|
||||
},
|
||||
"language": {
|
||||
"type": ["string", "null"]
|
||||
},
|
||||
"nickname": {
|
||||
"type": ["string", "null"]
|
||||
},
|
||||
"preview": {
|
||||
"type": ["string", "null"]
|
||||
},
|
||||
"question_count": {
|
||||
"type": ["number", "null"]
|
||||
},
|
||||
"response_count": {
|
||||
"type": ["number", "null"]
|
||||
},
|
||||
"title": {
|
||||
"type": ["string", "null"]
|
||||
}
|
||||
},
|
||||
"type": "object"
|
||||
}
|
||||
```
|
||||
|
||||
You should now be able to read your survey responses:
|
||||
|
||||
```
|
||||
poetry run source-survey-monkey-demo read --config secrets/config.json --catalog integration_tests/configured_catalog.json
|
||||
```
|
||||
|
||||
In the [next section](8-concurrency.md) we'll update the connector so it reads stream slices
|
||||
concurrently.
|
||||
@@ -0,0 +1,234 @@
|
||||
# Concurrent
|
||||
|
||||
In this section, we'll improve the connector performance by reading multiple stream slices in
|
||||
parallel.
|
||||
|
||||
Let's update the source. The bulk of the change is changing its parent class to
|
||||
`ConcurrentSourceAdapter`, and updating its `__init__` method so it's properly initialized. This
|
||||
requires a little bit of boilerplate:
|
||||
|
||||
```python
|
||||
# import the following libraries
|
||||
import logging
|
||||
import pendulum
|
||||
from airbyte_cdk.logger import AirbyteLogFormatter
|
||||
from airbyte_cdk.models import Level
|
||||
from airbyte_cdk.sources.concurrent_source.concurrent_source_adapter import ConcurrentSourceAdapter, ConcurrentSource
|
||||
from airbyte_cdk.sources.connector_state_manager import ConnectorStateManager
|
||||
from airbyte_cdk.sources.message.repository import InMemoryMessageRepository
|
||||
```
|
||||
|
||||
```python
|
||||
class SourceSurveyMonkeyDemo(ConcurrentSourceAdapter):
|
||||
message_repository = InMemoryMessageRepository(Level(AirbyteLogFormatter.level_mapping[_logger.level]))
|
||||
|
||||
def __init__(self, config: Optional[Mapping[str, Any]], state: Optional[Mapping[str, Any]]):
|
||||
if config:
|
||||
concurrency_level = min(config.get("num_workers", _DEFAULT_CONCURRENCY), _MAX_CONCURRENCY)
|
||||
else:
|
||||
concurrency_level = _DEFAULT_CONCURRENCY
|
||||
_logger.info(f"Using concurrent cdk with concurrency level {concurrency_level}")
|
||||
concurrent_source = ConcurrentSource.create(
|
||||
concurrency_level, concurrency_level // 2, _logger, self._slice_logger, self.message_repository
|
||||
)
|
||||
super().__init__(concurrent_source)
|
||||
self._config = config
|
||||
self._state = state
|
||||
|
||||
def _get_slice_boundary_fields(self, stream: Stream, state_manager: ConnectorStateManager) -> Optional[Tuple[str, str]]:
|
||||
return ("start_date", "end_date")
|
||||
```
|
||||
|
||||
We'll also need to update the `streams` method to wrap the streams in an adapter class to enable
|
||||
concurrency.
|
||||
```python
|
||||
# import the following libraries
|
||||
from airbyte_cdk.sources.streams.concurrent.adapters import StreamFacade
|
||||
from airbyte_cdk.sources.streams.concurrent.cursor import CursorField, ConcurrentCursor, FinalStateCursor
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
def streams(self, config: Mapping[str, Any]) -> List[Stream]:
|
||||
auth = TokenAuthenticator(config["access_token"])
|
||||
|
||||
survey_stream = SurveyMonkeyBaseStream(name="surveys", path="/v3/surveys", primary_key="id", data_field="data", authenticator=auth, cursor_field="date_modified")
|
||||
synchronous_streams = [
|
||||
survey_stream,
|
||||
SurveyMonkeySubstream(name="survey_responses", path="/v3/surveys/{stream_slice[id]}/responses/", primary_key="id", authenticator=auth, parent_stream=survey_stream)
|
||||
]
|
||||
state_manager = ConnectorStateManager(stream_instance_map={s.name: s for s in synchronous_streams}, state=self._state)
|
||||
|
||||
configured_streams = []
|
||||
|
||||
for stream in synchronous_streams:
|
||||
|
||||
if stream.cursor_field:
|
||||
cursor_field = CursorField(stream.cursor_field)
|
||||
legacy_state = state_manager.get_stream_state(stream.name, stream.namespace)
|
||||
cursor = ConcurrentCursor(
|
||||
stream.name,
|
||||
stream.namespace,
|
||||
legacy_state,
|
||||
self.message_repository,
|
||||
state_manager,
|
||||
stream.state_converter,
|
||||
cursor_field,
|
||||
self._get_slice_boundary_fields(stream, state_manager),
|
||||
pendulum.from_timestamp(_START_DATE),
|
||||
EpochValueConcurrentStreamStateConverter.get_end_provider()
|
||||
)
|
||||
else:
|
||||
cursor = FinalStateCursor(stream.name, stream.namespace, self.message_repository)
|
||||
configured_streams.append (
|
||||
StreamFacade.create_from_stream(stream,
|
||||
self,
|
||||
_logger,
|
||||
legacy_state,
|
||||
cursor)
|
||||
)
|
||||
return configured_streams
|
||||
```
|
||||
|
||||
The most interesting piece from this block is the use of `ConcurrentCursor` to support concurrent
|
||||
state management.
|
||||
|
||||
The survey responses stream does not support incremental reads, so it's using a `FinalStateCursor`
|
||||
instead. The rest of the code change is mostly boilerplate.
|
||||
|
||||
We'll also add a state converter to the `SurveyMonkeyBaseStream` to describe how the state cursor is
|
||||
formatted. We'll use the `EpochValueConcurrentStreamStateConverter` since the `get_updated_state`
|
||||
method returns the cursor as a timestamp
|
||||
|
||||
```python
|
||||
# import the following library
|
||||
from airbyte_cdk.sources.streams.concurrent.state_converters.datetime_stream_state_converter import EpochValueConcurrentStreamStateConverter
|
||||
```
|
||||
|
||||
```
|
||||
state_converter = EpochValueConcurrentStreamStateConverter()
|
||||
```
|
||||
|
||||
Next we'll add a few missing constants:
|
||||
|
||||
```
|
||||
_DEFAULT_CONCURRENCY = 10
|
||||
_MAX_CONCURRENCY = 10
|
||||
_RATE_LIMIT_PER_MINUTE = 120
|
||||
_logger = logging.getLogger("airbyte")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
:::info
|
||||
|
||||
The substream isn't entirely concurrent because its stream_slices definition reads records from the
|
||||
parent stream concurrently:
|
||||
|
||||
```python
|
||||
def stream_slices(self, stream_state: Mapping[str, Any] = None, **kwargs) -> Iterable[Optional[Mapping[str, any]]]:
|
||||
for _slice in self._parent_stream.stream_slices():
|
||||
for parent_record in self._parent_stream.read_records(sync_mode=SyncMode.full_refresh, stream_slice=_slice):
|
||||
yield parent_record
|
||||
```
|
||||
|
||||
This can be solved by implementing the connector using constructs from the concurrent CDK directly
|
||||
instead of wrapping synchronous streams in an adapter. This is left outside of the scope of this
|
||||
tutorial because no production connectors currently implement this.
|
||||
|
||||
:::
|
||||
|
||||
We'll now enable throttling to avoid going over the API rate limit. You can do this by configuring a
|
||||
moving window rate limit policy for the `SurveyMonkeyBaseStream` class:
|
||||
|
||||
```python
|
||||
# import the following libraries
|
||||
from airbyte_cdk.sources.streams.call_rate import MovingWindowCallRatePolicy, HttpAPIBudget, Rate
|
||||
```
|
||||
|
||||
```python
|
||||
class SurveyMonkeyBaseStream(HttpStream, ABC):
|
||||
def __init__(self, name: str, path: str, primary_key: Union[str, List[str]], data_field: Optional[str], cursor_field: Optional[str],
|
||||
**kwargs: Any) -> None:
|
||||
self._name = name
|
||||
self._path = path
|
||||
self._primary_key = primary_key
|
||||
self._data_field = data_field
|
||||
self._cursor_field = cursor_field
|
||||
super().__init__(**kwargs)
|
||||
|
||||
policies = [
|
||||
MovingWindowCallRatePolicy(
|
||||
rates=[Rate(limit=_RATE_LIMIT_PER_MINUTE, interval=datetime.timedelta(minutes=1))],
|
||||
matchers=[],
|
||||
),
|
||||
]
|
||||
api_budget = HttpAPIBudget(policies=policies)
|
||||
super().__init__(api_budget=api_budget, **kwargs)
|
||||
```
|
||||
|
||||
Finally, update the `run.py` file to properly instantiate the class. Most of this code is
|
||||
boilerplate code and isn't specific to the Survey Monkey connector.
|
||||
|
||||
```python
|
||||
#
|
||||
# Copyright (c) 2023 Airbyte, Inc., all rights reserved.
|
||||
#
|
||||
|
||||
import sys
|
||||
import traceback
|
||||
from datetime import datetime
|
||||
from typing import List
|
||||
|
||||
from airbyte_cdk.entrypoint import AirbyteEntrypoint, launch
|
||||
from airbyte_cdk.models import AirbyteErrorTraceMessage, AirbyteMessage, AirbyteTraceMessage, TraceType, Type
|
||||
|
||||
from .source import SourceSurveyMonkeyDemo
|
||||
|
||||
def _get_source(args: List[str]):
|
||||
config_path = AirbyteEntrypoint.extract_config(args)
|
||||
state_path = AirbyteEntrypoint.extract_state(args)
|
||||
try:
|
||||
return SourceSurveyMonkeyDemo(
|
||||
SourceSurveyMonkeyDemo.read_config(config_path) if config_path else None,
|
||||
SourceSurveyMonkeyDemo.read_state(state_path) if state_path else None,
|
||||
)
|
||||
except Exception as error:
|
||||
print(
|
||||
AirbyteMessage(
|
||||
type=Type.TRACE,
|
||||
trace=AirbyteTraceMessage(
|
||||
type=TraceType.ERROR,
|
||||
emitted_at=int(datetime.now().timestamp() * 1000),
|
||||
error=AirbyteErrorTraceMessage(
|
||||
message=f"Error starting the sync. This could be due to an invalid configuration or catalog. Please contact Support for assistance. Error: {error}",
|
||||
stack_trace=traceback.format_exc(),
|
||||
),
|
||||
),
|
||||
).json()
|
||||
)
|
||||
return None
|
||||
|
||||
|
||||
|
||||
def run():
|
||||
args = sys.argv[1:]
|
||||
source = _get_source(args)
|
||||
launch(source, args)
|
||||
```
|
||||
|
||||
You can now run a read operation again. The connector will read multiple partitions concurrently
|
||||
instead of looping through all of them sequentially.
|
||||
|
||||
```bash
|
||||
poetry run source-survey-monkey-demo read --config secrets/config.json --catalog integration_tests/configured_catalog.json
|
||||
```
|
||||
|
||||
We're now done! We implemented a Python connector covering many features:
|
||||
|
||||
- Fast and reproducible integration tests
|
||||
- Authentication errors are detected and labeled as such
|
||||
- One stream supports incremental reads
|
||||
- One stream depends on another stream
|
||||
|
||||
The final code can be found [here](https://github.com/girarda/airbyte/tree/survey_monkey_demo)
|
||||
283
docs/platform/connector-development/ux-handbook.md
Normal file
@@ -0,0 +1,283 @@
|
||||
# UX Handbook
|
||||
|
||||
## Connector Development UX Handbook
|
||||
|
||||

|
||||
|
||||
### Overview
|
||||
|
||||
The goal of this handbook is to allow scaling high quality decision making when developing connectors.
|
||||
|
||||
The Handbook is a living document, meant to be continuously updated. It is the best snapshot we can produce of the lessons learned from building and studying hundreds of connectors. While helpful, this snapshot is never perfect. Therefore, this Handbook is not a replacement for good judgment, but rather learnings that should help guide your work.
|
||||
|
||||
### How to use this handbook
|
||||
|
||||
1. When thinking about a UX-impacting decision regarding connectors, consult this Handbook.
|
||||
2. If the Handbook does not answer your question, then consider proposing an update to the Handbook if you believe your question will be applicable to more cases.
|
||||
|
||||
### Definition of UX-impacting changes
|
||||
|
||||
UX-impacting changes are ones which impact how the user directly interacts with, consumes, or perceives the product.
|
||||
|
||||
**Examples**:
|
||||
|
||||
1. Public-facing documentation
|
||||
2. Input configuration
|
||||
3. Output schema
|
||||
4. Prerequisite configuration by the user (e.g: you need to link an instagram account to your Facebook page for this connector to work properly)
|
||||
5. Consolidating two connectors into one, or splitting one connector into two
|
||||
6. Wait time for human-at-keyboard
|
||||
7. Anything that negatively impacts the runtime of the connector (e.g: a change that makes the runtime go from 10 minutes to 20 minutes on the same data size)
|
||||
8. Any other change which you deem UX-impacting
|
||||
1. The guide can’t cover everything, so this is an escape hatch based on the developer’s judgment.
|
||||
|
||||
**Examples of UX-impacting changes**:
|
||||
|
||||
1. Adding or removing an input field to/from spec.json
|
||||
2. Adding or removing fields from the output schema
|
||||
3. Adding a new stream or category of stream (e.g: supporting views in databases)
|
||||
4. Adding OAuth support
|
||||
|
||||
**Examples of non-UX-impacting changes**:
|
||||
|
||||
1. Refactoring without changing functionality
|
||||
2. Bugfix (e.g: pagination doesn’t work correctly)
|
||||
|
||||
### Guiding Principles
|
||||
|
||||
Would you trust AWS or Docker if it only worked 70, 80, or 90% of the time or if it leaked your business secrets? Yeah, me neither. You would only build on a tool if it worked at least 99% of the time. Infrastructure should give you back your time, rather than become a debugging timesink.
|
||||
|
||||
The same is true with Airbyte: if it worked less than 99% of the time, many users will stop using it. Airbyte is an infrastructure component within a user’s data pipeline. Our users’ goal is to move data; Airbyte is an implementation detail. In that sense, it is much closer to Terraform, Docker, or AWS than an end application.
|
||||
|
||||
#### Trust & Reliability are the top concerns
|
||||
|
||||
Our users have the following hierarchy of needs: 
|
||||
|
||||

|
||||
|
||||
**Security**
|
||||
|
||||
Users often move very confidential data like revenue numbers, salaries, or confidential documents through Airbyte. A user therefore must trust that their data is secure. This means no leaking credentials in logs or plain text, no vulnerabilities in the product, no frivolous sharing of credentials or data over internal slack channels, video calls, or other communications etc.
|
||||
|
||||
**Data integrity**
|
||||
|
||||
Data replicated by Airbyte must be correct and complete. If a user moves data with Airbyte, then all of the data must be present, and it must all be correct - no corruption, incorrect values, or wrongly formatted data.
|
||||
|
||||
Some tricky examples which can break data integrity if not handled correctly:
|
||||
|
||||
- Zipcodes for the US east coast should not lose their leading zeros because of being detected as integer
|
||||
- Database timezones could affect the value of timestamps
|
||||
- Esoteric text values (e.g: weird UTF characters)
|
||||
|
||||
**Reliability**
|
||||
|
||||
A connector needs to be reliable. Otherwise, a user will need to spend a lot of time debugging, and at that point, they’re better off using a competing product. The connector should be able to handle large inputs, weirdly formatted inputs, all data types, and basically anything a user should throw at it.
|
||||
|
||||
In other words, a connector should work 100% of the time, but 99.9% is occasionally acceptable.
|
||||
|
||||
#### Speed
|
||||
|
||||
Sync speed minimizes the time needed for deriving value from data. It also provides a better user experience as it allows quick iteration on connector configurations without suffering through long wait times. 
|
||||
|
||||
**Ease of use**
|
||||
|
||||
People love and trust a product that is easy to use. This means that it works as quickly as possible, introduces no friction, and uses sensible defaults that are good enough for 95% of users.
|
||||
|
||||
An important component of usability is predictability. That is, as much as possible, a user should know before running a connector what its output will be and what the different tables will mean.
|
||||
|
||||
Ideally, they would even see an ERD describing the output schema they can expect to find in the destination. (This particular feature is tracked [here](https://github.com/airbytehq/airbyte/issues/3731)).
|
||||
|
||||
**Feature Set**
|
||||
|
||||
Our connectors should cover as many use cases as is feasible. While it may not always work like that given our incremental delivery preference, we should always strive to provide the most featureful connectors which cover as much of the underlying API or database surface as possible.
|
||||
|
||||
There is also a tension between featureset and ease of use. The more features are available, the more thought it takes to make the product easy and intuitive to use. We’ll elaborate on this later.
|
||||
|
||||
### Airbyte's Target Personas
|
||||
|
||||
Without repeating too many details mentioned elsewhere, the important thing to know is Airbyte serves all the following personas:
|
||||
|
||||
| **Persona** | **Level of technical knowledge** |
|
||||
| ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| Data Analyst | <p>Proficient with:<br/><br/>Data manipulation tools like Excel or SQL<br/>Dashboard tools like Looker<br/><br/>Not very familiar with reading API docs and doesn't know what a curl request is. But might be able to generate an API key if you tell them exactly how.</p> |
|
||||
| Analytics Engineer | <p>Proficient with:<br/><br/>SQL & DBT<br/>Git<br/>A scripting language like Python<br/>Shallow familiarity with infra tools like Docker<br/><br/>Much more technical than a data analyst, but not as much as a data engineer</p> |
|
||||
| Data Engineer | <p>Proficient with:<br/><br/>SQL & DBT<br/>Git<br/>2 or more programming languages<br/>Infra tools like Docker or Kubernetes<br/>Cloud technologies like AWS or GCP<br/>Building or consuming APIs<br/>orhestartion tools like Airflow<br/><br/>The most technical persona we serve. Think of them like an engineer on your team</p> |
|
||||
|
||||
Keep in mind that the distribution of served personas will differ per connector. Data analysts are highly unlikely to form the majority of users for a very technical connector like say, Kafka.
|
||||
|
||||
## Specific Guidelines
|
||||
|
||||
### Input Configuration
|
||||
|
||||
_aka spec.json_
|
||||
|
||||
**Avoid configuration completely when possible**
|
||||
|
||||
Configuration means more work for the user and more chances for confusion, friction, or misconfiguration. If I could wave a magic wand, a user wouldn’t have to configure anything at all. Unfortunately, this is not reality, and some configuration is strictly required. When this is the case, follow the guidelines below.
|
||||
|
||||
**Avoid exposing implementation details in configuration**
|
||||
|
||||
If a configuration controls an implementation detail (like how many retries a connector should make before failing), then there should be almost no reason to expose this. If you feel a need to expose it, consider it might be a smell that the connector implementation is not robust.
|
||||
|
||||
Put another way, if a configuration tells the user how to do its job of syncing data rather than what job to achieve, it’s a code smell.
|
||||
|
||||
For example, the memory requirements for a database connector which syncs a table with very wide rows (50mb rows) can be very different than when syncing a table with very narrow rows (10kb per row). In this case, it may be acceptable to ask the user for some sort of “hint”/tuning parameter in configuration (hidden behind advanced configuration) to ensure the connector performs reliably or quickly. But even then, this option would strictly be a necessary evil/escape hatch. It is much more preferable for the connector to auto-detect what this setting should be and never need to bother the user with it.
|
||||
|
||||
**Minimize required configurations by setting defaults whenever possible**
|
||||
|
||||
In many cases, a configuration can be avoided by setting a default value for it but still making it possible to set other values. Whenever possible, follow this pattern.
|
||||
|
||||
**Hide technical or niche parameters under an “Advanced” section**
|
||||
|
||||
Sometimes, it’s inevitable that we need to expose some advanced or technical configuration. For example, the option to upload a TLS certificate to connect to a database, or the option to configure the number of retries done by an API connector: while these may be useful to some small percentage of users, it’s not worth making all users think or get confused about them.
|
||||
|
||||
Note: this is currently blocked by this [issue](https://github.com/airbytehq/airbyte/issues/3681).
|
||||
|
||||
**Add a “title” and “description” property for every input parameter**
|
||||
|
||||
This displays this information to the user in a polished way and gives less technical users (e.g: analysts) confidence that they can use this product. Be specific and unambiguous in the wording, explaining more than just the field name alone provides.
|
||||
|
||||
For example, the following spec:
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"user_name": {
|
||||
"type": "string"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
produces the following input field in the UI: 
|
||||
|
||||

|
||||
|
||||
Whereas the following specification:
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"user_name": {
|
||||
"type": "string",
|
||||
"description": "The username you use to login to the database",
|
||||
"title": "Username"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
produces the following UI:
|
||||
|
||||

|
||||
|
||||
The title should use Pascal Case “with spaces” e.g: “Attribution Lookback Window”, “Host URL”, etc...
|
||||
|
||||
**Clearly document the meaning and impact of all parameters**
|
||||
|
||||
All configurations must have an unmistakable explanation describing their purpose and impact, even the obvious ones. Remember, something that is obvious to an analyst may not be obvious to an engineer, and vice-versa.
|
||||
|
||||
For example, in some Ads APIs like Facebook, the user’s data may continue to be updated up to 28 days after it is created. This happens because a user may take action because of an ad (like buying a product) many days after they see the ad. In this case, the user may want to configure a “lookback” window for attributing.
|
||||
|
||||
Adding a parameter “attribution_lookback_window” with no explanation might confuse the user more than it helps them. Instead, we should add a clear title and description which describes what this parameter is and how different values will impact the data output by the connector.
|
||||
|
||||
**Document how users can obtain configuration parameters**
|
||||
|
||||
If a user needs to obtain an API key or host name, tell them exactly where to find it. Ideally you would show them screenshots, though include a date and API version in those if possible, so it’s clear when they’ve aged out of date.
|
||||
|
||||
**Links should point to page anchors where applicable**.
|
||||
|
||||
Often, you are trying to redirect the user to a specific part of the page. For example, if you wanted to point someone to the "Input Configuration" section of this doc, it is better to point them to `https://docs.airbyte.com/connector-development/ux-handbook#input-configuration` instead of `https://docs.airbyte.com/connector-development/ux-handbook`.
|
||||
|
||||
**Fail fast & actionably**
|
||||
|
||||
A user should not be able to configure something that will not work. If a user’s configuration is invalid, we should inform them as precisely as possible about what they need to do to fix the issue.
|
||||
|
||||
One helpful aid is to use the json-schema “pattern” keyword to accept inputs which adhere to the correct input shape.
|
||||
|
||||
### Output Data & Schemas
|
||||
|
||||
#### Strongly Favor ELT over ETL
|
||||
|
||||
Extract-Load-Transform (ELT) means extracting and loading the data into a destination while leaving its format/schema as unchanged as possible, and making transformation the responsibility of the consumer. By contrast, ETL means transforming data before it is sent to the destination, for example changing its schema to make it easier to consume in the destination.
|
||||
|
||||
When extracting data, strongly prefer ELT to ETL for the following reasons:
|
||||
|
||||
**Removes Airbyte as a development bottleneck**
|
||||
|
||||
If we get into the habit of structuring the output of each source according to how some users want to use it, then we will invite more feature requests from users asking us to transform data in a particular way. This introduces Airbyte’s dev team as an unnecessary bottleneck for these users.
|
||||
|
||||
Instead, we should set the standard that a user should be responsible for transformations once they’ve loaded data in a destination.
|
||||
|
||||
**Will always be backwards compatible**
|
||||
|
||||
APIs already follow strong conventions to maintain backwards compatibility. By transforming data, we break this guarantee, which means we may break downstream flows for our users.
|
||||
|
||||
**Future proof**
|
||||
|
||||
We may have a vision of what a user needs today. But if our persona evolves next year, then we’ll probably also need to adapt our transformation logic, which would require significant dev and data migration efforts.
|
||||
|
||||
**More flexible**
|
||||
|
||||
Current users have different needs from data. By being opinionated on how they should consume data, we are effectively favoring one user persona over the other. While there might be some cases where this is warranted, it should be done with extreme intentionality.
|
||||
|
||||
**More efficient**
|
||||
|
||||
With ETL, if the “T” ever needs to change, then we need to re-extract all data for all users. This is computationally and financially expensive and will place a lot of pressure on the source systems as we re-extract all data.
|
||||
|
||||
#### Describe output schemas as completely and reliably as possible
|
||||
|
||||
Our most popular destinations are strongly typed like Postgres, BigQuery, or Parquet & Avro.
|
||||
|
||||
Being strongly typed enables optimizations and syntactic sugar to make it very easy & performant for the user to query data.
|
||||
|
||||
To provide the best UX when moving data to these destinations, Airbyte source connectors should describe their schema in as much detail as correctness allows.
|
||||
|
||||
In some cases, describing schemas is impossible to do reliably. For example, MongoDB doesn’t have any schemas. To infer the schema, one needs to read all the records in a particular table. And even then, once new records are added, they also must all be read in order to update the inferred schema. At the time of writing, this is infeasible to do performantly in Airbyte since we do not have an intermediate staging area to do this. In this case, we should do the “best we can” to describe the schema, keeping in mind that reliability of the described schema is more important than expressiveness.
|
||||
|
||||
That is, we would rather not describe a schema at all than describe it incorrectly, as incorrect descriptions **will** lead to failures downstream.
|
||||
|
||||
To keep schema descriptions reliable, [automate schema generation](https://docs.airbyte.com/connector-development/cdk-python/schemas#generating-schemas-from-openapi-definitions) whenever possible.
|
||||
|
||||
#### Be very cautious about breaking changes to output schemas
|
||||
|
||||
Assuming we follow ELT over ETL, and automate generation of output schemas, this should come up very rarely. However, it is still important enough to warrant mention.
|
||||
|
||||
If for any reason we need to change the output schema declared by a connector in a backwards breaking way, consider it a necessary evil that should be avoided if possible. Basically, the only reasons for a backwards breaking change should be:
|
||||
|
||||
- a connector previously had an incorrect schema, or
|
||||
- It was not following ELT principles and is now being changed to follow them
|
||||
|
||||
Other breaking changes should probably be escalated for approval.
|
||||
|
||||
### Prerequisite Configurations & assumptions
|
||||
|
||||
**Document all assumptions**
|
||||
|
||||
If a connector makes assumptions about the underlying data source, then these assumptions must be documented. For example: for some features of the Facebook Pages connector to work, a user must have an Instagram Business account linked to an Instagram page linked to their Facebook Page. In this case, the externally facing documentation page for the connector must be very clear about this.
|
||||
|
||||
**Provide how-tos for prerequisite configuration**
|
||||
|
||||
If a user needs to set up their data source in a particular way to pull data, then we must provide documentation for how they should do it.
|
||||
|
||||
For example, to set up CDC for databases, a user must create logical replication slots and do a few other things. These steps should be documented with examples or screenshots wherever possible (e.g: here are the SQL statements you need to run, with the following permissions, on the following screen, etc.)
|
||||
|
||||
### External Documentation
|
||||
|
||||
This section is concerned with the external-facing documentation of a connector that goes in [https://docs.airbyte.com](https://docs.airbyte.com) e.g: [this one](https://docs.airbyte.com/integrations/sources/amazon-seller-partner)
|
||||
|
||||
**Documentation should communicate persona-impacting behaviors**
|
||||
|
||||
When writing documentation ask: who is the intended target persona for a piece of documentation, and what information do they need to understand how this connector impacts their workflows?
|
||||
|
||||
For example, data analysts & analytics engineers probably don’t care if we use Debezium for database replication. To them, the important thing is that we provide Change Data Capture (CDC) -- Debezium is an implementation detail. Therefore, when communicating information about our database replication logic, we should emphasize the end behaviors, rather than implementation details.
|
||||
|
||||
**Example**: Don’t say: “Debezium cannot process UTF-16 character set“.
|
||||
|
||||
Instead, say: “When using CDC, UTF-16 characters are not currently supported”
|
||||
|
||||
A user who doesn’t already know what Debezium is might be left confused by the first phrasing, so we should use the second phrasing.
|
||||
|
||||
\*: _this is a fake example. AFAIK there is no such limitation in Debezi-- I mean, the Postgres connector._
|
||||
201
docs/platform/connector-development/writing-connector-docs.md
Normal file
@@ -0,0 +1,201 @@
|
||||
# Writing Connector Documentation
|
||||
|
||||
This topic guides you through writing documentation for Airbyte connectors. The systems and practices described in [Updating Documentation](../contributing-to-airbyte/writing-docs.md) apply here as well. However, there are several features and restrictions that only apply to connectors.
|
||||
|
||||
## QA checks
|
||||
|
||||
If you're writing docs for a new connector, your docs must pass our [QA checks](../contributing-to-airbyte/resources/qa-checks).
|
||||
|
||||
## Custom Markdown extensions for connector docs
|
||||
|
||||
Airbyte's connector documentation must gracefully support different contexts in a way platform documentation doesn't.
|
||||
|
||||
- https://docs.airbyte.com
|
||||
- In-app documentation in self-managed versions of Airbyte
|
||||
- In-app documentation in the Cloud version of Airbyte
|
||||
|
||||
Key details about setting up a connection may differ between Cloud and Self-Managed. We created custom Markdown extensions that can show or hide different pieces of content based on the reader's environment. This is a rudimentary form of a concept called single-sourcing: writing content once, but using it in multiple contexts. This prevents us from having to maintain multiple highly similar pools of content.
|
||||
|
||||
The following features for single-sourcing are available. You can combine them to produce a more meaningful result.
|
||||
|
||||
### Hide content from the Airbyte UI
|
||||
|
||||
Some content is important to document, but unhelpful in Airbyte's UI. This could be:
|
||||
|
||||
- Background information that helps people understand a connector but doesn't affect the configuration process
|
||||
- Edge cases with complex solutions
|
||||
- Context about each environment, which doesn't need to be seen if you're in that environment
|
||||
|
||||
Wrapping content in `<HideInUI>...</HideInUI>` tags prevents Airbyte from rendering that content in-app, but https://docs.airbyte.com renders it normally.
|
||||
|
||||
### Hide content from Cloud
|
||||
|
||||
You can hide content from Airbyte Cloud, while still rendering it in Self-Managed and https://docs.airbyte.com.
|
||||
|
||||
```md
|
||||
<!-- env:oss -->
|
||||
|
||||
Only Self-Managed builds of the Airbyte UI will render this content.
|
||||
|
||||
<!-- /env:oss -->
|
||||
```
|
||||
|
||||
### Hide content from Self-Managed
|
||||
|
||||
You can hide content from Airbyte Self-Managed, while still rendering it in Cloud and https://docs.airbyte.com.
|
||||
|
||||
```md
|
||||
<!-- env:cloud -->
|
||||
|
||||
Only Cloud builds of the Airbyte UI will render this content.
|
||||
|
||||
<!-- /env:cloud -->
|
||||
```
|
||||
|
||||
### Example
|
||||
|
||||
Here's an example where the configuration steps are different in Cloud and Self-Managed. In this case, you want to render everything on https://docs.airbyte.com, but you want in-app content reflect only the environment the user is running.
|
||||
|
||||
```markdown title="connector-doc.md"
|
||||
# My connector
|
||||
|
||||
This content is rendered everywhere.
|
||||
|
||||
<!-- env:oss -->
|
||||
|
||||
<HideInUI>
|
||||
|
||||
## For open source:
|
||||
|
||||
</HideInUI>
|
||||
|
||||
Only self-managed builds of the Airbyte UI will render this content.
|
||||
|
||||
<!-- /env:oss -->
|
||||
|
||||
<!-- env:cloud -->
|
||||
<HideInUI>
|
||||
|
||||
## For Airbyte Cloud:
|
||||
|
||||
</HideInUI>
|
||||
|
||||
Only Cloud builds of the Airbyte UI will render this content.
|
||||
|
||||
<!-- /env:oss -->
|
||||
```
|
||||
|
||||
### Testing your content
|
||||
|
||||
To test in-app content in [a local Airbyte build](https://docs.airbyte.com/contributing-to-airbyte/developing-locally/#develop-on-airbyte-webapp), check out the `airbyte` git repository to the same branch and directory as the airbyte platform repository. Development builds fetch connector documentation from your local filesystem, so you can edit their content and view the rendered output in Airbyte.
|
||||
|
||||
To test https://docs.airbyte.com content, [build Docusaurus locally](../contributing-to-airbyte/writing-docs.md#set-up-your-environment).
|
||||
|
||||
## Map the UI to associated content
|
||||
|
||||
Sometimes a field requires more explanation than can be provided in a tooltip. In these cases, use the `<FieldAnchor>` tag to link documentation to a specific UI component.
|
||||
|
||||
When a user selects that field in the UI, the in-app documentation panel automatically scrolls to the related documentation, highlighting all content contained inside the `<FieldAnchor></FieldAnchor>` tag.
|
||||
|
||||
The `FieldAnchor` syntax accepts a modified version of `jsonpath`, without the conventional `$.` prefix. It looks like this:
|
||||
|
||||
```md title="example-a.md"
|
||||
## Configuring Widgets
|
||||
|
||||
<FieldAnchor field="widget_option">
|
||||
|
||||
...config-related instructions here...
|
||||
|
||||
</FieldAnchor>
|
||||
```
|
||||
|
||||
Taking a more complex example, you can access deeper-nested fields using `jsonpath` expressions syntax:
|
||||
|
||||
```md title="example-b.md"
|
||||
## Configuring Unstructured Streams
|
||||
|
||||
<FieldAnchor field="streams.0.format[unstructured],streams.1.format[unstructured],streams.2.format[unstructured]">
|
||||
|
||||
...config-related instructions here...
|
||||
|
||||
</FieldAnchor>
|
||||
```
|
||||
|
||||
:::note
|
||||
The `FieldAnchor` tag only affects in-app content for sources and destinations. It has no effect on https://docs.airbyte.com or any platform content.
|
||||
:::
|
||||
|
||||
How it works:
|
||||
|
||||
- There must be blank lines between a custom tag like `FieldAnchor` the content it wraps.
|
||||
- The `field` attribute must be a valid `jsonpath` expression to one of the properties nested under `connectionSpecification.properties` in that connector's `spec.json` or `spec.yaml` file. For example, if the connector spec contains a `connectionSpecification.properties.replication_method.replication_slot`, you would mark the start of the related documentation section with `<FieldAnchor field="replication_method.replication_slot">` and its end with `</FieldAnchor>`.
|
||||
- Highlight the same section for multiple fields by separating them with commas, like this: `<FieldAnchor field="replication_method.replication_slot,replication_method.queue_size">`.
|
||||
- To highlight a section after the user picks an option from a `oneOf`: use a `field` prop like `path.to.field[value-of-selection-key]`, where the `value-of-selection-key` is the value of a `const` field nested inside that `oneOf`.
|
||||
|
||||
For example, if the specification of the `oneOf` field is:
|
||||
|
||||
```json
|
||||
"replication_method": {
|
||||
"type": "object",
|
||||
"title": "Update Method",
|
||||
"oneOf": [
|
||||
{
|
||||
"title": "Read Changes using Binary Log (CDC)",
|
||||
"required": ["method"],
|
||||
"properties": {
|
||||
"method": {
|
||||
"type": "string",
|
||||
<!-- highlight-next-line -->
|
||||
"const": "CDC",
|
||||
"order": 0
|
||||
},
|
||||
"initial_waiting_seconds": {
|
||||
"type": "integer",
|
||||
"title": "Initial Waiting Time in Seconds (Advanced)",
|
||||
},
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Scan Changes with User Defined Cursor",
|
||||
"required": ["method"],
|
||||
"properties": {
|
||||
"method": {
|
||||
"type": "string",
|
||||
<!-- highlight-next-line -->
|
||||
"const": "STANDARD",
|
||||
"order": 0
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The selection keys are `CDC` and `STANDARD`. Wrap a specific replication method's documentation section with a `<FieldAnchor field="replication_method[CDC]">...</FieldAnchor>` tag to highlight it if the user selects CDC replication in the UI.
|
||||
|
||||
### Documenting PyAirbyte usage
|
||||
|
||||
PyAirbyte is a Python library that allows you to run syncs within a Python script for a subset of Airbyte's connectors. Documentation around PyAirbyte connectors is automatically generated from the connector's JSON schema spec. There are a few approaches to combine full control over the documentation with automatic generation for common cases:
|
||||
|
||||
- If a connector:
|
||||
|
||||
1. Is PyAirbyte-enabled (`remoteRegistries.pypi.enabled` is set in the `metadata.yaml` file of the connector), and
|
||||
2. Has no second-level heading `Usage with PyAirbyte` in the documentation
|
||||
|
||||
The documentation will be automatically generated and placed above the `Changelog` section.
|
||||
|
||||
- By manually specifying a `Usage with PyAirbyte` section, this is disabled. The following is a good starting point for this section:
|
||||
|
||||
```md
|
||||
<HideInUI>
|
||||
|
||||
## Usage with PyAirbyte
|
||||
|
||||
<PyAirbyteExample connector="source-google-sheets" />
|
||||
|
||||
<SpecSchema connector="source-google-sheets" />
|
||||
|
||||
</HideInUI>
|
||||
```
|
||||
|
||||
The `PyAirbyteExample` component will generate a code example that can be run with PyAirbyte, excluding an auto-generated sample configuration based on the configuration schema. The `SpecSchema` component will generate a reference table with the connector's JSON schema spec, like a non-interactive version of the connector form in the UI. It can be used on any docs page.
|
||||