[tools] prettier rules for .md + formatting cleanup
This commit is contained in:
@@ -33,7 +33,7 @@ offers helpers specific for creating Airbyte source connectors for:
|
||||
This document is a general introduction to the CDK. Readers should have basic familiarity with the
|
||||
[Airbyte Specification](https://docs.airbyte.com/understanding-airbyte/airbyte-protocol/) before proceeding.
|
||||
|
||||
If you have any issues with troubleshooting or want to learn more about the CDK from the Airbyte team, head to
|
||||
If you have any issues with troubleshooting or want to learn more about the CDK from the Airbyte team, head to
|
||||
[the Connector Development section of our Airbyte Forum](https://github.com/airbytehq/airbyte/discussions) to
|
||||
inquire further!
|
||||
|
||||
|
||||
@@ -52,8 +52,8 @@ As the code examples show, the `AbstractSource` delegates to the set of `Stream`
|
||||
|
||||
A summary of what we've covered so far on how to use the Airbyte CDK:
|
||||
|
||||
* A concrete implementation of the `AbstractSource` object is required.
|
||||
* This involves,
|
||||
- A concrete implementation of the `AbstractSource` object is required.
|
||||
- This involves,
|
||||
1. implementing the `check_connection`function.
|
||||
2. Creating the appropriate `Stream` classes and returning them in the `streams` function.
|
||||
3. placing the above mentioned `spec.yaml` file in the right place.
|
||||
@@ -61,4 +61,3 @@ A summary of what we've covered so far on how to use the Airbyte CDK:
|
||||
## HTTP Streams
|
||||
|
||||
We've covered how the `AbstractSource` works with the `Stream` interface in order to fulfill the Airbyte Specification. Although developers are welcome to implement their own object, the CDK saves developers the hassle of doing so in the case of HTTP APIs with the [`HTTPStream`](http-streams.md) object.
|
||||
|
||||
|
||||
@@ -45,4 +45,4 @@ We highly recommend implementing Incremental when feasible. See the [incremental
|
||||
|
||||
Another alternative to Incremental and Full Refresh streams is [resumable full refresh](resumable-full-refresh-stream.md). This is a stream that uses API
|
||||
endpoints that cannot reliably retrieve data in an incremental fashion. However, it can offer improved resilience
|
||||
against errors by checkpointing the stream's page number or cursor.
|
||||
against errors by checkpointing the stream's page number or cursor.
|
||||
|
||||
@@ -2,10 +2,10 @@
|
||||
|
||||
The CDK offers base classes that greatly simplify writing HTTP API-based connectors. Some of the most useful features include helper functionality for:
|
||||
|
||||
* Authentication \(basic auth, Oauth2, or any custom auth method\)
|
||||
* Pagination
|
||||
* Handling rate limiting with static or dynamic backoff timing
|
||||
* Caching
|
||||
- Authentication \(basic auth, Oauth2, or any custom auth method\)
|
||||
- Pagination
|
||||
- Handling rate limiting with static or dynamic backoff timing
|
||||
- Caching
|
||||
|
||||
All these features have sane off-the-shelf defaults but are completely customizable depending on your use case. They can also be combined with other stream features described in the [full refresh streams](full-refresh-stream.md) and [incremental streams](incremental-stream.md) sections.
|
||||
|
||||
@@ -35,7 +35,7 @@ Using either authenticator is as simple as passing the created authenticator int
|
||||
|
||||
## Pagination
|
||||
|
||||
Most APIs, when facing a large call, tend to return the results in pages. The CDK accommodates paging via the `next_page_token` function. This function is meant to extract the next page "token" from the latest response. The contents of a "token" are completely up to the developer: it can be an ID, a page number, a partial URL etc.. The CDK will continue making requests as long as the `next_page_token` continues returning non-`None` results. This can then be used in the `request_params` and other methods in `HttpStream` to page through API responses. Here is an [example](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-stripe/source_stripe/streams.py#L34) from the Stripe API.
|
||||
Most APIs, when facing a large call, tend to return the results in pages. The CDK accommodates paging via the `next_page_token` function. This function is meant to extract the next page "token" from the latest response. The contents of a "token" are completely up to the developer: it can be an ID, a page number, a partial URL etc.. The CDK will continue making requests as long as the `next_page_token` continues returning non-`None` results. This can then be used in the `request_params` and other methods in `HttpStream` to page through API responses. Here is an [example](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-stripe/source_stripe/streams.py#L34) from the Stripe API.
|
||||
|
||||
## Rate Limiting
|
||||
|
||||
@@ -50,7 +50,8 @@ Note that Airbyte will always attempt to make as many requests as possible and o
|
||||
When implementing [stream slicing](incremental-stream.md#streamstream_slices) in an `HTTPStream` each Slice is equivalent to a HTTP request; the stream will make one request per element returned by the `stream_slices` function. The current slice being read is passed into every other method in `HttpStream` e.g: `request_params`, `request_headers`, `path`, etc.. to be injected into a request. This allows you to dynamically determine the output of the `request_params`, `path`, and other functions to read the input slice and return the appropriate value.
|
||||
|
||||
## Nested Streams & Caching
|
||||
It's possible to cache data from a stream onto a temporary file on disk.
|
||||
|
||||
It's possible to cache data from a stream onto a temporary file on disk.
|
||||
|
||||
This is especially useful when dealing with streams that depend on the results of another stream e.g: `/employees/{id}/details`. In this case, we can use caching to write the data of the parent stream to a file to use this data when the child stream synchronizes, rather than performing a full HTTP request again.
|
||||
|
||||
@@ -61,10 +62,12 @@ Caching can be enabled by overriding the `use_cache` property of the `HttpStream
|
||||
The caching mechanism is related to parent streams. For child streams, there is an `HttpSubStream` class inheriting from `HttpStream` and overriding the `stream_slices` method that returns a generator of all parent entries.
|
||||
|
||||
To use caching in the parent/child relationship, perform the following steps:
|
||||
|
||||
1. Turn on parent stream caching by overriding the `use_cache` property.
|
||||
2. Inherit child stream class from `HttpSubStream` class.
|
||||
|
||||
#### Example
|
||||
|
||||
```python
|
||||
class Employees(HttpStream):
|
||||
...
|
||||
|
||||
@@ -4,10 +4,10 @@ An incremental Stream is a stream which reads data incrementally. That is, it on
|
||||
|
||||
Several new pieces are essential to understand how incrementality works with the CDK:
|
||||
|
||||
* `AirbyteStateMessage`
|
||||
* cursor fields
|
||||
* `IncrementalMixin`
|
||||
* `Stream.get_updated_state` (deprecated)
|
||||
- `AirbyteStateMessage`
|
||||
- cursor fields
|
||||
- `IncrementalMixin`
|
||||
- `Stream.get_updated_state` (deprecated)
|
||||
|
||||
as well as a few other optional concepts.
|
||||
|
||||
@@ -28,7 +28,7 @@ In the context of the CDK, setting the `Stream.cursor_field` property to any tru
|
||||
This class mixin adds property `state` with abstract setter and getter.
|
||||
The `state` attribute helps the CDK figure out the current state of sync at any moment (in contrast to deprecated `Stream.get_updated_state` method).
|
||||
The setter typically deserialize state saved by CDK and initialize internal state of the stream.
|
||||
The getter should serialize internal state of the stream.
|
||||
The getter should serialize internal state of the stream.
|
||||
|
||||
```python
|
||||
@property
|
||||
@@ -42,6 +42,7 @@ def state(self, value: Mapping[str, Any]):
|
||||
|
||||
The actual logic of updating state during reading is implemented somewhere else, usually as part of `read_records` method, right after the latest record returned that matches the new state.
|
||||
Therefore, the state represents the latest checkpoint successfully achieved, and all next records should match the next state after that one.
|
||||
|
||||
```python
|
||||
def read_records(self, ...):
|
||||
...
|
||||
@@ -56,6 +57,7 @@ def read_records(self, ...):
|
||||
```
|
||||
|
||||
### `Stream.get_updated_state`
|
||||
|
||||
(deprecated since 1.48.0, see `IncrementalMixin`)
|
||||
|
||||
This function helps the stream keep track of the latest state by inspecting every record output by the stream \(as returned by the `Stream.read_records` method\) and comparing it against the most recent state object. This allows sync to resume from where the previous sync last stopped, regardless of success or failure. This function typically compares the state object's and the latest record's cursor field, picking the latest one.
|
||||
@@ -76,7 +78,7 @@ While this is very simple, **it requires that records are output in ascending or
|
||||
Interval based checkpointing can be implemented by setting the `Stream.state_checkpoint_interval` property e.g:
|
||||
|
||||
```text
|
||||
class MyAmazingStream(Stream):
|
||||
class MyAmazingStream(Stream):
|
||||
# Save the state every 100 records
|
||||
state_checkpoint_interval = 100
|
||||
```
|
||||
@@ -97,7 +99,6 @@ For a more in-depth description of stream slicing, see the [Stream Slices guide]
|
||||
|
||||
In summary, an incremental stream requires:
|
||||
|
||||
* the `cursor_field` property
|
||||
* to be inherited from `IncrementalMixin` and state methods implemented
|
||||
* Optionally, the `stream_slices` function
|
||||
|
||||
- the `cursor_field` property
|
||||
- to be inherited from `IncrementalMixin` and state methods implemented
|
||||
- Optionally, the `stream_slices` function
|
||||
|
||||
@@ -56,4 +56,3 @@ class Pilot(Employee):
|
||||
Generators are basically iterators over arbitrary source data. They are handy because their syntax is extremely concise and feel just like any other list or collection when working with them in code.
|
||||
|
||||
If you see `yield` anywhere in the code -- that's a generator at work.
|
||||
|
||||
|
||||
@@ -19,7 +19,7 @@ values used to checkpoint state in between resumable full refresh sync attempts
|
||||
|
||||
## Criteria for Resumable Full Refresh
|
||||
|
||||
:::warning
|
||||
:::warning
|
||||
Resumable full refresh in the Python CDK does not currently support substreams. This work is currently in progress.
|
||||
:::
|
||||
|
||||
@@ -42,7 +42,7 @@ is retried.
|
||||
This class mixin adds property `state` with abstract setter and getter.
|
||||
The `state` attribute helps the CDK figure out the current state of sync at any moment.
|
||||
The setter typically deserializes state saved by CDK and initialize internal state of the stream.
|
||||
The getter should serialize internal state of the stream.
|
||||
The getter should serialize internal state of the stream.
|
||||
|
||||
```python
|
||||
@property
|
||||
@@ -88,5 +88,5 @@ in between sync attempts, but deleted at the beginning of new sync jobs.
|
||||
|
||||
In summary, a resumable full refresh stream requires:
|
||||
|
||||
* to be inherited from `StateMixin` and state methods implemented
|
||||
* implementing `Stream.read_records()` to get the Stream's current state, request a single page of records, and update the Stream's state with the next page to fetch or `{}`.
|
||||
- to be inherited from `StateMixin` and state methods implemented
|
||||
- implementing `Stream.read_records()` to get the Stream's current state, request a single page of records, and update the Stream's state with the next page to fetch or `{}`.
|
||||
|
||||
@@ -16,7 +16,7 @@ Important note: any objects referenced via `$ref` should be placed in the `share
|
||||
|
||||
If you are implementing a connector to pull data from an API which publishes an [OpenAPI/Swagger spec](https://swagger.io/specification/), you can use a tool we've provided for generating JSON schemas from the OpenAPI definition file. Detailed information can be found [here](https://github.com/airbytehq/airbyte/tree/master/tools/openapi2jsonschema/).
|
||||
|
||||
### Generating schemas using the output of your connector's read command
|
||||
### Generating schemas using the output of your connector's read command
|
||||
|
||||
We also provide a tool for generating schemas using a connector's `read` command output. Detailed information can be found [here](https://github.com/airbytehq/airbyte/tree/master/tools/schema_generator/).
|
||||
|
||||
@@ -43,7 +43,7 @@ def get_json_schema(self):
|
||||
|
||||
It is important to ensure output data conforms to the declared json schema. This is because the destination receiving this data to load into tables may strictly enforce schema \(e.g. when data is stored in a SQL database, you can't put CHAR type into INTEGER column\). In the case of changes to API output \(which is almost guaranteed to happen over time\) or a minor mistake in jsonschema definition, data syncs could thus break because of mismatched datatype schemas.
|
||||
|
||||
To remain robust in operation, the CDK provides a transformation ability to perform automatic object mutation to align with desired schema before outputting to the destination. All streams inherited from airbyte_cdk.sources.streams.core.Stream class have this transform configuration available. It is \_disabled_ by default and can be configured per stream within a source connector.
|
||||
To remain robust in operation, the CDK provides a transformation ability to perform automatic object mutation to align with desired schema before outputting to the destination. All streams inherited from airbyte*cdk.sources.streams.core.Stream class have this transform configuration available. It is \_disabled* by default and can be configured per stream within a source connector.
|
||||
|
||||
### Default type transformation
|
||||
|
||||
@@ -81,7 +81,7 @@ And objects inside array of referenced by $ref attribute.
|
||||
|
||||
If the value cannot be cast \(e.g. string "asdf" cannot be casted to integer\), the field would retain its original value. Schema type transformation support any jsonschema types, nested objects/arrays and reference types. Types described as array of more than one type \(except "null"\), types under oneOf/anyOf keyword wont be transformed.
|
||||
|
||||
_Note:_ This transformation is done by the source, not the stream itself. I.e. if you have overriden "read\_records" method in your stream it wont affect object transformation. All transformation are done in-place by modifing output object before passing it to "get\_updated\_state" method, so "get\_updated\_state" would receive the transformed object.
|
||||
_Note:_ This transformation is done by the source, not the stream itself. I.e. if you have overriden "read_records" method in your stream it wont affect object transformation. All transformation are done in-place by modifing output object before passing it to "get_updated_state" method, so "get_updated_state" would receive the transformed object.
|
||||
|
||||
### Custom schema type transformation
|
||||
|
||||
@@ -99,13 +99,13 @@ class MyStream(Stream):
|
||||
return transformed_value
|
||||
```
|
||||
|
||||
Where original\_value is initial field value and field\_schema is part of jsonschema describing field type. For schema
|
||||
Where original_value is initial field value and field_schema is part of jsonschema describing field type. For schema
|
||||
|
||||
```javascript
|
||||
{"type": "object", "properties": {"value": {"type": "string", "format": "date-time"}}}
|
||||
```
|
||||
|
||||
field\_schema variable would be equal to
|
||||
field_schema variable would be equal to
|
||||
|
||||
```javascript
|
||||
{"type": "string", "format": "date-time"}
|
||||
@@ -145,7 +145,7 @@ class MyStream(Stream):
|
||||
|
||||
Transforming each object on the fly would add some time for each object processing. This time is depends on object/schema complexity and hardware configuration.
|
||||
|
||||
There are some performance benchmarks we've done with ads\_insights facebook schema \(it is complex schema with objects nested inside arrays ob object and a lot of references\) and example object. Here is the average transform time per single object, seconds:
|
||||
There are some performance benchmarks we've done with ads_insights facebook schema \(it is complex schema with objects nested inside arrays ob object and a lot of references\) and example object. Here is the average transform time per single object, seconds:
|
||||
|
||||
```text
|
||||
regular transform:
|
||||
@@ -162,4 +162,3 @@ just traverse/validate through json schema and object fields:
|
||||
```
|
||||
|
||||
On my PC \(AMD Ryzen 7 5800X\) it took 0.8 milliseconds per object. As you can see most time \(~ 75%\) is taken by jsonschema traverse/validation routine and very little \(less than 10 %\) by actual converting. Processing time can be reduced by skipping jsonschema type checking but it would be no warnings about possible object jsonschema inconsistency.
|
||||
|
||||
|
||||
@@ -25,4 +25,3 @@ Slack is a chat platform for businesses. Collectively, a company can easily post
|
||||
This is a great usecase for stream slicing. The `messages` stream, which outputs one record per chat message, can slice records by time e.g: hourly. It implements this by specifying the beginning and end timestamp of each hour that it wants to pull data from. Then after all the records in a given hour \(i.e: slice\) have been read, the connector outputs a STATE message to indicate that state should be saved. This way, if the connector ever fails during a sync \(for example if the API goes down\) then at most, it will reread only one hour's worth of messages.
|
||||
|
||||
See the implementation of the Slack connector [here](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-slack/source_slack/source.py).
|
||||
|
||||
|
||||
Reference in New Issue
Block a user