162 lines
5.3 KiB
Markdown
162 lines
5.3 KiB
Markdown
# Retrieving Records Spread Across Partitions
|
|
|
|
In some cases, the data you are replicating is spread across multiple partitions. You can specify a set of parameters to be iterated over and used while requesting all of your data. On each iteration, using the current element being iterated upon, the connector will perform a cycle of requesting data from your source.
|
|
|
|
`PartitionRouter`s gives you the ability to specify either a static or dynamic set of elements that will be iterated over one at a time. This in turn is used to route requests to a partition of your data according to the elements iterated over.
|
|
|
|
The most common use case for the `PartitionRouter` component is the retrieval of data from an API endpoint that requires extra request inputs to indicate which partition of data to fetch.
|
|
|
|
Schema:
|
|
|
|
```yaml
|
|
partition_router:
|
|
default: []
|
|
anyOf:
|
|
- "$ref": "#/definitions/CustomPartitionRouter"
|
|
- "$ref": "#/definitions/ListPartitionRouter"
|
|
- "$ref": "#/definitions/SubstreamPartitionRouter"
|
|
- type: array
|
|
items:
|
|
anyOf:
|
|
- "$ref": "#/definitions/CustomPartitionRouter"
|
|
- "$ref": "#/definitions/ListPartitionRouter"
|
|
- "$ref": "#/definitions/SubstreamPartitionRouter"
|
|
```
|
|
|
|
Notice that you can specify one or more `PartitionRouter`s on a Retriever. When multiple are defined, the result will be Cartesian product of all partitions and a request cycle will be performed for each permutation.
|
|
|
|
## ListPartitionRouter
|
|
|
|
`ListPartitionRouter` iterates over values from a given list. It is defined by
|
|
|
|
- The partition values, which are the valid values for the cursor field
|
|
- The cursor field on a record
|
|
- request_option: optional request option to set on outgoing request parameters
|
|
|
|
Schema:
|
|
|
|
```yaml
|
|
ListPartitionRouter:
|
|
description: Partition router that is used to retrieve records that have been partitioned according to a list of values
|
|
type: object
|
|
required:
|
|
- type
|
|
- cursor_field
|
|
- slice_values
|
|
properties:
|
|
type:
|
|
type: string
|
|
enum: [ListPartitionRouter]
|
|
cursor_field:
|
|
type: string
|
|
partition_values:
|
|
anyOf:
|
|
- type: string
|
|
- type: array
|
|
items:
|
|
type: string
|
|
request_option:
|
|
"$ref": "#/definitions/RequestOption"
|
|
$parameters:
|
|
type: object
|
|
additionalProperties: true
|
|
```
|
|
|
|
As an example, this partition router will iterate over the 2 repositories ("airbyte" and "airbyte-secret") and will set a request_parameter on outgoing HTTP requests.
|
|
|
|
```yaml
|
|
partition_router:
|
|
type: ListPartitionRouter
|
|
values:
|
|
- "airbyte"
|
|
- "airbyte-secret"
|
|
cursor_field: "repository"
|
|
request_option:
|
|
type: RequestOption
|
|
field_name: "repository"
|
|
inject_into: "request_parameter"
|
|
```
|
|
|
|
## SubstreamPartitionRouter
|
|
|
|
Substreams are streams that depend on the records of another stream
|
|
|
|
We might for instance want to read all the commits for a given repository (parent stream).
|
|
|
|
Substreams are implemented by defining their partition router as a `SubstreamPartitionRouter`.
|
|
|
|
`SubstreamPartitionRouter` is used to route requests to fetch data that has been partitioned according to a parent stream's records . We might for instance want to read all the commits for a given repository (parent resource).
|
|
|
|
- what the parent stream is
|
|
- what is the key of the records in the parent stream
|
|
- what is the attribute on the parent record that is being used to partition the substream data
|
|
- how to specify that attribute on an outgoing HTTP request to retrieve that set of records
|
|
|
|
Schema:
|
|
|
|
```yaml
|
|
SubstreamPartitionRouter:
|
|
description: Partition router that is used to retrieve records that have been partitioned according to records from the specified parent streams
|
|
type: object
|
|
required:
|
|
- type
|
|
- parent_stream_configs
|
|
properties:
|
|
type:
|
|
type: string
|
|
enum: [SubstreamPartitionRouter]
|
|
parent_stream_configs:
|
|
type: array
|
|
items:
|
|
"$ref": "#/definitions/ParentStreamConfig"
|
|
$parameters:
|
|
type: object
|
|
additionalProperties: true
|
|
```
|
|
|
|
Example:
|
|
|
|
```yaml
|
|
partition_router:
|
|
type: SubstreamPartitionRouter
|
|
parent_streams_configs:
|
|
- stream: "#/repositories_stream"
|
|
parent_key: "id"
|
|
partition_field: "repository"
|
|
request_option:
|
|
type: RequestOption
|
|
field_name: "repository"
|
|
inject_into: "request_parameter"
|
|
```
|
|
|
|
REST APIs often nest sub-resources in the URL path.
|
|
If the URL to fetch commits was "/repositories/:id/commits", then the `Requester`'s path would need to refer to the stream slice's value and no `request_option` would be set:
|
|
|
|
Example:
|
|
|
|
```yaml
|
|
retriever:
|
|
<...>
|
|
requester:
|
|
<...>
|
|
path: "/respositories/{{ stream_slice.repository }}/commits"
|
|
partition_router:
|
|
type: SubstreamPartitionRouter
|
|
parent_streams_configs:
|
|
- stream: "#/repositories_stream"
|
|
parent_key: "id"
|
|
partition_field: "repository"
|
|
incremental_dependency: true
|
|
```
|
|
|
|
## Nested streams
|
|
|
|
Nested streams, subresources, or streams that depend on other streams can be implemented using a [`SubstreamPartitionRouter`](#SubstreamPartitionRouter)
|
|
|
|
## More readings
|
|
|
|
- [Incremental streams](../../cdk-python/incremental-stream.md)
|
|
- [Stream slices](../../cdk-python/stream-slices.md)
|
|
|
|
[^1] This is a slight oversimplification. See [update cursor section](#cursor-update) for more details on how the cursor is updated.
|