1
0
mirror of synced 2025-12-29 09:03:46 -05:00

feat(airbyte-cdk): Add Per Partition with Global fallback Cursor (#45125)

This commit is contained in:
Anatolii Yatsuk
2024-10-28 14:03:41 +02:00
committed by GitHub
parent 917ddae846
commit fb769bd0a0
17 changed files with 1388 additions and 336 deletions

View File

@@ -175,22 +175,25 @@ incremental_sync:
Nested streams, subresources, or streams that depend on other streams can be implemented using a [`SubstreamPartitionRouter`](#SubstreamPartitionRouter)
The default state format is **per partition**, but there are options to enhance efficiency depending on your use case: **incremental_dependency** and **global_substream_cursor**. Here's when and how to use each option, with examples:
#### Per Partition (Default)
- **Description**: This is the default state format, where each partition has its own cursor.
- **Limitation**: The per partition state has a limit of 10,000 partitions. When this limit is exceeded, the oldest partitions are deleted. During the next sync, deleted partitions will be read in full refresh, which can be inefficient.
- **When to Use**: Use this option if the number of partitions is manageable (under 10,000).
The default state format is **per partition with fallback to global**, but there are options to enhance efficiency depending on your use case: **incremental_dependency** and **global_substream_cursor**. Here's when and how to use each option, with examples:
#### Per Partition with Fallback to Global (Default)
- **Description**: This is the default state format, where each partition has its own cursor. However, when the number of records in the parent sync exceeds two times the set limit, the cursor automatically falls back to a global state to manage efficiency and scalability.
- **Limitation**: The per partition state has a limit of 10,000 partitions. Once this limit is exceeded, the global cursor takes over, aggregating the state across partitions to avoid inefficiencies.
- **When to Use**: Use this as the default option for most cases. It provides the flexibility of managing partitions while preventing performance degradation when large numbers of records are involved.
- **Example State**:
```json
[
{ "partition": "A", "timestamp": "2024-08-01T00:00:00" },
{ "partition": "B", "timestamp": "2024-08-01T01:00:00" },
{ "partition": "C", "timestamp": "2024-08-01T02:00:00" }
]
{
"states": [
{"partition_key": "partition_1", "cursor_field": "2021-01-15"},
{"partition_key": "partition_2", "cursor_field": "2021-02-14"}
],
"state": {
"cursor_field": "2021-02-15"
},
"use_global_cursor": false
}
```
#### Incremental Dependency
- **Description**: This option allows the parent stream to be read incrementally, ensuring that only new data is synced.
- **Requirement**: The API must ensure that the parent record's cursor is updated whenever child records are added or updated. If this requirement is not met, child records added to older parent records will be lost.
@@ -201,7 +204,7 @@ The default state format is **per partition**, but there are options to enhance
"parent_state": {
"parent_stream": { "timestamp": "2024-08-01T00:00:00" }
},
"child_state": [
"states": [
{ "partition": "A", "timestamp": "2024-08-01T00:00:00" },
{ "partition": "B", "timestamp": "2024-08-01T01:00:00" }
]
@@ -209,9 +212,9 @@ The default state format is **per partition**, but there are options to enhance
```
#### Global Substream Cursor
- **Description**: This option uses a single global cursor for all partitions, significantly reducing the state size. It enforces a minimal lookback window for substream based on the duration of the previous sync to avoid losing records. This lookback ensures that any records added or updated during the sync are captured in subsequent syncs.
- **When to Use**: Use this option if the number of partitions in the parent stream is significantly higher than the 10,000 partition limit (e.g., millions of records per sync). This prevents the inefficiency of reading most partitions in full refresh and avoids duplicates during the next sync.
- **Operational Detail**: The global cursor's value is updated only at the end of the sync. If the sync fails, only the parent state is updated if the incremental dependency is enabled.
- **Description**: This option uses a single global cursor for all partitions, significantly reducing the state size. It enforces a minimal lookback window based on the previous sync's duration to avoid losing records added or updated during the sync. Since the global cursor is already part of the per partition with fallback to global approach, it should only be used cautiously for custom connectors with exceptionally large parent streams to avoid managing state per partition.
- **When to Use**: Use this option cautiously for custom connectors where the number of partitions in the parent stream is extremely high (e.g., millions of records per sync). The global cursor avoids the inefficiency of managing state per partition but sacrifices some granularity, which may not be suitable for every use case.
- **Operational Detail**: The global cursor is updated only at the end of the sync. If the sync fails, only the parent state is updated, provided that the incremental dependency is enabled. The global cursor ensures that records are captured through a lookback window, even if they were added during the sync.
- **Example State**:
```json
[
@@ -220,9 +223,10 @@ The default state format is **per partition**, but there are options to enhance
```
### Summary
- **Per Partition**: Default, use for manageable partitions (\<10k).
- **Incremental Dependency**: Use for incremental parent streams with a dependent child cursor. Ensure API updates parent cursor with child records.
- **Global Substream Cursor**: Ideal for large-scale parent streams with many partitions to optimize performance.
Summary
- **Per Partition with Fallback to Global (Default)**: Best for managing scalability and optimizing state size. Starts with per partition cursors, and automatically falls back to a global cursor if the number of records in the parent sync exceeds two times the partition limit.
- **Incremental Dependency**: Use for incremental parent streams with a dependent child cursor. Ensure the API updates the parent cursor when child records are added or updated.
- **Global Substream Cursor**: Use cautiously for custom connectors with very large parent streams. Avoids per partition state management but sacrifices some granularity.
Choose the option that best fits your data structure and sync requirements to optimize performance and data integrity.