1
0
mirror of synced 2025-12-25 02:09:19 -05:00

Google Sheets Destination flushes on every state message and flush when buffer gets too large (#14751)

* Google Sheets Destination flushes on every state message

* add PR number to readme

* suggestions

* reverted clean change

* added memory allocation check for records_buffer(stream)

* Update airbyte-integrations/connectors/destination-google-sheets/destination_google_sheets/writer.py

Co-authored-by: Sherif A. Nada <snadalive@gmail.com>

* auto-bump connector version

Co-authored-by: Oleksandr Bazarnov <oleksandr.bazarnov@globallogic.com>
Co-authored-by: Sherif A. Nada <snadalive@gmail.com>
Co-authored-by: Octavia Squidington III <octavia-squidington-iii@users.noreply.github.com>
This commit is contained in:
Evan Tahler
2022-07-15 16:15:39 -07:00
committed by GitHub
parent 540210763c
commit b0f559dfca
8 changed files with 44 additions and 35 deletions

View File

@@ -303,7 +303,7 @@
- name: Google Sheets
destinationDefinitionId: a4cbd2d1-8dbe-4818-b8bc-b90ad782d12a
dockerRepository: airbyte/destination-google-sheets
dockerImageTag: 0.1.0
dockerImageTag: 0.1.1
documentationUrl: https://docs.airbyte.io/integrations/destinations/google-sheets
icon: google-sheets.svg
releaseStage: alpha

View File

@@ -5063,7 +5063,7 @@
supported_destination_sync_modes:
- "overwrite"
- "append"
- dockerImage: "airbyte/destination-google-sheets:0.1.0"
- dockerImage: "airbyte/destination-google-sheets:0.1.1"
spec:
documentationUrl: "https://docs.airbyte.io/integrations/destinations/google-sheets"
connectionSpecification:

View File

@@ -13,5 +13,5 @@ RUN pip install .
ENTRYPOINT ["python", "/airbyte/integration_code/main.py"]
LABEL io.airbyte.version=0.1.0
LABEL io.airbyte.version=0.1.1
LABEL io.airbyte.name=airbyte/destination-google-sheets

View File

@@ -13,8 +13,9 @@ class WriteBufferMixin:
# Default instance of AirbyteLogger
logger = AirbyteLogger()
# interval after which the records_buffer should be cleaned up for selected stream
flush_interval = 1000
# intervals after which the records_buffer should be cleaned up for selected stream
flush_interval = 500 # records count
flush_interval_size_in_kb = 10 ^ 8 # memory allocation ~ 97656 Kb or 95 Mb
def __init__(self):
# Buffer for input records

View File

@@ -68,6 +68,8 @@ class DestinationGoogleSheets(Destination):
writer.add_to_buffer(record.stream, record.data)
writer.queue_write_operation(record.stream)
elif message.type == Type.STATE:
# yielding a state message indicates that all preceding records have been persisted to the destination
writer.write_whats_left()
yield message
else:
continue

View File

@@ -34,12 +34,14 @@ class GoogleSheetsWriter(WriteBufferMixin):
"""
Mimics `batch_write` operation using records_buffer.
1) gets data from the records_buffer
1) gets data from the records_buffer, with respect to the size of the records_buffer (records count or size in Kb)
2) writes it to the target worksheet
3) cleans-up the records_buffer belonging to input stream
"""
if len(self.records_buffer[stream_name]) == self.flush_interval:
# get the size of records_buffer for target stream in Kb
# TODO unit test flush triggers
records_buffer_size_in_kb = self.records_buffer[stream_name].__sizeof__() / 1024
if len(self.records_buffer[stream_name]) == self.flush_interval or records_buffer_size_in_kb > self.flush_interval_size_in_kb:
self.write_from_queue(stream_name)
self.clear_buffer(stream_name)

View File

@@ -1,13 +1,12 @@
# Google Sheets
The Google Sheets Destination is configured to push data to a single Google Sheets spreadsheet with multiple Worksheets as streams. To replicate data to multiple spreadsheets, you can create multiple instances of the Google Sheets Destination in your Airbyte instance.
The Google Sheets Destination is configured to push data to a single Google Sheets spreadsheet with multiple Worksheets as streams. To replicate data to multiple spreadsheets, you can create multiple instances of the Google Sheets Destination in your Airbyte instance.
This page guides you through the process of setting up the Google Sheets destination connector.
## Prerequisites
* Google Account
* Google Spreadsheet URL
- Google Account
- Google Spreadsheet URL
## Step 1: Set up Google Sheets
@@ -19,9 +18,9 @@ Visit the [Google Support](https://support.google.com/accounts/answer/27441?hl=e
### Google Sheets (Google Spreadsheets)
1. Once you acquire your Google Account, simply open the [Google Support](https://support.google.com/docs/answer/6000292?hl=en&co=GENIE.Platform%3DDesktop) to create the fresh empty Google to be used as a destination for your data replication, or if already have one - follow the next step.
1. Once you acquire your Google Account, simply open the [Google Support](https://support.google.com/docs/answer/6000292?hl=en&co=GENIE.Platform%3DDesktop) to create the fresh empty Google to be used as a destination for your data replication, or if already have one - follow the next step.
2. You will need the link of the Spreadsheet you'd like to sync. To get it, click Share button in the top right corner of Google Sheets interface, and then click Copy Link in the dialog that pops up.
These two steps are highlighted in the screenshot below:
These two steps are highlighted in the screenshot below:
![](../../.gitbook/assets/google_spreadsheet_url.png)
@@ -29,9 +28,9 @@ These two steps are highlighted in the screenshot below:
**For Airbyte Cloud:**
1. [Log into your Airbyte Cloud](https://cloud.airbyte.io/workspaces) account.
2. In the left navigation bar, click **Destinations**. In the top-right corner, click **+ new destination**.
3. On the source setup page, select **Google Sheets** from the Source type dropdown and enter a name for this connector.
1. [Log into your Airbyte Cloud](https://cloud.airbyte.io/workspaces) account.
2. In the left navigation bar, click **Destinations**. In the top-right corner, click **+ new destination**.
3. On the source setup page, select **Google Sheets** from the Source type dropdown and enter a name for this connector.
4. Select `Sign in with Google`.
5. Log in and Authorize to the Instagram account and click `Set up source`.
@@ -46,10 +45,13 @@ Each worksheet in the selected spreadsheet will be the output as a separate sour
Airbyte only supports replicating `Grid Sheets`, which means the text raw data only could be replicated to the target spreadsheet. See the [Google Sheets API docs](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/sheets#SheetType) for more info on all available sheet types.
#### Note:
* The output columns are ordered alphabetically. The output columns should not be reordered manually after the sync, this could cause the data corruption for all next syncs.
* The underlying process of record normalization is applied to avoid data corruption during the write process. This handles two scenarios:
- The output columns are ordered alphabetically. The output columns should not be reordered manually after the sync, this could cause the data corruption for all next syncs.
- The underlying process of record normalization is applied to avoid data corruption during the write process. This handles two scenarios:
1. UnderSetting - when record has less keys (columns) than catalog declares
2. OverSetting - when record has more keys (columns) than catalog declares
```
EXAMPLE:
@@ -86,17 +88,17 @@ EXAMPLE:
### Data type mapping
| Integration Type | Airbyte Type |
| :--- | :--- |
| Any Type | `string` |
| :--------------- | :----------- |
| Any Type | `string` |
### Features & Supported sync modes
| Feature | Supported?\(Yes/No\) |
| :--- | :--- |
| Ful-Refresh Overwrite | Yes |
| Ful-Refresh Append | Yes |
| Incremental Append | Yes |
| Incremental Append-Deduplicate | Yes |
| Feature | Supported?\(Yes/No\) |
| :----------------------------- | :------------------- |
| Ful-Refresh Overwrite | Yes |
| Ful-Refresh Append | Yes |
| Incremental Append | Yes |
| Incremental Append-Deduplicate | Yes |
### Rate Limiting & Performance Considerations
@@ -106,25 +108,27 @@ Please be aware of the [Google Spreadsheet limitations](#limitations) before you
### <a name="limitations"></a>Google Sheets Limitations
During the upload process and from the data storage perspective there are some limitations that should be considered beforehands:
* **Maximum of 5 Million Cells**
- **Maximum of 5 Million Cells**
A Google Sheets document can have a maximum of 5 million cells. These can be in a single worksheet or in multiple sheets.
In case you already have the 5 million limit reached in fewer columns, it will not allow you to add more columns (and vice versa, i.e., if 5 million cells limit is reached with a certain number of rows, it will not allow more rows).
* **Maximum of 18,278 Columns**
- **Maximum of 18,278 Columns**
At max, you can have 18,278 columns in Google Sheets in a worksheet.
* **Up to 200 Worksheets in a Spreadsheet**
- **Up to 200 Worksheets in a Spreadsheet**
You cannot create more than 200 worksheets within single spreadsheet.
#### Future improvements:
- Handle multiple spreadsheets to split big amount of data into parts, once the main spreadsheet is full and cannot be extended more, due to [limitations](#limitations).
## Changelog
| Version | Date | Pull Request | Subject |
|---------|------------|------------------------------------------------------------|----------------------------------------|
| 0.1.0 | 2022-04-26 | [12135](https://github.com/airbytehq/airbyte/pull/12135) | Initial Release |
| Version | Date | Pull Request | Subject |
| ------- | ---------- | -------------------------------------------------------- | ----------------------------------- |
| 0.1.1 | 2022-06-15 | [14751](https://github.com/airbytehq/airbyte/pull/14751) | Yield state only when records saved |
| 0.1.0 | 2022-04-26 | [12135](https://github.com/airbytehq/airbyte/pull/12135) | Initial Release |

View File

@@ -170,7 +170,7 @@ def get_docker_label_to_connector_directory(base_directory: str, connector_modul
# parse the dockerfile label if the dockerfile exists
dockerfile_path = pathlib.Path(base_directory, connector, "Dockerfile")
if os.path.isfile(dockerfile_path):
print(f"Reading f{dockerfile_path}")
print(f"Reading {dockerfile_path}")
with open(dockerfile_path, "r") as file:
dockerfile_contents = file.read()
label = parse_dockerfile_repository_label(dockerfile_contents)