Google Sheets Destination flushes on every state message and flush when buffer gets too large (#14751)

* Google Sheets Destination flushes on every state message * add PR number to readme * suggestions * reverted clean change * added memory allocation check for records_buffer(stream) * Update airbyte-integrations/connectors/destination-google-sheets/destination_google_sheets/writer.py Co-authored-by: Sherif A. Nada <snadalive@gmail.com> * auto-bump connector version Co-authored-by: Oleksandr Bazarnov <oleksandr.bazarnov@globallogic.com> Co-authored-by: Sherif A. Nada <snadalive@gmail.com> Co-authored-by: Octavia Squidington III <octavia-squidington-iii@users.noreply.github.com>
2025-12-25 02:09:19 -05:00 · 2022-07-15 16:15:39 -07:00
parent 540210763c
commit b0f559dfca
8 changed files with 44 additions and 35 deletions
--- a/airbyte-config/init/src/main/resources/seed/destination_definitions.yaml
+++ b/airbyte-config/init/src/main/resources/seed/destination_definitions.yaml
@@ -303,7 +303,7 @@
 - name: Google Sheets
  destinationDefinitionId: a4cbd2d1-8dbe-4818-b8bc-b90ad782d12a
  dockerRepository: airbyte/destination-google-sheets
-  dockerImageTag: 0.1.0
+  dockerImageTag: 0.1.1
  documentationUrl: https://docs.airbyte.io/integrations/destinations/google-sheets
  icon: google-sheets.svg
  releaseStage: alpha
--- a/airbyte-config/init/src/main/resources/seed/destination_specs.yaml
+++ b/airbyte-config/init/src/main/resources/seed/destination_specs.yaml
@@ -5063,7 +5063,7 @@
    supported_destination_sync_modes:
    - "overwrite"
    - "append"
- dockerImage: "airbyte/destination-google-sheets:0.1.0"
+- dockerImage: "airbyte/destination-google-sheets:0.1.1"
  spec:
    documentationUrl: "https://docs.airbyte.io/integrations/destinations/google-sheets"
    connectionSpecification:
--- a/airbyte-integrations/connectors/destination-google-sheets/Dockerfile
+++ b/airbyte-integrations/connectors/destination-google-sheets/Dockerfile
@@ -13,5 +13,5 @@ RUN pip install .

 ENTRYPOINT ["python", "/airbyte/integration_code/main.py"]

-LABEL io.airbyte.version=0.1.0
+LABEL io.airbyte.version=0.1.1
 LABEL io.airbyte.name=airbyte/destination-google-sheets
--- a/airbyte-integrations/connectors/destination-google-sheets/destination_google_sheets/buffer.py
+++ b/airbyte-integrations/connectors/destination-google-sheets/destination_google_sheets/buffer.py
@@ -13,8 +13,9 @@ class WriteBufferMixin:

    # Default instance of AirbyteLogger
    logger = AirbyteLogger()
-    # interval after which the records_buffer should be cleaned up for selected stream
-    flush_interval = 1000
+    # intervals after which the records_buffer should be cleaned up for selected stream
+    flush_interval = 500  # records count
+    flush_interval_size_in_kb = 10 ^ 8  # memory allocation ~ 97656 Kb or 95 Mb

    def __init__(self):
        # Buffer for input records
--- a/airbyte-integrations/connectors/destination-google-sheets/destination_google_sheets/destination.py
+++ b/airbyte-integrations/connectors/destination-google-sheets/destination_google_sheets/destination.py
@@ -68,6 +68,8 @@ class DestinationGoogleSheets(Destination):
                    writer.add_to_buffer(record.stream, record.data)
                    writer.queue_write_operation(record.stream)
            elif message.type == Type.STATE:
+                # yielding a state message indicates that all preceding records have been persisted to the destination
+                writer.write_whats_left()
                yield message
            else:
                continue
--- a/airbyte-integrations/connectors/destination-google-sheets/destination_google_sheets/writer.py
+++ b/airbyte-integrations/connectors/destination-google-sheets/destination_google_sheets/writer.py
@@ -34,12 +34,14 @@ class GoogleSheetsWriter(WriteBufferMixin):
        """
        Mimics `batch_write` operation using records_buffer.

-        1) gets data from the records_buffer
+        1) gets data from the records_buffer, with respect to the size of the records_buffer (records count or size in Kb)
        2) writes it to the target worksheet
        3) cleans-up the records_buffer belonging to input stream
        """
-
-        if len(self.records_buffer[stream_name]) == self.flush_interval:
+        # get the size of records_buffer for target stream in Kb
+        # TODO unit test flush triggers
+        records_buffer_size_in_kb = self.records_buffer[stream_name].__sizeof__() / 1024
+        if len(self.records_buffer[stream_name]) == self.flush_interval or records_buffer_size_in_kb > self.flush_interval_size_in_kb:
            self.write_from_queue(stream_name)
            self.clear_buffer(stream_name)

--- a/docs/integrations/destinations/google-sheets.md
+++ b/docs/integrations/destinations/google-sheets.md
@@ -1,13 +1,12 @@
 # Google Sheets

-The Google Sheets Destination is configured to push data to a single Google Sheets spreadsheet with multiple Worksheets as streams. To replicate data to multiple spreadsheets, you can create multiple instances of the Google Sheets Destination in your Airbyte instance. 
+The Google Sheets Destination is configured to push data to a single Google Sheets spreadsheet with multiple Worksheets as streams. To replicate data to multiple spreadsheets, you can create multiple instances of the Google Sheets Destination in your Airbyte instance.
 This page guides you through the process of setting up the Google Sheets destination connector.

-
 ## Prerequisites

-* Google Account
-* Google Spreadsheet URL
+- Google Account
+- Google Spreadsheet URL

 ## Step 1: Set up Google Sheets

@@ -19,9 +18,9 @@ Visit the [Google Support](https://support.google.com/accounts/answer/27441?hl=e

 ### Google Sheets (Google Spreadsheets)

-1. Once you acquire your Google Account, simply open the [Google Support](https://support.google.com/docs/answer/6000292?hl=en&co=GENIE.Platform%3DDesktop) to create the fresh empty Google  to be used as a destination for your data replication, or if already have one - follow the next step.
+1. Once you acquire your Google Account, simply open the [Google Support](https://support.google.com/docs/answer/6000292?hl=en&co=GENIE.Platform%3DDesktop) to create the fresh empty Google to be used as a destination for your data replication, or if already have one - follow the next step.
 2. You will need the link of the Spreadsheet you'd like to sync. To get it, click Share button in the top right corner of Google Sheets interface, and then click Copy Link in the dialog that pops up.
-These two steps are highlighted in the screenshot below:
+   These two steps are highlighted in the screenshot below:

 ![](../../.gitbook/assets/google_spreadsheet_url.png)

@@ -29,9 +28,9 @@ These two steps are highlighted in the screenshot below:

 **For Airbyte Cloud:**

-1. [Log into your Airbyte Cloud](https://cloud.airbyte.io/workspaces) account. 
-2. In the left navigation bar, click **Destinations**. In the top-right corner, click **+ new destination**. 
-3. On the source setup page, select **Google Sheets** from the Source type dropdown and enter a name for this connector. 
+1. [Log into your Airbyte Cloud](https://cloud.airbyte.io/workspaces) account.
+2. In the left navigation bar, click **Destinations**. In the top-right corner, click **+ new destination**.
+3. On the source setup page, select **Google Sheets** from the Source type dropdown and enter a name for this connector.
 4. Select `Sign in with Google`.
 5. Log in and Authorize to the Instagram account and click `Set up source`.

@@ -46,10 +45,13 @@ Each worksheet in the selected spreadsheet will be the output as a separate sour
 Airbyte only supports replicating `Grid Sheets`, which means the text raw data only could be replicated to the target spreadsheet. See the [Google Sheets API docs](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/sheets#SheetType) for more info on all available sheet types.

 #### Note:
-* The output columns are ordered alphabetically. The output columns should not be reordered manually after the sync, this could cause the data corruption for all next syncs.
-* The underlying process of record normalization is applied to avoid data corruption during the write process. This handles two scenarios:
+
+- The output columns are ordered alphabetically. The output columns should not be reordered manually after the sync, this could cause the data corruption for all next syncs.
+- The underlying process of record normalization is applied to avoid data corruption during the write process. This handles two scenarios:
+
 1. UnderSetting - when record has less keys (columns) than catalog declares
 2. OverSetting - when record has more keys (columns) than catalog declares
+
 ```
 EXAMPLE:

@@ -86,17 +88,17 @@ EXAMPLE:
 ### Data type mapping

 | Integration Type | Airbyte Type |
-| :--- | :--- |
-| Any Type | `string` |
+| :--------------- | :----------- |
+| Any Type         | `string`     |

 ### Features & Supported sync modes

-| Feature | Supported?\(Yes/No\) |
-| :--- | :--- |
-| Ful-Refresh Overwrite | Yes |
-| Ful-Refresh Append | Yes |
-| Incremental Append | Yes |
-| Incremental Append-Deduplicate | Yes |
+| Feature                        | Supported?\(Yes/No\) |
+| :----------------------------- | :------------------- |
+| Ful-Refresh Overwrite          | Yes                  |
+| Ful-Refresh Append             | Yes                  |
+| Incremental Append             | Yes                  |
+| Incremental Append-Deduplicate | Yes                  |

 ### Rate Limiting & Performance Considerations

@@ -106,25 +108,27 @@ Please be aware of the [Google Spreadsheet limitations](#limitations) before you
 ### <a name="limitations"></a>Google Sheets Limitations

 During the upload process and from the data storage perspective there are some limitations that should be considered beforehands:
-* **Maximum of 5 Million Cells**
+
+- **Maximum of 5 Million Cells**

 A Google Sheets document can have a maximum of 5 million cells. These can be in a single worksheet or in multiple sheets.
 In case you already have the 5 million limit reached in fewer columns, it will not allow you to add more columns (and vice versa, i.e., if 5 million cells limit is reached with a certain number of rows, it will not allow more rows).

-* **Maximum of 18,278 Columns**
+- **Maximum of 18,278 Columns**

 At max, you can have 18,278 columns in Google Sheets in a worksheet.

-* **Up to 200 Worksheets in a Spreadsheet**
+- **Up to 200 Worksheets in a Spreadsheet**

 You cannot create more than 200 worksheets within single spreadsheet.

-
 #### Future improvements:
+
 - Handle multiple spreadsheets to split big amount of data into parts, once the main spreadsheet is full and cannot be extended more, due to [limitations](#limitations).

 ## Changelog

-| Version | Date       | Pull Request                                               | Subject                                |
-|---------|------------|------------------------------------------------------------|----------------------------------------|
-| 0.1.0  | 2022-04-26 | [12135](https://github.com/airbytehq/airbyte/pull/12135)   | Initial Release                         |
+| Version | Date       | Pull Request                                             | Subject                             |
+| ------- | ---------- | -------------------------------------------------------- | ----------------------------------- |
+| 0.1.1   | 2022-06-15 | [14751](https://github.com/airbytehq/airbyte/pull/14751) | Yield state only when records saved |
+| 0.1.0   | 2022-04-26 | [12135](https://github.com/airbytehq/airbyte/pull/12135) | Initial Release                     |
--- a/tools/bin/build_report.py
+++ b/tools/bin/build_report.py
@@ -170,7 +170,7 @@ def get_docker_label_to_connector_directory(base_directory: str, connector_modul
        # parse the dockerfile label if the dockerfile exists
        dockerfile_path = pathlib.Path(base_directory, connector, "Dockerfile")
        if os.path.isfile(dockerfile_path):
-            print(f"Reading f{dockerfile_path}")
+            print(f"Reading {dockerfile_path}")
            with open(dockerfile_path, "r") as file:
                dockerfile_contents = file.read()
                label = parse_dockerfile_repository_label(dockerfile_contents)