mirror of synced 2025-12-20 18:39:31 -05:00

Files

Teo 499a028f86 Clarify default value for S3 Bucket Path (#70902 )

Co-authored-by: Ian Alton <ian.alton@airbyte.io>

2025-12-13 00:03:40 +00:00

61 KiB

Raw Permalink Blame History

S3

This page guides you through the process of setting up the S3 destination connector.

Prerequisites

Allow connections from Airbyte to your AWS S3/Minio S3 cluster (if they exist in separate VPCs).
Enforce encryption of data in transit.
An S3 bucket with credentials, a Role ARN, or an instance profile with read/write permissions configured for the host (EC2, EKS).
- These fields are always required:
  - S3 Bucket Name
  - S3 Bucket Path
  - S3 Bucket Region
- If you are using STS Assume Role, you must provide:
  - Role ARN
- If you are using AWS credentials, you must provide:
  - Access Key ID
  - Secret Access Key
- If you are using an Instance Profile, you may omit the Access Key ID, Secret Access Key, and Role ARN.

Setup guide

Step 1: Set up S3

Prepare S3 bucket that will be used as destination, see this to create an S3 bucket.

NOTE: If the S3 cluster is not configured to use TLS, the connection to Amazon S3 silently reverts to an unencrypted connection. Airbyte recommends all connections be configured to use TLS/SSL as support for AWS's shared responsibility model

Create bucket a Policy

Open the IAM console.
In the IAM dashboard, select Policies, then click Create Policy.
Select the JSON tab, then paste the following JSON into the Policy editor (be sure to substitute in your bucket name):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:DeleteObject",
        "s3:PutObjectAcl",
        "s3:ListBucket",
        "s3:ListBucketMultipartUploads",
        "s3:AbortMultipartUpload",
        "s3:GetBucketLocation"
      ],
      "Resource": [
        "arn:aws:s3:::YOUR_BUCKET_NAME/*",
        "arn:aws:s3:::YOUR_BUCKET_NAME"
      ]
    }
  ]
}

:::note At this time, object-level permissions alone are not sufficient to successfully authenticate the connection. Please ensure you include the bucket-level permissions as provided in the example above. :::

Give your policy a descriptive name, then click Create policy.

Authentication Option 1: Using an IAM Role (Most secure)

:::note S3 authentication using an IAM role member must be enabled by a member of the Airbyte team. If you'd like to use this feature, please contact the Sales team for more information. :::

In the IAM dashboard, click Roles, then Create role.
Choose the appropriate trust entity and attach the policy you created.
Set up a trust relationship for the role. For example for AWS account trusted entity use default AWS account on your instance (it will be used to assume role). To use External ID set it to environment variables as export AWS_ASSUME_ROLE_EXTERNAL_ID="{your-external-id}". Edit the trust relationship policy to reflect this:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::{your-aws-account-id}:user/{your-username}"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": "{your-external-id}"
                }
            }
        }
    ]
}

Choose the AWS account trusted entity type.
Set up a trust relationship for the role. This allows the Airbyte instance's AWS account to assume this role. You will also need to specify an external ID, which is a secret key that the trusting service (Airbyte) and the trusted role (the role you're creating) both know. This ID is used to prevent the "confused deputy" problem. The External ID should be your Airbyte workspace ID, which can be found in the URL of your workspace page. Edit the trust relationship policy to include the external ID:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::094410056844:user/delegated_access_user"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": "{your-airbyte-workspace-id}"
                }
            }
        }
    ]
}

Complete the role creation and note the Role ARN.
Select Attach policies directly, then find and check the box for your new policy. Click Next, then Add permissions.

Authentication Option 2: Using an IAM User

Use an existing or create new Access Key ID and Secret Access Key.

In the IAM dashboard, click Users. Select an existing IAM user or create a new one by clicking Add users.
If you are using an existing IAM user, click the Add permissions dropdown menu and select Add permissions. If you are creating a new user, you will be taken to the Permissions screen after selecting a name.
Select Attach policies directly, then find and check the box for your new policy. Click Next, then Add permissions.
After successfully creating your user, select the Security credentials tab and click Create access key. You will be prompted to select a use case and add optional tags to your access key. Click Create access key to generate the keys.

Step 2: Set up the S3 destination connector in Airbyte

For Airbyte Cloud:

Log into your Airbyte Cloud account.
In the left navigation bar, click Destinations. In the top-right corner, click + new destination.
On the destination setup page, select S3 from the Destination type dropdown and enter a name for this connector.
Configure fields:
- Access Key Id
  - See this on how to generate an access key.
  - We recommend creating an Airbyte-specific user. This user will require read and write permissions to objects in the bucket.
- Secret Access Key
  - Corresponding key to the above key id.
- Role ARN
  - See this on how to create a role.
- S3 Bucket Name
  - See this to create an S3 bucket.
- S3 Bucket Path
  - Subdirectory under the bucket to sync the data into. Note: this defaults to airbyte-data.
- S3 Bucket Region:
  - See here for all region codes.
- S3 Path Format
  - Additional string format on how to store data under S3 Bucket Path. Default value is ${NAMESPACE}/${STREAM_NAME}/${YEAR}_${MONTH}_${DAY}_${EPOCH}_.
- S3 Endpoint
  - Leave empty if using AWS S3, fill in S3 URL if using Minio S3.
- S3 Filename pattern
  - The pattern allows you to set the file-name format for the S3 staging file(s), next placeholders combinations are currently supported: {date}, {date:yyyy_MM}, {timestamp}, {timestamp:millis}, {timestamp:micros}, {part_number}, {sync_id}, {format_extension}. Please, don't use empty space and not supportable placeholders, as they won't be recognized.
Click Set up destination.

For Airbyte Open Source:

Go to local Airbyte page.
In the left navigation bar, click Destinations. In the top-right corner, click + new destination.
On the destination setup page, select S3 from the Destination type dropdown and enter a name for this connector.
Configure fields:
- Access Key Id
  - See this on how to generate an access key.
  - See this on how to create a instanceprofile.
  - We recommend creating an Airbyte-specific user. This user will require read and write permissions to objects in the staging bucket.
  - If the Access Key and Secret Access Key are not provided, the authentication will rely either on the Role ARN using STS Assume Role or on the instanceprofile.
- Secret Access Key
  - Corresponding key to the above key id.
  - Make sure your S3 bucket is accessible from the machine running Airbyte.
  - This depends on your networking setup.
  - You can check AWS S3 documentation with a tutorial on how to properly configure your S3's access here.
  - If you use instance profile authentication, make sure the role has permission to read/write on the bucket.
  - The easiest way to verify if Airbyte is able to connect to your S3 bucket is via the check connection tool in the UI.
- S3 Bucket Name
  - See this to create an S3 bucket.
- S3 Bucket Path
  - Subdirectory under the above bucket to sync the data into.
- S3 Bucket Region
  - See here for all region codes.
- S3 Path Format
  - Additional string format on how to store data under S3 Bucket Path. Default value is ${NAMESPACE}/${STREAM_NAME}/${YEAR}_${MONTH}_${DAY}_${EPOCH}_.
- S3 Endpoint
  - Leave empty if using AWS S3, fill in S3 URL if using Minio S3.
- S3 Filename pattern
  - The pattern allows you to set the file-name format for the S3 staging file(s), next placeholders combinations are currently supported: {date}, {date:yyyy_MM}, {timestamp}, {timestamp:millis}, {timestamp:micros}, {part_number}, {sync_id}, {format_extension}.
  - Please, don't use empty space and not supportable placeholders, as they won't recognized.
Click Set up destination.

The full path of the output data with the default S3 Path Format ${NAMESPACE}/${STREAM_NAME}/${YEAR}_${MONTH}_${DAY}_${EPOCH}_ is:

<bucket-name>/<source-namespace-if-exists>/<stream-name>/<upload-date>_<epoch>_<partition-id>.<format-extension>

For example:

testing_bucket/data_output_path/public/users/2021_01_01_1234567890_0.csv.gz
↑              ↑                ↑      ↑     ↑          ↑          ↑ ↑
|              |                |      |     |          |          | format extension
|              |                |      |     |          |          unique incremental part id
|              |                |      |     |          milliseconds since epoch
|              |                |      |     upload date in YYYY_MM_DD
|              |                |      stream name
|              |                source namespace (if it exists)
|              bucket path
bucket name

The rationales behind this naming pattern are:

Each stream has its own directory.
The data output files can be sorted by upload time.
The upload time composes of a date part and millis part so that it is both readable and unique.

But it is possible to further customize by using the available variables to format the bucket path:

${NAMESPACE}: Namespace where the stream comes from or configured by the connection namespace fields.
${STREAM_NAME}: Name of the stream
${YEAR}: Year in which the sync was writing the output data in.
${MONTH}: Month in which the sync was writing the output data in.
${DAY}: Day in which the sync was writing the output data in.
${HOUR}: Hour in which the sync was writing the output data in.
${MINUTE} : Minute in which the sync was writing the output data in.
${SECOND}: Second in which the sync was writing the output data in.
${MILLISECOND}: Millisecond in which the sync was writing the output data in.
${EPOCH}: Milliseconds since Epoch in which the sync was writing the output data in.
${UUID}: random uuid string

Note:

Multiple / characters in the S3 path are collapsed into a single / character.
If the output bucket contains too many files, the part id variable is using a UUID instead. It uses sequential ID otherwise.

Please note that the stream name may contain a prefix, if it is configured on the connection. A data sync may create multiple files as the output files can be partitioned by size (targeting a size of 200MB compressed or lower) .

Supported sync modes

Feature	Support	Notes
FullRefresh - Overwrite Sync	✅	Warning: this mode deletes all previously synced data in the configured bucket path.
Incremental - Append Sync	✅	Warning: Airbyte provides at-least-once delivery. Depending on your source, you may see duplicated data. Learn more here
Incremental - Append + Deduped	❌
Namespaces	❌	Setting a specific bucket path is equivalent to having separate namespaces.

The Airbyte S3 destination allows you to sync data to AWS S3 or Minio S3. Each stream is written to its own directory under the bucket.

⚠️ Please be aware that in "Full Refresh Overwrite Sync" mode, data from the same generation is retained while all previous data is deleted upon a successful sync. In case of failures between different generations, data from multiple generations may persist until a subsequent successful sync. Each S3 object is tagged with x-amz-meta-ab-generation-id to identify its generation. We recommend provisioning a dedicated S3 resource for this sync to avoid accidental data deletion due to misconfiguration. ⚠️

Supported Output schema

Each stream will be outputted to its dedicated directory according to the configuration. The complete datastore of each stream includes all the output files under that directory. You can think of the directory as equivalent of a Table in the database world.

Under Full Refresh Sync mode, old output files will be purged before new files are created.
Under Incremental - Append Sync mode, new output files will be added that only contain the new data.

Avro

Apache Avro serializes data in a compact binary format. Currently, the Airbyte S3 Avro connector always uses the binary encoding, and assumes that all data records follow the same schema.

Configuration

Here is the available compression codecs:

No compression
deflate
- Compression level
  - Range [0, 9]. Default to 0.
  - Level 0: no compression & fastest.
  - Level 9: best compression & slowest.
bzip2
xz
- Compression level
  - Range [0, 9]. Default to 6.
  - Level 0-3 are fast with medium compression.
  - Level 4-6 are fairly slow with high compression.
  - Level 7-9 are like level 6 but use bigger dictionaries and have higher memory requirements. Unless the uncompressed size of the file exceeds 8 MiB, 16 MiB, or 32 MiB, it is waste of memory to use the presets 7, 8, or 9, respectively.
zstandard
- Compression level
  - Range [-5, 22]. Default to 3.
  - Negative levels are 'fast' modes akin to lz4 or snappy.
  - Levels above 9 are generally for archival purposes.
  - Levels above 18 use a lot of memory.
- Include checksum
  - If set to true, a checksum will be included in each data block.
snappy

Data schema

Under the hood, an Airbyte data stream in JSON schema is first converted to an Avro schema, then the JSON object is converted to an Avro record. Because the data stream can come from any data source, the JSON to Avro conversion process has arbitrary rules and limitations. Learn more about how source data is converted to Avro and the current limitations here.

CSV

Like most of the other Airbyte destination connectors, usually the output has three columns: a UUID, an emission timestamp, and the data blob. With the CSV output, it is possible to normalize (flatten) the data blob to multiple columns.

Column	Condition	Description
`_airbyte_raw_id`	Always exists.	A uuid assigned by Airbyte to each processed record.
`_airbyte_extracted_at`	Always exists.	A timestamp representing when the event was extracted from the data source.
`_airbyte_generation_id`	Always exists.	An integer id that increases with each new refresh.
`_airbyte_meta`	Always exists.	A structured object containing metadata about the record.
`_airbyte_data`	When no normalization (flattening) is needed, all data resides under this column as a JSON blob.
root level fields	When root level normalization (flattening) is selected, the root level fields are expanded.

The schema for _airbyte_meta is:

Field Name	Type	Description
`changes`	list	A list of structured change objects.
`sync_id`	integer	An integer identifier for the sync job.

The schema for a change object is:

Field Name	Type	Description
`field`	string	The name of the field that changed.
`change`	string	The type of change (eg, `NULLED`, `TRUNCATED`).
`reason`	string	The reason for the change, including its system of origin (ie, whether it was a source, destination, or platform error).

For example, given the following JSON object from a source:

{
  "user_id": 123,
  "name": {
    "first": "John",
    "last": "Doe"
  }
}

With no normalization, the output CSV is:

`_airbyte_raw_id`	`_airbyte_extracted_at`	`_airbyte_generation_id`	`_airbyte_meta`	`_airbyte_data`
`26d73cde-7eb1-4e1e-b7db-a4c03b4cf206`	1622135805000	11	`{"changes":[], "sync_id": 10111 }`	`{ "user_id": 123, name: { "first": "John", "last": "Doe" } }`

With root level normalization, the output CSV is:

`_airbyte_raw_id`	`_airbyte_extracted_at`	`_airbyte_generation_id`	`_airbyte_meta`	`user_id`	`name.first`	`name.last`
`26d73cde-7eb1-4e1e-b7db-a4c03b4cf206`	1622135805000	11	`{"changes":[], "sync_id": 10111 }`	123	John	Doe

Output files can be compressed. The default option is GZIP compression. If compression is selected, the output filename will have an extra extension (GZIP: .csv.gz).

JSON Lines (JSONL)

JSON Lines is a text format with one JSON per line. Each line has a structure as follows:

{
  "_airbyte_raw_id": "<uuid>",
  "_airbyte_extracted_at": "<timestamp>",
  "_airbyte_generation_id": "<generation-id>",
  "_airbyte_meta": "<json-meta>",
  "_airbyte_data": "<json-data-from-source>"
}

For example, given the following two JSON objects from a source:

[
  {
    "user_id": 123,
    "name": {
      "first": "John",
      "last": "Doe"
    }
  },
  {
    "user_id": 456,
    "name": {
      "first": "Jane",
      "last": "Roe"
    }
  }
]

They will be like this in the output file:

{ "_airbyte_raw_id": "26d73cde-7eb1-4e1e-b7db-a4c03b4cf206", "_airbyte_extracted_at": "1622135805000", "_airbyte_generation_id": "11", "_airbyte_meta": { "changes": [], "sync_id": 10111 }, "_airbyte_data": { "user_id": 123, "name": { "first": "John", "last": "Doe" } } }
{ "_airbyte_ab_id": "0a61de1b-9cdd-4455-a739-93572c9a5f20", "_airbyte_extracted_at": "1631948170000", "_airbyte_generation_id": "12", "_airbyte_meta": { "changes": [], "sync_id": 10112 }, "_airbyte_data": { "user_id": 456, "name": { "first": "Jane", "last": "Roe" } } }

Output files can be compressed. The default option is GZIP compression. If compression is selected, the output filename will have an extra extension (GZIP: .jsonl.gz).

Parquet

Configuration

The following configuration is available to configure the Parquet output:

Parameter	Type	Default	Description
`compression_codec`	enum	`UNCOMPRESSED`	Compression algorithm. Available candidates are: `UNCOMPRESSED`, `SNAPPY`, `GZIP`, `LZO`, `BROTLI`, `LZ4`, and `ZSTD`.
`block_size_mb`	integer	128 (MB)	Block size (row group size) in MB. This is the size of a row group being buffered in memory. It limits the memory usage when writing. Larger values will improve the IO when reading, but consume more memory when writing.
`max_padding_size_mb`	integer	8 (MB)	Max padding size in MB. This is the maximum size allowed as padding to align row groups. This is also the minimum size of a row group.
`page_size_kb`	integer	1024 (KB)	Page size in KB. The page size is for compression. A block is composed of pages. A page is the smallest unit that must be read fully to access a single record. If this value is too small, the compression will deteriorate.
`dictionary_page_size_kb`	integer	1024 (KB)	Dictionary Page Size in KB. There is one dictionary page per column per row group when dictionary encoding is used. The dictionary page size works like the page size but for dictionary.
`dictionary_encoding`	boolean	`true`	Dictionary encoding. This parameter controls whether dictionary encoding is turned on.

These parameters are related to the ParquetOutputFormat. See the Java doc for more details. Also see Parquet documentation for their recommended configurations (512 - 1024 MB block size, 8 KB page size).

Data schema

Under the hood, an Airbyte data stream in JSON schema is first converted to an Avro schema, then the JSON object is converted to an Avro record, and finally the Avro record is outputted to the Parquet format. Because the data stream can come from any data source, the JSON to Avro conversion process has arbitrary rules and limitations. Learn more about how source data is converted to Avro and the current limitations here.

In order for everything to work correctly, it is also necessary that the user whose "S3 Key Id" and "S3 Access Key" are used have access to both the bucket and its contents. Policies to use:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "s3:*",
      "Resource": [
        "arn:aws:s3:::YOUR_BUCKET_NAME/*",
        "arn:aws:s3:::YOUR_BUCKET_NAME"
      ]
    }
  ]
}

Limitations & Troubleshooting

To see connector limitations, or troubleshoot your S3 connector, see more in our s3 troubleshooting guide.

Changelog

Expand to review

Version	Date	Pull Request	Subject
1.9.5	2025-11-01	69134	Upgrade to Bulk CDK 0.1.61.
1.9.4	2025-10-21	67153	Implement new proto schema implementation
1.9.3	2025-10-05	67078	Remove memory limit for sync jobs to improve performance and resource utilization.
1.9.2	2025-09-18	66523	Fix Minio compatibility by forcing path-style bucket access.
1.9.1	2025-07-31	64138	Promoting release candidate 1.9.1-rc.1 to a main version.
1.9.1-rc.1	2025-07-28	64101	Prepare release candidate with speed mode fixes.
1.9.0	2025-07-24	63759	Promoting release candidate 1.9.0-rc.4 to a main version.
1.9.0-rc.4	2025-07-17	63350	Release candidate 4
1.9.0-rc.3	2025-07-10	62890	Release candidate 3
1.9.0-rc.2	2025-07-07	62829	Release candidate 2 with bug fixes
1.9.0-rc.1	2025-07-02	62068	Prepare to enable speed improvements
1.8.8	2025-06-27	62127	Identical to 1.8.6. Image was published from PR 62071.
1.8.7	2025-06-13	61588	~~Publish version to account for possible duplicate publishing in pipeline. Noop change.~~ WARNING: THIS HAS A BUG. DO NOT USE.
1.8.6	2025-05-30	60327	IPC Metadata, internal refactors.
1.8.5	2025-05-16	60327	Fixes file partitioning out of bounds error.
1.8.4	2025-05-14	60313	Re-wires up time window trigger. Files path releases memory properly.
1.8.3	2025-05-14	60287	Update spec.
1.8.2	2025-05-07	59721	Legacy File Transfer Uses New CDK Interface
1.8.1	2025-05-07	59710	CDK backpressure bugfix
1.8.0	2025-04-30	59168	Promoting release candidate 1.8.0-rc.1 to a main version.
1.8.0-rc.1	2025-04-29	59148	Upgrade to latest CDK with files+records support
1.7.4	2025-04-21	58146	Upgrade to latest CDK
1.7.3	2025-04-18	58140	Upgrade to latest CDK
1.7.2	2025-04-07	56391	Internal code refactor
1.7.1	2025-04-02	56974	Nonfunctional change: pin cdk version
1.7.0	2025-04-02	56974	Release 1.7.0 release candidate
1.7.0-rc.1	2025-03-31	56935	Internal performance refactor
1.6.0	2025-03-28	56458	Do not drop trailing `.0` from decimals in CSV/JSONL data
1.5.8	2025-03-25	56398	Internal CDK change to mark old Avro and Parquet formats as deprecated
1.5.7	2025-03-24	56355	Upgrade to airbyte/java-connector-base:2.0.1 to be M4 compatible.
1.5.6	2025-03-24	55849	Internal refactoring
1.5.5	2025-03-20	55875	Bugfix: Sync Can Hang on OOM
1.5.4	2025-03-05	54695	Nonfunctional changes to support performance testing
1.5.3	2025-03-04	54661	Nonfunctional changes to support performance testing
1.5.2	2025-02-25	54661	Nonfunctional cleanup; dropped unused staging code
1.5.1	2025-02-11	53636	Nonfunctional CDK version pin.
1.5.0	2025-02-11	53632	Promoting release candidate 1.5.0-rc.20 to a main version.
1.5.0-rc.20	2025-02-04	53173	Tweak spec wording
1.5.0-rc.19	2025-02-04	53163	Various fixes to truncate sync
1.5.0-rc.18	2025-01-29	52703	Force list call evaluation before making head calls
1.5.0-rc.17	2025-01-29	52610	Pin CDK 0.296
1.5.0-rc.16	2025-01-29	52610	Fix assume role behavior
1.5.0-rc.15	2025-01-23	52103	Make the connector use our non root base image.
1.5.0-rc.14	2025-01-24	51600	Internal refactor
1.5.0-rc.13	2025-01-22	52076	Test improvements.
1.5.0-rc.12	2025-01-22	52072	Bug fix: Configure OpenStreamTask concurrency to handle connection to reduce http connection errors.
1.5.0-rc.11	2025-01-17	51051	Input fully read before end-of-stream now correctly marked transient
1.5.0-rc.10	2025-01-15	50960	Bug fixes: tolerate repeated path variables; avro meta field schema matches old cdk
1.5.0-rc.9	2025-01-10	50960	Bug fixes: variables respected in bucket path; sync does not hang on streams w/o state
1.5.0-rc.8	2025-01-08	50960	Use `airbyte/java-connector-base` base image.
1.5.0-rc.7	2025-01-09	51021	Bug fix: Use CRT HTTP client to avoid OkHttp idle connection handling errors
1.5.0-rc.6	2025-01-06	50954	Bug fix: transient failure due to bug in generation tracker
1.5.0-rc.5	2025-01-06	50954	Bug fix: transient failure due to bug in filename clash prevention
1.5.0-rc.4	2025-01-06	50954	Bug fix: StreamLoader::close dispatched multiple times per stream
1.5.0-rc.3	2025-01-06	50949	Bug fix: parquet types/values nested in union of objects do not convert properly
1.5.0-rc.2	2025-01-02	50857	Migrate to Bulk Load CDK: cost reduction, perf increase, bug fix for filename clashes
1.4.0	2024-10-23	46302	add support for file transfer
1.3.0	2024-09-30	46281	fix tests
1.2.1	2024-09-20	45700	Improve resiliency to jsonschema fields
1.2.0	2024-09-18	45402	fix exception with columnless streams
1.1.0	2024-09-18	45436	upgrade all dependencies
1.0.5	2024-09-05	45143	don't overwrite (and delete) existing files, skip indexes instead
1.0.4	2024-08-30	44933	Fix: Avro/Parquet: handle empty schemas in nested objects/lists
1.0.3	2024-08-20	44476	Increase message parsing limit to 100mb
1.0.2	2024-08-19	44401	Fix: S3 Avro/Parquet: handle nullable top-level schema
1.0.1	2024-08-14	42579	OVERWRITE MODE: Deletes deferred until successful sync.
1.0.0	2024-08-08	42409	Major breaking changes: new destination schema, change capture, Avro/Parquet improvements, bugfixes
0.1.15	2024-12-18	49879	Use a base image: airbyte/java-connector-base:1.0.0
0.6.7	2024-08-11	43713	Decreased memory ratio (0.7 -> 0.5) and thread allocation (5 -> 2) for async S3 uploads.
0.6.6	2024-08-06	43343	Use Kotlin 2.0.0
0.6.5	2024-08-01	42405	S3 parallelizes workloads, checkpoints, submits counts, support for generationId in metadata for refreshes.
0.6.4	2024-04-16	42006	remove unnecessary zookeeper dependency
0.6.3	2024-04-15	38204	convert all production code to kotlin
0.6.2	2024-04-15	38204	add assume role auth
0.6.1	2024-04-08	37546	Adapt to CDK 0.30.8;
0.6.0	2024-04-08	36869	Adapt to CDK 0.29.8; Kotlin converted code.
0.5.9	2024-02-22	35569	Fix logging bug.
0.5.8	2024-01-03	#33924	Add new ap-southeast-3 AWS region
0.5.7	2023-12-28	#33788	Thread-safe fix for file part names
0.5.6	2023-12-08	#33263	(incorrect filename format, do not use) Adopt java CDK version 0.7.0.
0.5.5	2023-12-08	#33264	Update UI options with common defaults.
0.5.4	2023-11-06	#32193	(incorrect filename format, do not use) Adopt java CDK version 0.4.1.
0.5.3	2023-11-03	#32050	(incorrect filename format, do not use) Adopt java CDK version 0.4.0. This updates filenames to include a UUID.
0.5.1	2023-06-26	#27786	Fix build
0.5.0	2023-06-26	#27725	License Update: Elv2
0.4.2	2023-06-21	#27555	Reduce image size
0.4.1	2023-05-18	#26284	Fix: reenable LZO compression for Parquet output
0.4.0	2023-04-28	#25570	Fix: all integer schemas should be converted to Avro longs
0.3.25	2023-04-27	#25346	Internal code cleanup
0.3.23	2023-03-30	#24736	Improve behavior when throttled by AWS API
0.3.22	2023-03-17	#23788	S3-Parquet: added handler to process null values in arrays
0.3.21	2023-03-10	#23466	Changed S3 Avro type from Int to Long
0.3.20	2023-02-23	#21355	Add root level flattening option to JSONL output.
0.3.19	2023-01-18	#21087	Wrap Authentication Errors as Config Exceptions
0.3.18	2022-12-15	#20088	New data type support v0/v1
0.3.17	2022-10-15	#18031	Fix integration tests to use bucket path
0.3.16	2022-10-03	#17340	Enforced encrypted only traffic to S3 buckets and check logic
0.3.15	2022-09-01	#16243	Fix Json to Avro conversion when there is field name clash from combined restrictions (`anyOf`, `oneOf`, `allOf` fields).
0.3.14	2022-08-24	#15207	Fix S3 bucket path to be used for check.
0.3.13	2022-08-09	#15394	Added LZO compression support to Parquet format
0.3.12	2022-08-05	#14801	Fix multiple log bindings
0.3.11	2022-07-15	#14494	Make S3 output filename configurable.
0.3.10	2022-06-30	#14332	Change INSTANCE*PROFILE to use `AWSDefaultProfileCredential`, which supports more authentications on AWS
0.3.9	2022-06-24	#14114	Remove "additionalProperties": false from specs for connectors with staging
0.3.8	2022-06-17	#13753	Deprecate and remove PART_SIZE_MB fields from connectors based on StreamTransferManager
0.3.7	2022-06-14	#13483	Added support for int, long, float data types to Avro/Parquet formats.
0.3.6	2022-05-19	#13043	Destination S3: Remove configurable part size.
0.3.5	2022-05-12	#12797	Update spec to replace markdown.
0.3.4	2022-05-04	#12578	In JSON to Avro conversion, log JSON field values that do not follow Avro schema for debugging.
0.3.3	2022-04-20	#12167	Add gzip compression option for CSV and JSONL formats.
0.3.2	2022-04-22	#11795	Fix the connection check to verify the provided bucket path.
0.3.1	2022-04-05	#11728	Properly clean-up bucket when running OVERWRITE sync mode
0.3.0	2022-04-04	#11666	0.2.12 actually has breaking changes since files are compressed by default, this PR also fixes the naming to be more compatible with older versions.
0.2.13	2022-03-29	#11496	Fix S3 bucket path to be included with S3 bucket format
0.2.12	2022-03-28	#11294	Change to serialized buffering strategy to reduce memory consumption
0.2.11	2022-03-23	#11173	Added support for AWS Glue crawler
0.2.10	2022-03-07	#10856	`check` method now tests for listObjects permissions on the target bucket
0.2.7	2022-02-14	#10318	Prevented double slashes in S3 destination path
0.2.6	2022-02-14	10256	Add `-XX:+ExitOnOutOfMemoryError` JVM option
0.2.5	2022-01-13	#9399	Use instance profile authentication if credentials are not provided
0.2.4	2022-01-12	#9415	BigQuery Destination : Fix GCS processing of Facebook data
0.2.3	2022-01-11	#9367	Avro & Parquet: support array field with unknown item type; default any improperly typed field to string.
0.2.2	2021-12-21	#8574	Added namespace to Avro and Parquet record types
0.2.1	2021-12-20	#8974	Release a new version to ensure there is no excessive logging.
0.2.0	2021-12-15	#8607	Change the output filename for CSV files - it's now `bucketPath/namespace/streamName/timestamp_epochMillis_randomUuid.csv`
0.1.16	2021-12-10	#8562	Swap dependencies with destination-jdbc.
0.1.15	2021-12-03	#8501	Remove excessive logging for Avro and Parquet invalid date strings.
0.1.14	2021-11-09	#7732	Support timestamp in Avro and Parquet
0.1.13	2021-11-03	#7288	Support Json `additionalProperties`.
0.1.12	2021-09-13	#5720	Added configurable block size for stream. Each stream is limited to 10,000 by S3
0.1.11	2021-09-10	#5729	For field names that start with a digit, a `*` will be appended at the beginning for the`Parquet`and`Avro`formats.
0.1.10	2021-08-17	#4699	Added json config validator
0.1.9	2021-07-12	#4666	Fix MinIO output for Parquet format.
0.1.8	2021-07-07	#4613	Patched schema converter to support combined restrictions.
0.1.7	2021-06-23	#4227	Added Avro and JSONL output.
0.1.6	2021-06-16	#4130	Patched the check to verify prefix access instead of full-bucket access.
0.1.5	2021-06-14	#3908	Fixed default`max_padding_size_mb`in`spec.json`.
0.1.4	2021-06-14	#3908	Added Parquet output.
0.1.3	2021-06-13	#4038	Added support for alternative S3.
0.1.2	2021-06-10	#4029	Fixed `\_airbyte_emitted_at`field to be a UTC instead of local timestamp for consistency.
0.1.1	2021-06-09	#3973	Added`AIRBYTE_ENTRYPOINT` in base Docker image for Kubernetes support.
0.1.0	2021-06-03	#3672	Initial release with CSV output.

61 KiB Raw Permalink Blame History

S3

Prerequisites

Setup guide

Step 1: Set up S3

Create bucket a Policy

Authentication Option 1: Using an IAM Role (Most secure)

Authentication Option 2: Using an IAM User

Step 2: Set up the S3 destination connector in Airbyte

Supported sync modes

Supported Output schema

Avro

Configuration

Data schema

CSV

JSON Lines (JSONL)

Parquet

Configuration

Data schema

Limitations & Troubleshooting

Changelog

61 KiB

Raw Permalink Blame History