jprdonnelly/airbyte

Fork 0

mirror of synced 2025-12-21 02:51:29 -05:00

Files

Johnny Schmidt e9a37d45ac S3 Dest emits V2 fields and captures failures (#42409 )

2024-08-14 09:41:29 -07:00

21 KiB

Raw Permalink Blame History

S3 Migration Guide

Upgrading to 1.0.0

This version introduces changes to the schema of data written to S3, which make it isomorphic to our V2 certified database destinations, as well as various improvements to format conversion.

Conversion failures are captured: values not matching the client schema will no longer break syncs using Avro or Parquet formats.
Changes introduced by the source or platform, such as NULLing bad values or truncating long ones, are visible in the metadata.
There is improved handling of various types in Avro and Parquet formats, including simplification of date and time types, more human-readable schemaless objects and arrays, and better support for unions in Parquet.
Sync and generation ids are made available, providing more information for debugging.

Schema Changes

The schema changes are as follows:

Old field name	New field name	JSON Type	Avro Type	Parquet Type	Description
`_airbyte_ab_id`	`_airbyte_raw_id`	string (UUID)	`{ "type": "string", "logicalType": "uuid" }`	`String`	Airbyte-added unique identifier.
`_airbyte_emitted_at`	`_airbyte_extracted_at`	integer (epoch ms)	`{ "type": "string", "logicalType": "timestamp-millis" }`	`Int64(logicalType=Timestamp(Milliseconds))`	Time at which the data was extracted from the source.
[NONEXISTENT]	`_airbyte_generation_id`	integer	`{ "type": "long" }`	`Int64`	Monotonically-increasing refresh id, if applicable.
[NONEXISTENT]	`_airbyte_meta`	object (see below)	Record (see below)	Record (see below)	Additional metadata, including change data capture info and sync id.
`_airbyte_data`	[UNCHANGED]	optional object	(from client schema)	(from client schema)	Data payload when flattening is disabled.

The _airbyte_meta field is an object that currently has one field:

Field Name	JSON Type	Description

The changes field is a list of objects, each of which represents a change to the data. Each object has the following fields:

Field Name	JSON Type*	Description
`field`	string	The name of the field that was affected
`change`	string	The type of change (currently only `NULLED` or `TRUNCATED`)
`reason`	string	The reason for the change, including its origin (source, platform, or destination)

These schemas are subject to change, however any change that is not backward compatible (ie, additive) will be accompanied by a breaking change notice.

Data Changes

This version introduces changes to the data types used when writing to S3 in Avro or Parquet. The changes are as follows:

Primitive Types

All primitive types are unchanged:

Airbyte JSON Schema Type	Old Avro Type	Old Parquet Type	New Avro Type	New Parquet Type
string	`["null", "string"]`	`Optional String`	[UNCHANGED]	[UNCHANGED]
boolean	`["null", "boolean"]`	`Optional Boolean`	[UNCHANGED]	[UNCHANGED]
integer	`["null", "long"]`	`Optional Int64`	[UNCHANGED]	[UNCHANGED]
number	`["null", "double"]`	`Optional Double`	[UNCHANGED]	[UNCHANGED]

Date and Time Types

This change introduces simplification of the handling of dates, times, and timestamps in Parquet and Avro. Date and time types were stored in Avro as unions of integral logical types and strings, which in Parquet were represented as disjoint records. In practice, the date and time types were always converted, making the unions redundant. The resulting disjoint records were confusing.

Now all time types are converted to integral logical types. Values that cannot be converted will be nulled and tracked in _airbyte_meta.changes[].

Resulting Avro Changes

Avro users should only see changes at the schema level, the resulting data will appear the same.

Airbyte Time Type	Old Avro Type	New Avro Type
date	`["null", { "type": "int", "logicalType": "date" }, "string"]`	`[null, int(logicalType="date")]`
time without timezone	`["null", { "type": "long", "logicalType": "time-micros" }, "string"]`	`[null, long(logicalType="time-micros")]`
time with timezone	`["null", { "type": "long", "logicalType": "time-micros" }, "string"]`	`[null, long(logicalType="time-micros")]`
timestamp without timezone	`["null", { "type": "long", "logicalType": "timestamp-micros" }, "string"]`	`[null, long(logicalType="timestamp-micros")]`
timestamp with timezone	`["null", { "type": "long", "logicalType": "timestamp-micros" }, "string"]`	`[null, long(logicalType="timestamp-micros")]`

Resulting Parquet Changes

Airbyte Time Type	Old Parquet Type	New Parquet Type
date	`Optional Record { member0: Optional Int32(logical_type=Date), member1: Optional String }`	`Optional Int32(logical_type=Date)`
time without timezone	`Optional Record { member0: Optional Int64(logical_type=Time(Microseconds)), member1: Optional String }`	`Optional Int64(logical_type=Time(Microseconds))`
time with timezone	`Optional Record { member0: Optional Int64(logical_type=Time(Microseconds)), member1: Optional String }`	`Optional Int64(logical_type=Time(Microseconds))`
timestamp without timezone	`Optional Record { member0: Optional Int64(logical_type=Timestamp(Microseconds)), member1: Optional String }`	`Optional Int64(logical_type=Timestamp(Microseconds))`
timestamp with timezone	`Optional Record { member0: Optional Int64(logical_type=Timestamp(Microseconds)), member1: Optional String }`	`Optional Int64(logical_type=Timestamp(Microseconds))`

Note: the times and timestamps with timezones are converted to UTC. In Parquet, the field metadata is_adjusted_to_utc will always be true.

Time of Day Conversion Bugfix

Formerly, for the Airbyte type time_of_day_with_timezone, timezones were not respected when converting. This has been fixed.

Input Data	Old Output Data	New Output Data
`"12:00:00-01:00"`	`43200000000` (12:00:00 UTC)	`46800000000` (13:00:00 UTC)
`"04:00:00+02:00"`	`14400000000` (04:00:00 UTC)	`720000000` (02:00:00 UTC)
`"04:00:00+05:30"`	`14400000000` (04:00:00 UTC)	`81000000000` (22:30:00 UTC)

Object Types

Formerly, in both Avro and Parquet formats, objects without schemas ("type": "object" without properties), and properties not listed in the properties field, were accumulated in _airbyte_additional_properties. The new behavior is

undocumented fields in object with schemas are silently dropped
objects without schemas are serialized into a JSON string
_airbyte_additional_properties is dropped entirely

For example, the following input schemas and data will result in the following outputs:

Input Schema	Input Data	Output Schema	Output Data
`{ "type": "object", "properties": { "id": { "type": "integer" } } }`	`{ "id": 1, "name": "Alice" }`	`{ "type": "object", "properties": { "id": { "type": "integer" } } }`	`{ "id": 1 }`
`{ "type": "object" }`	`{ "id": 1, "name": "Alice" }`	`{ "type": "string" }`	`"{\"id\": 1, \"name\": \"Alice\"}"`

Note: dropped fields will not appear in _airbyte_meta.changes[]. Additionally, n object with null or empty properties ("properties": {}) will be treated as a schemaless object. This is because this usually indicates an upstream source is failing to report its schema properly, and the data is not actually extraneous.

Array Types

Formerly, arrays without types ("type": "array" with no items field) were converted to native arrays of string serializations of the underlying types. The same behavior was applied to arrays of union types ({ "type": "array", "items": { "oneOf": [ /* various types */ ] }).

Now:

Arrays without types are serialized into JSON array strings.
Arrays of unions are treated as arrays of mixed types.

For example, the following input schemas and data formerly resulted in:

Input Schema	Input Data	Old Output Schema	Old Output Data
`{ "type": "array", "items": { "type": "integer" } }`	`[1, "Alice"]`	`{ "type": "array", "items": ["null", "integer"] }`	[SYNC FAILED]
`{ "type": "array" }`	`[1, "Alice"]`	`{ "type": "array", "items": [ "null", "string" ] }`	`["1", "Alice"]`
`{ "type": "array", "items": { "oneOf": [ {"type": "integer", "type": "string"} ] } }`	`[1, "Alice"]`	`{ "type": "array", "items": [ "null", "string" ] }`	`["1", "Alice"]`
`{ "type": "array", "items": { "oneOf": [ {"type": "integer", "type": "string"} ] } }`	`[1, false]`	`{ "type": "array", "items": [ "null", "string" ] }`	`["1", "false"]`

Now:

Input Schema	Input Data	New Output Schema	New Output Data
`{ "type": "array", "items": { "type": "integer" } }`	`[1, "Alice"]`	`{ "type": "array", "items": ["null", "integer"] }`	`[1, null*]`
`{ "type": "array" }`	`[1, "Alice"]`	`{ "type": "string" }`	`"[1, \"Alice\"]"`
`{ "type": "array", "items": { "oneOf": [ {"type": "integer", "type": "string"} ] } }`	`[1, "Alice"]`	`{ "type": "array", "items": [ "null", "integer", "string" ] }`	`[1, "Alice"]`
`{ "type": "array", "items": { "oneOf": [ {"type": "integer", "type": "string"} ] } }`	`[1, false]`	`{ "type": "array", "items": [ "null", "integer", "string" ] }`	`[1, null*]`

*The nulled fields represent conversion failures and will appear in _airbyte_meta.changes[].

This behavior will be applied to both Avro and Parquet formats.

Union Types (Parquet Only)

Disjoint Record Improvements

Formerly, unions in Parquet were represented as anonymous disjoint records. For example:

Airbyte Union Type	Old Parquet Type
`{"oneOf": [ {"type": "integer"}, {"type": "string"} ] }`	`Optional Record { member0: Optional Int32, member1: Optional String }`
`{"oneOf": [ {"type": "boolean"}, {"type": "object", "properties": { "id": { "type": "integer" }, "name": { "type": string" } } } ] }`	`Optional Record { member0: Optional Boolean, member1: Optional Record { id: Optional Uint32, name: Optional String } }`

Now unions will be represented as typed disjoint records with named fields. For example:

Airbyte Union Type	New Parquet Type
`{"oneOf": [ {"type": "integer"}, {"type": "string"} ] }`	`Optional Record { type: String, integer: Optional Int32, string: Optional String }`
`{"oneOf": [ {"type": "boolean"}, {"type": "object", "properties": { "id": { "type": "integer" }, "name": { "type": string" } } } ] }`	`Optional Record { type: String, boolean: Optional Boolean, object: Optional Record { id: Optional Uint32, name: Optional String } }`

Where

Only one field will always be set.
The type field will always be set and always be equal to the name of the field that was set.

Following the examples above:

Input Data	New Output Data
`1`	`{ "type": "integer", "integer": 1 }`
`{"id": 10, "name": "Alice"}`	`{ "type": "object", "object": { "id": 10, "name": "Alice" } }`

Disjoint Record Type Names

The following type and field names will be used for each Airbyte type:

Airbyte Type	Parquet Type	Field Name
`{"type": "integer" }`	`Optional Int64`	`integer`
`{"type": "string" }`	`Optional String`	`string`
`{"type": "boolean" }`	`Optional Boolean`	`boolean`
`{"type": "object" }`	`Optional Record`	`object`
`{"type": "array" }`	`Optional List`	`array`
`{"type": "string", "format": "date" }`	`Optional Int32(logicalType=Date)`	`date`
`{"type": "string", "format": "time", "airbyte_type": "time_with_timezone" }`	`Optional Int64(Optional Int64(logical_type=Time(Microseconds))`	`time_with_timezone`
`{"type": "string", "format": "time", "airbyte_type": "time_without_timezone" }`	`Optional Int64(Optional Int64(logical_type=Time(Microseconds))`	`time_without_timezone`
`{"type": "string", "format": "date-time", "airbyte_type": "timestamp_with_timezone" }`	`Optional Int64(logical_type=Timestamp(Microseconds)))`	`timestamp_with_timezone`
`{"type": "string", "format": "date-time", "airbyte_type": "timestamp_without_timezone" }`	`Optional Int64(Optional Int64(logical_type=Timestamp(Microseconds)))`	`timestamp_without_timezone`

Merging Like Options in Unions

The above behavior applies only to unions of distinct types. Unions of the same type will continue to be merged. If the result is a single option, the union will be demoted to a field of that type. For example:

Airbyte Union Type	Avro Type	Parquet Type
`{"oneOf": [ {"type": "integer"}, {"type": "integer"} ] }`	`["null", "integer"]`	`Optional Int64`
`{"oneOf": [ {"type": "integer"}, {"type": "string"}, {"type": "integer" } ] }`	`["null", "integer", "string"]`	`Optional Record { type: Optional String, integer: Optional Int64, string: Optional String }`

Unions of objects will continue to be merged into a single object with a union schema. If this is not possible due to field type conflicts, an exception will be thrown and the sync will fail.

Object 1 Properties	Object 2 Properties	Merged Object
`{id: integer}`	`{name: string}`	`{id: integer, name: string}`
`{id: integer, name: string}`	`{id: integer, birthday: date}`	`{id: integer, name: string, birthday: date}`
`{id: integer}`	`{id: string}`	[SYNC FAILED]

Unions of arrays with different item schemas continue not to be supported.

21 KiB Raw Permalink Blame History