112 lines
6.4 KiB
Markdown
112 lines
6.4 KiB
Markdown
# Fauna Airbyte Source
|
||
|
||
This is a source for Airbyte, a tool which allows users to export data from Fauna into a number of destinations, such as
|
||
Snowflake, Big Query, Redshift, and more. Airbyte allows this source to export to any destination, as Airbyte provides a
|
||
common data format that all destinations must use. This means that this connector can support all Airbyte destinations,
|
||
listed here.
|
||
|
||
The Source can be configured to export a single collection. To export another collection, you must setup multiple
|
||
sources within Airbyte. When I say “Fauna documents,” I really mean the documents in the collection.
|
||
|
||
# Sync Modes
|
||
|
||
The Sync Mode is how data should be extracted from the Fauna Source and how it should be written to the destination.
|
||
|
||
The Fauna Source provides support for 2 export modes, Full Sync and Incremental Sync. Destinations provide support for 3
|
||
import modes, Overwrite, Append, and Append Deduped. Only specific combinations of these modes are valid, and are listed
|
||
below.
|
||
|
||
## Full Sync - Overwrite
|
||
|
||
This imports all data from fauna, and clears the destination. This is the simplest mode, and is the slowest.
|
||
|
||
This is useful when you want to keep the destination synced with fauna at all times. This doesn’t provide any method of
|
||
finding the state of Fauna in the past.
|
||
|
||
## Full Sync - Append
|
||
|
||
This imports all data from fauna, and appends it to the destination. This is slightly more complex, and stores the most
|
||
data.
|
||
|
||
This allows for easy queries to lookup the state of the whole Fauna database at a specific date.
|
||
|
||
## Incremental Sync - Append
|
||
|
||
This pulls all new records from fauna, and appends them to the destination. This provides a list of all documents over
|
||
all time, but doesn’t have any notion of when an old document has been replaced.
|
||
|
||
For example, this is useful when you want to export a list of logs, and you only care about the new ones each day.
|
||
|
||
## Incremental Sync - Append Deduped
|
||
|
||
This pulls all new records from fauna, and appends them to the destination. This also uses a primary key, so it knows
|
||
which documents have become out of date. This allows for a query which can lookup the state of the whole Fauna database
|
||
at a specific date, and stores only the new data each sync.
|
||
|
||
This mode is slower to query, but stores the least about of data, and is the most useful.
|
||
|
||
# Record Format
|
||
|
||
Each document in Fauna is converted to an Airbyte record. Records are essentially rows, and they have a pre-defined list
|
||
of columns. Because Fauna doesn’t support a specific shape of data, we rely on the user to specify their data format
|
||
before any data can be exported.
|
||
|
||
## Required Columns
|
||
|
||
The resulting record will always have at least 2 columns, named ref and ts. The ref is a string, which is the document
|
||
ID. This can be used as a primary key in the destination, as it is a unique identifier for each document. The ts is an
|
||
integer, which is the time since the document was last modified, stored in microseconds since the Unix Epoch.
|
||
|
||
The record has 1 optional column, named data. If this is enabled by the user, then the resulting record will contain all
|
||
of the fauna document data within a column. This is most useful when you simply wish to dump all of your data in the
|
||
destination, and you don’t need to worry about re-shaping it.
|
||
|
||
## Additional Columns
|
||
|
||
The remaining columns are all user-configurable. When the user is setting up the connector, they can specify any number
|
||
of “Additional Columns.” Each column has a name, a type, and a path. All of these fields are specified by the user.
|
||
|
||
Additional columns are implemented to provide an easy way to flatten fauna data into columns. This is because Airbyte
|
||
doesn’t have another easy-to-use method of reshaping records, so we implemented this as part of the Fauna Source.
|
||
|
||
The name of the additional column is the name that it will have in the destination. Additional columns must have unique
|
||
names, and cannot be named ref, ts, or data, as that would conflict with the required columns.
|
||
|
||
The type of the additional column is the type in the destination. This is used so that destinations like Snowflake can
|
||
know the type of the column before any data is sent.
|
||
|
||
The path of the additional column is the path within each Fauna Document for this data. This allows you to pick out a
|
||
single field, even if it is nested in fauna, and store it in a column in the destination.
|
||
|
||
# Deleted Documents
|
||
|
||
If a document is deleted in Fauna, some users would like a record that within their destination. However, in the
|
||
destination, they would like to know that it existed for some time, and then was removed at a specific date.
|
||
|
||
To support this, we allow for an optional deleted_at column. This column will be null for all present documents, and is
|
||
set to a date after a document is deleted.
|
||
|
||
This deleted_at column is only supported in incremental syncs. If you combine this with the incremental append deduped
|
||
mode, you can easily query for documents that are present at a certain time.
|
||
|
||
# Data Serialization
|
||
|
||
Fauna documents have a lot of extra types. These types need to be converted into the Airbyte JSON format. Below is an
|
||
exhaustive list of how all fauna documents are converted.
|
||
|
||
| Fauna Type | Format | Note |
|
||
| ------------- | ------------------------------------------------------------------- | -------------------------------------------------- |
|
||
| Document Ref | `{ id: "id", "collection": "collection-name", "type": "document" }` | |
|
||
| Other Ref | `{ id: "id", "type": "ref-type" }` | This includes collection refs, database refs, etc. |
|
||
| Byte Array | base64 url formatted string | |
|
||
| Timestamp | date-time, or an iso-format timestamp | |
|
||
| Query, SetRef | a string containing the wire protocol of this value | The wire protocol is not documented. |
|
||
|
||
## Ref Types
|
||
|
||
Every ref is serialized as a JSON object with 2 or 3 fields, as listed above. The type field will be a string, which is
|
||
the type of the reference. For example, a document ref would have the type document, and a collection reference would
|
||
have the type collection.
|
||
|
||
For all other refs (for example if you stored the result of Collections()), the type will be "unknown".
|