1
0
mirror of synced 2026-02-03 10:02:09 -05:00
Files
airbyte/docs/integrations/destinations/iceberg.md
Evan Tahler 92ad0fdc07 Destination doc and warning updates (#20110)
* Doc updates

* Bigquery Denormalized

* bump faker for change

* ignore missing strict-encrypt connectors from ci check

* Apply suggestions from code review

Co-authored-by: Augustin <augustin@airbyte.io>

* Fix MD titles

Co-authored-by: Augustin <augustin@airbyte.io>
2022-12-06 12:10:13 -08:00

61 lines
3.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Iceberg
This page guides you through the process of setting up the Iceberg destination connector.
## Sync overview
### Output schema
The incoming airbyte data is structured in keyspaces and tables and is partitioned and replicated across different nodes
in the cluster. This connector maps an incoming `stream` to an Iceberg `table` and a `namespace` to an
Iceberg `database`. Fields in the airbyte message become different columns in the Iceberg tables. Each table will
contain the following columns.
- `_airbyte_ab_id`: A random generated uuid.
- `_airbyte_emitted_at`: a timestamp representing when the event was received from the data source.
- `_airbyte_data`: a json text representing the extracted data.
### Features
This section should contain a table with the following format:
| Feature | Supported?(Yes/No) | Notes |
| :---------------------------- | :----------------- | :---- |
| Full Refresh Sync | ✅ | |
| Incremental Sync | ✅ | |
| Replicate Incremental Deletes | ❌ | |
| SSH Tunnel Support | ❌ | |
### Performance considerations
Every ten thousand pieces of incoming airbyte data in a stream ————we call it a batch, would produce one data file(
Parquet/Avro) in an Iceberg table. This batch size can be configurabled by `Data file flushing batch size`
property.
As the quantity of Iceberg data files grows, it causes an unnecessary amount of metadata and less efficient queries from
file open costs.
Iceberg provides data file compaction action to improve this case, you can read more about
compaction [HERE](https://iceberg.apache.org/docs/latest/maintenance/#compact-data-files).
This connector also provides auto compact action when stream closes, by `Auto compact data files` property. Any you can
specify the target size of compacted Iceberg data file.
## Getting started
### Requirements
- **Iceberg catalog** : Iceberg uses `catalog` to manage tables. this connector already supports:
- [HiveCatalog](https://iceberg.apache.org/docs/latest/hive/#global-hive-catalog) connects to a **Hive metastore**
to keep track of Iceberg tables.
- [HadoopCatalog](https://iceberg.apache.org/docs/latest/java-api-quickstart/#using-a-hadoop-catalog) doesnt need
to connect to a Hive MetaStore, but can only be used with **HDFS or similar file systems** that support atomic
rename. For `HadoopCatalog`, this connector use **Storage Config** (S3 or HDFS) to manage Iceberg tables.
- [JdbcCatalog](https://iceberg.apache.org/docs/latest/jdbc/) uses a table in a relational database to manage
Iceberg tables through JDBC. So far, this connector supports **PostgreSQL** only.
- **Storage medium** means where Iceberg data files storages in. So far, this connector supports **S3/S3N/S3N**
object-storage only.
## Changelog
| Version | Date | Pull Request | Subject |
| :------ | :--------- | :------------------------------------------------------- | :------------- |
| 0.1.0 | 2022-11-01 | [18836](https://github.com/airbytehq/airbyte/pull/18836) | Initial Commit |