diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md index ec07b321674..4d87af6e4bc 100644 --- a/docs/SUMMARY.md +++ b/docs/SUMMARY.md @@ -100,7 +100,7 @@ * [Data Loading](faq/data-loading.md) * [Transformation and Schemas](faq/transformation-and-schemas.md) * [Security & Data Audits](faq/security-and-data-audits.md) - * [Differences with...](faq/differences-with.../README.md) + * [Differences with](faq/differences-with.../README.md) * [Fivetran vs Airbyte](faq/differences-with.../fivetran-vs-airbyte.md) * [StitchData vs Airbyte](faq/differences-with.../stitchdata-vs-airbyte.md) * [Singer vs Airbyte](faq/differences-with.../singer-vs-airbyte.md) diff --git a/docs/architecture/incremental.md b/docs/architecture/incremental.md index 4aeb06c28ad..722302967e4 100644 --- a/docs/architecture/incremental.md +++ b/docs/architecture/incremental.md @@ -114,7 +114,8 @@ You can find more relevant SQL transformations you might need to do on your data Note that in **Incremental Append**, the size of the data in your warehouse increases monotonically since an updated record in the source is appended to the destination rather than updated in-place. If you only care about having the latest snapshot of your data, you may want to periodically run cleanup jobs which retain only the latest instance of each record, deduping by primary key. ## Inclusive Cursors -When replicating data incrementally, Airbyte provides an at-least-once delivery guarantee. This means that it is acceptable for sources to re-send some data when run replicating incrementally. One case where this is particularly relevant is when a source's cursor is not very granular. For example, if a cursor field has the granularity of a day (but not hours, seconds, etc), then if that source is run twice in the same day, there is no way for the source to know which records that are that date were already replicated earlier that day. By convention, sources should prefer resending data if the cursor field is ambiguous. + +When replicating data incrementally, Airbyte provides an at-least-once delivery guarantee. This means that it is acceptable for sources to re-send some data when run replicating incrementally. One case where this is particularly relevant is when a source's cursor is not very granular. For example, if a cursor field has the granularity of a day \(but not hours, seconds, etc\), then if that source is run twice in the same day, there is no way for the source to know which records that are that date were already replicated earlier that day. By convention, sources should prefer resending data if the cursor field is ambiguous. ## Known Limitations diff --git a/docs/career-and-open-positions/README.md b/docs/career-and-open-positions/README.md index aa51a090934..c083bd25a23 100644 --- a/docs/career-and-open-positions/README.md +++ b/docs/career-and-open-positions/README.md @@ -33,7 +33,7 @@ Here are the values we deeply believe in: ## **Our Investors** -We are just closing a $5M seed round with Accel, YCombinator, 8VC, and a few leaders in the data industry (including the co-founder of Segment, the founder of Liveramp, the former GM of Cloudera). +We are just closing a $5M seed round with Accel, YCombinator, 8VC, and a few leaders in the data industry \(including the co-founder of Segment, the founder of Liveramp, the former GM of Cloudera\). We have a lot of capital, but [we're a lean, strong team](../company-handbook/team.md) - so you've got the opportunity to have a huge impact. diff --git a/docs/faq/differences-with.../README.md b/docs/faq/differences-with.../README.md index 1ad1eb1ad51..d020cfd1db3 100644 --- a/docs/faq/differences-with.../README.md +++ b/docs/faq/differences-with.../README.md @@ -1,2 +1,2 @@ -# Differences with... +# Differences with diff --git a/docs/tutorials/adding-incremental-sync.md b/docs/tutorials/adding-incremental-sync.md index e5115051172..ccd4732512d 100644 --- a/docs/tutorials/adding-incremental-sync.md +++ b/docs/tutorials/adding-incremental-sync.md @@ -1,12 +1,15 @@ -# How to add Incremental Replication to a Source +# Adding Incremental to a Source ## Overview -This tutorial will assume that you already have a working source. If you do not, feel free to refer to the [Building a Toy Connector](./toy-connector.md) tutorial. This tutorial will build directly off the example from that article. We will also assume that you have a basic understanding of how Airbyte's Incremental-Append replication strategy works. We have a brief explanation of it [here](../architecture/incremental.md). + +This tutorial will assume that you already have a working source. If you do not, feel free to refer to the [Building a Toy Connector](toy-connector.md) tutorial. This tutorial will build directly off the example from that article. We will also assume that you have a basic understanding of how Airbyte's Incremental-Append replication strategy works. We have a brief explanation of it [here](../architecture/incremental.md). ## Update Catalog in `discover` + First we need to identify a given stream in the Source as supporting incremental. This information is declared in the catalog that the `discover` method returns. You will notice in the stream object contains a field called `supported_sync_modes`. If we are adding incremental to an existing stream, we just need to add `"incremental"` to that array. This tells Airbyte that this stream can either be synced in an incremental fashion. In practice, this will mean that in the UI, a user will have the ability to configure this type of sync. In the example we used in the Toy Connector tutorial, the `discover` method would not look like this. Note: that "incremental" has been added to the `support_sync_modes` array. We also set `source_defined_cursor` to `True` to declare that the Source knows what field to use for the cursor, in this case the date field, and does not require user input. Nothing else has changed. + ```python def discover(): catalog = { @@ -34,12 +37,14 @@ def discover(): ``` ## Update `read` + Next we will adapt the `read` method that we wrote previously. We need to change three things. First, we need to pass it information about what data was replicated in the previous sync. In Airbyte this is called a `state` object. The structure of the state object is determined by Source. This means that each Source can construct a state object that makes sense to it and does not need to worry about adhering to any other convention. That being said, a pretty typical structure for a state object is a map of stream name to the last value in the cursor field for that stream. In this case we might choose something like this: -```json + +```javascript { "stock_prices": "2020-02-01" } @@ -112,3 +117,4 @@ def to_datetime(date): ``` That's all you need to do to add incremental functionality to the stock ticker Source. Incremental definitely requires more configurability than full refresh, so your implementation may deviate slightly depending on whether your cursor field is source defined or user-defined. If you think you are running into one of those cases, check out our [incremental](../architecture/incremental.md) documentation for more information on different types of configuration. + diff --git a/docs/tutorials/build-a-slack-activity-dashboard.md b/docs/tutorials/build-a-slack-activity-dashboard.md index 5fe0219d35e..58cbd112772 100644 --- a/docs/tutorials/build-a-slack-activity-dashboard.md +++ b/docs/tutorials/build-a-slack-activity-dashboard.md @@ -26,7 +26,7 @@ Got it? Now let’s get started. ### a. Deploying Airbyte -There are several easy ways to deploy Airbyte, as listed [here](https://docs.airbyte.io/). For this tutorial, I will just use the [Docker Compose method](https://docs.airbyte.io/deploying-airbyte/on-your-workstation) from my workstation: +There are several easy ways to deploy Airbyte, as listed [here](https://docs.airbyte.io/). For this tutorial, I will just use the [Docker Compose method](https://docs.airbyte.io/deploying-airbyte/on-your-workstation) from my workstation: ```text # In your workstation terminal @@ -35,15 +35,15 @@ cd airbyte docker-compose up ``` -The above command will make the Airbyte app available on `localhost:8000`. Visit the URL on your favorite browser, and you should see Airbyte’s dashboard \(if this is your first time, you will be prompted to enter your email to get started\). +The above command will make the Airbyte app available on `localhost:8000`. Visit the URL on your favorite browser, and you should see Airbyte’s dashboard \(if this is your first time, you will be prompted to enter your email to get started\). -If you haven’t set Docker up, follow the [instructions here](https://docs.docker.com/desktop/) to set it up on your machine. +If you haven’t set Docker up, follow the [instructions here](https://docs.docker.com/desktop/) to set it up on your machine. ### b. Setting Up Airbyte’s Slack Source Connector -Airbyte’s Slack connector will give us access to the data. So, we are going to kick things off by setting this connector to be our data source in Airbyte’s web app. I am assuming you already have Airbyte and Docker set up on your local machine. We will be using Docker to create our PostgreSQL database container later on. - -Now, let’s proceed. If you already went through the onboarding, click on the “new source” button at the top right of the Sources section. If you're going through the onboarding, then follow the instructions. +Airbyte’s Slack connector will give us access to the data. So, we are going to kick things off by setting this connector to be our data source in Airbyte’s web app. I am assuming you already have Airbyte and Docker set up on your local machine. We will be using Docker to create our PostgreSQL database container later on. + +Now, let’s proceed. If you already went through the onboarding, click on the “new source” button at the top right of the Sources section. If you're going through the onboarding, then follow the instructions. You will be requested to enter a name for the source you are about to create. You can call it “slack-source”. Then, in the Source Type combo box, look for “Slack,” and then select it. Airbyte will then present the configuration fields needed for the Slack connector. So you should be seeing something like this on the Airbyte App: @@ -89,11 +89,11 @@ Slack will prompt you that your app is requesting permission to access your work ![](../.gitbook/assets/10.png) -After the app has been successfully installed, you will be navigated to Slack’s dashboard, where you will see the Bot User OAuth Access Token. +After the app has been successfully installed, you will be navigated to Slack’s dashboard, where you will see the Bot User OAuth Access Token. This is the token you will provide back on the Airbyte page, where we dropped off to obtain this token. So make sure to copy it and keep it in a safe place. -Now that we are done with obtaining a Slack token, let’s go back to the Airbyte page we dropped off and add the token in there. +Now that we are done with obtaining a Slack token, let’s go back to the Airbyte page we dropped off and add the token in there. We will also need to provide Airbyte with `start_date`. This is the date from which we want Airbyte to start replicating data from the Slack API, and we define that in the format: `YYYY-MM-DDT00:00:00Z`. @@ -103,7 +103,7 @@ We will specify ours as `2020-09-01T00:00:00Z`. We will also tell Airbyte to exc Finally, click on the **Set up source** button for Airbyte to set the Slack source up. -If the source was set up correctly, you will be taken to the destination section of Airbyte’s dashboard, where you will tell Airbyte where to store the replicated data. +If the source was set up correctly, you will be taken to the destination section of Airbyte’s dashboard, where you will tell Airbyte where to store the replicated data. ### c. Setting Up Airbyte’s Postgres Destination Connector @@ -125,7 +125,7 @@ Since we already have Docker installed, we can spin off a Postgres container wit docker run --rm --name slack-db -e POSTGRES_PASSWORD=password -p 2000:5432 -d postgres ``` -\(Note that the Docker compose file for Superset ships with a Postgres database, as you can see [here](https://github.com/apache/superset/blob/master/docker-compose.yml#L40)\). +\(Note that the Docker compose file for Superset ships with a Postgres database, as you can see [here](https://github.com/apache/superset/blob/master/docker-compose.yml#L40)\). The above command will do the following: @@ -138,7 +138,7 @@ With this, we can go back to the Airbyte screen and supply the information neede ![](../.gitbook/assets/14.png) -Then click on the **Set up destination** button. +Then click on the **Set up destination** button. ### d. Setting Up the Replication @@ -148,7 +148,7 @@ You should now see the following screen: Airbyte will then fetch the schema for the data coming from the Slack API for your workspace. You should leave all boxes checked and then choose the sync frequency - this is the interval in which Airbyte will sync the data coming from your workspace. Let’s set the sync interval to every 24 hours. -Then click on the **Set up connection** button. +Then click on the **Set up connection** button. Airbyte will now take you to the destination dashboard, where you will see the destination you just set up. Click on it to see more details about this destination. @@ -226,7 +226,7 @@ Then run: docker-compose up ``` -This will download the Docker images Superset needs and build containers and start services Superset needs to run locally on your machine. +This will download the Docker images Superset needs and build containers and start services Superset needs to run locally on your machine. Once that’s done, you should be able to access Superset on your browser by visiting [`http://localhost:8088`](http://localhost:8088), and you should be presented with the Superset login screen. @@ -254,7 +254,7 @@ Let’s call our Database `slack_db`, and then add the following URI as the conn postgresql://postgres:password@docker.for.mac.localhost:2000/postgres ``` - If you are on a Windows Machine, yours will be: +If you are on a Windows Machine, yours will be: ```text postgresql://postgres:password@docker.for.win.localhost:2000/postgres @@ -280,23 +280,23 @@ Now that you’ve added the database, you will need to hover over the data menu ![](../.gitbook/assets/25.png) -Then, you will be taken to the datasets page: +Then, you will be taken to the datasets page: ![](../.gitbook/assets/26.png) -We want to only see the datasets that are in our `slack_db` database, so in the Database that is currently showing All, select `slack_db` and you will see that we don’t have any datasets at the moment. +We want to only see the datasets that are in our `slack_db` database, so in the Database that is currently showing All, select `slack_db` and you will see that we don’t have any datasets at the moment. ![](../.gitbook/assets/27.png) ![](../.gitbook/assets/28.png) -You can fix this by clicking on the **+ DATASET** button and adding the following datasets. +You can fix this by clicking on the **+ DATASET** button and adding the following datasets. Note: Make sure you select the public schema under the Schema dropdown. ![](../.gitbook/assets/29.png) -Now that we have set up Superset and given it our Slack data, let’s proceed to creating the visualizations we need. +Now that we have set up Superset and given it our Slack data, let’s proceed to creating the visualizations we need. Still remember them? Here they are again: @@ -323,8 +323,8 @@ Now change the **Visualization Type** to **Big Number,** remove the **Time Range ![](../.gitbook/assets/32.png) -Then, click on the **RUN QUERY** button, and you should now see the total number of members. - +Then, click on the **RUN QUERY** button, and you should now see the total number of members. + Pretty cool, right? Now let’s save this chart by clicking on the **SAVE** button. ![](../.gitbook/assets/33.png) @@ -337,11 +337,11 @@ Great! We have successfully created our first Chart, and we also created the Das Before we proceed with the rest of the charts for our dashboard, if you inspect the **ts** column on either the **messages** table or the **threads** table, you will see it’s of the type `VARCHAR`. We can’t really use this for our charts, so we have to cast both the **messages** and **threads**’ **ts** column as `TIMESTAMP`. Then, we can create our charts from the results of those queries. Let’s do this. -First, navigate to the **Data** menu, and click on the **Datasets** link. In the list of datasets, click the **Edit** button for the **messages** table. +First, navigate to the **Data** menu, and click on the **Datasets** link. In the list of datasets, click the **Edit** button for the **messages** table. ![](../.gitbook/assets/34.png) -You’re now in the Edit Dataset view. Click the **Lock** button to enable editing of the dataset. Then, navigate to the **Columns** tab, expand the **ts** dropdown, and then tick the **Is Temporal** box. +You’re now in the Edit Dataset view. Click the **Lock** button to enable editing of the dataset. Then, navigate to the **Columns** tab, expand the **ts** dropdown, and then tick the **Is Temporal** box. ![](../.gitbook/assets/35.png) @@ -371,7 +371,7 @@ Now, we are finished with creating the message chart. Let's go over to the threa ### f. Evolution of messages per channel -For this visualization, we will need a more complex SQL query. Here’s the query we used \(as you can see in the screenshot below\): +For this visualization, we will need a more complex SQL query. Here’s the query we used \(as you can see in the screenshot below\): ```text SELECT CAST(m.ts as TIMESTAMP), c.name, m.text