1
0
mirror of synced 2025-12-19 18:14:56 -05:00
Files
airbyte/docs/integrations/sources/apify-dataset.md
octavia-bot-hoard[bot] 55f64c19c3 🐙 source-apify-dataset: run up-to-date pipeline [2025-12-16] (#70793)
Co-authored-by: octavia-bot-hoard[bot] <230633153+octavia-bot-hoard[bot]@users.noreply.github.com>
2025-12-18 14:35:45 -08:00

12 KiB

description
description
Web scraping and automation platform.

Apify Dataset

Overview

Apify is a web scraping and web automation platform providing both ready-made and custom solutions, an open-source JavaScript SDK and Python SDK for web scraping, proxies, and many other tools to help you build and run web automation jobs at scale.

The results of a scraping job are usually stored in the [Apify Dataset](https://docs.apify.com/storage/dataset). This Airbyte connector provides streams to work with the datasets, including syncing their content to your chosen destination using Airbyte.

To sync data from a dataset, all you need to know is your API token and dataset ID.

You can find your personal API token in the Apify Console in the [Settings -> Integrations](https://console.apify.com/account/integrations) and the dataset ID in the [Storage -> Datasets](https://console.apify.com/storage/datasets).

Running Airbyte sync from Apify webhook

When your Apify job (aka Actor run) finishes, it can trigger an Airbyte sync by calling the Airbyte API manual connection trigger (POST /v1/connections/sync). The API can be called from Apify webhook which is executed when your Apify run finishes.

Features

Feature Supported?
Full Refresh Sync Yes
Incremental Sync Yes

Performance considerations

The Apify dataset connector uses Apify Python Client under the hood and should handle any API limitations under normal usage.

Streams

dataset_collection

  • Calls api.apify.com/v2/datasets (docs)
  • Properties:
    • Apify Personal API token (you can find it here)

dataset

  • Calls https://api.apify.com/v2/datasets/{datasetId} (docs)
  • Properties:
    • Apify Personal API token (you can find it here)
    • Dataset ID (check the docs)

item_collection

  • Calls api.apify.com/v2/datasets/{datasetId}/items (docs)
  • Properties:
    • Apify Personal API token (you can find it here)
    • Dataset ID (check the docs)
  • Limitations:
    • The stream uses a dynamic schema (all the data are stored under the "data" key), so it should support all the Apify Datasets (produced by whatever Actor).

item_collection_website_content_crawler

  • Calls the same endpoint and uses the same properties as the item_collection stream.
  • Limitations:
    • The stream uses a static schema which corresponds to the datasets produced by Website Content Crawler Actor. So only datasets produced by this Actor are supported.

Changelog

Expand to review
Version Date Pull Request Subject
2.2.35 2025-12-16 70793 Update dependencies
2.2.34 2025-11-25 69867 Update dependencies
2.2.33 2025-11-18 69519 Update dependencies
2.2.32 2025-11-04 68843 Update dependencies
2.2.31 2025-10-21 68364 Update dependencies
2.2.30 2025-10-14 67997 Update dependencies
2.2.29 2025-10-07 67163 Update dependencies
2.2.28 2025-09-30 66273 Update dependencies
2.2.27 2025-08-09 64655 Update dependencies
2.2.26 2025-08-02 64432 Update dependencies
2.2.25 2025-07-26 63802 Update dependencies
2.2.24 2025-07-05 62540 Update dependencies
2.2.23 2025-06-28 62139 Update dependencies
2.2.22 2025-06-15 61108 Update dependencies
2.2.21 2025-05-17 60677 Update dependencies
2.2.20 2025-05-10 59857 Update dependencies
2.2.19 2025-05-03 59312 Update dependencies
2.2.18 2025-04-26 58251 Update dependencies
2.2.17 2025-04-12 57599 Update dependencies
2.2.16 2025-04-05 57134 Update dependencies
2.2.15 2025-03-29 56579 Update dependencies
2.2.14 2025-03-22 56107 Update dependencies
2.2.13 2025-03-08 55423 Update dependencies
2.2.12 2025-03-01 54885 Update dependencies
2.2.11 2025-02-22 54235 Update dependencies
2.2.10 2025-02-15 53872 Update dependencies
2.2.9 2025-02-08 53440 Update dependencies
2.2.8 2025-02-01 52904 Update dependencies
2.2.7 2025-01-25 52208 Update dependencies
2.2.6 2025-01-18 51740 Update dependencies
2.2.5 2025-01-11 51257 Update dependencies
2.2.4 2024-12-28 50468 Update dependencies
2.2.3 2024-12-21 50217 Update dependencies
2.2.2 2024-12-14 49553 Update dependencies
2.2.1 2024-12-12 48216 Update dependencies
2.2.0 2024-10-29 47286 Migrate to manifest only format
2.1.27 2024-10-29 47068 Update dependencies
2.1.26 2024-10-12 46837 Update dependencies
2.1.25 2024-10-01 46373 add user-agent header to be able to track Airbyte integration on Apify
2.1.24 2024-10-05 46430 Update dependencies
2.1.23 2024-09-28 46146 Update dependencies
2.1.22 2024-09-21 45820 Update dependencies
2.1.21 2024-09-14 45479 Update dependencies
2.1.20 2024-09-07 45252 Update dependencies
2.1.19 2024-08-31 44962 Update dependencies
2.1.18 2024-08-24 44734 Update dependencies
2.1.17 2024-08-17 44204 Update dependencies
2.1.16 2024-08-10 43607 Update dependencies
2.1.15 2024-08-03 43071 Update dependencies
2.1.14 2024-07-27 42627 Update dependencies
2.1.13 2024-07-20 42364 Update dependencies
2.1.12 2024-07-13 41893 Update dependencies
2.1.11 2024-07-10 41344 Update dependencies
2.1.10 2024-07-09 41189 Update dependencies
2.1.9 2024-07-06 40813 Update dependencies
2.1.8 2024-06-25 40411 Update dependencies
2.1.7 2024-06-22 40187 Update dependencies
2.1.6 2024-06-04 39010 [autopull] Upgrade base image to v1.2.1
2.1.5 2024-04-19 37115 Updating to 0.80.0 CDK
2.1.4 2024-04-18 37115 Manage dependencies with Poetry.
2.1.3 2024-04-15 37115 Base image migration: remove Dockerfile and use the python-connector-base image
2.1.2 2024-04-12 37115 schema descriptions
2.1.1 2023-12-14 33414 Prepare for airbyte-lib
2.1.0 2023-10-13 31333 Add stream for arbitrary datasets
2.0.0 2023-09-18 30428 Fix broken stream, manifest refactor
1.0.0 2023-08-25 29859 Migrate to lowcode
0.2.0 2022-06-20 28290 Make connector work with platform changes not syncing empty stream schemas.
0.1.11 2022-04-27 12397 No changes. Used connector to test publish workflow changes.
0.1.9 2022-04-05 PR#11712 No changes from 0.1.4. Used connector to test publish workflow changes.
0.1.4 2021-12-23 PR#8434 Update fields in source-connectors specifications
0.1.2 2021-11-08 PR#7499 Remove base-python dependencies
0.1.0 2021-07-29 PR#5069 Initial version of the connector