# Incremental reads In this section, we'll add support to read data incrementally. While this is optional, you should implement it whenever possible because reading in incremental mode allows users to save time and money by only reading new data. We'll first need to implement three new methods on the base stream class The `cursor_field` property indicates that records produced by the stream have a cursor that can be used to identify it in the timeline. ```python @property def cursor_field(self) -> Optional[str]: return self._cursor_field ``` The `get_updated_state` method is used to update the stream's state. We'll set its value to the maximum between the current state's value and the value extracted from the record. ```python # import the following library import datetime ``` ```python def get_updated_state(self, current_stream_state: MutableMapping[str, Any], latest_record: Mapping[str, Any]) -> Mapping[str, Any]: state_value = max(current_stream_state.get(self.cursor_field, 0), datetime.datetime.strptime(latest_record.get(self._cursor_field, ""), _INCOMING_DATETIME_FORMAT).timestamp()) return {self._cursor_field: state_value} ``` Note that we're converting the datetimes to unix epoch. We could've also chosen to persist it as an ISO date. You can use any format that works best for you. Integers are easy to work with so that's what we'll do for this tutorial. Then we'll implement the `stream_slices` method, which will be used to partition the stream into time windows. While this isn't mandatory since we could omit the `end_modified_at` parameter from our requests and try to read all new records at once, it is preferable to partition the stream because it enables checkpointing. This might mean the connector will make more requests than necessary during the initial sync, and this is most visible when working with a sandbox or an account that does not have many records. The upside are worth the tradeoff because the additional cost is negligible for accounts that have many records, and the time cost will be entirely mitigated in a follow up section when we fetch partitions concurrently. ```python def stream_slices(self, stream_state: Mapping[str, Any] = None, **kwargs) -> Iterable[Optional[Mapping[str, any]]]: start_ts = stream_state.get(self._cursor_field, _START_DATE) if stream_state else _START_DATE now_ts = datetime.datetime.now().timestamp() if start_ts >= now_ts: yield from [] return for start, end in self.chunk_dates(start_ts, now_ts): yield {"start_date": start, "end_date": end} def chunk_dates(self, start_date_ts: int, end_date_ts: int) -> Iterable[Tuple[int, int]]: step = int(_SLICE_RANGE * 24 * 60 * 60) after_ts = start_date_ts while after_ts < end_date_ts: before_ts = min(end_date_ts, after_ts + step) yield after_ts, before_ts after_ts = before_ts + 1 ``` Note that we're introducing the concept of a start date. You might have to fiddle to find the earliest start date that can be queried. You can also choose to make the start date configurable by the end user. This will make your life simpler, at the cost of pushing the complexity to the end-user. We'll now update the query params. In addition the passing the page size and the include field, we'll pass in the `start_modified_at` and `end_modified_at` which can be extracted from the `stream_slice` parameter. ```python def request_params( self, stream_state: Mapping[str, Any], stream_slice: Mapping[str, any] = None, next_page_token: Mapping[str, Any] = None ) -> MutableMapping[str, Any]: if next_page_token: return urlparse(next_page_token["next_url"]).query else: return { "per_page": _PAGE_SIZE, "include": "response_count,date_created,date_modified,language,question_count,analyze_url,preview,collect_stats", "start_modified_at": datetime.datetime.strftime(datetime.datetime.fromtimestamp(stream_slice["start_date"]), _OUTGOING_DATETIME_FORMAT), "end_modified_at": datetime.datetime.strftime(datetime.datetime.fromtimestamp(stream_slice["end_date"]), _OUTGOING_DATETIME_FORMAT) } ``` And add the following constants to the source.py file ```python _START_DATE = datetime.datetime(2020,1,1, 0,0,0).timestamp() _SLICE_RANGE = 365 _OUTGOING_DATETIME_FORMAT = "%Y-%m-%dT%H:%M:%SZ" _INCOMING_DATETIME_FORMAT = "%Y-%m-%dT%H:%M:%S" ``` Notice the outgoing and incoming date formats are different! Now, update the stream constructor so it accepts a cursor_field parameter. ```python class SurveyMonkeyBaseStream(HttpStream, ABC): def __init__(self, name: str, path: str, primary_key: Union[str, List[str]], data_field: Optional[str], cursor_field: Optional[str], **kwargs: Any) -> None: self._name = name self._path = path self._primary_key = primary_key self._data_field = data_field self._cursor_field = cursor_field super().__init__(**kwargs) ``` And update the stream's creation: ```python return [SurveyMonkeyBaseStream(name="surveys", path="/v3/surveys", primary_key="id", data_field="data", cursor_field="date_modified", authenticator=auth)] ``` Finally, modify the configured catalog to run the stream in incremental mode: ```json { "streams": [ { "stream": { "name": "surveys", "json_schema": {}, "supported_sync_modes": ["full_refresh", "incremental"] }, "sync_mode": "incremental", "destination_sync_mode": "overwrite" } ] } ``` Run another read operation. The state messages should include the cursor: ```json { "type": "STATE", "state": { "type": "STREAM", "stream": { "stream_descriptor": { "name": "surveys", "namespace": null }, "stream_state": { "date_modified": 1623348420.0 } }, "sourceStats": { "recordCount": 0.0 } } } ``` And update the sample state to a timestamp earlier than the first record. There should be fewer records ```json [ { "type": "STREAM", "stream": { "stream_descriptor": { "name": "surveys" }, "stream_state": { "date_modified": 1711753326 } } } ] ``` Run another read command, passing the `--state` flag: ```bash poetry run source-survey-monkey-demo read --config secrets/config.json --catalog integration_tests/configured_catalog.json --state integration_tests/sample_state.json ``` Only more recent records should be read. In the [next section](7-reading-from-a-subresource.md), we'll implement the survey responses stream, which depends on the surveys stream.