MachineLearningNotebooks/how-to-use-azureml/data-drift/azure-ml-datadrift.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Track Data Drift between Training and Inference Data in Production \n",
    "\n",
    "With this notebook, you will learn how to enable the DataDrift service to automatically track and determine whether your inference data is drifting from the data your model was initially trained on. The DataDrift service provides metrics and visualizations to help stakeholders identify which specific features cause the concept drift to occur.\n",
    "\n",
    "Please email driftfeedback@microsoft.com with any issues. A member from the DataDrift team will respond shortly. \n",
    "\n",
    "The DataDrift Public Preview API can be found [here](https://docs.microsoft.com/en-us/python/api/azureml-contrib-datadrift/?view=azure-ml-py). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/contrib/datadrift/azureml-datadrift.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Prerequisites and Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Install the DataDrift package\n",
    "\n",
    "Install the azureml-contrib-datadrift, azureml-contrib-opendatasets and lightgbm packages before running this notebook.\n",
    "```\n",
    "pip install azureml-contrib-datadrift\n",
    "pip install azureml-contrib-datasets\n",
    "pip install lightgbm\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import Dependencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import os\n",
    "import time\n",
    "from datetime import datetime, timedelta\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import requests\n",
    "from azureml.contrib.datadrift import DataDriftDetector, AlertConfiguration\n",
    "from azureml.contrib.opendatasets import NoaaIsdWeather\n",
    "from azureml.core import Dataset, Workspace, Run\n",
    "from azureml.core.compute import AksCompute, ComputeTarget\n",
    "from azureml.core.conda_dependencies import CondaDependencies\n",
    "from azureml.core.experiment import Experiment\n",
    "from azureml.core.image import ContainerImage\n",
    "from azureml.core.model import Model\n",
    "from azureml.core.webservice import Webservice, AksWebservice\n",
    "from azureml.widgets import RunDetails\n",
    "from sklearn.externals import joblib\n",
    "from sklearn.model_selection import train_test_split\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set up Configuraton and Create Azure ML Workspace\n",
    "\n",
    "If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, go through the [configuration notebook](../../configuration.ipynb) first if you haven't already to establish your connection to the AzureML Workspace."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Please type in your initials/alias. The prefix is prepended to the names of resources created by this notebook. \n",
    "prefix = \"dd\"\n",
    "\n",
    "# NOTE: Please do not change the model_name, as it's required by the score.py file\n",
    "model_name = \"driftmodel\"\n",
    "image_name = \"{}driftimage\".format(prefix)\n",
    "service_name = \"{}driftservice\".format(prefix)\n",
    "\n",
    "# optionally, set email address to receive an email alert for DataDrift\n",
    "email_address = \"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ws = Workspace.from_config()\n",
    "print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generate Train/Testing Data\n",
    "\n",
    "For this demo, we will use NOAA weather data from [Azure Open Datasets](https://azure.microsoft.com/services/open-datasets/). You may replace this step with your own dataset. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "usaf_list = ['725724', '722149', '723090', '722159', '723910', '720279',\n",
    "             '725513', '725254', '726430', '720381', '723074', '726682',\n",
    "             '725486', '727883', '723177', '722075', '723086', '724053',\n",
    "             '725070', '722073', '726060', '725224', '725260', '724520',\n",
    "             '720305', '724020', '726510', '725126', '722523', '703333',\n",
    "             '722249', '722728', '725483', '722972', '724975', '742079',\n",
    "             '727468', '722193', '725624', '722030', '726380', '720309',\n",
    "             '722071', '720326', '725415', '724504', '725665', '725424',\n",
    "             '725066']\n",
    "\n",
    "columns = ['usaf', 'wban', 'datetime', 'latitude', 'longitude', 'elevation', 'windAngle', 'windSpeed', 'temperature', 'stationName', 'p_k']\n",
    "\n",
    "def enrich_weather_noaa_data(noaa_df):\n",
    "    hours_in_day = 23\n",
    "    week_in_year = 52\n",
    "    \n",
    "\n",
    "    noaa_df = noaa_df.assign(hour=noaa_df[\"datetime\"].dt.hour,\n",
    "                          weekofyear=noaa_df[\"datetime\"].dt.week,\n",
    "                          sine_weekofyear=noaa_df['datetime'].transform(lambda x: np.sin((2*np.pi*x.dt.week-1)/week_in_year)),\n",
    "                          cosine_weekofyear=noaa_df['datetime'].transform(lambda x: np.cos((2*np.pi*x.dt.week-1)/week_in_year)),\n",
    "                          sine_hourofday=noaa_df['datetime'].transform(lambda x: np.sin(2*np.pi*x.dt.hour/hours_in_day)),\n",
    "                          cosine_hourofday=noaa_df['datetime'].transform(lambda x: np.cos(2*np.pi*x.dt.hour/hours_in_day))\n",
    "                  )\n",
    "    \n",
    "    return noaa_df\n",
    "\n",
    "\n",
    "def add_window_col(input_df):\n",
    "    shift_interval = pd.Timedelta('-7 days') # your X days interval\n",
    "    df_shifted = input_df.copy()\n",
    "    df_shifted.loc[:,'datetime'] = df_shifted['datetime'] - shift_interval\n",
    "    df_shifted.drop(list(input_df.columns.difference(['datetime', 'usaf', 'wban', 'sine_hourofday', 'temperature'])), axis=1, inplace=True)\n",
    "\n",
    "    # merge, keeping only observations where -1 lag is present\n",
    "    df2 = pd.merge(input_df,\n",
    "                   df_shifted,\n",
    "                   on=['datetime', 'usaf', 'wban', 'sine_hourofday'],\n",
    "                   how='inner',  # use 'left' to keep observations without lags\n",
    "                   suffixes=['', '-7'])\n",
    "    return df2\n",
    "\n",
    "def get_noaa_data(start_time, end_time, cols, station_list):\n",
    "    isd = NoaaIsdWeather(start_time, end_time, cols=cols)\n",
    "    # Read into Pandas data frame.\n",
    "    noaa_df = isd.to_pandas_dataframe()\n",
    "    noaa_df = noaa_df.rename(columns={\"stationName\": \"station_name\"})\n",
    "    \n",
    "    df_filtered = noaa_df[noaa_df[\"usaf\"].isin(station_list)]\n",
    "    df_filtered.reset_index(drop=True)\n",
    "    \n",
    "    # Enrich with time features\n",
    "    df_enriched = enrich_weather_noaa_data(df_filtered)\n",
    "    \n",
    "    return df_enriched\n",
    "\n",
    "def get_featurized_noaa_df(start_time, end_time, cols, station_list):\n",
    "    df_1 = get_noaa_data(start_time - timedelta(days=7), start_time - timedelta(seconds=1), cols, station_list)\n",
    "    df_2 = get_noaa_data(start_time, end_time, cols, station_list)\n",
    "    noaa_df = pd.concat([df_1, df_2])\n",
    "    \n",
    "    print(\"Adding window feature\")\n",
    "    df_window = add_window_col(noaa_df)\n",
    "    \n",
    "    cat_columns = df_window.dtypes == object\n",
    "    cat_columns = cat_columns[cat_columns == True]\n",
    "    \n",
    "    print(\"Encoding categorical columns\")\n",
    "    df_encoded = pd.get_dummies(df_window, columns=cat_columns.keys().tolist())\n",
    "    \n",
    "    print(\"Dropping unnecessary columns\")\n",
    "    df_featurized = df_encoded.drop(['windAngle', 'windSpeed', 'datetime', 'elevation'], axis=1).dropna().drop_duplicates()\n",
    "    \n",
    "    return df_featurized"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Train model on Jan 1 - 14, 2009 data\n",
    "df = get_featurized_noaa_df(datetime(2009, 1, 1), datetime(2009, 1, 14, 23, 59, 59), columns, usaf_list)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "label = \"temperature\"\n",
    "x_df = df.drop(label, axis=1)\n",
    "y_df = df[[label]]\n",
    "x_train, x_test, y_train, y_test = train_test_split(df, y_df, test_size=0.2, random_state=223)\n",
    "print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)\n",
    "\n",
    "training_dir = 'outputs/training'\n",
    "training_file = \"training.csv\"\n",
    "\n",
    "# Generate training dataframe to register as Training Dataset\n",
    "os.makedirs(training_dir, exist_ok=True)\n",
    "training_df = pd.merge(x_train.drop(label, axis=1), y_train, left_index=True, right_index=True)\n",
    "training_df.to_csv(training_dir + \"/\" + training_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create/Register Training Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_name = \"dataset\"\n",
    "name_suffix = datetime.utcnow().strftime(\"%Y-%m-%d-%H-%M-%S\")\n",
    "snapshot_name = \"snapshot-{}\".format(name_suffix)\n",
    "\n",
    "dstore = ws.get_default_datastore()\n",
    "dstore.upload(training_dir, \"data/training\", show_progress=True)\n",
    "dpath = dstore.path(\"data/training/training.csv\")\n",
    "trainingDataset = Dataset.auto_read_files(dpath, include_path=True)\n",
    "trainingDataset = trainingDataset.register(workspace=ws, name=dataset_name, description=\"dset\", exist_ok=True)\n",
    "\n",
    "trainingDataSnapshot = trainingDataset.create_snapshot(snapshot_name=snapshot_name, compute_target=None, create_data_snapshot=True)\n",
    "datasets = [(Dataset.Scenario.TRAINING, trainingDataSnapshot)]\n",
    "print(\"dataset registration done.\\n\")\n",
    "datasets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train and Save Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import lightgbm as lgb\n",
    "\n",
    "train = lgb.Dataset(data=x_train, \n",
    "                    label=y_train)\n",
    "\n",
    "test = lgb.Dataset(data=x_test, \n",
    "                   label=y_test,\n",
    "                   reference=train)\n",
    "\n",
    "params = {'learning_rate'    : 0.1,\n",
    "          'boosting'         : 'gbdt',\n",
    "          'metric'           : 'rmse',\n",
    "          'feature_fraction' : 1,\n",
    "          'bagging_fraction' : 1,\n",
    "          'max_depth': 6,\n",
    "          'num_leaves'       : 31,\n",
    "          'objective'        : 'regression',\n",
    "          'bagging_freq'     : 1,\n",
    "          \"verbose\": -1,\n",
    "          'min_data_per_leaf': 100}\n",
    "\n",
    "model = lgb.train(params, \n",
    "                  num_boost_round=500,\n",
    "                  train_set=train,\n",
    "                  valid_sets=[train, test],\n",
    "                  verbose_eval=50,\n",
    "                  early_stopping_rounds=25)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_file = 'outputs/{}.pkl'.format(model_name)\n",
    "\n",
    "os.makedirs('outputs', exist_ok=True)\n",
    "joblib.dump(model, model_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Register Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = Model.register(model_path=model_file,\n",
    "                       model_name=model_name,\n",
    "                       workspace=ws,\n",
    "                       datasets=datasets)\n",
    "\n",
    "print(model_name, image_name, service_name, model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Deploy Model To AKS"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prepare Environment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "myenv = CondaDependencies.create(conda_packages=['numpy','scikit-learn', 'joblib', 'lightgbm', 'pandas'],\n",
    "                                 pip_packages=['azureml-monitoring', 'azureml-sdk[automl]'])\n",
    "\n",
    "with open(\"myenv.yml\",\"w\") as f:\n",
    "    f.write(myenv.serialize_to_string())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create Image"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Image creation may take up to 15 minutes.\n",
    "\n",
    "image_name = image_name + str(model.version)\n",
    "\n",
    "if not image_name in ws.images:\n",
    "    # Use the score.py defined in this directory as the execution script\n",
    "    # NOTE: The Model Data Collector must be enabled in the execution script for DataDrift to run correctly\n",
    "    image_config = ContainerImage.image_configuration(execution_script=\"score.py\",\n",
    "                                                      runtime=\"python\",\n",
    "                                                      conda_file=\"myenv.yml\",\n",
    "                                                      description=\"Image with weather dataset model\")\n",
    "    image = ContainerImage.create(name=image_name,\n",
    "                                  models=[model],\n",
    "                                  image_config=image_config,\n",
    "                                  workspace=ws)\n",
    "\n",
    "    image.wait_for_creation(show_output=True)\n",
    "else:\n",
    "    image = ws.images[image_name]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create Compute Target"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "aks_name = 'dd-demo-e2e'\n",
    "prov_config = AksCompute.provisioning_configuration()\n",
    "\n",
    "if not aks_name in ws.compute_targets:\n",
    "    aks_target = ComputeTarget.create(workspace=ws,\n",
    "                                      name=aks_name,\n",
    "                                      provisioning_configuration=prov_config)\n",
    "\n",
    "    aks_target.wait_for_completion(show_output=True)\n",
    "    print(aks_target.provisioning_state)\n",
    "    print(aks_target.provisioning_errors)\n",
    "else:\n",
    "    aks_target=ws.compute_targets[aks_name]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deploy Service"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "aks_service_name = service_name\n",
    "\n",
    "if not aks_service_name in ws.webservices:\n",
    "    aks_config = AksWebservice.deploy_configuration(collect_model_data=True, enable_app_insights=True)\n",
    "    aks_service = Webservice.deploy_from_image(workspace=ws,\n",
    "                                               name=aks_service_name,\n",
    "                                               image=image,\n",
    "                                               deployment_config=aks_config,\n",
    "                                               deployment_target=aks_target)\n",
    "    aks_service.wait_for_deployment(show_output=True)\n",
    "    print(aks_service.state)\n",
    "else:\n",
    "    aks_service = ws.webservices[aks_service_name]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Run DataDrift Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Send Scoring Data to Service"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download Scoring Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Score Model on March 15, 2016 data\n",
    "scoring_df = get_noaa_data(datetime(2016, 3, 15) - timedelta(days=7), datetime(2016, 3, 16),  columns, usaf_list)\n",
    "# Add the window feature column\n",
    "scoring_df = add_window_col(scoring_df)\n",
    "\n",
    "# Drop features not used by the model\n",
    "print(\"Dropping unnecessary columns\")\n",
    "scoring_df = scoring_df.drop(['windAngle', 'windSpeed', 'datetime', 'elevation'], axis=1).dropna()\n",
    "scoring_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# One Hot Encode the scoring dataset to match the training dataset schema\n",
    "columns_dict = model.datasets[\"training\"][0].get_profile().columns\n",
    "extra_cols = ('Path', 'Column1')\n",
    "for k in extra_cols:\n",
    "    columns_dict.pop(k, None)\n",
    "training_columns = list(columns_dict.keys())\n",
    "\n",
    "categorical_columns = scoring_df.dtypes == object\n",
    "categorical_columns = categorical_columns[categorical_columns == True]\n",
    "\n",
    "test_df = pd.get_dummies(scoring_df[categorical_columns.keys().tolist()])\n",
    "encoded_df = scoring_df.join(test_df)\n",
    "\n",
    "# Populate missing OHE columns with 0 values to match traning dataset schema\n",
    "difference = list(set(training_columns) - set(encoded_df.columns.tolist()))\n",
    "for col in difference:\n",
    "    encoded_df[col] = 0\n",
    "encoded_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Serialize dataframe to list of row dictionaries\n",
    "encoded_dict = encoded_df.to_dict('records')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Submit Scoring Data to Service"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "# retreive the API keys. AML generates two keys.\n",
    "key1, key2 = aks_service.get_keys()\n",
    "\n",
    "total_count = len(scoring_df)\n",
    "i = 0\n",
    "load = []\n",
    "for row in encoded_dict:\n",
    "    load.append(row)\n",
    "    i = i + 1\n",
    "    if i % 100 == 0:\n",
    "        payload = json.dumps({\"data\": load})\n",
    "        \n",
    "        # construct raw HTTP request and send to the service\n",
    "        payload_binary = bytes(payload,encoding = 'utf8')\n",
    "        headers = {'Content-Type':'application/json', 'Authorization': 'Bearer ' + key1}\n",
    "        resp = requests.post(aks_service.scoring_uri, payload_binary, headers=headers)\n",
    "        \n",
    "        print(\"prediction:\", resp.content, \"Progress: {}/{}\".format(i, total_count))   \n",
    "\n",
    "        load = []\n",
    "        time.sleep(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Configure DataDrift"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "services = [service_name]\n",
    "start = datetime.now() - timedelta(days=2)\n",
    "end = datetime(year=2020, month=1, day=22, hour=15, minute=16)\n",
    "feature_list = ['usaf', 'wban', 'latitude', 'longitude', 'station_name', 'p_k',  'sine_hourofday', 'cosine_hourofday', 'temperature-7']\n",
    "alert_config = AlertConfiguration([email_address]) if email_address else None\n",
    "\n",
    "# there will be an exception indicating using get() method if DataDrift object already exist\n",
    "try:\n",
    "    datadrift = DataDriftDetector.create(ws, model.name, model.version, services, frequency=\"Day\", alert_config=alert_config)\n",
    "except KeyError:\n",
    "    datadrift = DataDriftDetector.get(ws, model.name, model.version)\n",
    "    \n",
    "print(\"Details of DataDrift Object:\\n{}\".format(datadrift))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Run an Adhoc DataDriftDetector Run"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "target_date = datetime.today()\n",
    "run = datadrift.run(target_date, services, feature_list=feature_list, create_compute_target=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "exp = Experiment(ws, datadrift._id)\n",
    "dd_run = Run(experiment=exp, run_id=run)\n",
    "RunDetails(dd_run).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get Drift Analysis Results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "children = list(dd_run.get_children())\n",
    "for child in children:\n",
    "    child.wait_for_completion()\n",
    "\n",
    "drift_metrics = datadrift.get_output(start_time=start, end_time=end)\n",
    "drift_metrics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Show all drift figures, one per serivice.\n",
    "# If setting with_details is False (by default), only drift will be shown; if it's True, all details will be shown.\n",
    "\n",
    "drift_figures = datadrift.show(with_details=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Enable DataDrift Schedule"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "datadrift.enable_schedule()"
   ]
  }
 ],
 "metadata": {
  "authors": [
   {
    "name": "rafarmah"
   }
  ],
  "kernelspec": {
   "display_name": "Python 3.6",
   "language": "python",
   "name": "python36"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}