![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/deployment/enable-data-collection-for-models-in-aks/enable-data-collection-for-models-in-aks.png)

# Enabling Data Collection for Models in Production
With this notebook, you can learn how to collect input model data from your Azure Machine Learning service in an Azure Blob storage. Once enabled, this data collected gives you the opportunity:

* Monitor data drifts as production data enters your model
* Make better decisions on when to retrain or optimize your model
* Retrain your model with the data collected

## What data is collected?
* Model input data (voice, images, and video are not supported) from services deployed in Azure Kubernetes Cluster (AKS)
* Model predictions using production input data.

**Note:** pre-aggregation or pre-calculations on this data are done by user and not included in this version of the product.

## What is different compared to standard production deployment process?
1. Update scoring file.
2. Update yml file with new dependency.
3. Update aks configuration.
4. Build new image and deploy it. 

## 1. Import your dependencies

In [None]:
from azureml.core import Workspace
from azureml.core.compute import AksCompute, ComputeTarget
from azureml.core.webservice import Webservice, AksWebservice
import azureml.core
print(azureml.core.VERSION)

## 2. Set up your configuration and create a workspace

In [None]:
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

## 3. Register Model
Register an existing trained model, add descirption and tags.

In [None]:
#Register the model
from azureml.core.model import Model
model = Model.register(model_path = "sklearn_regression_model.pkl", # this points to a local file
                       model_name = "sklearn_regression_model.pkl", # this is the name the model is registered as
                       tags = {'area': "diabetes", 'type': "regression"},
                       description = "Ridge regression model to predict diabetes",
                       workspace = ws)

print(model.name, model.description, model.version)

## 4. *Update your scoring file with Data Collection*
The file below, compared to the file used in notebook 11, has the following changes:
### a. Import the module
```python 
from azureml.monitoring import ModelDataCollector```
### b. In your init function add:
```python 
global inputs_dc, prediction_d
inputs_dc = ModelDataCollector("best_model", identifier="inputs", feature_names=["feat1", "feat2", "feat3", "feat4", "feat5", "Feat6"])
prediction_dc = ModelDataCollector("best_model", identifier="predictions", feature_names=["prediction1", "prediction2"])```
    
* Identifier: Identifier is later used for building the folder structure in your Blob, it can be used to divide "raw" data versus "processed".
* CorrelationId: is an optional parameter, you do not need to set it up if your model doesn't require it. Having a correlationId in place does help you for easier mapping with other data. (Examples include: LoanNumber, CustomerId, etc.)
* Feature Names: These need to be set up in the order of your features in order for them to have column names when the .csv is created.

### c. In your run function add:
```python
inputs_dc.collect(data)
prediction_dc.collect(result)```

In [None]:
%%writefile score.py
import pickle
import json
import numpy 
from sklearn.externals import joblib
from sklearn.linear_model import Ridge
from azureml.core.model import Model
from azureml.monitoring import ModelDataCollector
import time

def init():
    global model
    print ("model initialized" + time.strftime("%H:%M:%S"))
    # note here "sklearn_regression_model.pkl" is the name of the model registered under the workspace
    # this call should return the path to the model.pkl file on the local disk.
    model_path = Model.get_model_path(model_name = 'sklearn_regression_model.pkl')
    # deserialize the model file back into a sklearn model
    model = joblib.load(model_path)
    global inputs_dc, prediction_dc
    # this setup will help us save our inputs under the "inputs" path in our Azure Blob
    inputs_dc = ModelDataCollector(model_name="sklearn_regression_model", identifier="inputs", feature_names=["feat1", "feat2"]) 
    # this setup will help us save our ipredictions under the "predictions" path in our Azure Blob
    prediction_dc = ModelDataCollector("sklearn_regression_model", identifier="predictions", feature_names=["prediction1", "prediction2"]) 
  
# note you can pass in multiple rows for scoring
def run(raw_data):
    global inputs_dc, prediction_dc
    try:
        data = json.loads(raw_data)['data']
        data = numpy.array(data)
        result = model.predict(data)
        print ("saving input data" + time.strftime("%H:%M:%S"))
        inputs_dc.collect(data) #this call is saving our input data into our blob
        prediction_dc.collect(result)#this call is saving our prediction data into our blob
        print ("saving prediction data" + time.strftime("%H:%M:%S"))
        # you can return any data type as long as it is JSON-serializable
        return result.tolist()
    except Exception as e:
        error = str(e)
        print (error + time.strftime("%H:%M:%S"))
        return error

## 5. *Update your myenv.yml file with the required module*

In [None]:
from azureml.core.conda_dependencies import CondaDependencies 

myenv = CondaDependencies.create(conda_packages=['numpy','scikit-learn'])
myenv.add_pip_package("azureml-monitoring")

with open("myenv.yml","w") as f:
    f.write(myenv.serialize_to_string())

## 6. Create your new Image

In [None]:
from azureml.core.image import ContainerImage

image_config = ContainerImage.image_configuration(execution_script = "score.py",
                                                  runtime = "python",
                                                  conda_file = "myenv.yml",
                                                  description = "Image with ridge regression model",
                                                  tags = {'area': "diabetes", 'type': "regression"}
                                                 )

image = ContainerImage.create(name = "myimage1",
                              # this is the model object
                              models = [model],
                              image_config = image_config,
                              workspace = ws)

image.wait_for_creation(show_output = True)

In [None]:
print(model.name, model.description, model.version)

## 7. Deploy to AKS service

### Create AKS compute if you haven't done so.

In [None]:
# Use the default configuration (can also provide parameters to customize)
prov_config = AksCompute.provisioning_configuration()

aks_name = 'my-aks-test1' 
# Create the cluster
aks_target = ComputeTarget.create(workspace = ws, 
                                  name = aks_name, 
                                  provisioning_configuration = prov_config)

In [None]:
%%time
aks_target.wait_for_completion(show_output = True)
print(aks_target.provisioning_state)
print(aks_target.provisioning_errors)

If you already have a cluster you can attach the service to it:

```python 
    %%time
    resource_id = '/subscriptions/<subscriptionid>/resourcegroups/<resourcegroupname>/providers/Microsoft.ContainerService/managedClusters/<aksservername>'
    create_name= 'myaks4'
    attach_config = AksCompute.attach_configuration(resource_id=resource_id)
    aks_target = ComputeTarget.attach(workspace = ws, 
                                      name = create_name, 
                                      attach_configuration=attach_config)
    ## Wait for the operation to complete
    aks_target.wait_for_provisioning(True)```

### a. *Activate Data Collection and App Insights through updating AKS Webservice configuration*
In order to enable Data Collection and App Insights in your service you will need to update your AKS configuration file:

In [None]:
#Set the web service configuration
aks_config = AksWebservice.deploy_configuration(collect_model_data=True, enable_app_insights=True)

### b. Deploy your service

In [None]:
if aks_target.provisioning_state== "Succeeded": 
    aks_service_name ='aks-w-dc0'
    aks_service = Webservice.deploy_from_image(workspace = ws, 
                                               name = aks_service_name,
                                               image = image,
                                               deployment_config = aks_config,
                                               deployment_target = aks_target
                                               )
    aks_service.wait_for_deployment(show_output = True)
    print(aks_service.state)
else: 
    raise ValueError("aks provisioning failed, can't deploy service. Error: ", aks_service.error)

## 8. Test your service and send some data
**Note**: It will take around 15 mins for your data to appear in your blob.
The data will appear in your Azure Blob following this format:

/modeldata/subscriptionid/resourcegroupname/workspacename/webservicename/modelname/modelversion/identifier/year/month/day/data.csv 

In [None]:
%%time
import json

test_sample = json.dumps({'data': [
    [1,2,3,4,54,6,7,8,88,10], 
    [10,9,8,37,36,45,4,33,2,1]
]})
test_sample = bytes(test_sample,encoding = 'utf8')

if aks_service.state == "Healthy":
    prediction = aks_service.run(input_data=test_sample)
    print(prediction)
else:
    raise ValueError("Service deployment isn't healthy, can't call the service. Error: ", aks_service.error)

## 9. Validate you data and analyze it
You can look into your data following this path format in your Azure Blob (it takes up to 15 minutes for the data to appear):

/modeldata/**subscriptionid>**/**resourcegroupname>**/**workspacename>**/**webservicename>**/**modelname>**/**modelversion>>**/**identifier>**/*year/month/day*/data.csv 

For doing further analysis you have multiple options:

### a. Create DataBricks cluter and connect it to your blob
https://docs.microsoft.com/en-us/azure/azure-databricks/quickstart-create-databricks-workspace-portal or in your databricks workspace you can look for the template "Azure Blob Storage Import Example Notebook".


Here is an example for setting up the file location to extract the relevant data:

<code> file_location = "wasbs://mycontainer@storageaccountname.blob.core.windows.net/unknown/unknown/unknown-bigdataset-unknown/my_iterate_parking_inputs/2018/&deg;/&deg;/data.csv" 
file_type = "csv"</code>


### b. Connect Blob to Power Bi (Small Data only)
1. Download and Open PowerBi Desktop
2. Select "Get Data" and click on "Azure Blob Storage" >> Connect
3. Add your storage account and enter your storage key.
4. Select the container where your Data Collection is stored and click on Edit. 
5. In the query editor, click under "Name" column and add your Storage account Model path into the filter. Note: if you want to only look into files from a specific year or month, just expand the filter path. For example, just look into March data: /modeldata/subscriptionid>/resourcegroupname>/workspacename>/webservicename>/modelname>/modelversion>/identifier>/year>/3
6. Click on the double arrow aside the "Content" column to combine the files. 
7. Click OK and the data will preload.
8. You can now click Close and Apply and start building your custom reports on your Model Input data.

# Disable Data Collection

In [None]:
aks_service.update(collect_model_data=False)

## Clean up

In [None]:
%%time
aks_service.delete()
image.delete()
model.delete()