update to version 1.0.6

2018-12-20 15:40:20 -05:00
parent 4a2d6d637a
commit d8cf73503e
63 changed files with 1218 additions and 82536 deletions
--- a/tutorials/regression-part1-data-prep.ipynb
+++ b/tutorials/regression-part1-data-prep.ipynb
@@ -13,7 +13,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "# Tutorial #1: Prepare data for regression modeling"
+    "# Tutorial (part 1): Prepare data for regression modeling"
   ]
  },
  {
@@ -101,7 +101,7 @@
    "all_columns = dprep.ColumnSelector(term=\".*\", use_regex=True)\n",
    "drop_if_all_null = [all_columns, dprep.ColumnRelationship(dprep.ColumnRelationship.ALL)]\n",
    "useful_columns = [\n",
-    "    \"cost\", \"distance\"\"distance\", \"dropoff_datetime\", \"dropoff_latitude\", \"dropoff_longitude\",\n",
+    "    \"cost\", \"distance\", \"dropoff_datetime\", \"dropoff_latitude\", \"dropoff_longitude\",\n",
    "    \"passengers\", \"pickup_datetime\", \"pickup_latitude\", \"pickup_longitude\", \"store_forward\", \"vendor\"\n",
    "]"
   ]
@@ -337,6 +337,23 @@
    "combined_df = combined_df.replace(columns=\"store_forward\", find=\"0\", replace_with=\"N\").fill_nulls(\"store_forward\", \"N\")"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Execute another `replace` function, this time on the `distance` field. This reformats distance values that are incorrectly labeled as `.00`, and fills any nulls with zeros. Convert the `distance` field to numerical format."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "combined_df = combined_df.replace(columns=\"distance\", find=\".00\", replace_with=0).fill_nulls(\"distance\", 0)\n",
+    "combined_df = combined_df.to_number([\"distance\"])"
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@@ -507,6 +524,23 @@
    "tmp_df.get_profile()"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Before packaging the dataflow, perform two final filters on the data set. To eliminate incorrect data points, filter the dataflow on records where both the `cost` and `distance` are greater than zero."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tmp_df = tmp_df.filter(dprep.col(\"distance\") > 0)\n",
+    "tmp_df = tmp_df.filter(dprep.col(\"cost\") > 0)"
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@@ -520,9 +554,12 @@
   "metadata": {},
   "outputs": [],
   "source": [
+    "import os\n",
+    "file_path = os.path.join(os.getcwd(), \"dflows.dprep\")\n",
+    "\n",
    "dflow_prepared = tmp_df\n",
    "package = dprep.Package([dflow_prepared])\n",
-    "package.save(\".\\dflows\")"
+    "package.save(file_path)"
   ]
  },
  {
@@ -536,7 +573,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Delete the file `dflows` (whether you are running locally or in Azure Notebooks) in your current directory if you do not wish to continue with part two of the tutorial. If you continue on to part two, you will need the `dflows` file in the current directory."
+    "Delete the file `dflows.dprep` (whether you are running locally or in Azure Notebooks) in your current directory if you do not wish to continue with part two of the tutorial. If you continue on to part two, you will need the `dflows.dprep` file in the current directory."
   ]
  },
  {
@@ -571,9 +608,9 @@
   }
  ],
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3.6",
   "language": "python",
-   "name": "python3"
+   "name": "python36"
  },
  "language_info": {
   "codemirror_mode": {
--- a/tutorials/regression-part2-automated-ml.ipynb
+++ b/tutorials/regression-part2-automated-ml.ipynb
@@ -11,7 +11,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "# Tutorial #2: Train a regression model with automated machine learning\n",
+    "# Tutorial (part 2): Use automated machine learning to build your regression model \n",
    "\n",
    "This tutorial is **part two of a two-part tutorial series**. In the previous tutorial, you [prepared the NYC taxi data for regression modeling](regression-part1-data-prep.ipynb).\n",
    "\n",
@@ -112,7 +112,11 @@
   "outputs": [],
   "source": [
    "import azureml.dataprep as dprep\n",
-    "package_saved = dprep.Package.open(\".\\dflow\")\n",
+    "import os\n",
+    "\n",
+    "file_path = os.path.join(os.getcwd(), \"dflows.dprep\")\n",
+    "\n",
+    "package_saved = dprep.Package.open(file_path)\n",
    "dflow_prepared = package_saved.dataflows[0]\n",
    "dflow_prepared.get_profile()"
   ]
@@ -130,7 +134,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "dflow_X = dflow_prepared.keep_columns(['pickup_weekday', 'dropoff_latitude', 'dropoff_longitude','pickup_hour','pickup_longitude','pickup_latitude','passengers'])\n",
+    "dflow_X = dflow_prepared.keep_columns(['pickup_weekday','pickup_hour', 'distance','passengers', 'vendor'])\n",
    "dflow_y = dflow_prepared.keep_columns('cost')"
   ]
  },
@@ -155,7 +159,7 @@
    "x_df = dflow_X.to_pandas_dataframe()\n",
    "y_df = dflow_y.to_pandas_dataframe()\n",
    "\n",
-    "x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, test_size=0.2, random_state=123)\n",
+    "x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, test_size=0.2, random_state=223)\n",
    "# flatten y_train to 1d array\n",
    "y_train.values.flatten()"
   ]
@@ -373,7 +377,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Compare the predicted cost values with the actual cost values. Use the `y_test` dataframe, and convert it to a list to compare to the predicted values. The function `mean_absolute_error` takes two arrays of values, and calculates the average absolute value error between them. In this example, a mean absolute error of 3.5 would mean that on average, the model predicts the cost within plus or minus 3.5 of the actual value."
+    "Create a scatter plot to visualize the predicted cost values compared to the actual cost values. The following code uses the `distance` feature as the x-axis, and trip `cost` as the y-axis. The first 100 predicted and actual cost values are created as separate series, in order to compare the variance of predicted cost at each trip distance value. Examining the plot shows that the distance/cost relationship is nearly linear, and the predicted cost values are in most cases very close to the actual cost values for the same trip distance."
   ]
  },
  {
@@ -382,10 +386,44 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "from sklearn.metrics import mean_absolute_error\n",
+    "import matplotlib.pyplot as plt\n",
    "\n",
+    "fig = plt.figure(figsize=(14, 10))\n",
+    "ax1 = fig.add_subplot(111)\n",
+    "\n",
+    "distance_vals = [x[4] for x in x_test.values]\n",
    "y_actual = y_test.values.flatten().tolist()\n",
-    "mean_absolute_error(y_actual, y_predict)"
+    "\n",
+    "ax1.scatter(distance_vals[:100], y_predict[:100], s=18, c='b', marker=\"s\", label='Predicted')\n",
+    "ax1.scatter(distance_vals[:100], y_actual[:100], s=18, c='r', marker=\"o\", label='Actual')\n",
+    "\n",
+    "ax1.set_xlabel('distance (mi)')\n",
+    "ax1.set_title('Predicted and Actual Cost/Distance')\n",
+    "ax1.set_ylabel('Cost ($)')\n",
+    "\n",
+    "plt.legend(loc='upper left', prop={'size': 12})\n",
+    "plt.rcParams.update({'font.size': 14})\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Calculate the `root mean squared error` of the results. Use the `y_test` dataframe, and convert it to a list to compare to the predicted values. The function `mean_squared_error` takes two arrays of values, and calculates the average squared error between them. Taking the square root of the result gives an error in the same units as the y variable (cost), and indicates roughly how far your predictions are from the actual value. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.metrics import mean_squared_error\n",
+    "from math import sqrt\n",
+    "\n",
+    "rmse = sqrt(mean_squared_error(y_actual, y_predict))\n",
+    "rmse"
   ]
  },
  {
@@ -444,9 +482,9 @@
   }
  ],
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3.6",
   "language": "python",
-   "name": "python3"
+   "name": "python36"
  },
  "language_info": {
   "codemirror_mode": {