From 01d391f5c2b0e017b08040b9adc81b0c78cf1f71 Mon Sep 17 00:00:00 2001
From: Parashar Shah <17348989+parasharshah@users.noreply.github.com>
Date: Fri, 30 Nov 2018 11:14:49 -0800
Subject: [PATCH] updated table

---
 databricks/automl_adb_readme.md | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/databricks/automl_adb_readme.md b/databricks/automl_adb_readme.md
index 600a4cc8..d3f21735 100644
--- a/databricks/automl_adb_readme.md
+++ b/databricks/automl_adb_readme.md
@@ -17,22 +17,21 @@ Select New Cluster and fill in following detail:
  - Databricks Runtime: Any 4.* runtime (NO GPU) or Recommended: 4.3(includes Apache spark 2.3.1, Scala 2.11))
  - Python version: **3**
  - Driver type – you may select a small driver node size (eg. Standard_DS3_v2 0.75 DBU)
- - Worker node VM types: Memory optimized preferred. Please follow this table.
+ - Worker node VM types: **Memory optimized VM** preferred. Please follow this table.
  
  |**Dataset type**  | **Dataset size** |**Preprocessed dataset size**  | **Number of cross validations (cv)** |**Recommended memory per concurrency for VM**|**Total memory required for cluster** |
 |--|--|--|--|--|--|
-|String & Numeric  | X |3X  | 3 * X * cv |3 * X * cv * 3  | 3 * X * cv  * 3 * number of concurrent runs  |
-|Numeric  | Y |Y| Y |3Y| 3 * Y * number of concurrent runs  |
+|String & Numeric (preprocessing True)  | X |5 * X  | 5 * X * cv |5 * X * cv * 3  | 5 * X * cv  * 3 * number of concurrent runs  |
+|Numeric (preprocessing False)  | Y |1.5 * Y| 1.5 *  Y |5 * Y| 5 * Y * number of concurrent runs  |
+|Numeric (preprocessing True)  | Y |1.5 * Y| 1.5 * Y * cv |5 * Y * cv| 5 * cv * Y * number of concurrent runs  |
 
-- Number of concurrent runs should be less than or equal to the number
-of cores in your Databricks cluster.
+- Number of concurrent runs should be less than or equal to the number of cores in your Databricks cluster.
 - For a 1 GB numeric only dataset, to do 10 cross validations with run 16 concurrent runs, the minimum usable cluster memory should be 1
 GB X 16 concurrent runs X 3 = 48 GB. This is in addition to what
 Spark itself will use on your cluster.
--   For text dataset, with featurization (eg. one hot encoding) & cross validation this requirement is much higher. For a 500 MB
-string+numeric dataset, to do 5 cross validation with 4 concurrent
-runs, the minimum usable cluster memory should be 0.5 GB X 4
+-   For string & numeric dataset, with featurization (eg. one hot encoding) & cross validation this requirement is much higher. For a 500 MB string+numeric dataset, to do 5 cross validation with 4 concurrent runs, the minimum usable cluster memory should be 0.5 GB X 4
 concurrent runs X 3 X 5 cross validations X 3 = 90 GB
+- For text dataset, TBD.
 - Uncheck _Enable Autoscaling_
 - Workers: 2 or higher