added table

2018-11-29 20:15:54 -08:00
parent df025e6a17
commit b0ff1e1a5d
1 changed files with 19 additions and 14 deletions
--- a/databricks/automl_adb_readme.md
+++ b/databricks/automl_adb_readme.md
@@ -6,21 +6,26 @@ Select New Cluster and fill in following detail:
 - Databricks Runtime: Any 4.* runtime (NO GPU) or Recommended: 4.3(includes Apache spark 2.3.1, Scala 2.11))
 - Python version: **3**
 - Driver type – you may select a small driver node size (eg. Standard_DS3_v2 0.75 DBU)
- - Worker node VM types: Memory optimized preferred.
-	 - Number of concurrent runs should be less than or equal to the number
-   of cores in your Databricks cluster.
-   -   For a 1 GB numeric only dataset, to do 10 cross validations with run 16 concurrent runs, the minimum usable cluster memory should be 1
-   GB X 16 concurrent runs X 3 = 48 GB. This is in addition to what
-   Spark itself will use on your cluster.
-   -   For text dataset, with featurization (eg. one hot encoding) & cross validation this requirement is much higher. For a 500 MB
-   string+numeric dataset, to do 5 cross validation with 4 concurrent
-   runs, the minimum usable cluster memory should be 0.5 GB X 4
-   concurrent runs X 3 X 5 cross validations X 3 = 90 GB
-  - Uncheck _Enable Autoscaling_
- Workers: 2 or higher
- It will take few minutes to create the cluster.
+ - Worker node VM types: Memory optimized preferred. Please follow this table.
+ 
+ |**Dataset type**  | **Dataset size** |**Preprocessed dataset size**  | **Number of cross validations (cv)** |**Recommended memory per concurrency for VM**|**Total memory required for cluster** |
+|--|--|--|--|--|--|
+|String & Numeric  | X |3X  | 3 * X * cv |3 * X * cv * 3  | 3 * X * cv  * 3 * number of concurrent runs  |
+|Numeric  | Y |Y| Y |3Y| 3 * Y * number of concurrent runs  |

-Ensure that the cluster state is running before proceeding further.
+- Number of concurrent runs should be less than or equal to the number
+of cores in your Databricks cluster.
+- For a 1 GB numeric only dataset, to do 10 cross validations with run 16 concurrent runs, the minimum usable cluster memory should be 1
+GB X 16 concurrent runs X 3 = 48 GB. This is in addition to what
+Spark itself will use on your cluster.
+-   For text dataset, with featurization (eg. one hot encoding) & cross validation this requirement is much higher. For a 500 MB
+string+numeric dataset, to do 5 cross validation with 4 concurrent
+runs, the minimum usable cluster memory should be 0.5 GB X 4
+concurrent runs X 3 X 5 cross validations X 3 = 90 GB
+- Uncheck _Enable Autoscaling_
+- Workers: 2 or higher
+
+It will take few minutes to create the cluster. Please ensure that the cluster state is running before proceeding further.

 **Install Azure ML with Automated ML SDK**