update samples from Release-137 as a part of 1.0.53 SDK release

This commit is contained in:
vizhur
2019-07-24 22:37:36 +00:00
parent ddfce6b24c
commit ee1da0ee19
57 changed files with 2778 additions and 511 deletions

View File

@@ -0,0 +1,45 @@
-----BEGIN PRIVATE KEY-----
MIIEvgIBADANBgkqhkiG9w0BAQEFAASCBKgwggSkAgEAAoIBAQC/C0oc6vvF1UEc
y9JeGDXdtKynG11wTTIHIokFhNinHNSpJBLmNWFyFkqzvjJCPR4kWuqw4IXhCS3L
VoqRmT680SvUFFF6HnEaa75Bc1YSACn1ZsHuCRGrqO9BaTgt3mM0sRYC67+f+W0E
tA+k+EA0XnTtDdEBX3RLzvaYAR4yijEHIBQeeNemPYK4msW6Xw67ib1xn59blX4Z
a4Z85FjrekmoTl9493bFj6znDTX6wpKsPF7WLEF9S+oD/Lg4EHBi9BfefFxQpGZ9
FQHToFKyz1tA2iaY/9LjCtJcincMkuXt3KuQA4Nv2GiTzz4+FEy1pOqHnyNL2tFR
1G5n04BHAgMBAAECggEAAqcXeltQ76hMZSf3XdMcPF3b394jaAHKZgr2uBrmHzvp
QAf+MzAekET6+I/1hrHujzar95TGhx9ngWFMP0VPd7O31hQKJZXyoBlK5QHC+jEC
ZCPvIW0Cz81itRfO7eQeoIas9ZFscb4240/Uv8eqrI97NCdy9X/rz3mqNuYdEzqN
2v9XlwE/Fyx79O1PQqzPRiQt3n4ss9NO169y7X99KUZtYiZAiyBBGS8wYdaGF69G
URZ3qwoUE+nByZdeRfFLLTy+UDCOwQZV+0V4p0J++YLqQAac340A1F4D60qzMHnv
KVKnMc+RrYYVFOZU+USRlphSl3Ws5j0u94CiLitK4QKBgQDivJVHNmk1JleI/MPF
bx/YT5gzcVRFhGxkGso12JrQiFPs05JmoRFaqNBDNoZYDn2ggUrMwZVfPI5C6+7U
tCe2vrjVpvcAO9reK1u4N9ohpUpkocxWQy0nNHlrorDTZnyKreRtPC87W8xpiwl4
R/+nMgGd8vex7tGfchpThj8ZeQKBgQDXs2sgpE8vmnZBWrXAuGD8M9VnfcALEjwL
Fi3NR+XCr8jHkeIJVbSI2/asWsBGg8v6gV6Cdx9KV9r+fHDzdocS85X4P7crP83A
IX2rTT6Hsmc170SzCDa2jJJyLHQ6qtXBS9ZW8/dPFc1fiBf0NcmTLrRoNg5N8Px6
Qt0T51q3vwKBgQCYAfhOetMD2AW9iEAzwDFoUsxmSKdHx+TnI/LHMMVx4sPpNVqk
RX2d+ylMtmRQ6r4cejHMnkfnRnDVutkubu1lHe5LBpn35Sjx472k/oTWI7uBRdv5
RSYjb5GrsLG9uKrsSnKnLT85G20qoRUjN5nU3LiqzPZ0qviMXfH6ZzkseQKBgQCT
ft6MTY7QUGD4w5xxEiNPkeolgHmnmGpyclITg0x7WlSDEyBrna17wF3m8Y91KH58
56XGtMoyvezEBDgAY1ZuAR7VyEvqSRDahow2bPWLONUWrmxduAohvfIOHJPF4jeU
m9UPVHgSHih3YMpwda9G87LtZ7lUVqtutvYRvCvuZQKBgAypo514DZW7Y9lMCgkR
GpJLKCWFR0Sl9bQXI7N5nAG0YFz5ZhdA1PjS2tj+OKyWR6wekbv3g0CyVXT4XYsi
tKRu9PR2OUQLPv/h2qLAeSOYdScfWoOU5tlb4tkLoUNmj5/N9VpqbvLdDh6hPWQL
o4s+29QYKEoNmOrcZ6oRkRP8
-----END PRIVATE KEY-----
-----BEGIN CERTIFICATE-----
MIICoTCCAYkCAgPoMA0GCSqGSIb3DQEBBQUAMBQxEjAQBgNVBAMMCUNMSS1Mb2dp
bjAiGA8yMDE5MDUwMzIwMDIwOVoYDzIwMjAwNTAzMjAwMjExWjAUMRIwEAYDVQQD
DAlDTEktTG9naW4wggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQC/C0oc
6vvF1UEcy9JeGDXdtKynG11wTTIHIokFhNinHNSpJBLmNWFyFkqzvjJCPR4kWuqw
4IXhCS3LVoqRmT680SvUFFF6HnEaa75Bc1YSACn1ZsHuCRGrqO9BaTgt3mM0sRYC
67+f+W0EtA+k+EA0XnTtDdEBX3RLzvaYAR4yijEHIBQeeNemPYK4msW6Xw67ib1x
n59blX4Za4Z85FjrekmoTl9493bFj6znDTX6wpKsPF7WLEF9S+oD/Lg4EHBi9Bfe
fFxQpGZ9FQHToFKyz1tA2iaY/9LjCtJcincMkuXt3KuQA4Nv2GiTzz4+FEy1pOqH
nyNL2tFR1G5n04BHAgMBAAEwDQYJKoZIhvcNAQEFBQADggEBAGz3pOgNPESr+QoO
OVCgSS6VtWlmrAcxl5JaiNBFpBGAqfvbfRe1eZY7Rn6fuw1jc3pPBVzNTf8Plel+
DcuLzDLJAEag2GpRE+Xg57DNSwPqP6jZfHRE/ufLwIRLcNG9wRUwqlBvdAu1Kign
nlTZvTEAwxlQdvmIIT1XrTLZ+OwtVXcgrf0vInmueZKz/UDqsSDPY+d426S9eOWt
60h2WgXPU3QvBYfA6Yd2ReeP3+SHwBd4/1ByNFWBytcI9ow3pp2JznU366dfX4IQ
Q0iOTvHzXbfPmtsxqho6+hBbLvXVNWJMg8e22Pp/TyXYqeV5V09k18EgCnuA/9Gd
kKDVROA=
-----END CERTIFICATE-----

View File

@@ -222,7 +222,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
"version": "3.6.4"
},
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
},

View File

@@ -47,6 +47,7 @@
"[Read PostgreSQL](#postgresql)<br>\n",
"[Read From Azure Blob](#azure-blob)<br>\n",
"[Read From ADLS](#adls)<br>\n",
"[Read From ADLSGen2](#adlsgen2)<br>\n",
"[Read Pandas DataFrame](#pandas-df)<br>"
]
},
@@ -315,6 +316,25 @@
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can see in the results that the FBI Code column now contains some NaN values where before, when calling head, it didn't. By default, `to_pandas_dataframe` attempts to coalesce columns into a single type for better performance and lower memory overhead. This specific column has a mixutre of both numbers and strings and the strings were replaced with NaN values.\n",
"\n",
"If you wish to keep the mixed-type column in the Pandas DataFrame, you can set the `extended_types` argument to True when calling `to_pandas_dataframe`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = dflow_skipped_rows.to_pandas_dataframe(extended_types=True)\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
@@ -635,7 +655,7 @@
"metadata": {},
"outputs": [],
"source": [
"df = dflow.to_pandas_dataframe()\n",
"df = dflow.to_pandas_dataframe(extended_types=True)\n",
"df.dtypes"
]
},
@@ -751,7 +771,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"There are two ways the Data Prep API can acquire the necessary OAuth token to access Azure DataLake Storage:\n",
"Data Prep currently supports both ADLS and ADLSGen2. There are two ways the Data Prep API can acquire the necessary OAuth token to access Azure DataLake Storage:\n",
"1. Retrieve the access token from a recent login session of the user's [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest) login.\n",
"2. Use a ServicePrincipal (SP) and a certificate as a secret."
]
@@ -883,6 +903,70 @@
"dflow.to_pandas_dataframe().head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"adlsgen2\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Read from ADLSGen2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Please refer to the Read for ADLS section above to get details of how to register a Service Principal and obtain an OAuth access token.[ADLS](http://localhost:8888/notebooks/notebooks/how-to-guides/data-ingestion.ipynb#adls)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Configure ADLSGen2 Account for ServicePrincipal"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"certThumbprint = '23:66:84:6B:3A:14:9E:B1:17:CA:EE:E3:BB:2C:21:2D:20:B0:DF:F2'\n",
"certificate = ''\n",
"with open('../data/ADLSgen2-datapreptest.crt', 'rt', encoding='utf-8') as crtFile:\n",
" certificate = crtFile.read()\n",
"\n",
"servicePrincipalAppId = \"127a58c3-f307-46a1-969e-a6b63da3f411\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Acquire an OAuth Access Token for ADLSGen2"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import adal\n",
"from azureml.dataprep.api.datasources import ADLSGen2\n",
"\n",
"ctx = adal.AuthenticationContext('https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47')\n",
"token = ctx.acquire_token_with_client_certificate('https://storage.azure.com/', servicePrincipalAppId, certificate, certThumbprint)\n",
"dflow = dprep.read_csv(path = ADLSGen2(path='https://adlsgen2datapreptest.dfs.core.windows.net/datapreptest/people.csv', accessToken=token['accessToken']))\n",
"dflow.to_pandas_dataframe().head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
@@ -923,7 +1007,24 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"After loading in the data you can now do `read_pandas_dataframe`."
"After loading in the data you can now do `read_pandas_dataframe`. If you only need to consume the Dataflow created from the current environment, you can read the DataFrame in memory."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow_df = dprep.read_pandas_dataframe(df, in_memory=True)\n",
"dflow_df.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, if you intend to use this Dataflow past the end of your current Python session (such as by saving the Dataflow to a file), you can provide a cache directory where the contents of the DataFrame will be stored so they can be retrieved later."
]
},
{

View File

@@ -183,6 +183,37 @@
"dflow_adls = dprep.read_csv(path=DataPath(datastore, path_on_datastore='/input/crime0-10.csv'))\n",
"dflow_adls.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now you can read all the files in the `dataprep_adlsgen2` datastore which references an ADLSGen2 Storage account."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# read a file from ADLSGen2\n",
"datastore = Datastore(workspace=workspace, name='adlsgen2')\n",
"dflow_adlsgen2 = dprep.read_csv(path=DataPath(datastore, path_on_datastore='/testfolder/peopletest.csv'))\n",
"dflow_adlsgen2.head(5)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# read all files from ADLSGen2 directory\n",
"datastore = Datastore(workspace=workspace, name='adlsgen2')\n",
"dflow_adlsgen2 = dprep.read_csv(path=DataPath(datastore, path_on_datastore='/testfolder/testdir'))\n",
"dflow_adlsgen2.head()"
]
}
],
"metadata": {

View File

@@ -186,7 +186,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we have successfully split the data into useful columns through examples. "
"Now we have successfully split the data into useful columns through examples."
]
}
],