update samples from Release-137 as a part of 1.0.53 SDK release
This commit is contained in:
45
work-with-data/dataprep/data/ADLSgen2-datapreptest.crt
Normal file
45
work-with-data/dataprep/data/ADLSgen2-datapreptest.crt
Normal file
@@ -0,0 +1,45 @@
|
||||
-----BEGIN PRIVATE KEY-----
|
||||
MIIEvgIBADANBgkqhkiG9w0BAQEFAASCBKgwggSkAgEAAoIBAQC/C0oc6vvF1UEc
|
||||
y9JeGDXdtKynG11wTTIHIokFhNinHNSpJBLmNWFyFkqzvjJCPR4kWuqw4IXhCS3L
|
||||
VoqRmT680SvUFFF6HnEaa75Bc1YSACn1ZsHuCRGrqO9BaTgt3mM0sRYC67+f+W0E
|
||||
tA+k+EA0XnTtDdEBX3RLzvaYAR4yijEHIBQeeNemPYK4msW6Xw67ib1xn59blX4Z
|
||||
a4Z85FjrekmoTl9493bFj6znDTX6wpKsPF7WLEF9S+oD/Lg4EHBi9BfefFxQpGZ9
|
||||
FQHToFKyz1tA2iaY/9LjCtJcincMkuXt3KuQA4Nv2GiTzz4+FEy1pOqHnyNL2tFR
|
||||
1G5n04BHAgMBAAECggEAAqcXeltQ76hMZSf3XdMcPF3b394jaAHKZgr2uBrmHzvp
|
||||
QAf+MzAekET6+I/1hrHujzar95TGhx9ngWFMP0VPd7O31hQKJZXyoBlK5QHC+jEC
|
||||
ZCPvIW0Cz81itRfO7eQeoIas9ZFscb4240/Uv8eqrI97NCdy9X/rz3mqNuYdEzqN
|
||||
2v9XlwE/Fyx79O1PQqzPRiQt3n4ss9NO169y7X99KUZtYiZAiyBBGS8wYdaGF69G
|
||||
URZ3qwoUE+nByZdeRfFLLTy+UDCOwQZV+0V4p0J++YLqQAac340A1F4D60qzMHnv
|
||||
KVKnMc+RrYYVFOZU+USRlphSl3Ws5j0u94CiLitK4QKBgQDivJVHNmk1JleI/MPF
|
||||
bx/YT5gzcVRFhGxkGso12JrQiFPs05JmoRFaqNBDNoZYDn2ggUrMwZVfPI5C6+7U
|
||||
tCe2vrjVpvcAO9reK1u4N9ohpUpkocxWQy0nNHlrorDTZnyKreRtPC87W8xpiwl4
|
||||
R/+nMgGd8vex7tGfchpThj8ZeQKBgQDXs2sgpE8vmnZBWrXAuGD8M9VnfcALEjwL
|
||||
Fi3NR+XCr8jHkeIJVbSI2/asWsBGg8v6gV6Cdx9KV9r+fHDzdocS85X4P7crP83A
|
||||
IX2rTT6Hsmc170SzCDa2jJJyLHQ6qtXBS9ZW8/dPFc1fiBf0NcmTLrRoNg5N8Px6
|
||||
Qt0T51q3vwKBgQCYAfhOetMD2AW9iEAzwDFoUsxmSKdHx+TnI/LHMMVx4sPpNVqk
|
||||
RX2d+ylMtmRQ6r4cejHMnkfnRnDVutkubu1lHe5LBpn35Sjx472k/oTWI7uBRdv5
|
||||
RSYjb5GrsLG9uKrsSnKnLT85G20qoRUjN5nU3LiqzPZ0qviMXfH6ZzkseQKBgQCT
|
||||
ft6MTY7QUGD4w5xxEiNPkeolgHmnmGpyclITg0x7WlSDEyBrna17wF3m8Y91KH58
|
||||
56XGtMoyvezEBDgAY1ZuAR7VyEvqSRDahow2bPWLONUWrmxduAohvfIOHJPF4jeU
|
||||
m9UPVHgSHih3YMpwda9G87LtZ7lUVqtutvYRvCvuZQKBgAypo514DZW7Y9lMCgkR
|
||||
GpJLKCWFR0Sl9bQXI7N5nAG0YFz5ZhdA1PjS2tj+OKyWR6wekbv3g0CyVXT4XYsi
|
||||
tKRu9PR2OUQLPv/h2qLAeSOYdScfWoOU5tlb4tkLoUNmj5/N9VpqbvLdDh6hPWQL
|
||||
o4s+29QYKEoNmOrcZ6oRkRP8
|
||||
-----END PRIVATE KEY-----
|
||||
-----BEGIN CERTIFICATE-----
|
||||
MIICoTCCAYkCAgPoMA0GCSqGSIb3DQEBBQUAMBQxEjAQBgNVBAMMCUNMSS1Mb2dp
|
||||
bjAiGA8yMDE5MDUwMzIwMDIwOVoYDzIwMjAwNTAzMjAwMjExWjAUMRIwEAYDVQQD
|
||||
DAlDTEktTG9naW4wggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQC/C0oc
|
||||
6vvF1UEcy9JeGDXdtKynG11wTTIHIokFhNinHNSpJBLmNWFyFkqzvjJCPR4kWuqw
|
||||
4IXhCS3LVoqRmT680SvUFFF6HnEaa75Bc1YSACn1ZsHuCRGrqO9BaTgt3mM0sRYC
|
||||
67+f+W0EtA+k+EA0XnTtDdEBX3RLzvaYAR4yijEHIBQeeNemPYK4msW6Xw67ib1x
|
||||
n59blX4Za4Z85FjrekmoTl9493bFj6znDTX6wpKsPF7WLEF9S+oD/Lg4EHBi9Bfe
|
||||
fFxQpGZ9FQHToFKyz1tA2iaY/9LjCtJcincMkuXt3KuQA4Nv2GiTzz4+FEy1pOqH
|
||||
nyNL2tFR1G5n04BHAgMBAAEwDQYJKoZIhvcNAQEFBQADggEBAGz3pOgNPESr+QoO
|
||||
OVCgSS6VtWlmrAcxl5JaiNBFpBGAqfvbfRe1eZY7Rn6fuw1jc3pPBVzNTf8Plel+
|
||||
DcuLzDLJAEag2GpRE+Xg57DNSwPqP6jZfHRE/ufLwIRLcNG9wRUwqlBvdAu1Kign
|
||||
nlTZvTEAwxlQdvmIIT1XrTLZ+OwtVXcgrf0vInmueZKz/UDqsSDPY+d426S9eOWt
|
||||
60h2WgXPU3QvBYfA6Yd2ReeP3+SHwBd4/1ByNFWBytcI9ow3pp2JznU366dfX4IQ
|
||||
Q0iOTvHzXbfPmtsxqho6+hBbLvXVNWJMg8e22Pp/TyXYqeV5V09k18EgCnuA/9Gd
|
||||
kKDVROA=
|
||||
-----END CERTIFICATE-----
|
||||
@@ -222,7 +222,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.8"
|
||||
"version": "3.6.4"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
|
||||
},
|
||||
|
||||
@@ -47,6 +47,7 @@
|
||||
"[Read PostgreSQL](#postgresql)<br>\n",
|
||||
"[Read From Azure Blob](#azure-blob)<br>\n",
|
||||
"[Read From ADLS](#adls)<br>\n",
|
||||
"[Read From ADLSGen2](#adlsgen2)<br>\n",
|
||||
"[Read Pandas DataFrame](#pandas-df)<br>"
|
||||
]
|
||||
},
|
||||
@@ -315,6 +316,25 @@
|
||||
"df"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can see in the results that the FBI Code column now contains some NaN values where before, when calling head, it didn't. By default, `to_pandas_dataframe` attempts to coalesce columns into a single type for better performance and lower memory overhead. This specific column has a mixutre of both numbers and strings and the strings were replaced with NaN values.\n",
|
||||
"\n",
|
||||
"If you wish to keep the mixed-type column in the Pandas DataFrame, you can set the `extended_types` argument to True when calling `to_pandas_dataframe`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"df = dflow_skipped_rows.to_pandas_dataframe(extended_types=True)\n",
|
||||
"df"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
@@ -635,7 +655,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"df = dflow.to_pandas_dataframe()\n",
|
||||
"df = dflow.to_pandas_dataframe(extended_types=True)\n",
|
||||
"df.dtypes"
|
||||
]
|
||||
},
|
||||
@@ -751,7 +771,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"There are two ways the Data Prep API can acquire the necessary OAuth token to access Azure DataLake Storage:\n",
|
||||
"Data Prep currently supports both ADLS and ADLSGen2. There are two ways the Data Prep API can acquire the necessary OAuth token to access Azure DataLake Storage:\n",
|
||||
"1. Retrieve the access token from a recent login session of the user's [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest) login.\n",
|
||||
"2. Use a ServicePrincipal (SP) and a certificate as a secret."
|
||||
]
|
||||
@@ -883,6 +903,70 @@
|
||||
"dflow.to_pandas_dataframe().head()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"adlsgen2\"></a>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Read from ADLSGen2"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Please refer to the Read for ADLS section above to get details of how to register a Service Principal and obtain an OAuth access token.[ADLS](http://localhost:8888/notebooks/notebooks/how-to-guides/data-ingestion.ipynb#adls)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Configure ADLSGen2 Account for ServicePrincipal"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"certThumbprint = '23:66:84:6B:3A:14:9E:B1:17:CA:EE:E3:BB:2C:21:2D:20:B0:DF:F2'\n",
|
||||
"certificate = ''\n",
|
||||
"with open('../data/ADLSgen2-datapreptest.crt', 'rt', encoding='utf-8') as crtFile:\n",
|
||||
" certificate = crtFile.read()\n",
|
||||
"\n",
|
||||
"servicePrincipalAppId = \"127a58c3-f307-46a1-969e-a6b63da3f411\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Acquire an OAuth Access Token for ADLSGen2"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import adal\n",
|
||||
"from azureml.dataprep.api.datasources import ADLSGen2\n",
|
||||
"\n",
|
||||
"ctx = adal.AuthenticationContext('https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47')\n",
|
||||
"token = ctx.acquire_token_with_client_certificate('https://storage.azure.com/', servicePrincipalAppId, certificate, certThumbprint)\n",
|
||||
"dflow = dprep.read_csv(path = ADLSGen2(path='https://adlsgen2datapreptest.dfs.core.windows.net/datapreptest/people.csv', accessToken=token['accessToken']))\n",
|
||||
"dflow.to_pandas_dataframe().head()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
@@ -923,7 +1007,24 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"After loading in the data you can now do `read_pandas_dataframe`."
|
||||
"After loading in the data you can now do `read_pandas_dataframe`. If you only need to consume the Dataflow created from the current environment, you can read the DataFrame in memory."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_df = dprep.read_pandas_dataframe(df, in_memory=True)\n",
|
||||
"dflow_df.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"However, if you intend to use this Dataflow past the end of your current Python session (such as by saving the Dataflow to a file), you can provide a cache directory where the contents of the DataFrame will be stored so they can be retrieved later."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
||||
@@ -183,6 +183,37 @@
|
||||
"dflow_adls = dprep.read_csv(path=DataPath(datastore, path_on_datastore='/input/crime0-10.csv'))\n",
|
||||
"dflow_adls.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now you can read all the files in the `dataprep_adlsgen2` datastore which references an ADLSGen2 Storage account."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# read a file from ADLSGen2\n",
|
||||
"datastore = Datastore(workspace=workspace, name='adlsgen2')\n",
|
||||
"dflow_adlsgen2 = dprep.read_csv(path=DataPath(datastore, path_on_datastore='/testfolder/peopletest.csv'))\n",
|
||||
"dflow_adlsgen2.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# read all files from ADLSGen2 directory\n",
|
||||
"datastore = Datastore(workspace=workspace, name='adlsgen2')\n",
|
||||
"dflow_adlsgen2 = dprep.read_csv(path=DataPath(datastore, path_on_datastore='/testfolder/testdir'))\n",
|
||||
"dflow_adlsgen2.head()"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
|
||||
@@ -186,7 +186,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now we have successfully split the data into useful columns through examples. "
|
||||
"Now we have successfully split the data into useful columns through examples."
|
||||
]
|
||||
}
|
||||
],
|
||||
|
||||
Reference in New Issue
Block a user