mirror of
https://github.com/apache/impala.git
synced 2025-12-19 18:12:08 -05:00
IMPALA-3398: Add docs to main Impala branch.
These are refugees from doc_prototype. They can be rendered with the DITA Open Toolkit version 2.3.3 by: /tmp/dita-ot-2.3.3/bin/dita \ -i impala.ditamap \ -f html5 \ -o $(mktemp -d) \ -filter impala_html.ditaval Change-Id: I8861e99adc446f659a04463ca78c79200669484f Reviewed-on: http://gerrit.cloudera.org:8080/5014 Reviewed-by: John Russell <jrussell@cloudera.com> Tested-by: John Russell <jrussell@cloudera.com>
This commit is contained in:
10
docs/Cloudera-Impala-Release-Notes.ditamap
Normal file
10
docs/Cloudera-Impala-Release-Notes.ditamap
Normal file
@@ -0,0 +1,10 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE map PUBLIC "-//OASIS//DTD DITA Map//EN" "map.dtd">
|
||||
<map audience="standalone">
|
||||
<title>Cloudera Impala Release Notes</title>
|
||||
<topicref href="topics/impala_relnotes.xml" audience="HTML standalone"/>
|
||||
<topicref href="topics/impala_new_features.xml"/>
|
||||
<topicref href="topics/impala_incompatible_changes.xml"/>
|
||||
<topicref href="topics/impala_known_issues.xml"/>
|
||||
<topicref href="topics/impala_fixed_issues.xml"/>
|
||||
</map>
|
||||
73
docs/generatingImpalaDoc.md
Normal file
73
docs/generatingImpalaDoc.md
Normal file
@@ -0,0 +1,73 @@
|
||||
#Generating HTML or a PDF of Apache Impala (Incubating) Documentation
|
||||
|
||||
##Prerequisites:
|
||||
Make sure that you have a recent version of a Java JDK installed and that your JAVA\_HOME environment variable is set. This procedure has been tested with JDK 1.8.0. See [Setting JAVA\_HOME](#settingjavahome) at the end of these instructions.
|
||||
|
||||
* Open a terminal window and run the following commands to get the Impala documentation source files from Git:
|
||||
|
||||
<pre><code>git clone https://git-wip-us.apache.org/repos/asf/incubator-impala.git/docs
|
||||
cd \<local\_directory\>
|
||||
git checkout doc\_prototype</code></pre>
|
||||
|
||||
Where <code>doc\_prototype</code> is the branch where Impala documentation source files are uploaded.
|
||||
|
||||
* Download the DITA Open Toolkit version 2.3.3 from the DITA Open Toolkit web site:
|
||||
|
||||
[https://github.com/dita-ot/dita-ot/releases/download/2.3.3/dita-ot-2.3.3.zip] (https://github.com/dita-ot/dita-ot/releases/download/2.3.3/dita-ot-2.3.3.zip)
|
||||
|
||||
**Note:** A DITA-OT 2.3.3 User Guide is included in the toolkit. Look for <code>userguide.pdf</code> in the <code>doc</code> directory of the toolkit after you extract it. For example, if you extract the toolkit package to the <code>/Users/\<_username_\>/DITA-OT</code> directory on Mac OS, you will find the <code>userguide.pdf</code> at the following location:
|
||||
|
||||
<code>/Users/\<_username_\>/DITA-OT/doc/userguide.pdf</code>
|
||||
|
||||
##To generate HTML or PDF:
|
||||
|
||||
1. In the directory where you cloned the Impala documentation files, you will find the following important configuration files in the <code>docs</code> subdirectory. These files are used to convert the XML source you downloaded from the Apache site to PDF and HTML:
|
||||
* <code>impala.ditamap</code>: Tells the DITA Open Toolkit what topics to include in the Impala User/Administration Guide. This guide also includes the Impala SQL Reference.
|
||||
* <code>impala\_sqlref.ditamap</code>: Tells the DITA Open Toolkit what topics to include in the Impala SQL Reference.
|
||||
* <code>impala\_html.ditaval</code>: Further defines what topics to include in the Impala HTML output.
|
||||
* <code>impala\_pdf.ditaval</code>: Further defines what topics to include in the Impala PDF output.
|
||||
2. Extract the contents of the DITA-OT package into a directory where you want to generate the HTML or the PDF.
|
||||
3. Open a terminal window and navigate to the directory where you extracted the DITA-OT package.
|
||||
4. Run one of the following commands, depending on what you want to generate:
|
||||
* **To generate HTML output of the Impala User and Administration Guide, which includes the Impala SQL Reference, run the following command:**
|
||||
|
||||
<code>./bin/dita -input \<path\_to\_impala.ditamap\> -format html5 -output \<path\_to\_build\_output\_directory\> -filter \<path\_to\_impala\_html.ditaval\></code>
|
||||
|
||||
* **To generate PDF output of the Impala User and Administration Guide, which includes the Impala SQL Reference, run the following command:**
|
||||
|
||||
<code>./bin/dita -input \<path\_to\_impala.ditamap\> -format pdf -output \<path\_to\_build\_output\_directory\> -filter \<path\_to\_impala\_pdf.ditaval\></code>
|
||||
|
||||
* **To generate HTML output of the Impala SQL Reference, run the following command:**
|
||||
|
||||
<code>./bin/dita -input \<path\_to\_impala\_sqlref.ditamap\> -format html5 -output \<path\_to\_build\_output\_directory\> -filter \<path\_to\_impala\_html.ditaval\></code>
|
||||
|
||||
* **To generate PDF output of the Impala SQL Reference, run the following command:**
|
||||
|
||||
<code>./bin/dita -input \<path\_to\_impala\_sqlref.ditamap\> -format pdf -output \<path\_to\_build\_output\_directory\> -filter \<path\_to\_impala\_pdf.ditaval\></code>
|
||||
|
||||
**Note:** For a description of all command-line options, see the _DITA Open Toolkit User Guide_ in the <code>doc</code> directory of your downloaded DITA Open Toolkit.
|
||||
|
||||
5. Go to the output directory that you specified in Step 3 to view the HTML or PDF that you generated. If you generated HTML, open the <code>index.html</code> file with a browser to view the output.
|
||||
|
||||
<a name="settingjavahome" />
|
||||
#Setting JAVA\_HOME
|
||||
</a>
|
||||
|
||||
Set your JAVA\_HOME environment variable to tell your computer where to find the Java executable file. For example, to set your JAVA\_HOME environment on Mac OS X when you the the 1.8.0\_101 version of the Java Development Kit (JDK) installed and you are using the Bash version 3.2 shell, perform the following steps:
|
||||
|
||||
1. Edit your <code>/Users/\<username\>/.bash\_profile</code> file and add the following lines to the end of the file:
|
||||
|
||||
<pre><code>#Set JAVA_HOME
|
||||
JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home
|
||||
export JAVA_HOME;</code></pre>
|
||||
|
||||
Where <code>jdk1.8.0\_101.jdk</code> is the version of JDK that you have installed. For example, if you have installed <code>jdk1.8.0\_102.jdk</code>, you would use that value instead.
|
||||
|
||||
2. Test to make sure you have set your JAVA\_HOME correctly:
|
||||
* Open a terminal window and type: <code>$JAVA\_HOME/bin/java -version</code>
|
||||
* Press return. If you see something like the following:
|
||||
<pre><code>java version "1.5.0_16"
|
||||
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_16-b06-284)
|
||||
Java HotSpot (TM) Client VM (build 1.5.0\_16-133, mixed mode, sharing)</code></pre>
|
||||
|
||||
Then you've successfully set your JAVA\_HOME environment variable to the binary stored in <code>/Library/Java/JavaVirtualMachines/jdk1.8.0\_101.jdk/Contents/Home</code>.
|
||||
BIN
docs/images/howto_access_control.png
Normal file
BIN
docs/images/howto_access_control.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 45 KiB |
BIN
docs/images/howto_per_node_peak_memory_usage.png
Normal file
BIN
docs/images/howto_per_node_peak_memory_usage.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 82 KiB |
BIN
docs/images/howto_show_histogram.png
Normal file
BIN
docs/images/howto_show_histogram.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 18 KiB |
BIN
docs/images/howto_static_server_pools_config.png
Normal file
BIN
docs/images/howto_static_server_pools_config.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 44 KiB |
BIN
docs/images/impala_arch.jpeg
Normal file
BIN
docs/images/impala_arch.jpeg
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 41 KiB |
BIN
docs/images/support_send_diagnostic_data.png
Normal file
BIN
docs/images/support_send_diagnostic_data.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 34 KiB |
1172
docs/impala.ditamap
Normal file
1172
docs/impala.ditamap
Normal file
File diff suppressed because it is too large
Load Diff
21
docs/impala_html.ditaval
Normal file
21
docs/impala_html.ditaval
Normal file
@@ -0,0 +1,21 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?><val>
|
||||
<!-- Exclude Cloudera-only content. This is typically material that's permanently hidden,
|
||||
e.g. obsolete or abandoned. Use pre-release for material being actively worked on
|
||||
that's not ready for prime time. -->
|
||||
<prop att="audience" val="Cloudera" action="exclude"/>
|
||||
<!-- These two are backward: things marked HTML are excluded from the HTML and
|
||||
things marked PDF are excluded from the PDF. -->
|
||||
<prop att="audience" val="HTML" action="exclude"/>
|
||||
<prop att="audience" val="PDF" action="include"/>
|
||||
<!-- standalone = upstream Impala docs, not part of any larger library
|
||||
integrated = any xrefs, topicrefs, or other residue from original downstream docs
|
||||
that don't resolve properly in the upstream context -->
|
||||
<prop att="audience" val="integrated" action="exclude"/>
|
||||
<prop att="audience" val="standalone" action="include"/>
|
||||
<!-- John added this so he can work on Impala_Next in master without fear that
|
||||
it will show up too early in released docs -->
|
||||
<prop att="audience" val="impala_next" action="exclude"/>
|
||||
<!-- This DITAVAL specifically EXCLUDES things marked pre-release -->
|
||||
<!-- It is safe to use for generating public artifacts. -->
|
||||
<prop att="audience" val="pre-release" action="exclude"/>
|
||||
</val>
|
||||
21
docs/impala_pdf.ditaval
Normal file
21
docs/impala_pdf.ditaval
Normal file
@@ -0,0 +1,21 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?><val>
|
||||
<!-- Exclude Cloudera-only content. This is typically material that's permanently hidden,
|
||||
e.g. obsolete or abandoned. Use pre-release for material being actively worked on
|
||||
that's not ready for prime time. -->
|
||||
<prop att="audience" val="Cloudera" action="exclude"/>
|
||||
<!-- These two are backward: things marked HTML are excluded from the HTML and
|
||||
things marked PDF are excluded from the PDF. -->
|
||||
<prop att="audience" val="PDF" action="exclude"/>
|
||||
<prop att="audience" val="HTML" action="include"/>
|
||||
<!-- standalone = upstream Impala docs, not part of any larger library
|
||||
integrated = any xrefs, topicrefs, or other residue from original downstream docs
|
||||
that don't resolve properly in the upstream context -->
|
||||
<prop att="audience" val="integrated" action="exclude"/>
|
||||
<prop att="audience" val="standalone" action="include"/>
|
||||
<!-- John added this so he can work on Impala_Next in master without fear that
|
||||
it will show up too early in released docs -->
|
||||
<prop att="audience" val="impala_next" action="exclude"/>
|
||||
<!-- This DITAVAL specifically EXCLUDES things marked pre-release -->
|
||||
<!-- It is safe to use for generating public artifacts. -->
|
||||
<prop att="audience" val="pre-release" action="exclude"/>
|
||||
</val>
|
||||
146
docs/impala_sqlref.ditamap
Normal file
146
docs/impala_sqlref.ditamap
Normal file
@@ -0,0 +1,146 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE map PUBLIC "-//OASIS//DTD DITA Map//EN" "map.dtd">
|
||||
<map id="impala_sqlref">
|
||||
<title>Impala SQL Reference</title>
|
||||
<topicmeta>
|
||||
<prodinfo conref="shared/ImpalaVariables.xml#impala_vars/prodinfo_for_html">
|
||||
<prodname/>
|
||||
<vrmlist>
|
||||
<vrm version="version_dlq_gry_sm"/>
|
||||
</vrmlist>
|
||||
</prodinfo>
|
||||
</topicmeta>
|
||||
<topicref href="topics/impala_langref.xml"/>
|
||||
<topicref href="topics/impala_comments.xml"/>
|
||||
<topicref href="topics/impala_datatypes.xml">
|
||||
<topicref href="topics/impala_array.xml"/>
|
||||
<topicref href="topics/impala_bigint.xml"/>
|
||||
<topicref href="topics/impala_boolean.xml"/>
|
||||
<topicref href="topics/impala_char.xml"/>
|
||||
<topicref href="topics/impala_decimal.xml"/>
|
||||
<topicref href="topics/impala_double.xml"/>
|
||||
<topicref href="topics/impala_float.xml"/>
|
||||
<topicref href="topics/impala_int.xml"/>
|
||||
<topicref href="topics/impala_map.xml"/>
|
||||
<topicref href="topics/impala_real.xml"/>
|
||||
<topicref href="topics/impala_smallint.xml"/>
|
||||
<topicref href="topics/impala_string.xml"/>
|
||||
<topicref href="topics/impala_struct.xml"/>
|
||||
<topicref href="topics/impala_timestamp.xml"/>
|
||||
<topicref href="topics/impala_tinyint.xml"/>
|
||||
<topicref href="topics/impala_varchar.xml"/>
|
||||
<topicref href="topics/impala_complex_types.xml"/>
|
||||
</topicref>
|
||||
<topicref href="topics/impala_literals.xml"/>
|
||||
<topicref href="topics/impala_operators.xml"/>
|
||||
<topicref href="topics/impala_schema_objects.xml">
|
||||
<topicref href="topics/impala_aliases.xml"/>
|
||||
<topicref href="topics/impala_databases.xml"/>
|
||||
<topicref href="topics/impala_functions_overview.xml"/>
|
||||
<topicref href="topics/impala_identifiers.xml"/>
|
||||
<topicref href="topics/impala_tables.xml"/>
|
||||
<topicref href="topics/impala_views.xml"/>
|
||||
</topicref>
|
||||
<topicref href="topics/impala_langref_sql.xml">
|
||||
<topicref href="topics/impala_ddl.xml"/>
|
||||
<topicref href="topics/impala_dml.xml"/>
|
||||
<topicref href="topics/impala_alter_table.xml"/>
|
||||
<topicref href="topics/impala_alter_view.xml"/>
|
||||
<topicref href="topics/impala_compute_stats.xml"/>
|
||||
<topicref href="topics/impala_create_database.xml"/>
|
||||
<topicref href="topics/impala_create_function.xml"/>
|
||||
<topicref href="topics/impala_create_role.xml"/>
|
||||
<topicref href="topics/impala_create_table.xml"/>
|
||||
<topicref href="topics/impala_create_view.xml"/>
|
||||
<topicref audience="impala_next" href="topics/impala_delete.xml"/>
|
||||
<topicref href="topics/impala_describe.xml"/>
|
||||
<topicref href="topics/impala_drop_database.xml"/>
|
||||
<topicref href="topics/impala_drop_function.xml"/>
|
||||
<topicref href="topics/impala_drop_role.xml"/>
|
||||
<topicref href="topics/impala_drop_stats.xml"/>
|
||||
<topicref href="topics/impala_drop_table.xml"/>
|
||||
<topicref href="topics/impala_drop_view.xml"/>
|
||||
<topicref href="topics/impala_explain.xml"/>
|
||||
<topicref href="topics/impala_grant.xml"/>
|
||||
<topicref href="topics/impala_insert.xml"/>
|
||||
<topicref href="topics/impala_invalidate_metadata.xml"/>
|
||||
<topicref href="topics/impala_load_data.xml"/>
|
||||
<topicref href="topics/impala_refresh.xml"/>
|
||||
<topicref href="topics/impala_revoke.xml"/>
|
||||
<topicref href="topics/impala_select.xml">
|
||||
<topicref href="topics/impala_joins.xml"/>
|
||||
<topicref href="topics/impala_order_by.xml"/>
|
||||
<topicref href="topics/impala_group_by.xml"/>
|
||||
<topicref href="topics/impala_having.xml"/>
|
||||
<topicref href="topics/impala_limit.xml"/>
|
||||
<topicref href="topics/impala_offset.xml"/>
|
||||
<topicref href="topics/impala_union.xml"/>
|
||||
<topicref href="topics/impala_subqueries.xml"/>
|
||||
<topicref href="topics/impala_with.xml"/>
|
||||
<topicref href="topics/impala_distinct.xml"/>
|
||||
<topicref href="topics/impala_hints.xml"/>
|
||||
</topicref>
|
||||
<topicref href="topics/impala_set.xml"/>
|
||||
<topicref href="topics/impala_query_options.xml">
|
||||
<topicref href="topics/impala_abort_on_default_limit_exceeded.xml"/>
|
||||
<topicref href="topics/impala_abort_on_error.xml"/>
|
||||
<topicref href="topics/impala_allow_unsupported_formats.xml"/>
|
||||
<topicref href="topics/impala_appx_count_distinct.xml"/>
|
||||
<topicref href="topics/impala_batch_size.xml"/>
|
||||
<topicref href="topics/impala_compression_codec.xml"/>
|
||||
<topicref href="topics/impala_debug_action.xml"/>
|
||||
<topicref href="topics/impala_default_order_by_limit.xml"/>
|
||||
<topicref href="topics/impala_disable_codegen.xml"/>
|
||||
<topicref href="topics/impala_disable_unsafe_spills.xml"/>
|
||||
<topicref href="topics/impala_exec_single_node_rows_threshold.xml"/>
|
||||
<topicref href="topics/impala_explain_level.xml"/>
|
||||
<topicref href="topics/impala_hbase_cache_blocks.xml"/>
|
||||
<topicref href="topics/impala_hbase_caching.xml"/>
|
||||
<topicref href="topics/impala_live_progress.xml"/>
|
||||
<topicref href="topics/impala_live_summary.xml"/>
|
||||
<topicref href="topics/impala_max_errors.xml"/>
|
||||
<topicref href="topics/impala_max_io_buffers.xml"/>
|
||||
<topicref href="topics/impala_max_scan_range_length.xml"/>
|
||||
<topicref href="topics/impala_mem_limit.xml"/>
|
||||
<topicref href="topics/impala_num_nodes.xml"/>
|
||||
<topicref href="topics/impala_num_scanner_threads.xml"/>
|
||||
<topicref href="topics/impala_parquet_compression_codec.xml"/>
|
||||
<topicref href="topics/impala_parquet_file_size.xml"/>
|
||||
<topicref href="topics/impala_query_timeout_s.xml"/>
|
||||
<topicref href="topics/impala_request_pool.xml"/>
|
||||
<topicref href="topics/impala_reservation_request_timeout.xml"/>
|
||||
<topicref href="topics/impala_support_start_over.xml"/>
|
||||
<topicref href="topics/impala_sync_ddl.xml"/>
|
||||
<topicref href="topics/impala_v_cpu_cores.xml"/>
|
||||
</topicref>
|
||||
<topicref href="topics/impala_show.xml"/>
|
||||
<topicref href="topics/impala_truncate_table.xml"/>
|
||||
<topicref audience="impala_next" href="topics/impala_update.xml"/>
|
||||
<topicref href="topics/impala_use.xml"/>
|
||||
</topicref>
|
||||
<topicref href="topics/impala_functions.xml">
|
||||
<topicref href="topics/impala_math_functions.xml"/>
|
||||
<topicref href="topics/impala_bit_functions.xml"/>
|
||||
<topicref href="topics/impala_conversion_functions.xml"/>
|
||||
<topicref href="topics/impala_datetime_functions.xml"/>
|
||||
<topicref href="topics/impala_conditional_functions.xml"/>
|
||||
<topicref href="topics/impala_string_functions.xml"/>
|
||||
<topicref href="topics/impala_misc_functions.xml"/>
|
||||
<topicref href="topics/impala_aggregate_functions.xml">
|
||||
<topicref href="topics/impala_appx_median.xml"/>
|
||||
<topicref href="topics/impala_avg.xml"/>
|
||||
<topicref href="topics/impala_count.xml"/>
|
||||
<topicref href="topics/impala_group_concat.xml"/>
|
||||
<topicref href="topics/impala_max.xml"/>
|
||||
<topicref href="topics/impala_min.xml"/>
|
||||
<topicref href="topics/impala_ndv.xml"/>
|
||||
<topicref href="topics/impala_stddev.xml"/>
|
||||
<topicref href="topics/impala_sum.xml"/>
|
||||
<topicref href="topics/impala_variance.xml"/>
|
||||
</topicref>
|
||||
<topicref href="topics/impala_analytic_functions.xml"/>
|
||||
<topicref href="topics/impala_udf.xml"/>
|
||||
</topicref>
|
||||
<topicref href="topics/impala_langref_unsupported.xml"/>
|
||||
<topicref href="topics/impala_porting.xml"/>
|
||||
</map>
|
||||
52
docs/shared/ImpalaVariables.xml
Normal file
52
docs/shared/ImpalaVariables.xml
Normal file
@@ -0,0 +1,52 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="impala_vars">
|
||||
<title>Cloudera Impala Variables</title>
|
||||
<prolog id="prolog_slg_nmv_km">
|
||||
<metadata id="metadata_ecq_qmv_km">
|
||||
<prodinfo audience="PDF" id="prodinfo_for_html">
|
||||
<prodname>Impala</prodname>
|
||||
<vrmlist>
|
||||
<vrm version="Impala 2.7.x / CDH 5.9.x"/>
|
||||
</vrmlist>
|
||||
</prodinfo>
|
||||
<prodinfo audience="HTML" id="prodinfo_for_pdf">
|
||||
<prodname></prodname>
|
||||
<vrmlist>
|
||||
<vrm version="Impala 2.7.x / CDH 5.9.x"/>
|
||||
</vrmlist>
|
||||
</prodinfo>
|
||||
</metadata>
|
||||
</prolog>
|
||||
<conbody>
|
||||
<p>Substitution variables for denoting features available in release X or higher.
|
||||
The upstream docs can refer to the Impala release number.
|
||||
The docs included with a distro can refer to the distro release number by
|
||||
editing the values here.
|
||||
<ul>
|
||||
<li><ph id="impala27">CDH 5.9</ph></li>
|
||||
<li><ph id="impala26">CDH 5.8</ph></li>
|
||||
<li><ph id="impala25">CDH 5.7</ph></li>
|
||||
<li><ph id="impala24">CDH 5.6</ph></li>
|
||||
<li><ph id="impala23">CDH 5.5</ph></li>
|
||||
<li><ph id="impala22">CDH 5.4</ph></li>
|
||||
<li><ph id="impala21">CDH 5.3</ph></li>
|
||||
<li><ph id="impala20">CDH 5.2</ph></li>
|
||||
<li><ph id="impala14">CDH 5.1</ph></li>
|
||||
<li><ph id="impala13">CDH 5.0</ph></li>
|
||||
</ul>
|
||||
</p>
|
||||
<p>Release Version Variable - <ph id="ReleaseVersion">Impala 2.7.x / CDH 5.9.x</ph></p>
|
||||
<p>Banner for examples showing shell version -<ph id="ShellBanner">(Shell
|
||||
build version: Impala Shell v2.7.x (<varname>hash</varname>) built on
|
||||
<varname>date</varname>)</ph></p>
|
||||
<p>Banner for examples showing impalad version -<ph id="ImpaladBanner">Server version: impalad version 2.7.x (build
|
||||
x.y.z)</ph></p>
|
||||
<data name="version-message" id="version-message">
|
||||
<foreign>
|
||||
<lines xml:space="preserve">This is the documentation for <data name="version"/>.
|
||||
Documentation for other versions is available at <xref href="http://www.cloudera.com/content/support/en/documentation.html" scope="external" format="html">Cloudera Documentation</xref>.</lines>
|
||||
</foreign>
|
||||
</data>
|
||||
</conbody>
|
||||
</concept>
|
||||
3690
docs/shared/impala_common.xml
Normal file
3690
docs/shared/impala_common.xml
Normal file
File diff suppressed because it is too large
Load Diff
77
docs/topics/impala.xml
Normal file
77
docs/topics/impala.xml
Normal file
@@ -0,0 +1,77 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="about_impala">
|
||||
|
||||
<title>Apache Impala (incubating) - Interactive SQL</title>
|
||||
<titlealts audience="PDF"><navtitle>Impala Guide</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Components"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/impala_mission_statement"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/impala_hive_compatibility"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/impala_advantages"/>
|
||||
|
||||
<p outputclass="toc"/>
|
||||
|
||||
<p audience="integrated">
|
||||
<b>Related information throughout the CDH 5 library:</b>
|
||||
</p>
|
||||
|
||||
<p audience="integrated">
|
||||
In CDH 5, the Impala documentation for Release Notes, Installation, Upgrading, and Security has been
|
||||
integrated alongside the corresponding information for other Hadoop components:
|
||||
</p>
|
||||
|
||||
<!-- Same list is in impala.xml and Impala FAQs. Conref in both places. -->
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
<xref href="impala_new_features.xml#new_features">New features</xref>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<xref href="impala_known_issues.xml#known_issues">Known and fixed issues</xref>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<xref href="impala_incompatible_changes.xml#incompatible_changes">Incompatible changes</xref>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<xref href="impala_install.xml#install">Installing Impala</xref>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<xref href="impala_upgrading.xml#upgrading">Upgrading Impala</xref>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<xref href="impala_config.xml#config">Configuring Impala</xref>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<xref href="impala_processes.xml#processes">Starting Impala</xref>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<xref href="impala_security.xml#security">Security for Impala</xref>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<xref href="http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH-Version-and-Packaging-Information/CDH-Version-and-Packaging-Information.html" scope="external" format="html">CDH
|
||||
Version and Packaging Information</xref>
|
||||
</li>
|
||||
</ul>
|
||||
</conbody>
|
||||
</concept>
|
||||
23
docs/topics/impala_abort_on_default_limit_exceeded.xml
Normal file
23
docs/topics/impala_abort_on_default_limit_exceeded.xml
Normal file
@@ -0,0 +1,23 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept rev="obwl" id="abort_on_default_limit_exceeded">
|
||||
|
||||
<title>ABORT_ON_DEFAULT_LIMIT_EXCEEDED Query Option</title>
|
||||
<titlealts audience="PDF"><navtitle>ABORT_ON_DEFAULT_LIMIT_EXCEEDED</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Impala Query Options"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/obwl_query_options"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/type_boolean"/>
|
||||
<p conref="../shared/impala_common.xml#common/default_false_0"/>
|
||||
</conbody>
|
||||
</concept>
|
||||
44
docs/topics/impala_abort_on_error.xml
Normal file
44
docs/topics/impala_abort_on_error.xml
Normal file
@@ -0,0 +1,44 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="abort_on_error">
|
||||
|
||||
<title>ABORT_ON_ERROR Query Option</title>
|
||||
<titlealts audience="PDF"><navtitle>ABORT_ON_ERROR</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Impala Query Options"/>
|
||||
<data name="Category" value="Troubleshooting"/>
|
||||
<data name="Category" value="Querying"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<indexterm audience="Cloudera">ABORT_ON_ERROR query option</indexterm>
|
||||
When this option is enabled, Impala cancels a query immediately when any of the nodes encounters an error,
|
||||
rather than continuing and possibly returning incomplete results. This option is disabled by default, to help
|
||||
gather maximum diagnostic information when an error occurs, for example, whether the same problem occurred on
|
||||
all nodes or only a single node. Currently, the errors that Impala can skip over involve data corruption,
|
||||
such as a column that contains a string value when expected to contain an integer value.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
To control how much logging Impala does for non-fatal errors when <codeph>ABORT_ON_ERROR</codeph> is turned
|
||||
off, use the <codeph>MAX_ERRORS</codeph> option.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/type_boolean"/>
|
||||
<p conref="../shared/impala_common.xml#common/default_false_0"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
<p>
|
||||
<xref href="impala_max_errors.xml#max_errors"/>,
|
||||
<xref href="impala_logging.xml#logging"/>
|
||||
</p>
|
||||
|
||||
</conbody>
|
||||
</concept>
|
||||
60
docs/topics/impala_admin.xml
Normal file
60
docs/topics/impala_admin.xml
Normal file
@@ -0,0 +1,60 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="admin">
|
||||
|
||||
<title>Impala Administration</title>
|
||||
<titlealts audience="PDF"><navtitle>Administration</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Administrators"/>
|
||||
<!-- Although there is a reasonable amount of info on the page, it could be better to use wiki-style embedding instead of linking hither and thither. -->
|
||||
<data name="Category" value="Stub Pages"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
As an administrator, you monitor Impala's use of resources and take action when necessary to keep Impala
|
||||
running smoothly and avoid conflicts with other Hadoop components running on the same cluster. When you
|
||||
detect that an issue has happened or could happen in the future, you reconfigure Impala or other components
|
||||
such as HDFS or even the hardware of the cluster itself to resolve or avoid problems.
|
||||
</p>
|
||||
|
||||
<p outputclass="toc"/>
|
||||
|
||||
<p>
|
||||
<b>Related tasks:</b>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
As an administrator, you can expect to perform installation, upgrade, and configuration tasks for Impala on
|
||||
all machines in a cluster. See <xref href="impala_install.xml#install"/>,
|
||||
<xref href="impala_upgrading.xml#upgrading"/>, and <xref href="impala_config.xml#config"/> for details.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
For security tasks typically performed by administrators, see <xref href="impala_security.xml#security"/>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Administrators also decide how to allocate cluster resources so that all Hadoop components can run smoothly
|
||||
together. For Impala, this task primarily involves:
|
||||
<ul>
|
||||
<li>
|
||||
Deciding how many Impala queries can run concurrently and with how much memory, through the admission
|
||||
control feature. See <xref href="impala_admission.xml#admission_control"/> for details.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Dividing cluster resources such as memory between Impala and other components, using YARN for overall
|
||||
resource management, and Llama to mediate resource requests from Impala to YARN. See
|
||||
<xref href="impala_resource_management.xml#resource_management"/> for details.
|
||||
</li>
|
||||
</ul>
|
||||
</p>
|
||||
|
||||
<!-- <p conref="../shared/impala_common.xml#common/impala_mr"/> -->
|
||||
</conbody>
|
||||
</concept>
|
||||
947
docs/topics/impala_admission.xml
Normal file
947
docs/topics/impala_admission.xml
Normal file
@@ -0,0 +1,947 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept rev="1.3.0" id="admission_control">
|
||||
|
||||
<title>Admission Control and Query Queuing</title>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Querying"/>
|
||||
<data name="Category" value="Admission Control"/>
|
||||
<data name="Category" value="Resource Management"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p id="admission_control_intro">
|
||||
Admission control is an Impala feature that imposes limits on concurrent SQL queries, to avoid resource usage
|
||||
spikes and out-of-memory conditions on busy CDH clusters.
|
||||
It is a form of <q>throttling</q>.
|
||||
New queries are accepted and executed until
|
||||
certain conditions are met, such as too many queries or too much
|
||||
total memory used across the cluster.
|
||||
When one of these thresholds is reached,
|
||||
incoming queries wait to begin execution. These queries are
|
||||
queued and are admitted (that is, begin executing) when the resources become available.
|
||||
</p>
|
||||
<p>
|
||||
In addition to the threshold values for currently executing queries,
|
||||
you can place limits on the maximum number of queries that are
|
||||
queued (waiting) and a limit on the amount of time they might wait
|
||||
before returning with an error. These queue settings let you ensure that queries do
|
||||
not wait indefinitely, so that you can detect and correct <q>starvation</q> scenarios.
|
||||
</p>
|
||||
<p>
|
||||
Enable this feature if your cluster is
|
||||
underutilized at some times and overutilized at others. Overutilization is indicated by performance
|
||||
bottlenecks and queries being cancelled due to out-of-memory conditions, when those same queries are
|
||||
successful and perform well during times with less concurrent load. Admission control works as a safeguard to
|
||||
avoid out-of-memory conditions during heavy concurrent usage.
|
||||
</p>
|
||||
|
||||
<note conref="../shared/impala_common.xml#common/impala_llama_obsolete"/>
|
||||
|
||||
<p outputclass="toc inpage"/>
|
||||
</conbody>
|
||||
|
||||
<concept id="admission_intro">
|
||||
|
||||
<title>Overview of Impala Admission Control</title>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Concepts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
On a busy CDH cluster, you might find there is an optimal number of Impala queries that run concurrently.
|
||||
For example, when the I/O capacity is fully utilized by I/O-intensive queries,
|
||||
you might not find any throughput benefit in running more concurrent queries.
|
||||
By allowing some queries to run at full speed while others wait, rather than having
|
||||
all queries contend for resources and run slowly, admission control can result in higher overall throughput.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
For another example, consider a memory-bound workload such as many large joins or aggregation queries.
|
||||
Each such query could briefly use many gigabytes of memory to process intermediate results.
|
||||
Because Impala by default cancels queries that exceed the specified memory limit,
|
||||
running multiple large-scale queries at once might require
|
||||
re-running some queries that are cancelled. In this case, admission control improves the
|
||||
reliability and stability of the overall workload by only allowing as many concurrent queries
|
||||
as the overall memory of the cluster can accomodate.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The admission control feature lets you set an upper limit on the number of concurrent Impala
|
||||
queries and on the memory used by those queries. Any additional queries are queued until the earlier ones
|
||||
finish, rather than being cancelled or running slowly and causing contention. As other queries finish, the
|
||||
queued queries are allowed to proceed.
|
||||
</p>
|
||||
|
||||
<p rev="2.5.0">
|
||||
In <keyword keyref="impala25_full"/> and higher, you can specify these limits and thresholds for each
|
||||
pool rather than globally. That way, you can balance the resource usage and throughput
|
||||
between steady well-defined workloads, rare resource-intensive queries, and ad hoc
|
||||
exploratory queries.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
For details on the internal workings of admission control, see
|
||||
<xref href="impala_admission.xml#admission_architecture"/>.
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
|
||||
<concept id="admission_concurrency">
|
||||
<title>Concurrent Queries and Admission Control</title>
|
||||
<conbody>
|
||||
<p>
|
||||
One way to limit resource usage through admission control is to set an upper limit
|
||||
on the number of concurrent queries. This is the initial technique you might use
|
||||
when you do not have extensive information about memory usage for your workload.
|
||||
This setting can be specified separately for each dynamic resource pool.
|
||||
</p>
|
||||
<p>
|
||||
You can combine this setting with the memory-based approach described in
|
||||
<xref href="impala_admission.xml#admission_memory"/>. If either the maximum number of
|
||||
or the expected memory usage of the concurrent queries is exceeded, subsequent queries
|
||||
are queued until the concurrent workload falls below the threshold again.
|
||||
</p>
|
||||
<p>
|
||||
See
|
||||
<xref audience="integrated" href="cm_mc_resource_pools.xml#concept_xkk_l1d_wr"/><xref audience="standalone" href="http://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_resource_pools.html" scope="external" format="html"/>
|
||||
for information about all these dynamic resource
|
||||
pool settings, how to use them together, and how to divide different parts of your workload among
|
||||
different pools.
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
|
||||
<concept id="admission_memory">
|
||||
<title>Memory Limits and Admission Control</title>
|
||||
<conbody>
|
||||
<p>
|
||||
Each dynamic resource pool can have an upper limit on the cluster-wide memory used by queries executing in that pool.
|
||||
This is the technique to use once you have a stable workload with well-understood memory requirements.
|
||||
</p>
|
||||
<p>
|
||||
Always specify the <uicontrol>Default Query Memory Limit</uicontrol> for the expected maximum amount of RAM
|
||||
that a query might require on each host, which is equivalent to setting the <codeph>MEM_LIMIT</codeph>
|
||||
query option for every query run in that pool. That value affects the execution of each query, preventing it
|
||||
from overallocating memory on each host, and potentially activating the spill-to-disk mechanism or cancelling
|
||||
the query when necessary.
|
||||
</p>
|
||||
<p>
|
||||
Optionally, specify the <uicontrol>Max Memory</uicontrol> setting, a cluster-wide limit that determines
|
||||
how many queries can be safely run concurrently, based on the upper memory limit per host multiplied by the
|
||||
number of Impala nodes in the cluster.
|
||||
</p>
|
||||
<p conref="../shared/impala_common.xml#common/admission_control_mem_limit_interaction"/>
|
||||
<note conref="../shared/impala_common.xml#common/max_memory_default_limit_caveat"/>
|
||||
<p>
|
||||
You can combine the memory-based settings with the upper limit on concurrent queries described in
|
||||
<xref href="impala_admission.xml#admission_concurrency"/>. If either the maximum number of
|
||||
or the expected memory usage of the concurrent queries is exceeded, subsequent queries
|
||||
are queued until the concurrent workload falls below the threshold again.
|
||||
</p>
|
||||
<p>
|
||||
See
|
||||
<xref audience="integrated" href="cm_mc_resource_pools.xml#concept_xkk_l1d_wr"/><xref audience="standalone" href="http://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_resource_pools.html" scope="external" format="html"/>
|
||||
for information about all these dynamic resource
|
||||
pool settings, how to use them together, and how to divide different parts of your workload among
|
||||
different pools.
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
|
||||
<concept id="admission_yarn">
|
||||
|
||||
<title>How Impala Admission Control Relates to Other Resource Management Tools</title>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Concepts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
The admission control feature is similar in some ways to the Cloudera Manager
|
||||
static partitioning feature, as well as the YARN resource management framework. These features
|
||||
can be used separately or together. This section describes some similarities and differences, to help you
|
||||
decide which combination of resource management features to use for Impala.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Admission control is a lightweight, decentralized system that is suitable for workloads consisting
|
||||
primarily of Impala queries and other SQL statements. It sets <q>soft</q> limits that smooth out Impala
|
||||
memory usage during times of heavy load, rather than taking an all-or-nothing approach that cancels jobs
|
||||
that are too resource-intensive.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Because the admission control system does not interact with other Hadoop workloads such as MapReduce jobs, you
|
||||
might use YARN with static service pools on CDH 5 clusters where resources are shared between
|
||||
Impala and other Hadoop components. This configuration is recommended when using Impala in a
|
||||
<term>multitenant</term> cluster. Devote a percentage of cluster resources to Impala, and allocate another
|
||||
percentage for MapReduce and other batch-style workloads. Let admission control handle the concurrency and
|
||||
memory usage for the Impala work within the cluster, and let YARN manage the work for other components within the
|
||||
cluster. In this scenario, Impala's resources are not managed by YARN.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The Impala admission control feature uses the same configuration mechanism as the YARN resource manager to map users to
|
||||
pools and authenticate them.
|
||||
</p>
|
||||
|
||||
<p rev="DOCS-648">
|
||||
Although the Impala admission control feature uses a <codeph>fair-scheduler.xml</codeph> configuration file
|
||||
behind the scenes, this file does not depend on which scheduler is used for YARN. You still use this file,
|
||||
and Cloudera Manager can generate it for you, even when YARN is using the capacity scheduler.
|
||||
</p>
|
||||
|
||||
</conbody>
|
||||
</concept>
|
||||
|
||||
<concept id="admission_architecture">
|
||||
|
||||
<title>How Impala Schedules and Enforces Limits on Concurrent Queries</title>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Concepts"/>
|
||||
<data name="Category" value="Scheduling"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
The admission control system is decentralized, embedded in each Impala daemon and communicating through the
|
||||
statestore mechanism. Although the limits you set for memory usage and number of concurrent queries apply
|
||||
cluster-wide, each Impala daemon makes its own decisions about whether to allow each query to run
|
||||
immediately or to queue it for a less-busy time. These decisions are fast, meaning the admission control
|
||||
mechanism is low-overhead, but might be imprecise during times of heavy load across many coordinators. There could be times when the
|
||||
more queries were queued (in aggregate across the cluster) than the specified limit, or when number of admitted queries
|
||||
exceeds the expected number. Thus, you typically err on the
|
||||
high side for the size of the queue, because there is not a big penalty for having a large number of queued
|
||||
queries; and you typically err on the low side for configuring memory resources, to leave some headroom in case more
|
||||
queries are admitted than expected, without running out of memory and being cancelled as a result.
|
||||
</p>
|
||||
|
||||
<!-- Commenting out as redundant.
|
||||
<p>
|
||||
The limit on the number of concurrent queries is a <q>soft</q> one, To achieve high throughput, Impala
|
||||
makes quick decisions at the host level about which queued queries to dispatch. Therefore, Impala might
|
||||
slightly exceed the limits from time to time.
|
||||
</p>
|
||||
-->
|
||||
|
||||
<p>
|
||||
To avoid a large backlog of queued requests, you can set an upper limit on the size of the queue for
|
||||
queries that are queued. When the number of queued queries exceeds this limit, further queries are
|
||||
cancelled rather than being queued. You can also configure a timeout period per pool, after which queued queries are
|
||||
cancelled, to avoid indefinite waits. If a cluster reaches this state where queries are cancelled due to
|
||||
too many concurrent requests or long waits for query execution to begin, that is a signal for an
|
||||
administrator to take action, either by provisioning more resources, scheduling work on the cluster to
|
||||
smooth out the load, or by doing <xref href="impala_performance.xml#performance">Impala performance
|
||||
tuning</xref> to enable higher throughput.
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
|
||||
<concept id="admission_jdbc_odbc">
|
||||
|
||||
<title>How Admission Control works with Impala Clients (JDBC, ODBC, HiveServer2)</title>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="JDBC"/>
|
||||
<data name="Category" value="ODBC"/>
|
||||
<data name="Category" value="HiveServer2"/>
|
||||
<data name="Category" value="Concepts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
Most aspects of admission control work transparently with client interfaces such as JDBC and ODBC:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
If a SQL statement is put into a queue rather than running immediately, the API call blocks until the
|
||||
statement is dequeued and begins execution. At that point, the client program can request to fetch
|
||||
results, which might also block until results become available.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
If a SQL statement is cancelled because it has been queued for too long or because it exceeded the memory
|
||||
limit during execution, the error is returned to the client program with a descriptive error message.
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
|
||||
<p rev="CDH-27667">
|
||||
In Impala 2.0 and higher, you can submit
|
||||
a SQL <codeph>SET</codeph> statement from the client application
|
||||
to change the <codeph>REQUEST_POOL</codeph> query option.
|
||||
This option lets you submit queries to different resource pools,
|
||||
as described in <xref href="impala_request_pool.xml#request_pool"/>.
|
||||
<!-- Commenting out as starting to be too old to mention.
|
||||
Prior to Impala 2.0, that option was only settable
|
||||
for a session through the <cmdname>impala-shell</cmdname> <codeph>SET</codeph> command, or cluster-wide through an
|
||||
<cmdname>impalad</cmdname> startup option.
|
||||
-->
|
||||
</p>
|
||||
|
||||
<p>
|
||||
At any time, the set of queued queries could include queries submitted through multiple different Impala
|
||||
daemon hosts. All the queries submitted through a particular host will be executed in order, so a
|
||||
<codeph>CREATE TABLE</codeph> followed by an <codeph>INSERT</codeph> on the same table would succeed.
|
||||
Queries submitted through different hosts are not guaranteed to be executed in the order they were
|
||||
received. Therefore, if you are using load-balancing or other round-robin scheduling where different
|
||||
statements are submitted through different hosts, set up all table structures ahead of time so that the
|
||||
statements controlled by the queuing system are primarily queries, where order is not significant. Or, if a
|
||||
sequence of statements needs to happen in strict order (such as an <codeph>INSERT</codeph> followed by a
|
||||
<codeph>SELECT</codeph>), submit all those statements through a single session, while connected to the same
|
||||
Impala daemon host.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Admission control has the following limitations or special behavior when used with JDBC or ODBC
|
||||
applications:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
The other resource-related query options,
|
||||
<codeph>RESERVATION_REQUEST_TIMEOUT</codeph> and <codeph>V_CPU_CORES</codeph>, are no longer used. Those query options only
|
||||
applied to using Impala with Llama, which is no longer supported.
|
||||
</li>
|
||||
</ul>
|
||||
</conbody>
|
||||
</concept>
|
||||
|
||||
<concept id="admission_schema_config">
|
||||
<title>SQL and Schema Considerations for Admission Control</title>
|
||||
<conbody>
|
||||
<p>
|
||||
When queries complete quickly and are tuned for optimal memory usage, there is less chance of
|
||||
performance or capacity problems during times of heavy load. Before setting up admission control,
|
||||
tune your Impala queries to ensure that the query plans are efficient and the memory estimates
|
||||
are accurate. Understanding the nature of your workload, and which queries are the most
|
||||
resource-intensive, helps you to plan how to divide the queries into different pools and
|
||||
decide what limits to define for each pool.
|
||||
</p>
|
||||
<p>
|
||||
For large tables, especially those involved in join queries, keep their statistics up to date
|
||||
after loading substantial amounts of new data or adding new partitions.
|
||||
Use the <codeph>COMPUTE STATS</codeph> statement for unpartitioned tables, and
|
||||
<codeph>COMPUTE INCREMENTAL STATS</codeph> for partitioned tables.
|
||||
</p>
|
||||
<p>
|
||||
When you use dynamic resource pools with a <uicontrol>Max Memory</uicontrol> setting enabled,
|
||||
you typically override the memory estimates that Impala makes based on the statistics from the
|
||||
<codeph>COMPUTE STATS</codeph> statement.
|
||||
You either set the <codeph>MEM_LIMIT</codeph> query option within a particular session to
|
||||
set an upper memory limit for queries within that session, or a default <codeph>MEM_LIMIT</codeph>
|
||||
setting for all queries processed by the <cmdname>impalad</cmdname> instance, or
|
||||
a default <codeph>MEM_LIMIT</codeph> setting for all queries assigned to a particular
|
||||
dynamic resource pool. By designating a consistent memory limit for a set of similar queries
|
||||
that use the same resource pool, you avoid unnecessary query queuing or out-of-memory conditions
|
||||
that can arise during high-concurrency workloads when memory estimates for some queries are inaccurate.
|
||||
</p>
|
||||
<p>
|
||||
Follow other steps from <xref href="impala_performance.xml#performance"/> to tune your queries.
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
|
||||
|
||||
<concept id="admission_config">
|
||||
|
||||
<title>Configuring Admission Control</title>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Configuring"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
The configuration options for admission control range from the simple (a single resource pool with a single
|
||||
set of options) to the complex (multiple resource pools with different options, each pool handling queries
|
||||
for a different set of users and groups). <ph rev="upstream">Cloudera</ph> recommends configuring the settings through the Cloudera Manager user
|
||||
interface.
|
||||
<!--
|
||||
, or on a system without Cloudera Manager by editing configuration files or through startup
|
||||
options to the <cmdname>impalad</cmdname> daemon.
|
||||
-->
|
||||
</p>
|
||||
|
||||
<!-- To do: reconcile the similar notes in impala_admission.xml and admin_impala_admission_control.xml
|
||||
and make into a conref in both places. -->
|
||||
<note type="important">
|
||||
Although the following options are still present in the Cloudera Manager interface under the
|
||||
<uicontrol>Admission Control</uicontrol> configuration settings dialog,
|
||||
<ph rev="upstream">Cloudera</ph> recommends you not use them in <keyword keyref="impala25_full"/> and higher.
|
||||
These settings only apply if you enable admission control but leave dynamic resource pools disabled.
|
||||
In <keyword keyref="impala25_full"/> and higher, prefer to set up dynamic resource pools and
|
||||
customize the settings for each pool, as described in
|
||||
<ph audience="integrated"><xref href="cm_mc_resource_pools.xml#concept_xkk_l1d_wr/section_p15_mhn_2v"/> and <xref href="cm_mc_resource_pools.xml#concept_xkk_l1d_wr/section_gph_tnk_lm"/></ph>
|
||||
<xref audience="standalone" href="http://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_resource_pools.html" scope="external" format="html"/>.
|
||||
</note>
|
||||
|
||||
<section id="admission_flags">
|
||||
|
||||
<title>Impala Service Flags for Admission Control (Advanced)</title>
|
||||
|
||||
<p>
|
||||
The following Impala configuration options let you adjust the settings of the admission control feature. When supplying the
|
||||
options on the <cmdname>impalad</cmdname> command line, prepend the option name with <codeph>--</codeph>.
|
||||
</p>
|
||||
|
||||
<dl id="admission_control_option_list">
|
||||
<dlentry id="queue_wait_timeout_ms">
|
||||
<dt>
|
||||
<codeph>queue_wait_timeout_ms</codeph>
|
||||
</dt>
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">--queue_wait_timeout_ms</indexterm>
|
||||
<b>Purpose:</b> Maximum amount of time (in milliseconds) that a
|
||||
request waits to be admitted before timing out.
|
||||
<p>
|
||||
<b>Type:</b> <codeph>int64</codeph>
|
||||
</p>
|
||||
<p>
|
||||
<b>Default:</b> <codeph>60000</codeph>
|
||||
</p>
|
||||
</dd>
|
||||
</dlentry>
|
||||
<dlentry id="default_pool_max_requests">
|
||||
<dt>
|
||||
<codeph>default_pool_max_requests</codeph>
|
||||
</dt>
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">--default_pool_max_requests</indexterm>
|
||||
<b>Purpose:</b> Maximum number of concurrent outstanding requests
|
||||
allowed to run before incoming requests are queued. Because this
|
||||
limit applies cluster-wide, but each Impala node makes independent
|
||||
decisions to run queries immediately or queue them, it is a soft
|
||||
limit; the overall number of concurrent queries might be slightly
|
||||
higher during times of heavy load. A negative value indicates no
|
||||
limit. Ignored if <codeph>fair_scheduler_config_path</codeph> and
|
||||
<codeph>llama_site_path</codeph> are set. <p>
|
||||
<b>Type:</b>
|
||||
<codeph>int64</codeph>
|
||||
</p>
|
||||
<p>
|
||||
<b>Default:</b>
|
||||
<ph rev="2.5.0">-1, meaning unlimited (prior to <keyword keyref="impala25_full"/> the default was 200)</ph>
|
||||
</p>
|
||||
</dd>
|
||||
</dlentry>
|
||||
<dlentry id="default_pool_max_queued">
|
||||
<dt>
|
||||
<codeph>default_pool_max_queued</codeph>
|
||||
</dt>
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">--default_pool_max_queued</indexterm>
|
||||
<b>Purpose:</b> Maximum number of requests allowed to be queued
|
||||
before rejecting requests. Because this limit applies
|
||||
cluster-wide, but each Impala node makes independent decisions to
|
||||
run queries immediately or queue them, it is a soft limit; the
|
||||
overall number of queued queries might be slightly higher during
|
||||
times of heavy load. A negative value or 0 indicates requests are
|
||||
always rejected once the maximum concurrent requests are
|
||||
executing. Ignored if <codeph>fair_scheduler_config_path</codeph>
|
||||
and <codeph>llama_site_path</codeph> are set. <p>
|
||||
<b>Type:</b>
|
||||
<codeph>int64</codeph>
|
||||
</p>
|
||||
<p>
|
||||
<b>Default:</b>
|
||||
<ph rev="2.5.0">unlimited</ph>
|
||||
</p>
|
||||
</dd>
|
||||
</dlentry>
|
||||
<dlentry id="default_pool_mem_limit">
|
||||
<dt>
|
||||
<codeph>default_pool_mem_limit</codeph>
|
||||
</dt>
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">--default_pool_mem_limit</indexterm>
|
||||
<b>Purpose:</b> Maximum amount of memory (across the entire
|
||||
cluster) that all outstanding requests in this pool can use before
|
||||
new requests to this pool are queued. Specified in bytes,
|
||||
megabytes, or gigabytes by a number followed by the suffix
|
||||
<codeph>b</codeph> (optional), <codeph>m</codeph>, or
|
||||
<codeph>g</codeph>, either uppercase or lowercase. You can
|
||||
specify floating-point values for megabytes and gigabytes, to
|
||||
represent fractional numbers such as <codeph>1.5</codeph>. You can
|
||||
also specify it as a percentage of the physical memory by
|
||||
specifying the suffix <codeph>%</codeph>. 0 or no setting
|
||||
indicates no limit. Defaults to bytes if no unit is given. Because
|
||||
this limit applies cluster-wide, but each Impala node makes
|
||||
independent decisions to run queries immediately or queue them, it
|
||||
is a soft limit; the overall memory used by concurrent queries
|
||||
might be slightly higher during times of heavy load. Ignored if
|
||||
<codeph>fair_scheduler_config_path</codeph> and
|
||||
<codeph>llama_site_path</codeph> are set. <note
|
||||
conref="../shared/impala_common.xml#common/admission_compute_stats" />
|
||||
<p conref="../shared/impala_common.xml#common/type_string" />
|
||||
<p>
|
||||
<b>Default:</b>
|
||||
<codeph>""</codeph> (empty string, meaning unlimited) </p>
|
||||
</dd>
|
||||
</dlentry>
|
||||
<!-- Possibly from here on down, command-line controls not applicable to CM. -->
|
||||
<dlentry id="disable_admission_control">
|
||||
<dt>
|
||||
<codeph>disable_admission_control</codeph>
|
||||
</dt>
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">--disable_admission_control</indexterm>
|
||||
<b>Purpose:</b> Turns off the admission control feature entirely,
|
||||
regardless of other configuration option settings.
|
||||
<p>
|
||||
<b>Type:</b> Boolean </p>
|
||||
<p>
|
||||
<b>Default:</b>
|
||||
<codeph>false</codeph>
|
||||
</p>
|
||||
</dd>
|
||||
</dlentry>
|
||||
<dlentry id="disable_pool_max_requests">
|
||||
<dt>
|
||||
<codeph>disable_pool_max_requests</codeph>
|
||||
</dt>
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">--disable_pool_max_requests</indexterm>
|
||||
<b>Purpose:</b> Disables all per-pool limits on the maximum number
|
||||
of running requests. <p>
|
||||
<b>Type:</b> Boolean </p>
|
||||
<p>
|
||||
<b>Default:</b>
|
||||
<codeph>false</codeph>
|
||||
</p>
|
||||
</dd>
|
||||
</dlentry>
|
||||
<dlentry id="disable_pool_mem_limits">
|
||||
<dt>
|
||||
<codeph>disable_pool_mem_limits</codeph>
|
||||
</dt>
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">--disable_pool_mem_limits</indexterm>
|
||||
<b>Purpose:</b> Disables all per-pool mem limits. <p>
|
||||
<b>Type:</b> Boolean </p>
|
||||
<p>
|
||||
<b>Default:</b>
|
||||
<codeph>false</codeph>
|
||||
</p>
|
||||
</dd>
|
||||
</dlentry>
|
||||
<dlentry id="fair_scheduler_allocation_path">
|
||||
<dt>
|
||||
<codeph>fair_scheduler_allocation_path</codeph>
|
||||
</dt>
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">--fair_scheduler_allocation_path</indexterm>
|
||||
<b>Purpose:</b> Path to the fair scheduler allocation file
|
||||
(<codeph>fair-scheduler.xml</codeph>). <p
|
||||
conref="../shared/impala_common.xml#common/type_string" />
|
||||
<p>
|
||||
<b>Default:</b>
|
||||
<codeph>""</codeph> (empty string) </p>
|
||||
<p>
|
||||
<b>Usage notes:</b> Admission control only uses a small subset
|
||||
of the settings that can go in this file, as described below.
|
||||
For details about all the Fair Scheduler configuration settings,
|
||||
see the <xref
|
||||
href="http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html#Configuration"
|
||||
scope="external" format="html">Apache wiki</xref>. </p>
|
||||
</dd>
|
||||
</dlentry>
|
||||
<dlentry id="llama_site_path">
|
||||
<dt>
|
||||
<codeph>llama_site_path</codeph>
|
||||
</dt>
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">--llama_site_path</indexterm>
|
||||
<b>Purpose:</b> Path to the configuration file used by admission control
|
||||
(<codeph>llama-site.xml</codeph>). If set,
|
||||
<codeph>fair_scheduler_allocation_path</codeph> must also be set.
|
||||
<p conref="../shared/impala_common.xml#common/type_string" />
|
||||
<p>
|
||||
<b>Default:</b> <codeph>""</codeph> (empty string) </p>
|
||||
<p>
|
||||
<b>Usage notes:</b> Admission control only uses a few
|
||||
of the settings that can go in this file, as described below.
|
||||
</p>
|
||||
</dd>
|
||||
</dlentry>
|
||||
</dl>
|
||||
</section>
|
||||
</conbody>
|
||||
|
||||
<concept id="admission_config_cm">
|
||||
|
||||
<!-- TK: Maybe all this stuff overlaps with admin_impala_admission_control and can be delegated there. -->
|
||||
|
||||
<title>Configuring Admission Control Using Cloudera Manager</title>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Cloudera Manager"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
In Cloudera Manager, you can configure pools to manage queued Impala queries, and the options for the
|
||||
limit on number of concurrent queries and how to handle queries that exceed the limit. For details, see
|
||||
<xref href="http://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_managing_resources.html" scope="external" format="html">Managing Resources with Cloudera Manager</xref>.
|
||||
</p>
|
||||
|
||||
<p audience="Cloudera"><!-- Hiding link because that subtopic is now hidden. -->
|
||||
See <xref href="#admission_examples"/> for a sample setup for admission control under
|
||||
Cloudera Manager.
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
|
||||
<concept id="admission_config_noncm">
|
||||
|
||||
<title>Configuring Admission Control Using the Command Line</title>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
If you do not use Cloudera Manager, you use a combination of startup options for the Impala daemon, and
|
||||
optionally editing or manually constructing the configuration files
|
||||
<filepath>fair-scheduler.xml</filepath> and <filepath>llama-site.xml</filepath>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
For a straightforward configuration using a single resource pool named <codeph>default</codeph>, you can
|
||||
specify configuration options on the command line and skip the <filepath>fair-scheduler.xml</filepath>
|
||||
and <filepath>llama-site.xml</filepath> configuration files.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
For an advanced configuration with multiple resource pools using different settings, set up the
|
||||
<filepath>fair-scheduler.xml</filepath> and <filepath>llama-site.xml</filepath> configuration files
|
||||
manually. Provide the paths to each one using the <cmdname>impalad</cmdname> command-line options,
|
||||
<codeph>--fair_scheduler_allocation_path</codeph> and <codeph>--llama_site_path</codeph> respectively.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The Impala admission control feature only uses the Fair Scheduler configuration settings to determine how
|
||||
to map users and groups to different resource pools. For example, you might set up different resource
|
||||
pools with separate memory limits, and maximum number of concurrent and queued queries, for different
|
||||
categories of users within your organization. For details about all the Fair Scheduler configuration
|
||||
settings, see the
|
||||
<xref href="http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html#Configuration" scope="external" format="html">Apache
|
||||
wiki</xref>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The Impala admission control feature only uses a small subset of possible settings from the
|
||||
<filepath>llama-site.xml</filepath> configuration file:
|
||||
</p>
|
||||
|
||||
<codeblock>llama.am.throttling.maximum.placed.reservations.<varname>queue_name</varname>
|
||||
llama.am.throttling.maximum.queued.reservations.<varname>queue_name</varname>
|
||||
<ph rev="2.5.0 IMPALA-2538">impala.admission-control.pool-default-query-options.<varname>queue_name</varname>
|
||||
impala.admission-control.pool-queue-timeout-ms.<varname>queue_name</varname></ph>
|
||||
</codeblock>
|
||||
|
||||
<p rev="2.5.0 IMPALA-2538">
|
||||
The <codeph>impala.admission-control.pool-queue-timeout-ms</codeph>
|
||||
setting specifies the timeout value for this pool, in milliseconds.
|
||||
The<codeph>impala.admission-control.pool-default-query-options</codeph>
|
||||
settings designates the default query options for all queries that run
|
||||
in this pool. Its argument value is a comma-delimited string of
|
||||
'key=value' pairs, for example,<codeph>'key1=val1,key2=val2'</codeph>.
|
||||
For example, this is where you might set a default memory limit
|
||||
for all queries in the pool, using an argument such as <codeph>MEM_LIMIT=5G</codeph>.
|
||||
</p>
|
||||
|
||||
<p rev="2.5.0 IMPALA-2538">
|
||||
The <codeph>impala.admission-control.*</codeph> configuration settings are available in
|
||||
<keyword keyref="impala25_full"/> and higher.
|
||||
</p>
|
||||
|
||||
<p audience="Cloudera"><!-- Hiding link because that subtopic is now hidden. -->
|
||||
See <xref href="#admission_examples/section_etq_qgb_rq"/> for sample configuration files
|
||||
for admission control using multiple resource pools, without Cloudera Manager.
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
|
||||
<concept id="admission_examples">
|
||||
<!-- Pruning the CM examples and screenshots because in Impala 2.5 the defaults match up much better with our recommendations. -->
|
||||
|
||||
<title>Examples of Admission Control Configurations</title>
|
||||
|
||||
<conbody>
|
||||
|
||||
<section id="section_fqn_qgb_rq">
|
||||
|
||||
<title>Example Admission Control Configurations Using Cloudera Manager</title>
|
||||
|
||||
<p>
|
||||
For full instructions about configuring dynamic resource pools through Cloudera Manager, see
|
||||
<xref audience="integrated" href="cm_mc_resource_pools.xml#xd_583c10bfdbd326ba--43d5fd93-1410993f8c2--7ff2"/><xref audience="standalone" href="http://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_resource_pools.html" scope="external" format="html"/>.
|
||||
</p>
|
||||
|
||||
</section>
|
||||
|
||||
<section id="section_etq_qgb_rq">
|
||||
|
||||
<title>Example Admission Control Configurations Using Configuration Files</title>
|
||||
|
||||
<p>
|
||||
For clusters not managed by Cloudera Manager, here are sample <filepath>fair-scheduler.xml</filepath>
|
||||
and <filepath>llama-site.xml</filepath> files that define resource pools <codeph>root.default</codeph>,
|
||||
<codeph>root.development</codeph>, and <codeph>root.production</codeph>.
|
||||
These sample files are stripped down: in a real deployment they
|
||||
might contain other settings for use with various aspects of the YARN component. The
|
||||
settings shown here are the significant ones for the Impala admission control feature.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>fair-scheduler.xml:</b>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Although Impala does not use the <codeph>vcores</codeph> value, you must still specify it to satisfy
|
||||
YARN requirements for the file contents.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Each <codeph><aclSubmitApps></codeph> tag (other than the one for <codeph>root</codeph>) contains
|
||||
a comma-separated list of users, then a space, then a comma-separated list of groups; these are the
|
||||
users and groups allowed to submit Impala statements to the corresponding resource pool.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
If you leave the <codeph><aclSubmitApps></codeph> element empty for a pool, nobody can submit
|
||||
directly to that pool; child pools can specify their own <codeph><aclSubmitApps></codeph> values
|
||||
to authorize users and groups to submit to those pools.
|
||||
</p>
|
||||
|
||||
<codeblock><![CDATA[<allocations>
|
||||
<queue name="root">
|
||||
<aclSubmitApps> </aclSubmitApps>
|
||||
<queue name="default">
|
||||
<maxResources>50000 mb, 0 vcores</maxResources>
|
||||
<aclSubmitApps>*</aclSubmitApps>
|
||||
</queue>
|
||||
<queue name="development">
|
||||
<maxResources>200000 mb, 0 vcores</maxResources>
|
||||
<aclSubmitApps>user1,user2 dev,ops,admin</aclSubmitApps>
|
||||
</queue>
|
||||
<queue name="production">
|
||||
<maxResources>1000000 mb, 0 vcores</maxResources>
|
||||
<aclSubmitApps> ops,admin</aclSubmitApps>
|
||||
</queue>
|
||||
</queue>
|
||||
<queuePlacementPolicy>
|
||||
<rule name="specified" create="false"/>
|
||||
<rule name="default" />
|
||||
</queuePlacementPolicy>
|
||||
</allocations>
|
||||
]]>
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
<b>llama-site.xml:</b>
|
||||
</p>
|
||||
|
||||
<codeblock rev="2.5.0 IMPALA-2538"><![CDATA[
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<configuration>
|
||||
<property>
|
||||
<name>llama.am.throttling.maximum.placed.reservations.root.default</name>
|
||||
<value>10</value>
|
||||
</property>
|
||||
<property>
|
||||
<name>llama.am.throttling.maximum.queued.reservations.root.default</name>
|
||||
<value>50</value>
|
||||
</property>
|
||||
<property>
|
||||
<name>impala.admission-control.pool-default-query-options.root.default</name>
|
||||
<value>mem_limit=128m,query_timeout_s=20,max_io_buffers=10</value>
|
||||
</property>
|
||||
<property>
|
||||
<name>impala.admission-control.pool-queue-timeout-ms.root.default</name>
|
||||
<value>30000</value>
|
||||
</property>
|
||||
<property>
|
||||
<name>llama.am.throttling.maximum.placed.reservations.root.development</name>
|
||||
<value>50</value>
|
||||
</property>
|
||||
<property>
|
||||
<name>llama.am.throttling.maximum.queued.reservations.root.development</name>
|
||||
<value>100</value>
|
||||
</property>
|
||||
<property>
|
||||
<name>impala.admission-control.pool-default-query-options.root.development</name>
|
||||
<value>mem_limit=256m,query_timeout_s=30,max_io_buffers=10</value>
|
||||
</property>
|
||||
<property>
|
||||
<name>impala.admission-control.pool-queue-timeout-ms.root.development</name>
|
||||
<value>15000</value>
|
||||
</property>
|
||||
<property>
|
||||
<name>llama.am.throttling.maximum.placed.reservations.root.production</name>
|
||||
<value>100</value>
|
||||
</property>
|
||||
<property>
|
||||
<name>llama.am.throttling.maximum.queued.reservations.root.production</name>
|
||||
<value>200</value>
|
||||
</property>
|
||||
<!--
|
||||
Default query options for the 'root.production' pool.
|
||||
THIS IS A NEW PARAMETER in CDH 5.7 / Impala 2.5.
|
||||
Note that the MEM_LIMIT query option still shows up in here even though it is a
|
||||
separate box in the UI. We do that because it is the most important query option
|
||||
that people will need (everything else is somewhat advanced).
|
||||
|
||||
MEM_LIMIT takes a per-node memory limit which is specified using one of the following:
|
||||
- '<int>[bB]?' -> bytes (default if no unit given)
|
||||
- '<float>[mM(bB)]' -> megabytes
|
||||
- '<float>[gG(bB)]' -> in gigabytes
|
||||
E.g. 'MEM_LIMIT=12345' (no unit) means 12345 bytes, and you can append m or g
|
||||
to specify megabytes or gigabytes, though that is not required.
|
||||
-->
|
||||
<property>
|
||||
<name>impala.admission-control.pool-default-query-options.root.production</name>
|
||||
<value>mem_limit=386m,query_timeout_s=30,max_io_buffers=10</value>
|
||||
</property>
|
||||
<!--
|
||||
Default queue timeout (ms) for the pool 'root.production'.
|
||||
If this isn’t set, the process-wide flag is used.
|
||||
THIS IS A NEW PARAMETER in CDH 5.7 / Impala 2.5.
|
||||
-->
|
||||
<property>
|
||||
<name>impala.admission-control.pool-queue-timeout-ms.root.production</name>
|
||||
<value>30000</value>
|
||||
</property>
|
||||
</configuration>
|
||||
]]>
|
||||
</codeblock>
|
||||
</section>
|
||||
</conbody>
|
||||
</concept>
|
||||
</concept>
|
||||
|
||||
<!-- End Config -->
|
||||
|
||||
<concept id="admission_guidelines">
|
||||
|
||||
<title>Guidelines for Using Admission Control</title>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Planning"/>
|
||||
<data name="Category" value="Guidelines"/>
|
||||
<data name="Category" value="Best Practices"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
To see how admission control works for particular queries, examine the profile output for the query. This
|
||||
information is available through the <codeph>PROFILE</codeph> statement in <cmdname>impala-shell</cmdname>
|
||||
immediately after running a query in the shell, on the <uicontrol>queries</uicontrol> page of the Impala
|
||||
debug web UI, or in the Impala log file (basic information at log level 1, more detailed information at log
|
||||
level 2). The profile output contains details about the admission decision, such as whether the query was
|
||||
queued or not and which resource pool it was assigned to. It also includes the estimated and actual memory
|
||||
usage for the query, so you can fine-tune the configuration for the memory limits of the resource pools.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Where practical, use Cloudera Manager to configure the admission control parameters. The Cloudera Manager
|
||||
GUI is much simpler than editing the configuration files directly.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Remember that the limits imposed by admission control are <q>soft</q> limits.
|
||||
The decentralized nature of this mechanism means that each Impala node makes its own decisions about whether
|
||||
to allow queries to run immediately or to queue them. These decisions rely on information passed back and forth
|
||||
between nodes by the statestore service. If a sudden surge in requests causes more queries than anticipated to run
|
||||
concurrently, then throughput could decrease due to queries spilling to disk or contending for resources;
|
||||
or queries could be cancelled if they exceed the <codeph>MEM_LIMIT</codeph> setting while running.
|
||||
</p>
|
||||
|
||||
<!--
|
||||
<p>
|
||||
If you have trouble getting a query to run because its estimated memory usage is too high, you can override
|
||||
the estimate by setting the <codeph>MEM_LIMIT</codeph> query option in <cmdname>impala-shell</cmdname>,
|
||||
then issuing the query through the shell in the same session. The <codeph>MEM_LIMIT</codeph> value is
|
||||
treated as the estimated amount of memory, overriding the estimate that Impala would generate based on
|
||||
table and column statistics. This value is used only for making admission control decisions, and is not
|
||||
pre-allocated by the query.
|
||||
</p>
|
||||
-->
|
||||
|
||||
<p>
|
||||
In <cmdname>impala-shell</cmdname>, you can also specify which resource pool to direct queries to by
|
||||
setting the <codeph>REQUEST_POOL</codeph> query option.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The statements affected by the admission control feature are primarily queries, but also include statements
|
||||
that write data such as <codeph>INSERT</codeph> and <codeph>CREATE TABLE AS SELECT</codeph>. Most write
|
||||
operations in Impala are not resource-intensive, but inserting into a Parquet table can require substantial
|
||||
memory due to buffering intermediate data before writing out each Parquet data block. See
|
||||
<xref href="impala_parquet.xml#parquet_etl"/> for instructions about inserting data efficiently into
|
||||
Parquet tables.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Although admission control does not scrutinize memory usage for other kinds of DDL statements, if a query
|
||||
is queued due to a limit on concurrent queries or memory usage, subsequent statements in the same session
|
||||
are also queued so that they are processed in the correct order:
|
||||
</p>
|
||||
|
||||
<codeblock>-- This query could be queued to avoid out-of-memory at times of heavy load.
|
||||
select * from huge_table join enormous_table using (id);
|
||||
-- If so, this subsequent statement in the same session is also queued
|
||||
-- until the previous statement completes.
|
||||
drop table huge_table;
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
If you set up different resource pools for different users and groups, consider reusing any classifications
|
||||
you developed for use with Sentry security. See <xref href="impala_authorization.xml#authorization"/> for details.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
For details about all the Fair Scheduler configuration settings, see
|
||||
<xref href="https://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-yarn/hadoop-yarn-site/FairScheduler.html#Configuration" scope="external" format="html">Fair
|
||||
Scheduler Configuration</xref>, in particular the tags such as <codeph><queue></codeph> and
|
||||
<codeph><aclSubmitApps></codeph> to map users and groups to particular resource pools (queues).
|
||||
</p>
|
||||
|
||||
<!-- Wait a sec. We say admission control doesn't use RESERVATION_REQUEST_TIMEOUT at all.
|
||||
What's the real story here? Matt did refer to some timeout option that was
|
||||
available through the shell but not the DB-centric APIs.
|
||||
<p>
|
||||
Because you cannot override query options such as
|
||||
<codeph>RESERVATION_REQUEST_TIMEOUT</codeph>
|
||||
in a JDBC or ODBC application, consider configuring timeout periods
|
||||
on the application side to cancel queries that take
|
||||
too long due to being queued during times of high load.
|
||||
</p>
|
||||
-->
|
||||
</conbody>
|
||||
</concept>
|
||||
</concept>
|
||||
<!-- Admission control -->
|
||||
33
docs/topics/impala_aggregate_functions.xml
Normal file
33
docs/topics/impala_aggregate_functions.xml
Normal file
@@ -0,0 +1,33 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="aggregate_functions">
|
||||
|
||||
<title>Impala Aggregate Functions</title>
|
||||
<titlealts audience="PDF"><navtitle>Aggregate Functions</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Impala Functions"/>
|
||||
<data name="Category" value="Aggregate Functions"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Querying"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/aggr1"/>
|
||||
|
||||
<codeblock conref="../shared/impala_common.xml#common/aggr2"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/aggr3"/>
|
||||
|
||||
<p>
|
||||
<indexterm audience="Cloudera">aggregate functions</indexterm>
|
||||
</p>
|
||||
|
||||
<p outputclass="toc"/>
|
||||
</conbody>
|
||||
</concept>
|
||||
87
docs/topics/impala_aliases.xml
Normal file
87
docs/topics/impala_aliases.xml
Normal file
@@ -0,0 +1,87 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="aliases">
|
||||
|
||||
<title>Overview of Impala Aliases</title>
|
||||
<titlealts audience="PDF"><navtitle>Aliases</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Querying"/>
|
||||
<data name="Category" value="Tables"/>
|
||||
<data name="Category" value="Schemas"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
When you write the names of tables, columns, or column expressions in a query, you can assign an alias at the
|
||||
same time. Then you can specify the alias rather than the original name when making other references to the
|
||||
table or column in the same statement. You typically specify aliases that are shorter, easier to remember, or
|
||||
both than the original names. The aliases are printed in the query header, making them useful for
|
||||
self-documenting output.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
To set up an alias, add the <codeph>AS <varname>alias</varname></codeph> clause immediately after any table,
|
||||
column, or expression name in the <codeph>SELECT</codeph> list or <codeph>FROM</codeph> list of a query. The
|
||||
<codeph>AS</codeph> keyword is optional; you can also specify the alias immediately after the original name.
|
||||
</p>
|
||||
|
||||
<codeblock>-- Make the column headers of the result set easier to understand.
|
||||
SELECT c1 AS name, c2 AS address, c3 AS phone FROM table_with_terse_columns;
|
||||
SELECT SUM(ss_xyz_dollars_net) AS total_sales FROM table_with_cryptic_columns;
|
||||
-- The alias can be a quoted string for extra readability.
|
||||
SELECT c1 AS "Employee ID", c2 AS "Date of hire" FROM t1;
|
||||
-- The AS keyword is optional.
|
||||
SELECT c1 "Employee ID", c2 "Date of hire" FROM t1;
|
||||
|
||||
-- The table aliases assigned in the FROM clause can be used both earlier
|
||||
-- in the query (the SELECT list) and later (the WHERE clause).
|
||||
SELECT one.name, two.address, three.phone
|
||||
FROM census one, building_directory two, phonebook three
|
||||
WHERE one.id = two.id and two.id = three.id;
|
||||
|
||||
-- The aliases c1 and c2 let the query handle columns with the same names from 2 joined tables.
|
||||
-- The aliases t1 and t2 let the query abbreviate references to long or cryptically named tables.
|
||||
SELECT t1.column_n AS c1, t2.column_n AS c2 FROM long_name_table AS t1, very_long_name_table2 AS t2
|
||||
WHERE c1 = c2;
|
||||
SELECT t1.column_n c1, t2.column_n c2 FROM table1 t1, table2 t2
|
||||
WHERE c1 = c2;
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
To use an alias name that matches one of the Impala reserved keywords (listed in
|
||||
<xref href="impala_reserved_words.xml#reserved_words"/>), surround the identifier with either single or
|
||||
double quotation marks, or <codeph>``</codeph> characters (backticks).
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<ph conref="../shared/impala_common.xml#common/aliases_vs_identifiers"/>
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/complex_types_blurb"/>
|
||||
|
||||
<p rev="2.3.0">
|
||||
Queries involving the complex types (<codeph>ARRAY</codeph>,
|
||||
<codeph>STRUCT</codeph>, and <codeph>MAP</codeph>), typically make
|
||||
extensive use of table aliases. These queries involve join clauses
|
||||
where the complex type column is treated as a joined table.
|
||||
To construct two-part or three-part qualified names for the
|
||||
complex column elements in the <codeph>FROM</codeph> list,
|
||||
sometimes it is syntactically required to construct a table
|
||||
alias for the complex column where it is referenced in the join clause.
|
||||
See <xref href="impala_complex_types.xml#complex_types"/> for details and examples.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>Alternatives:</b>
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/views_vs_identifiers"/>
|
||||
</conbody>
|
||||
</concept>
|
||||
31
docs/topics/impala_allow_unsupported_formats.xml
Normal file
31
docs/topics/impala_allow_unsupported_formats.xml
Normal file
@@ -0,0 +1,31 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="allow_unsupported_formats">
|
||||
|
||||
<title>ALLOW_UNSUPPORTED_FORMATS Query Option</title>
|
||||
<titlealts audience="PDF"><navtitle>ALLOW_UNSUPPORTED_FORMATS</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Impala Query Options"/>
|
||||
<data name="Category" value="Deprecated Features"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<!--
|
||||
The original brief explanation with not enough detail comes from the comments at:
|
||||
http://github.sf.cloudera.com/CDH/Impala/raw/master/common/thrift/ImpalaService.thrift
|
||||
Removing that wording from here after discussions with dev team. Just recording the URL for posterity.
|
||||
-->
|
||||
|
||||
<p>
|
||||
An obsolete query option from early work on support for file formats. Do not use. Might be removed in the
|
||||
future.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/type_boolean"/>
|
||||
<p conref="../shared/impala_common.xml#common/default_false_0"/>
|
||||
</conbody>
|
||||
</concept>
|
||||
21
docs/topics/impala_alter_function.xml
Normal file
21
docs/topics/impala_alter_function.xml
Normal file
@@ -0,0 +1,21 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept audience="Cloudera" rev="1.x" id="alter_function">
|
||||
|
||||
<title>ALTER FUNCTION Statement</title>
|
||||
<titlealts audience="PDF"><navtitle>ALTER FUNCTION</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="DDL"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p/>
|
||||
</conbody>
|
||||
</concept>
|
||||
806
docs/topics/impala_alter_table.xml
Normal file
806
docs/topics/impala_alter_table.xml
Normal file
@@ -0,0 +1,806 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="alter_table">
|
||||
|
||||
<title>ALTER TABLE Statement</title>
|
||||
<titlealts audience="PDF"><navtitle>ALTER TABLE</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Impala Data Types"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="DDL"/>
|
||||
<data name="Category" value="HDFS Caching"/>
|
||||
<data name="Category" value="Tables"/>
|
||||
<data name="Category" value="Schemas"/>
|
||||
<data name="Category" value="S3"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<indexterm audience="Cloudera">ALTER TABLE statement</indexterm>
|
||||
The <codeph>ALTER TABLE</codeph> statement changes the structure or properties of an existing Impala table.
|
||||
</p>
|
||||
<p>
|
||||
In Impala, this is primarily a logical operation that updates the table metadata in the metastore database that Impala
|
||||
shares with Hive. Most <codeph>ALTER TABLE</codeph> operations do not actually rewrite, move, and so on the actual data
|
||||
files. (The <codeph>RENAME TO</codeph> clause is the one exception; it can cause HDFS files to be moved to different paths.)
|
||||
When you do an <codeph>ALTER TABLE</codeph> operation, you typically need to perform corresponding physical filesystem operations,
|
||||
such as rewriting the data files to include extra fields, or converting them to a different file format.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
||||
|
||||
<codeblock>ALTER TABLE [<varname>old_db_name</varname>.]<varname>old_table_name</varname> RENAME TO [<varname>new_db_name</varname>.]<varname>new_table_name</varname>
|
||||
|
||||
ALTER TABLE <varname>name</varname> ADD COLUMNS (<varname>col_spec</varname>[, <varname>col_spec</varname> ...])
|
||||
ALTER TABLE <varname>name</varname> DROP [COLUMN] <varname>column_name</varname>
|
||||
ALTER TABLE <varname>name</varname> CHANGE <varname>column_name</varname> <varname>new_name</varname> <varname>new_type</varname>
|
||||
ALTER TABLE <varname>name</varname> REPLACE COLUMNS (<varname>col_spec</varname>[, <varname>col_spec</varname> ...])
|
||||
|
||||
ALTER TABLE <varname>name</varname> { ADD [IF NOT EXISTS] | DROP [IF EXISTS] } PARTITION (<varname>partition_spec</varname>) <ph rev="2.3.0">[PURGE]</ph>
|
||||
<ph rev="2.3.0 IMPALA-1568 CDH-36799">ALTER TABLE <varname>name</varname> RECOVER PARTITIONS</ph>
|
||||
|
||||
ALTER TABLE <varname>name</varname> [PARTITION (<varname>partition_spec</varname>)]
|
||||
SET { FILEFORMAT <varname>file_format</varname>
|
||||
| LOCATION '<varname>hdfs_path_of_directory</varname>'
|
||||
| TBLPROPERTIES (<varname>table_properties</varname>)
|
||||
| SERDEPROPERTIES (<varname>serde_properties</varname>) }
|
||||
|
||||
<ph rev="2.6.0 IMPALA-3369">ALTER TABLE <varname>name</varname> <varname>colname</varname>
|
||||
('<varname>statsKey</varname>'='<varname>val</varname>, ...)
|
||||
|
||||
statsKey ::= numDVs | numNulls | avgSize | maxSize</ph>
|
||||
|
||||
<ph rev="1.4.0">ALTER TABLE <varname>name</varname> [PARTITION (<varname>partition_spec</varname>)] SET { CACHED IN '<varname>pool_name</varname>' <ph rev="2.2.0">[WITH REPLICATION = <varname>integer</varname>]</ph> | UNCACHED }</ph>
|
||||
|
||||
<varname>new_name</varname> ::= [<varname>new_database</varname>.]<varname>new_table_name</varname>
|
||||
|
||||
<varname>col_spec</varname> ::= <varname>col_name</varname> <varname>type_name</varname>
|
||||
|
||||
<varname>partition_spec</varname> ::= <varname>partition_col</varname>=<varname>constant_value</varname>
|
||||
|
||||
<varname>table_properties</varname> ::= '<varname>name</varname>'='<varname>value</varname>'[, '<varname>name</varname>'='<varname>value</varname>' ...]
|
||||
|
||||
<varname>serde_properties</varname> ::= '<varname>name</varname>'='<varname>value</varname>'[, '<varname>name</varname>'='<varname>value</varname>' ...]
|
||||
|
||||
<varname>file_format</varname> ::= { PARQUET | TEXTFILE | RCFILE | SEQUENCEFILE | AVRO }
|
||||
</codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/ddl_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/complex_types_blurb"/>
|
||||
|
||||
<p rev="2.3.0">
|
||||
In <keyword keyref="impala23_full"/> and higher, the <codeph>ALTER TABLE</codeph> statement can
|
||||
change the metadata for tables containing complex types (<codeph>ARRAY</codeph>,
|
||||
<codeph>STRUCT</codeph>, and <codeph>MAP</codeph>).
|
||||
For example, you can use an <codeph>ADD COLUMNS</codeph>, <codeph>DROP COLUMN</codeph>, or <codeph>CHANGE</codeph>
|
||||
clause to modify the table layout for complex type columns.
|
||||
Although Impala queries only work for complex type columns in Parquet tables, the complex type support in the
|
||||
<codeph>ALTER TABLE</codeph> statement applies to all file formats.
|
||||
For example, you can use Impala to update metadata for a staging table in a non-Parquet file format where the
|
||||
data is populated by Hive. Or you can use <codeph>ALTER TABLE SET FILEFORMAT</codeph> to change the format
|
||||
of an existing table to Parquet so that Impala can query it. Remember that changing the file format for a table does
|
||||
not convert the data files within the table; you must prepare any Parquet data files containing complex types
|
||||
outside Impala, and bring them into the table using <codeph>LOAD DATA</codeph> or updating the table's
|
||||
<codeph>LOCATION</codeph> property.
|
||||
See <xref href="impala_complex_types.xml#complex_types"/> for details about using complex types.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
||||
|
||||
<p>
|
||||
Whenever you specify partitions in an <codeph>ALTER TABLE</codeph> statement, through the <codeph>PARTITION
|
||||
(<varname>partition_spec</varname>)</codeph> clause, you must include all the partitioning columns in the
|
||||
specification.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Most of the <codeph>ALTER TABLE</codeph> operations work the same for internal tables (managed by Impala) as
|
||||
for external tables (with data files located in arbitrary locations). The exception is renaming a table; for
|
||||
an external table, the underlying data directory is not renamed or moved.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/s3_blurb"/>
|
||||
|
||||
<p rev="2.6.0 CDH-39913 IMPALA-1878">
|
||||
You can specify an <codeph>s3a://</codeph> prefix on the <codeph>LOCATION</codeph> attribute of a table or partition
|
||||
to make Impala query data from the Amazon S3 filesystem. In <keyword keyref="impala26_full"/> and higher, Impala automatically
|
||||
handles creating or removing the associated folders when you issue <codeph>ALTER TABLE</codeph> statements
|
||||
with the <codeph>ADD PARTITION</codeph> or <codeph>DROP PARTITION</codeph> clauses.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/s3_ddl"/>
|
||||
|
||||
<p rev="1.4.0">
|
||||
<b>HDFS caching (CACHED IN clause):</b>
|
||||
</p>
|
||||
|
||||
<p rev="1.4.0">
|
||||
If you specify the <codeph>CACHED IN</codeph> clause, any existing or future data files in the table
|
||||
directory or the partition subdirectories are designated to be loaded into memory with the HDFS caching
|
||||
mechanism. See <xref href="impala_perf_hdfs_caching.xml#hdfs_caching"/> for details about using the HDFS
|
||||
caching feature.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/impala_cache_replication_factor"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/sync_ddl_blurb"/>
|
||||
|
||||
<p>
|
||||
The following sections show examples of the use cases for various <codeph>ALTER TABLE</codeph> clauses.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>To rename a table (RENAME TO clause):</b>
|
||||
</p>
|
||||
|
||||
<!-- Beefing up the syntax in its original location up to, don't need to repeat it here.
|
||||
<codeblock>ALTER TABLE <varname>old_name</varname> RENAME TO <varname>new_name</varname>;</codeblock>
|
||||
-->
|
||||
|
||||
<p>
|
||||
The <codeph>RENAME TO</codeph> clause lets you change the name of an existing table, and optionally which
|
||||
database it is located in.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
For internal tables, this operation physically renames the directory within HDFS that contains the data files;
|
||||
the original directory name no longer exists. By qualifying the table names with database names, you can use
|
||||
this technique to move an internal table (and its associated data directory) from one database to another.
|
||||
For example:
|
||||
</p>
|
||||
|
||||
<codeblock>create database d1;
|
||||
create database d2;
|
||||
create database d3;
|
||||
use d1;
|
||||
create table mobile (x int);
|
||||
use d2;
|
||||
-- Move table from another database to the current one.
|
||||
alter table d1.mobile rename to mobile;
|
||||
use d1;
|
||||
-- Move table from one database to another.
|
||||
alter table d2.mobile rename to d3.mobile;</codeblock>
|
||||
|
||||
<p>
|
||||
For external tables,
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>To change the physical location where Impala looks for data files associated with a table or
|
||||
partition:</b>
|
||||
</p>
|
||||
|
||||
<codeblock>ALTER TABLE <varname>table_name</varname> [PARTITION (<varname>partition_spec</varname>)] SET LOCATION '<varname>hdfs_path_of_directory</varname>';</codeblock>
|
||||
|
||||
<p>
|
||||
The path you specify is the full HDFS path where the data files reside, or will be created. Impala does not
|
||||
create any additional subdirectory named after the table. Impala does not move any data files to this new
|
||||
location or change any data files that might already exist in that directory.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
To set the location for a single partition, include the <codeph>PARTITION</codeph> clause. Specify all the
|
||||
same partitioning columns for the table, with a constant value for each, to precisely identify the single
|
||||
partition affected by the statement:
|
||||
</p>
|
||||
|
||||
<codeblock>create table p1 (s string) partitioned by (month int, day int);
|
||||
-- Each ADD PARTITION clause creates a subdirectory in HDFS.
|
||||
alter table p1 add partition (month=1, day=1);
|
||||
alter table p1 add partition (month=1, day=2);
|
||||
alter table p1 add partition (month=2, day=1);
|
||||
alter table p1 add partition (month=2, day=2);
|
||||
-- Redirect queries, INSERT, and LOAD DATA for one partition
|
||||
-- to a specific different directory.
|
||||
alter table p1 partition (month=1, day=1) set location '/usr/external_data/new_years_day';
|
||||
</codeblock>
|
||||
|
||||
<note conref="../shared/impala_common.xml#common/add_partition_set_location"/>
|
||||
|
||||
<p rev="2.3.0 IMPALA-1568 CDH-36799">
|
||||
<b>To automatically detect new partition directories added through Hive or HDFS operations:</b>
|
||||
</p>
|
||||
|
||||
<p rev="2.3.0 IMPALA-1568 CDH-36799">
|
||||
In <keyword keyref="impala23_full"/> and higher, the <codeph>RECOVER PARTITIONS</codeph> clause scans
|
||||
a partitioned table to detect if any new partition directories were added outside of Impala,
|
||||
such as by Hive <codeph>ALTER TABLE</codeph> statements or by <cmdname>hdfs dfs</cmdname>
|
||||
or <cmdname>hadoop fs</cmdname> commands. The <codeph>RECOVER PARTITIONS</codeph> clause
|
||||
automatically recognizes any data files present in these new directories, the same as
|
||||
the <codeph>REFRESH</codeph> statement does.
|
||||
</p>
|
||||
|
||||
<p rev="2.3.0 IMPALA-1568 CDH-36799">
|
||||
For example, here is a sequence of examples showing how you might create a partitioned table in Impala,
|
||||
create new partitions through Hive, copy data files into the new partitions with the <cmdname>hdfs</cmdname>
|
||||
command, and have Impala recognize the new partitions and new data:
|
||||
</p>
|
||||
|
||||
<p rev="2.3.0 IMPALA-1568 CDH-36799">
|
||||
In Impala, create the table, and a single partition for demonstration purposes:
|
||||
</p>
|
||||
|
||||
<codeblock rev="2.3.0 IMPALA-1568 CDH-36799">
|
||||
<![CDATA[
|
||||
create database recover_partitions;
|
||||
use recover_partitions;
|
||||
create table t1 (s string) partitioned by (yy int, mm int);
|
||||
insert into t1 partition (yy = 2016, mm = 1) values ('Partition exists');
|
||||
show files in t1;
|
||||
+---------------------------------------------------------------------+------+--------------+
|
||||
| Path | Size | Partition |
|
||||
+---------------------------------------------------------------------+------+--------------+
|
||||
| /user/hive/warehouse/recover_partitions.db/t1/yy=2016/mm=1/data.txt | 17B | yy=2016/mm=1 |
|
||||
+---------------------------------------------------------------------+------+--------------+
|
||||
quit;
|
||||
]]>
|
||||
</codeblock>
|
||||
|
||||
<p rev="2.3.0 IMPALA-1568 CDH-36799">
|
||||
In Hive, create some new partitions. In a real use case, you might create the
|
||||
partitions and populate them with data as the final stages of an ETL pipeline.
|
||||
</p>
|
||||
|
||||
<codeblock rev="2.3.0 IMPALA-1568 CDH-36799">
|
||||
<![CDATA[
|
||||
hive> use recover_partitions;
|
||||
OK
|
||||
hive> alter table t1 add partition (yy = 2016, mm = 2);
|
||||
OK
|
||||
hive> alter table t1 add partition (yy = 2016, mm = 3);
|
||||
OK
|
||||
hive> quit;
|
||||
]]>
|
||||
</codeblock>
|
||||
|
||||
<p rev="2.3.0 IMPALA-1568 CDH-36799">
|
||||
For demonstration purposes, manually copy data (a single row) into these
|
||||
new partitions, using manual HDFS operations:
|
||||
</p>
|
||||
|
||||
<codeblock rev="2.3.0 IMPALA-1568 CDH-36799">
|
||||
<![CDATA[
|
||||
$ hdfs dfs -ls /user/hive/warehouse/recover_partitions.db/t1/yy=2016/
|
||||
Found 3 items
|
||||
drwxr-xr-x - impala hive 0 2016-05-09 16:06 /user/hive/warehouse/recover_partitions.db/t1/yy=2016/mm=1
|
||||
drwxr-xr-x - jrussell hive 0 2016-05-09 16:14 /user/hive/warehouse/recover_partitions.db/t1/yy=2016/mm=2
|
||||
drwxr-xr-x - jrussell hive 0 2016-05-09 16:13 /user/hive/warehouse/recover_partitions.db/t1/yy=2016/mm=3
|
||||
|
||||
$ hdfs dfs -cp /user/hive/warehouse/recover_partitions.db/t1/yy=2016/mm=1/data.txt \
|
||||
/user/hive/warehouse/recover_partitions.db/t1/yy=2016/mm=2/data.txt
|
||||
$ hdfs dfs -cp /user/hive/warehouse/recover_partitions.db/t1/yy=2016/mm=1/data.txt \
|
||||
/user/hive/warehouse/recover_partitions.db/t1/yy=2016/mm=3/data.txt
|
||||
]]>
|
||||
</codeblock>
|
||||
|
||||
<codeblock rev="2.3.0 IMPALA-1568 CDH-36799">
|
||||
<![CDATA[
|
||||
hive> select * from t1;
|
||||
OK
|
||||
Partition exists 2016 1
|
||||
Partition exists 2016 2
|
||||
Partition exists 2016 3
|
||||
hive> quit;
|
||||
]]>
|
||||
</codeblock>
|
||||
|
||||
<p rev="2.3.0 IMPALA-1568 CDH-36799">
|
||||
In Impala, initially the partitions and data are not visible.
|
||||
Running <codeph>ALTER TABLE</codeph> with the <codeph>RECOVER PARTITIONS</codeph>
|
||||
clause scans the table data directory to find any new partition directories, and
|
||||
the data files inside them:
|
||||
</p>
|
||||
|
||||
<codeblock rev="2.3.0 IMPALA-1568 CDH-36799">
|
||||
<![CDATA[
|
||||
select * from t1;
|
||||
+------------------+------+----+
|
||||
| s | yy | mm |
|
||||
+------------------+------+----+
|
||||
| Partition exists | 2016 | 1 |
|
||||
+------------------+------+----+
|
||||
|
||||
alter table t1 recover partitions;
|
||||
select * from t1;
|
||||
+------------------+------+----+
|
||||
| s | yy | mm |
|
||||
+------------------+------+----+
|
||||
| Partition exists | 2016 | 1 |
|
||||
| Partition exists | 2016 | 3 |
|
||||
| Partition exists | 2016 | 2 |
|
||||
+------------------+------+----+
|
||||
]]>
|
||||
</codeblock>
|
||||
|
||||
<p rev="1.2">
|
||||
<b>To change the key-value pairs of the TBLPROPERTIES and SERDEPROPERTIES fields:</b>
|
||||
</p>
|
||||
|
||||
<codeblock>ALTER TABLE <varname>table_name</varname> SET TBLPROPERTIES ('<varname>key1</varname>'='<varname>value1</varname>', '<varname>key2</varname>'='<varname>value2</varname>'[, ...]);
|
||||
ALTER TABLE <varname>table_name</varname> SET SERDEPROPERTIES ('<varname>key1</varname>'='<varname>value1</varname>', '<varname>key2</varname>'='<varname>value2</varname>'[, ...]);</codeblock>
|
||||
|
||||
<p>
|
||||
The <codeph>TBLPROPERTIES</codeph> clause is primarily a way to associate arbitrary user-specified data items
|
||||
with a particular table.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The <codeph>SERDEPROPERTIES</codeph> clause sets up metadata defining how tables are read or written, needed
|
||||
in some cases by Hive but not used extensively by Impala. You would use this clause primarily to change the
|
||||
delimiter in an existing text table or partition, by setting the <codeph>'serialization.format'</codeph> and
|
||||
<codeph>'field.delim'</codeph> property values to the new delimiter character:
|
||||
</p>
|
||||
|
||||
<codeblock>-- This table begins life as pipe-separated text format.
|
||||
create table change_to_csv (s1 string, s2 string) row format delimited fields terminated by '|';
|
||||
-- Then we change it to a CSV table.
|
||||
alter table change_to_csv set SERDEPROPERTIES ('serialization.format'=',', 'field.delim'=',');
|
||||
insert overwrite change_to_csv values ('stop','go'), ('yes','no');
|
||||
!hdfs dfs -cat 'hdfs://<varname>hostname</varname>:8020/<varname>data_directory</varname>/<varname>dbname</varname>.db/change_to_csv/<varname>data_file</varname>';
|
||||
stop,go
|
||||
yes,no</codeblock>
|
||||
|
||||
<p>
|
||||
Use the <codeph>DESCRIBE FORMATTED</codeph> statement to see the current values of these properties for an
|
||||
existing table. See <xref href="impala_create_table.xml#create_table"/> for more details about these clauses.
|
||||
See <xref href="impala_perf_stats.xml#perf_table_stats_manual"/> for an example of using table properties to
|
||||
fine-tune the performance-related table statistics.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>To manually set or update table or column statistics:</b>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Although for most tables the <codeph>COMPUTE STATS</codeph> or <codeph>COMPUTE INCREMENTAL STATS</codeph>
|
||||
statement is all you need to keep table and column statistics up to date for a table,
|
||||
sometimes for a very large table or one that is updated frequently, the length of time to recompute
|
||||
all the statistics might make it impractical to run those statements as often as needed.
|
||||
As a workaround, you can use the <codeph>ALTER TABLE</codeph> statement to set table statistics
|
||||
at the level of the entire table or a single partition, or column statistics at the level of
|
||||
the entire table.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
You can set the <codeph>numrows</codeph> value for table statistics by changing the
|
||||
<codeph>TBLPROPERTIES</codeph> setting for a table or partition.
|
||||
For example:
|
||||
<codeblock conref="../shared/impala_common.xml#common/set_numrows_example"/>
|
||||
<codeblock conref="../shared/impala_common.xml#common/set_numrows_partitioned_example"/>
|
||||
See <xref href="impala_perf_stats.xml#perf_table_stats_manual"/> for details.
|
||||
</p>
|
||||
|
||||
<p rev="2.6.0 IMPALA-3369">
|
||||
In <keyword keyref="impala26_full"/> and higher, you can use the <codeph>SET COLUMN STATS</codeph> clause
|
||||
to set a specific stats value for a particular column.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/set_column_stats_example"/>
|
||||
|
||||
<p>
|
||||
<b>To reorganize columns for a table:</b>
|
||||
</p>
|
||||
|
||||
<codeblock>ALTER TABLE <varname>table_name</varname> ADD COLUMNS (<varname>column_defs</varname>);
|
||||
ALTER TABLE <varname>table_name</varname> REPLACE COLUMNS (<varname>column_defs</varname>);
|
||||
ALTER TABLE <varname>table_name</varname> CHANGE <varname>column_name</varname> <varname>new_name</varname> <varname>new_type</varname>;
|
||||
ALTER TABLE <varname>table_name</varname> DROP <varname>column_name</varname>;</codeblock>
|
||||
|
||||
<p>
|
||||
The <varname>column_spec</varname> is the same as in the <codeph>CREATE TABLE</codeph> statement: the column
|
||||
name, then its data type, then an optional comment. You can add multiple columns at a time. The parentheses
|
||||
are required whether you add a single column or multiple columns. When you replace columns, all the original
|
||||
column definitions are discarded. You might use this technique if you receive a new set of data files with
|
||||
different data types or columns in a different order. (The data files are retained, so if the new columns are
|
||||
incompatible with the old ones, use <codeph>INSERT OVERWRITE</codeph> or <codeph>LOAD DATA OVERWRITE</codeph>
|
||||
to replace all the data before issuing any further queries.)
|
||||
</p>
|
||||
|
||||
<p rev="CDH-37178">
|
||||
For example, here is how you might add columns to an existing table.
|
||||
The first <codeph>ALTER TABLE</codeph> adds two new columns, and the second
|
||||
<codeph>ALTER TABLE</codeph> adds one new column.
|
||||
A single Impala query reads both the old and new data files, containing different numbers of columns.
|
||||
For any columns not present in a particular data file, all the column values are
|
||||
considered to be <codeph>NULL</codeph>.
|
||||
</p>
|
||||
|
||||
<codeblock rev="CDH-37178">
|
||||
create table t1 (x int);
|
||||
insert into t1 values (1), (2);
|
||||
|
||||
alter table t1 add columns (s string, t timestamp);
|
||||
insert into t1 values (3, 'three', now());
|
||||
|
||||
alter table t1 add columns (b boolean);
|
||||
insert into t1 values (4, 'four', now(), true);
|
||||
|
||||
select * from t1 order by x;
|
||||
+---+-------+-------------------------------+------+
|
||||
| x | s | t | b |
|
||||
+---+-------+-------------------------------+------+
|
||||
| 1 | NULL | NULL | NULL |
|
||||
| 2 | NULL | NULL | NULL |
|
||||
| 3 | three | 2016-05-11 11:19:45.054457000 | NULL |
|
||||
| 4 | four | 2016-05-11 11:20:20.260733000 | true |
|
||||
+---+-------+-------------------------------+------+
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
You might use the <codeph>CHANGE</codeph> clause to rename a single column, or to treat an existing column as
|
||||
a different type than before, such as to switch between treating a column as <codeph>STRING</codeph> and
|
||||
<codeph>TIMESTAMP</codeph>, or between <codeph>INT</codeph> and <codeph>BIGINT</codeph>. You can only drop a
|
||||
single column at a time; to drop multiple columns, issue multiple <codeph>ALTER TABLE</codeph> statements, or
|
||||
define the new set of columns with a single <codeph>ALTER TABLE ... REPLACE COLUMNS</codeph> statement.
|
||||
</p>
|
||||
|
||||
<p rev="CDH-37178">
|
||||
The following examples show some safe operations to drop or change columns. Dropping the final column
|
||||
in a table lets Impala ignore the data causing any disruption to existing data files. Changing the type
|
||||
of a column works if existing data values can be safely converted to the new type. The type conversion
|
||||
rules depend on the file format of the underlying table. For example, in a text table, the same value
|
||||
can be interpreted as a <codeph>STRING</codeph> or a numeric value, while in a binary format such as
|
||||
Parquet, the rules are stricter and type conversions only work between certain sizes of integers.
|
||||
</p>
|
||||
|
||||
<codeblock rev="CDH-37178">
|
||||
create table optional_columns (x int, y int, z int, a1 int, a2 int);
|
||||
insert into optional_columns values (1,2,3,0,0), (2,3,4,100,100);
|
||||
|
||||
-- When the last column in the table is dropped, Impala ignores the
|
||||
-- values that are no longer needed. (Dropping A1 but leaving A2
|
||||
-- would cause problems, as we will see in a subsequent example.)
|
||||
alter table optional_columns drop column a2;
|
||||
alter table optional_columns drop column a1;
|
||||
|
||||
select * from optional_columns;
|
||||
+---+---+---+
|
||||
| x | y | z |
|
||||
+---+---+---+
|
||||
| 1 | 2 | 3 |
|
||||
| 2 | 3 | 4 |
|
||||
+---+---+---+
|
||||
</codeblock>
|
||||
|
||||
<codeblock rev="CDH-37178">
|
||||
create table int_to_string (s string, x int);
|
||||
insert into int_to_string values ('one', 1), ('two', 2);
|
||||
|
||||
-- What was an INT column will now be interpreted as STRING.
|
||||
-- This technique works for text tables but not other file formats.
|
||||
-- The second X represents the new name of the column, which we keep the same.
|
||||
alter table int_to_string change x x string;
|
||||
|
||||
-- Once the type is changed, we can insert non-integer values into the X column
|
||||
-- and treat that column as a string, for example by uppercasing or concatenating.
|
||||
insert into int_to_string values ('three', 'trois');
|
||||
select s, upper(x) from int_to_string;
|
||||
+-------+----------+
|
||||
| s | upper(x) |
|
||||
+-------+----------+
|
||||
| one | 1 |
|
||||
| two | 2 |
|
||||
| three | TROIS |
|
||||
+-------+----------+
|
||||
</codeblock>
|
||||
|
||||
<p rev="CDH-37178">
|
||||
Remember that Impala does not actually do any conversion for the underlying data files as a result of
|
||||
<codeph>ALTER TABLE</codeph> statements. If you use <codeph>ALTER TABLE</codeph> to create a table
|
||||
layout that does not agree with the contents of the underlying files, you must replace the files
|
||||
yourself, such as using <codeph>LOAD DATA</codeph> to load a new set of data files, or
|
||||
<codeph>INSERT OVERWRITE</codeph> to copy from another table and replace the original data.
|
||||
</p>
|
||||
|
||||
<p rev="CDH-37178">
|
||||
The following example shows what happens if you delete the middle column from a Parquet table containing three columns.
|
||||
The underlying data files still contain three columns of data. Because the columns are interpreted based on their positions in
|
||||
the data file instead of the specific column names, a <codeph>SELECT *</codeph> query now reads the first and second
|
||||
columns from the data file, potentially leading to unexpected results or conversion errors.
|
||||
For this reason, if you expect to someday drop a column, declare it as the last column in the table, where its data
|
||||
can be ignored by queries after the column is dropped. Or, re-run your ETL process and create new data files
|
||||
if you drop or change the type of a column in a way that causes problems with existing data files.
|
||||
</p>
|
||||
|
||||
<codeblock rev="CDH-37178">
|
||||
-- Parquet table showing how dropping a column can produce unexpected results.
|
||||
create table p1 (s1 string, s2 string, s3 string) stored as parquet;
|
||||
|
||||
insert into p1 values ('one', 'un', 'uno'), ('two', 'deux', 'dos'),
|
||||
('three', 'trois', 'tres');
|
||||
select * from p1;
|
||||
+-------+-------+------+
|
||||
| s1 | s2 | s3 |
|
||||
+-------+-------+------+
|
||||
| one | un | uno |
|
||||
| two | deux | dos |
|
||||
| three | trois | tres |
|
||||
+-------+-------+------+
|
||||
|
||||
alter table p1 drop column s2;
|
||||
-- The S3 column contains unexpected results.
|
||||
-- Because S2 and S3 have compatible types, the query reads
|
||||
-- values from the dropped S2, because the existing data files
|
||||
-- still contain those values as the second column.
|
||||
select * from p1;
|
||||
+-------+-------+
|
||||
| s1 | s3 |
|
||||
+-------+-------+
|
||||
| one | un |
|
||||
| two | deux |
|
||||
| three | trois |
|
||||
+-------+-------+
|
||||
</codeblock>
|
||||
|
||||
<codeblock rev="CDH-37178">
|
||||
-- Parquet table showing how dropping a column can produce conversion errors.
|
||||
create table p2 (s1 string, x int, s3 string) stored as parquet;
|
||||
|
||||
insert into p2 values ('one', 1, 'uno'), ('two', 2, 'dos'), ('three', 3, 'tres');
|
||||
select * from p2;
|
||||
+-------+---+------+
|
||||
| s1 | x | s3 |
|
||||
+-------+---+------+
|
||||
| one | 1 | uno |
|
||||
| two | 2 | dos |
|
||||
| three | 3 | tres |
|
||||
+-------+---+------+
|
||||
|
||||
alter table p2 drop column x;
|
||||
select * from p2;
|
||||
WARNINGS:
|
||||
File '<varname>hdfs_filename</varname>' has an incompatible Parquet schema for column 'add_columns.p2.s3'.
|
||||
Column type: STRING, Parquet schema:
|
||||
optional int32 x [i:1 d:1 r:0]
|
||||
|
||||
File '<varname>hdfs_filename</varname>' has an incompatible Parquet schema for column 'add_columns.p2.s3'.
|
||||
Column type: STRING, Parquet schema:
|
||||
optional int32 x [i:1 d:1 r:0]
|
||||
</codeblock>
|
||||
|
||||
<p rev="IMPALA-3092">
|
||||
In <keyword keyref="impala26_full"/> and higher, if an Avro table is created without column definitions in the
|
||||
<codeph>CREATE TABLE</codeph> statement, and columns are later
|
||||
added through <codeph>ALTER TABLE</codeph>, the resulting
|
||||
table is now queryable. Missing values from the newly added
|
||||
columns now default to <codeph>NULL</codeph>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>To change the file format that Impala expects data to be in, for a table or partition:</b>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Use an <codeph>ALTER TABLE ... SET FILEFORMAT</codeph> clause. You can include an optional <codeph>PARTITION
|
||||
(<varname>col1</varname>=<varname>val1</varname>, <varname>col2</varname>=<varname>val2</varname>,
|
||||
...</codeph> clause so that the file format is changed for a specific partition rather than the entire table.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Because this operation only changes the table metadata, you must do any conversion of existing data using
|
||||
regular Hadoop techniques outside of Impala. Any new data created by the Impala <codeph>INSERT</codeph>
|
||||
statement will be in the new format. You cannot specify the delimiter for Text files; the data files must be
|
||||
comma-delimited.
|
||||
<!-- Although Impala can read Avro tables
|
||||
created through Hive, you cannot specify the Avro file format in an Impala
|
||||
<codeph>ALTER TABLE</codeph> statement. -->
|
||||
</p>
|
||||
|
||||
<p>
|
||||
To set the file format for a single partition, include the <codeph>PARTITION</codeph> clause. Specify all the
|
||||
same partitioning columns for the table, with a constant value for each, to precisely identify the single
|
||||
partition affected by the statement:
|
||||
</p>
|
||||
|
||||
<codeblock>create table p1 (s string) partitioned by (month int, day int);
|
||||
-- Each ADD PARTITION clause creates a subdirectory in HDFS.
|
||||
alter table p1 add partition (month=1, day=1);
|
||||
alter table p1 add partition (month=1, day=2);
|
||||
alter table p1 add partition (month=2, day=1);
|
||||
alter table p1 add partition (month=2, day=2);
|
||||
-- Queries and INSERT statements will read and write files
|
||||
-- in this format for this specific partition.
|
||||
alter table p1 partition (month=2, day=2) set fileformat parquet;
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
<b>To add or drop partitions for a table</b>, the table must already be partitioned (that is, created with a
|
||||
<codeph>PARTITIONED BY</codeph> clause). The partition is a physical directory in HDFS, with a name that
|
||||
encodes a particular column value (the <b>partition key</b>). The Impala <codeph>INSERT</codeph> statement
|
||||
already creates the partition if necessary, so the <codeph>ALTER TABLE ... ADD PARTITION</codeph> is
|
||||
primarily useful for importing data by moving or copying existing data files into the HDFS directory
|
||||
corresponding to a partition. (You can use the <codeph>LOAD DATA</codeph> statement to move files into the
|
||||
partition directory, or <codeph>ALTER TABLE ... PARTITION (...) SET LOCATION</codeph> to point a partition at
|
||||
a directory that already contains data files.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The <codeph>DROP PARTITION</codeph> clause is used to remove the HDFS directory and associated data files for
|
||||
a particular set of partition key values; for example, if you always analyze the last 3 months worth of data,
|
||||
at the beginning of each month you might drop the oldest partition that is no longer needed. Removing
|
||||
partitions reduces the amount of metadata associated with the table and the complexity of calculating the
|
||||
optimal query plan, which can simplify and speed up queries on partitioned tables, particularly join queries.
|
||||
Here is an example showing the <codeph>ADD PARTITION</codeph> and <codeph>DROP PARTITION</codeph> clauses.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
To avoid errors while adding or dropping partitions whose existence is not certain,
|
||||
add the optional <codeph>IF [NOT] EXISTS</codeph> clause between the <codeph>ADD</codeph> or
|
||||
<codeph>DROP</codeph> keyword and the <codeph>PARTITION</codeph> keyword. That is, the entire
|
||||
clause becomes <codeph>ADD IF NOT EXISTS PARTITION</codeph> or <codeph>DROP IF EXISTS PARTITION</codeph>.
|
||||
The following example shows how partitions can be created automatically through <codeph>INSERT</codeph>
|
||||
statements, or manually through <codeph>ALTER TABLE</codeph> statements. The <codeph>IF [NOT] EXISTS</codeph>
|
||||
clauses let the <codeph>ALTER TABLE</codeph> statements succeed even if a new requested partition already
|
||||
exists, or a partition to be dropped does not exist.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Inserting 2 year values creates 2 partitions:
|
||||
</p>
|
||||
|
||||
<codeblock>
|
||||
create table partition_t (s string) partitioned by (y int);
|
||||
insert into partition_t (s,y) values ('two thousand',2000), ('nineteen ninety',1990);
|
||||
show partitions partition_t;
|
||||
+-------+-------+--------+------+--------------+-------------------+--------+-------------------+
|
||||
| y | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats |
|
||||
+-------+-------+--------+------+--------------+-------------------+--------+-------+
|
||||
| 1990 | -1 | 1 | 16B | NOT CACHED | NOT CACHED | TEXT | false |
|
||||
| 2000 | -1 | 1 | 13B | NOT CACHED | NOT CACHED | TEXT | false |
|
||||
| Total | -1 | 2 | 29B | 0B | | | |
|
||||
+-------+-------+--------+------+--------------+-------------------+--------+-------+
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
Without the <codeph>IF NOT EXISTS</codeph> clause, an attempt to add a new partition might fail:
|
||||
</p>
|
||||
|
||||
<codeblock>
|
||||
alter table partition_t add partition (y=2000);
|
||||
ERROR: AnalysisException: Partition spec already exists: (y=2000).
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
The <codeph>IF NOT EXISTS</codeph> clause makes the statement succeed whether or not there was already a
|
||||
partition with the specified key value:
|
||||
</p>
|
||||
|
||||
<codeblock>
|
||||
alter table partition_t add if not exists partition (y=2000);
|
||||
alter table partition_t add if not exists partition (y=2010);
|
||||
show partitions partition_t;
|
||||
+-------+-------+--------+------+--------------+-------------------+--------+-------------------+
|
||||
| y | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats |
|
||||
+-------+-------+--------+------+--------------+-------------------+--------+-------+
|
||||
| 1990 | -1 | 1 | 16B | NOT CACHED | NOT CACHED | TEXT | false |
|
||||
| 2000 | -1 | 1 | 13B | NOT CACHED | NOT CACHED | TEXT | false |
|
||||
| 2010 | -1 | 0 | 0B | NOT CACHED | NOT CACHED | TEXT | false |
|
||||
| Total | -1 | 2 | 29B | 0B | | | |
|
||||
+-------+-------+--------+------+--------------+-------------------+--------+-------+
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
Likewise, the <codeph>IF EXISTS</codeph> clause lets <codeph>DROP PARTITION</codeph> succeed whether or not the partition is already
|
||||
in the table:
|
||||
</p>
|
||||
|
||||
<codeblock>
|
||||
alter table partition_t drop if exists partition (y=2000);
|
||||
alter table partition_t drop if exists partition (y=1950);
|
||||
show partitions partition_t;
|
||||
+-------+-------+--------+------+--------------+-------------------+--------+-------------------+
|
||||
| y | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats |
|
||||
+-------+-------+--------+------+--------------+-------------------+--------+-------+
|
||||
| 1990 | -1 | 1 | 16B | NOT CACHED | NOT CACHED | TEXT | false |
|
||||
| 2010 | -1 | 0 | 0B | NOT CACHED | NOT CACHED | TEXT | false |
|
||||
| Total | -1 | 1 | 16B | 0B | | | |
|
||||
+-------+-------+--------+------+--------------+-------------------+--------+-------+
|
||||
</codeblock>
|
||||
|
||||
<p rev="2.3.0"> The optional <codeph>PURGE</codeph> keyword, available in
|
||||
<keyword keyref="impala23_full"/> and higher, is used with the <codeph>DROP
|
||||
PARTITION</codeph> clause to remove associated HDFS data files
|
||||
immediately rather than going through the HDFS trashcan mechanism. Use
|
||||
this keyword when dropping a partition if it is crucial to remove the data
|
||||
as quickly as possible to free up space, or if there is a problem with the
|
||||
trashcan, such as the trash cannot being configured or being in a
|
||||
different HDFS encryption zone than the data files. </p>
|
||||
|
||||
<!--
|
||||
To do: Make example more general by partitioning by year/month/day.
|
||||
Then could show inserting into fixed year, variable month and day;
|
||||
dropping particular year/month/day partition.
|
||||
-->
|
||||
|
||||
<codeblock>-- Create an empty table and define the partitioning scheme.
|
||||
create table part_t (x int) partitioned by (month int);
|
||||
-- Create an empty partition into which you could copy data files from some other source.
|
||||
alter table part_t add partition (month=1);
|
||||
-- After changing the underlying data, issue a REFRESH statement to make the data visible in Impala.
|
||||
refresh part_t;
|
||||
-- Later, do the same for the next month.
|
||||
alter table part_t add partition (month=2);
|
||||
|
||||
-- Now you no longer need the older data.
|
||||
alter table part_t drop partition (month=1);
|
||||
-- If the table was partitioned by month and year, you would issue a statement like:
|
||||
-- alter table part_t drop partition (year=2003,month=1);
|
||||
-- which would require 12 ALTER TABLE statements to remove a year's worth of data.
|
||||
|
||||
-- If the data files for subsequent months were in a different file format,
|
||||
-- you could set a different file format for the new partition as you create it.
|
||||
alter table part_t add partition (month=3) set fileformat=parquet;
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
The value specified for a partition key can be an arbitrary constant expression, without any references to
|
||||
columns. For example:
|
||||
</p>
|
||||
|
||||
<codeblock>alter table time_data add partition (month=concat('Decem','ber'));
|
||||
alter table sales_data add partition (zipcode = cast(9021 * 10 as string));</codeblock>
|
||||
|
||||
<note>
|
||||
<p>
|
||||
An alternative way to reorganize a table and its associated data files is to use <codeph>CREATE
|
||||
TABLE</codeph> to create a variation of the original table, then use <codeph>INSERT</codeph> to copy the
|
||||
transformed or reordered data to the new table. The advantage of <codeph>ALTER TABLE</codeph> is that it
|
||||
avoids making a duplicate copy of the data files, allowing you to reorganize huge volumes of data in a
|
||||
space-efficient way using familiar Hadoop techniques.
|
||||
</p>
|
||||
</note>
|
||||
|
||||
<p>
|
||||
<b>To switch a table between internal and external:</b>
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/switch_internal_external_table"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/permissions_blurb"/>
|
||||
<p rev="CDH-19187">
|
||||
Most <codeph>ALTER TABLE</codeph> clauses do not actually
|
||||
read or write any HDFS files, and so do not depend on
|
||||
specific HDFS permissions. For example, the <codeph>SET FILEFORMAT</codeph>
|
||||
clause does not actually check the file format existing data files or
|
||||
convert them to the new format, and the <codeph>SET LOCATION</codeph> clause
|
||||
does not require any special permissions on the new location.
|
||||
(Any permission-related failures would come later, when you
|
||||
actually query or insert into the table.)
|
||||
</p>
|
||||
<!-- Haven't rigorously tested all the assertions in the following paragraph. -->
|
||||
<!-- Most testing so far has been around RENAME TO clause. -->
|
||||
<p>
|
||||
In general, <codeph>ALTER TABLE</codeph> clauses that do touch
|
||||
HDFS files and directories require the same HDFS permissions
|
||||
as corresponding <codeph>CREATE</codeph>, <codeph>INSERT</codeph>,
|
||||
or <codeph>SELECT</codeph> statements.
|
||||
The permissions allow
|
||||
the user ID that the <cmdname>impalad</cmdname> daemon runs under,
|
||||
typically the <codeph>impala</codeph> user, to read or write
|
||||
files or directories, or (in the case of the execute bit) descend into a directory.
|
||||
The <codeph>RENAME TO</codeph> clause requires read, write, and execute permission in the
|
||||
source and destination database directories and in the table data directory,
|
||||
and read and write permission for the data files within the table.
|
||||
The <codeph>ADD PARTITION</codeph> and <codeph>DROP PARTITION</codeph> clauses
|
||||
require write and execute permissions for the associated partition directory.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
|
||||
<p>
|
||||
<xref href="impala_tables.xml#tables"/>,
|
||||
<xref href="impala_create_table.xml#create_table"/>, <xref href="impala_drop_table.xml#drop_table"/>,
|
||||
<xref href="impala_partitioning.xml#partitioning"/>, <xref href="impala_tables.xml#internal_tables"/>,
|
||||
<xref href="impala_tables.xml#external_tables"/>
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
86
docs/topics/impala_alter_view.xml
Normal file
86
docs/topics/impala_alter_view.xml
Normal file
@@ -0,0 +1,86 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept rev="1.1" id="alter_view">
|
||||
|
||||
<title>ALTER VIEW Statement</title>
|
||||
<titlealts audience="PDF"><navtitle>ALTER VIEW</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="DDL"/>
|
||||
<data name="Category" value="Tables"/>
|
||||
<data name="Category" value="Schemas"/>
|
||||
<data name="Category" value="Views"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<indexterm audience="Cloudera">ALTER VIEW statement</indexterm>
|
||||
Changes the characteristics of a view. The syntax has two forms:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
The <codeph>AS</codeph> clause associates the view with a different query.
|
||||
</li>
|
||||
<li>
|
||||
The <codeph>RENAME TO</codeph> clause changes the name of the view, moves the view to
|
||||
a different database, or both.
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p>
|
||||
Because a view is purely a logical construct (an alias for a query) with no physical data behind it,
|
||||
<codeph>ALTER VIEW</codeph> only involves changes to metadata in the metastore database, not any data files
|
||||
in HDFS.
|
||||
</p>
|
||||
|
||||
<!-- View _permissions_ don't rely on underlying table. -->
|
||||
|
||||
<!-- Could use views to grant access only to certain columns. -->
|
||||
|
||||
<!-- Treated like a table for authorization. -->
|
||||
|
||||
<!-- ALTER VIEW that queries another view - possibly a runtime error. -->
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
||||
|
||||
<codeblock>ALTER VIEW [<varname>database_name</varname>.]<varname>view_name</varname> AS <varname>select_statement</varname>
|
||||
ALTER VIEW [<varname>database_name</varname>.]<varname>view_name</varname> RENAME TO [<varname>database_name</varname>.]<varname>view_name</varname></codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/ddl_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/sync_ddl_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/security_blurb"/>
|
||||
<p conref="../shared/impala_common.xml#common/redaction_yes"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/permissions_blurb_no"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
|
||||
<codeblock>create table t1 (x int, y int, s string);
|
||||
create table t2 like t1;
|
||||
create view v1 as select * from t1;
|
||||
alter view v1 as select * from t2;
|
||||
alter view v1 as select x, upper(s) s from t2;</codeblock>
|
||||
|
||||
<!-- Repeat the same blurb + example to see the definition of a view, as in CREATE VIEW. -->
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/describe_formatted_view"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
|
||||
<p>
|
||||
<xref href="impala_views.xml#views"/>, <xref href="impala_create_view.xml#create_view"/>,
|
||||
<xref href="impala_drop_view.xml#drop_view"/>
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
1739
docs/topics/impala_analytic_functions.xml
Normal file
1739
docs/topics/impala_analytic_functions.xml
Normal file
File diff suppressed because it is too large
Load Diff
81
docs/topics/impala_appx_count_distinct.xml
Normal file
81
docs/topics/impala_appx_count_distinct.xml
Normal file
@@ -0,0 +1,81 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept rev="2.0.0" id="appx_count_distinct">
|
||||
|
||||
<title>APPX_COUNT_DISTINCT Query Option (<keyword keyref="impala20"/> or higher only)</title>
|
||||
<titlealts audience="PDF"><navtitle>APPX_COUNT_DISTINCT</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Impala Query Options"/>
|
||||
<data name="Category" value="Aggregate Functions"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p rev="2.0.0">
|
||||
<indexterm audience="Cloudera">APPX_COUNT_DISTINCT query option</indexterm>
|
||||
Allows multiple <codeph>COUNT(DISTINCT)</codeph> operations within a single query, by internally rewriting
|
||||
each <codeph>COUNT(DISTINCT)</codeph> to use the <codeph>NDV()</codeph> function. The resulting count is
|
||||
approximate rather than precise.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/type_boolean"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/default_false_0"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
|
||||
<p>
|
||||
The following examples show how the <codeph>APPX_COUNT_DISTINCT</codeph> lets you work around the restriction
|
||||
where a query can only evaluate <codeph>COUNT(DISTINCT <varname>col_name</varname>)</codeph> for a single
|
||||
column. By default, you can count the distinct values of one column or another, but not both in a single
|
||||
query:
|
||||
</p>
|
||||
|
||||
<codeblock>[localhost:21000] > select count(distinct x) from int_t;
|
||||
+-------------------+
|
||||
| count(distinct x) |
|
||||
+-------------------+
|
||||
| 10 |
|
||||
+-------------------+
|
||||
[localhost:21000] > select count(distinct property) from int_t;
|
||||
+--------------------------+
|
||||
| count(distinct property) |
|
||||
+--------------------------+
|
||||
| 7 |
|
||||
+--------------------------+
|
||||
[localhost:21000] > select count(distinct x), count(distinct property) from int_t;
|
||||
ERROR: AnalysisException: all DISTINCT aggregate functions need to have the same set of parameters
|
||||
as count(DISTINCT x); deviating function: count(DISTINCT property)
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
When you enable the <codeph>APPX_COUNT_DISTINCT</codeph> query option, now the query with multiple
|
||||
<codeph>COUNT(DISTINCT)</codeph> works. The reason this behavior requires a query option is that each
|
||||
<codeph>COUNT(DISTINCT)</codeph> is rewritten internally to use the <codeph>NDV()</codeph> function instead,
|
||||
which provides an approximate result rather than a precise count.
|
||||
</p>
|
||||
|
||||
<codeblock>[localhost:21000] > set APPX_COUNT_DISTINCT=true;
|
||||
[localhost:21000] > select count(distinct x), count(distinct property) from int_t;
|
||||
+-------------------+--------------------------+
|
||||
| count(distinct x) | count(distinct property) |
|
||||
+-------------------+--------------------------+
|
||||
| 10 | 7 |
|
||||
+-------------------+--------------------------+
|
||||
</codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
|
||||
<p>
|
||||
<xref href="impala_count.xml#count"/>,
|
||||
<xref href="impala_distinct.xml#distinct"/>,
|
||||
<xref href="impala_ndv.xml#ndv"/>
|
||||
</p>
|
||||
|
||||
</conbody>
|
||||
</concept>
|
||||
124
docs/topics/impala_appx_median.xml
Normal file
124
docs/topics/impala_appx_median.xml
Normal file
@@ -0,0 +1,124 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept rev="1.2.1" id="appx_median">
|
||||
|
||||
<title>APPX_MEDIAN Function</title>
|
||||
<titlealts audience="PDF"><navtitle>APPX_MEDIAN</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
<data name="Category" value="Impala Functions"/>
|
||||
<data name="Category" value="Aggregate Functions"/>
|
||||
<data name="Category" value="Querying"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<indexterm audience="Cloudera">appx_median() function</indexterm>
|
||||
An aggregate function that returns a value that is approximately the median (midpoint) of values in the set
|
||||
of input values.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
||||
|
||||
<codeblock>APPX_MEDIAN([DISTINCT | ALL] <varname>expression</varname>)
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
This function works with any input type, because the only requirement is that the type supports less-than and
|
||||
greater-than comparison operators.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
||||
|
||||
<p>
|
||||
Because the return value represents the estimated midpoint, it might not reflect the precise midpoint value,
|
||||
especially if the cardinality of the input values is very high. If the cardinality is low (up to
|
||||
approximately 20,000), the result is more accurate because the sampling considers all or almost all of the
|
||||
different values.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/return_type_same_except_string"/>
|
||||
|
||||
<p>
|
||||
The return value is always the same as one of the input values, not an <q>in-between</q> value produced by
|
||||
averaging.
|
||||
</p>
|
||||
|
||||
<!-- <p conref="../shared/impala_common.xml#common/restrictions_sliding_window"/> -->
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/analytic_not_allowed_caveat"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
|
||||
<p>
|
||||
The following example uses a table of a million random floating-point numbers ranging up to approximately
|
||||
50,000. The average is approximately 25,000. Because of the random distribution, we would expect the median
|
||||
to be close to this same number. Computing the precise median is a more intensive operation than computing
|
||||
the average, because it requires keeping track of every distinct value and how many times each occurs. The
|
||||
<codeph>APPX_MEDIAN()</codeph> function uses a sampling algorithm to return an approximate result, which in
|
||||
this case is close to the expected value. To make sure that the value is not substantially out of range due
|
||||
to a skewed distribution, subsequent queries confirm that there are approximately 500,000 values higher than
|
||||
the <codeph>APPX_MEDIAN()</codeph> value, and approximately 500,000 values lower than the
|
||||
<codeph>APPX_MEDIAN()</codeph> value.
|
||||
</p>
|
||||
|
||||
<codeblock>[localhost:21000] > select min(x), max(x), avg(x) from million_numbers;
|
||||
+-------------------+-------------------+-------------------+
|
||||
| min(x) | max(x) | avg(x) |
|
||||
+-------------------+-------------------+-------------------+
|
||||
| 4.725693727250069 | 49994.56852674231 | 24945.38563793553 |
|
||||
+-------------------+-------------------+-------------------+
|
||||
[localhost:21000] > select appx_median(x) from million_numbers;
|
||||
+----------------+
|
||||
| appx_median(x) |
|
||||
+----------------+
|
||||
| 24721.6 |
|
||||
+----------------+
|
||||
[localhost:21000] > select count(x) as higher from million_numbers where x > (select appx_median(x) from million_numbers);
|
||||
+--------+
|
||||
| higher |
|
||||
+--------+
|
||||
| 502013 |
|
||||
+--------+
|
||||
[localhost:21000] > select count(x) as lower from million_numbers where x < (select appx_median(x) from million_numbers);
|
||||
+--------+
|
||||
| lower |
|
||||
+--------+
|
||||
| 497987 |
|
||||
+--------+
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
The following example computes the approximate median using a subset of the values from the table, and then
|
||||
confirms that the result is a reasonable estimate for the midpoint.
|
||||
</p>
|
||||
|
||||
<codeblock>[localhost:21000] > select appx_median(x) from million_numbers where x between 1000 and 5000;
|
||||
+-------------------+
|
||||
| appx_median(x) |
|
||||
+-------------------+
|
||||
| 3013.107787358159 |
|
||||
+-------------------+
|
||||
[localhost:21000] > select count(x) as higher from million_numbers where x between 1000 and 5000 and x > 3013.107787358159;
|
||||
+--------+
|
||||
| higher |
|
||||
+--------+
|
||||
| 37692 |
|
||||
+--------+
|
||||
[localhost:21000] > select count(x) as lower from million_numbers where x between 1000 and 5000 and x < 3013.107787358159;
|
||||
+-------+
|
||||
| lower |
|
||||
+-------+
|
||||
| 37089 |
|
||||
+-------+
|
||||
</codeblock>
|
||||
</conbody>
|
||||
</concept>
|
||||
269
docs/topics/impala_array.xml
Normal file
269
docs/topics/impala_array.xml
Normal file
@@ -0,0 +1,269 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="array">
|
||||
|
||||
<title>ARRAY Complex Type (<keyword keyref="impala23"/> or higher only)</title>
|
||||
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Impala Data Types"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
A complex data type that can represent an arbitrary number of ordered elements.
|
||||
The elements can be scalars or another complex type (<codeph>ARRAY</codeph>,
|
||||
<codeph>STRUCT</codeph>, or <codeph>MAP</codeph>).
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
||||
|
||||
<!-- To do: make sure there is sufficient syntax info under the SELECT statement to understand how to query all the complex types. -->
|
||||
|
||||
<codeblock><varname>column_name</varname> ARRAY < <varname>type</varname> >
|
||||
|
||||
type ::= <varname>primitive_type</varname> | <varname>complex_type</varname>
|
||||
</codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/complex_types_combo"/>
|
||||
|
||||
<p>
|
||||
The elements of the array have no names. You refer to the value of the array item using the
|
||||
<codeph>ITEM</codeph> pseudocolumn, or its position in the array with the <codeph>POS</codeph>
|
||||
pseudocolumn. See <xref href="impala_complex_types.xml#item"/> for information about
|
||||
these pseudocolumns.
|
||||
</p>
|
||||
|
||||
<!-- Array is a frequently used idiom; don't recommend MAP right up front, since that is more rarely used. STRUCT has all different considerations.
|
||||
<p>
|
||||
If it would be logical to have a fixed number of elements and give each one a name, consider using a
|
||||
<codeph>MAP</codeph> (when all elements are of the same type) or a <codeph>STRUCT</codeph> (if different
|
||||
elements have different types) instead of an <codeph>ARRAY</codeph>.
|
||||
</p>
|
||||
-->
|
||||
|
||||
<p>
|
||||
Each row can have a different number of elements (including none) in the array for that row.
|
||||
</p>
|
||||
|
||||
<!-- Since you don't use numeric indexes, this assertion and advice doesn't make sense.
|
||||
<p>
|
||||
If you attempt to refer to a non-existent array element, the result is <codeph>NULL</codeph>. Therefore,
|
||||
when using operations such as addition or string concatenation involving array elements, you might use
|
||||
conditional functions to substitute default values such as 0 or <codeph>""</codeph> in the place of missing
|
||||
array elements.
|
||||
</p>
|
||||
-->
|
||||
|
||||
<p>
|
||||
When an array contains items of scalar types, you can use aggregation functions on the array elements without using join notation. For
|
||||
example, you can find the <codeph>COUNT()</codeph>, <codeph>AVG()</codeph>, <codeph>SUM()</codeph>, and so on of numeric array
|
||||
elements, or the <codeph>MAX()</codeph> and <codeph>MIN()</codeph> of any scalar array elements by referring to
|
||||
<codeph><varname>table_name</varname>.<varname>array_column</varname></codeph> in the <codeph>FROM</codeph> clause of the query. When
|
||||
you need to cross-reference values from the array with scalar values from the same row, such as by including a <codeph>GROUP
|
||||
BY</codeph> clause to produce a separate aggregated result for each row, then the join clause is required.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
A common usage pattern with complex types is to have an array as the top-level type for the column:
|
||||
an array of structs, an array of maps, or an array of arrays.
|
||||
For example, you can model a denormalized table by creating a column that is an <codeph>ARRAY</codeph>
|
||||
of <codeph>STRUCT</codeph> elements; each item in the array represents a row from a table that would
|
||||
normally be used in a join query. This kind of data structure lets you essentially denormalize tables by
|
||||
associating multiple rows from one table with the matching row in another table.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
You typically do not create more than one top-level <codeph>ARRAY</codeph> column, because if there is
|
||||
some relationship between the elements of multiple arrays, it is convenient to model the data as
|
||||
an array of another complex type element (either <codeph>STRUCT</codeph> or <codeph>MAP</codeph>).
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/complex_types_describe"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/added_in_230"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
|
||||
|
||||
<ul conref="../shared/impala_common.xml#common/complex_types_restrictions">
|
||||
<li/>
|
||||
</ul>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
|
||||
<note conref="../shared/impala_common.xml#common/complex_type_schema_pointer"/>
|
||||
|
||||
<p>
|
||||
The following example shows how to construct a table with various kinds of <codeph>ARRAY</codeph> columns,
|
||||
both at the top level and nested within other complex types.
|
||||
Whenever the <codeph>ARRAY</codeph> consists of a scalar value, such as in the <codeph>PETS</codeph>
|
||||
column or the <codeph>CHILDREN</codeph> field, you can see that future expansion is limited.
|
||||
For example, you could not easily evolve the schema to record the kind of pet or the child's birthday alongside the name.
|
||||
Therefore, it is more common to use an <codeph>ARRAY</codeph> whose elements are of <codeph>STRUCT</codeph> type,
|
||||
to associate multiple fields with each array element.
|
||||
</p>
|
||||
|
||||
<note>
|
||||
Practice the <codeph>CREATE TABLE</codeph> and query notation for complex type columns
|
||||
using empty tables, until you can visualize a complex data structure and construct corresponding SQL statements reliably.
|
||||
</note>
|
||||
|
||||
<!-- To do: verify and flesh out this example. -->
|
||||
|
||||
<codeblock><![CDATA[CREATE TABLE array_demo
|
||||
(
|
||||
id BIGINT,
|
||||
name STRING,
|
||||
-- An ARRAY of scalar type as a top-level column.
|
||||
pets ARRAY <STRING>,
|
||||
|
||||
-- An ARRAY with elements of complex type (STRUCT).
|
||||
places_lived ARRAY < STRUCT <
|
||||
place: STRING,
|
||||
start_year: INT
|
||||
>>,
|
||||
|
||||
-- An ARRAY as a field (CHILDREN) within a STRUCT.
|
||||
-- (The STRUCT is inside another ARRAY, because it is rare
|
||||
-- for a STRUCT to be a top-level column.)
|
||||
marriages ARRAY < STRUCT <
|
||||
spouse: STRING,
|
||||
children: ARRAY <STRING>
|
||||
>>,
|
||||
|
||||
-- An ARRAY as the value part of a MAP.
|
||||
-- The first MAP field (the key) would be a value such as
|
||||
-- 'Parent' or 'Grandparent', and the corresponding array would
|
||||
-- represent 2 parents, 4 grandparents, and so on.
|
||||
ancestors MAP < STRING, ARRAY <STRING> >
|
||||
)
|
||||
STORED AS PARQUET;
|
||||
]]>
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
The following example shows how to examine the structure of a table containing one or more <codeph>ARRAY</codeph> columns by using the
|
||||
<codeph>DESCRIBE</codeph> statement. You can visualize each <codeph>ARRAY</codeph> as its own two-column table, with columns
|
||||
<codeph>ITEM</codeph> and <codeph>POS</codeph>.
|
||||
</p>
|
||||
|
||||
<!-- To do: extend the examples to include MARRIAGES and ANCESTORS columns, or get rid of those columns. -->
|
||||
|
||||
<codeblock><![CDATA[DESCRIBE array_demo;
|
||||
+--------------+---------------------------+
|
||||
| name | type |
|
||||
+--------------+---------------------------+
|
||||
| id | bigint |
|
||||
| name | string |
|
||||
| pets | array<string> |
|
||||
| marriages | array<struct< |
|
||||
| | spouse:string, |
|
||||
| | children:array<string> |
|
||||
| | >> |
|
||||
| places_lived | array<struct< |
|
||||
| | place:string, |
|
||||
| | start_year:int |
|
||||
| | >> |
|
||||
| ancestors | map<string,array<string>> |
|
||||
+--------------+---------------------------+
|
||||
|
||||
DESCRIBE array_demo.pets;
|
||||
+------+--------+
|
||||
| name | type |
|
||||
+------+--------+
|
||||
| item | string |
|
||||
| pos | bigint |
|
||||
+------+--------+
|
||||
|
||||
DESCRIBE array_demo.marriages;
|
||||
+------+--------------------------+
|
||||
| name | type |
|
||||
+------+--------------------------+
|
||||
| item | struct< |
|
||||
| | spouse:string, |
|
||||
| | children:array<string> |
|
||||
| | > |
|
||||
| pos | bigint |
|
||||
+------+--------------------------+
|
||||
|
||||
DESCRIBE array_demo.places_lived;
|
||||
+------+------------------+
|
||||
| name | type |
|
||||
+------+------------------+
|
||||
| item | struct< |
|
||||
| | place:string, |
|
||||
| | start_year:int |
|
||||
| | > |
|
||||
| pos | bigint |
|
||||
+------+------------------+
|
||||
|
||||
DESCRIBE array_demo.ancestors;
|
||||
+-------+---------------+
|
||||
| name | type |
|
||||
+-------+---------------+
|
||||
| key | string |
|
||||
| value | array<string> |
|
||||
+-------+---------------+
|
||||
]]>
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
The following example shows queries involving <codeph>ARRAY</codeph> columns containing elements of scalar or complex types. You
|
||||
<q>unpack</q> each <codeph>ARRAY</codeph> column by referring to it in a join query, as if it were a separate table with
|
||||
<codeph>ITEM</codeph> and <codeph>POS</codeph> columns. If the array element is a scalar type, you refer to its value using the
|
||||
<codeph>ITEM</codeph> pseudocolumn. If the array element is a <codeph>STRUCT</codeph>, you refer to the <codeph>STRUCT</codeph> fields
|
||||
using dot notation and the field names. If the array element is another <codeph>ARRAY</codeph> or a <codeph>MAP</codeph>, you use
|
||||
another level of join to unpack the nested collection elements.
|
||||
</p>
|
||||
|
||||
<!-- To do: have some sample output to show for these queries. -->
|
||||
|
||||
<codeblock><![CDATA[-- Array of scalar values.
|
||||
-- Each array element represents a single string, plus we know its position in the array.
|
||||
SELECT id, name, pets.pos, pets.item FROM array_demo, array_demo.pets;
|
||||
|
||||
-- Array of structs.
|
||||
-- Now each array element has named fields, possibly of different types.
|
||||
-- You can consider an ARRAY of STRUCT to represent a table inside another table.
|
||||
SELECT id, name, places_lived.pos, places_lived.item.place, places_lived.item.start_year
|
||||
FROM array_demo, array_demo.places_lived;
|
||||
|
||||
-- The .ITEM name is optional for array elements that are structs.
|
||||
-- The following query is equivalent to the previous one, with .ITEM
|
||||
-- removed from the column references.
|
||||
SELECT id, name, places_lived.pos, places_lived.place, places_lived.start_year
|
||||
FROM array_demo, array_demo.places_lived;
|
||||
|
||||
-- To filter specific items from the array, do comparisons against the .POS or .ITEM
|
||||
-- pseudocolumns, or names of struct fields, in the WHERE clause.
|
||||
SELECT id, name, pets.item FROM array_demo, array_demo.pets
|
||||
WHERE pets.pos in (0, 1, 3);
|
||||
|
||||
SELECT id, name, pets.item FROM array_demo, array_demo.pets
|
||||
WHERE pets.item LIKE 'Mr. %';
|
||||
|
||||
SELECT id, name, places_lived.pos, places_lived.place, places_lived.start_year
|
||||
FROM array_demo, array_demo.places_lived
|
||||
WHERE places_lived.place like '%California%';
|
||||
]]>
|
||||
</codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
|
||||
<p>
|
||||
<xref href="impala_complex_types.xml#complex_types"/>,
|
||||
<!-- <xref href="impala_array.xml#array"/>, -->
|
||||
<xref href="impala_struct.xml#struct"/>, <xref href="impala_map.xml#map"/>
|
||||
</p>
|
||||
|
||||
</conbody>
|
||||
|
||||
</concept>
|
||||
260
docs/topics/impala_auditing.xml
Normal file
260
docs/topics/impala_auditing.xml
Normal file
@@ -0,0 +1,260 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="auditing">
|
||||
|
||||
<title>Auditing Impala Operations</title>
|
||||
<titlealts audience="PDF"><navtitle>Auditing</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Auditing"/>
|
||||
<data name="Category" value="Governance"/>
|
||||
<data name="Category" value="Navigator"/>
|
||||
<data name="Category" value="Security"/>
|
||||
<data name="Category" value="Administrators"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
To monitor how Impala data is being used within your organization, ensure that your Impala authorization and
|
||||
authentication policies are effective, and detect attempts at intrusion or unauthorized access to Impala
|
||||
data, you can use the auditing feature in Impala 1.2.1 and higher:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
Enable auditing by including the option <codeph>-audit_event_log_dir=<varname>directory_path</varname></codeph>
|
||||
in your <cmdname>impalad</cmdname> startup options for a cluster not managed by Cloudera Manager, or
|
||||
<xref audience="integrated" href="cn_iu_audit_log.xml#xd_583c10bfdbd326ba--6eed2fb8-14349d04bee--7d6f/section_v25_lmy_bn">configuring Impala Daemon logging in Cloudera Manager</xref><xref audience="standalone" href="http://www.cloudera.com/documentation/enterprise/latest/topics/cn_iu_service_audit.html" scope="external" format="html">configuring Impala Daemon logging in Cloudera Manager</xref>.
|
||||
The log directory must be a local directory on the
|
||||
server, not an HDFS directory.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Decide how many queries will be represented in each log files. By default, Impala starts a new log file
|
||||
every 5000 queries. To specify a different number, <ph
|
||||
audience="standalone"
|
||||
>include
|
||||
the option <codeph>-max_audit_event_log_file_size=<varname>number_of_queries</varname></codeph> in the
|
||||
<cmdname>impalad</cmdname> startup
|
||||
options</ph><xref
|
||||
href="cn_iu_audit_log.xml#xd_583c10bfdbd326ba--6eed2fb8-14349d04bee--7d6f/section_v25_lmy_bn"
|
||||
audience="integrated"
|
||||
>configure
|
||||
Impala Daemon logging in Cloudera Manager</xref>.
|
||||
</li>
|
||||
|
||||
<li> Configure Cloudera Navigator to collect and consolidate the audit
|
||||
logs from all the hosts in the cluster. </li>
|
||||
|
||||
<li>
|
||||
Use Cloudera Navigator or Cloudera Manager to filter, visualize, and produce reports based on the audit
|
||||
data. (The Impala auditing feature works with Cloudera Manager 4.7 to 5.1 and Cloudera Navigator 2.1 and
|
||||
higher.) Check the audit data to ensure that all activity is authorized and detect attempts at
|
||||
unauthorized access.
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p outputclass="toc inpage"/>
|
||||
</conbody>
|
||||
|
||||
<concept id="auditing_performance">
|
||||
|
||||
<title>Durability and Performance Considerations for Impala Auditing</title>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Performance"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
The auditing feature only imposes performance overhead while auditing is enabled.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Because any Impala host can process a query, enable auditing on all hosts where the
|
||||
<ph audience="standalone"><cmdname>impalad</cmdname> daemon</ph>
|
||||
<ph audience="integrated">Impala Daemon role</ph> runs. Each host stores its own log
|
||||
files, in a directory in the local filesystem. The log data is periodically flushed to disk (through an
|
||||
<codeph>fsync()</codeph> system call) to avoid loss of audit data in case of a crash.
|
||||
</p>
|
||||
|
||||
<p> The runtime overhead of auditing applies to whichever host serves as the coordinator for the query, that is, the host you connect to when you issue the query. This might be the same host for all queries, or different applications or users might connect to and issue queries through different hosts. </p>
|
||||
|
||||
<p> To avoid excessive I/O overhead on busy coordinator hosts, Impala syncs the audit log data (using the <codeph>fsync()</codeph> system call) periodically rather than after every query. Currently, the <codeph>fsync()</codeph> calls are issued at a fixed interval, every 5 seconds. </p>
|
||||
|
||||
<p>
|
||||
By default, Impala avoids losing any audit log data in the case of an error during a logging operation
|
||||
(such as a disk full error), by immediately shutting down
|
||||
<cmdname audience="standalone">impalad</cmdname><ph audience="integrated">the Impala
|
||||
Daemon role</ph> on the host where the auditing problem occurred.
|
||||
<ph audience="standalone">You can override this setting by specifying the option
|
||||
<codeph>-abort_on_failed_audit_event=false</codeph> in the <cmdname>impalad</cmdname> startup options.</ph>
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
|
||||
<concept id="auditing_format">
|
||||
|
||||
<title>Format of the Audit Log Files</title>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Logs"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p> The audit log files represent the query information in JSON format, one query per line. Typically, rather than looking at the log files themselves, you use the Cloudera Navigator product to consolidate the log data from all Impala hosts and filter and visualize the results in useful ways. (If you do examine the raw log data, you might run the files through a JSON pretty-printer first.) </p>
|
||||
|
||||
<p>
|
||||
All the information about schema objects accessed by the query is encoded in a single nested record on the
|
||||
same line. For example, the audit log for an <codeph>INSERT ... SELECT</codeph> statement records that a
|
||||
select operation occurs on the source table and an insert operation occurs on the destination table. The
|
||||
audit log for a query against a view records the base table accessed by the view, or multiple base tables
|
||||
in the case of a view that includes a join query. Every Impala operation that corresponds to a SQL
|
||||
statement is recorded in the audit logs, whether the operation succeeds or fails. Impala records more
|
||||
information for a successful operation than for a failed one, because an unauthorized query is stopped
|
||||
immediately, before all the query planning is completed.
|
||||
</p>
|
||||
|
||||
<!-- Opportunity to conref at the phrase level here... the content of this paragraph is the same as part
|
||||
of a list bullet earlier on. -->
|
||||
|
||||
<p>
|
||||
The information logged for each query includes:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
Client session state:
|
||||
<ul>
|
||||
<li>
|
||||
Session ID
|
||||
</li>
|
||||
|
||||
<li>
|
||||
User name
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Network address of the client connection
|
||||
</li>
|
||||
</ul>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
SQL statement details:
|
||||
<ul>
|
||||
<li>
|
||||
Query ID
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Statement Type - DML, DDL, and so on
|
||||
</li>
|
||||
|
||||
<li>
|
||||
SQL statement text
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Execution start time, in local time
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Execution Status - Details on any errors that were encountered
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Target Catalog Objects:
|
||||
<ul>
|
||||
<li>
|
||||
Object Type - Table, View, or Database
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Fully qualified object name
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Privilege - How the object is being used (<codeph>SELECT</codeph>, <codeph>INSERT</codeph>,
|
||||
<codeph>CREATE</codeph>, and so on)
|
||||
</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<!-- Delegating actual examples to the Cloudera Navigator doc for the moment.
|
||||
<p>
|
||||
Here is an excerpt from a sample audit log file:
|
||||
</p>
|
||||
<codeblock></codeblock>
|
||||
-->
|
||||
</conbody>
|
||||
</concept>
|
||||
|
||||
<concept id="auditing_exceptions">
|
||||
|
||||
<title>Which Operations Are Audited</title>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
The kinds of SQL queries represented in the audit log are:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
Queries that are prevented due to lack of authorization.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Queries that Impala can analyze and parse to determine that they are authorized. The audit data is
|
||||
recorded immediately after Impala finishes its analysis, before the query is actually executed.
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p>
|
||||
The audit log does not contain entries for queries that could not be parsed and analyzed. For example, a
|
||||
query that fails due to a syntax error is not recorded in the audit log. The audit log also does not
|
||||
contain queries that fail due to a reference to a table that does not exist, if you would be authorized to
|
||||
access the table if it did exist.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Certain statements in the <cmdname>impala-shell</cmdname> interpreter, such as <codeph>CONNECT</codeph>,
|
||||
<codeph rev="1.4.0">SUMMARY</codeph>, <codeph>PROFILE</codeph>, <codeph>SET</codeph>, and
|
||||
<codeph>QUIT</codeph>, do not correspond to actual SQL queries, and these statements are not reflected in
|
||||
the audit log.
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
|
||||
<concept id="auditing_reviewing">
|
||||
|
||||
<title>Reviewing the Audit Logs</title>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Logs"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
You typically do not review the audit logs in raw form. The Cloudera Manager Agent periodically transfers
|
||||
the log information into a back-end database where it can be examined in consolidated form. See
|
||||
<ph audience="standalone">the <xref href="http://www.cloudera.com/content/cloudera-content/cloudera-docs/Navigator/latest/Cloudera-Navigator-Installation-and-User-Guide/Cloudera-Navigator-Installation-and-User-Guide.html"
|
||||
scope="external" format="html">Cloudera Navigator documentation</xref> for details</ph>
|
||||
<xref href="cn_iu_audits.xml#cn_topic_7" audience="integrated" />.
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
</concept>
|
||||
39
docs/topics/impala_authentication.xml
Normal file
39
docs/topics/impala_authentication.xml
Normal file
@@ -0,0 +1,39 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="authentication">
|
||||
|
||||
<title>Impala Authentication</title>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Security"/>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Authentication"/>
|
||||
<data name="Category" value="Administrators"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
Authentication is the mechanism to ensure that only specified hosts and users can connect to Impala. It also
|
||||
verifies that when clients connect to Impala, they are connected to a legitimate server. This feature
|
||||
prevents spoofing such as <term>impersonation</term> (setting up a phony client system with the same account
|
||||
and group names as a legitimate user) and <term>man-in-the-middle attacks</term> (intercepting application
|
||||
requests before they reach Impala and eavesdropping on sensitive information in the requests or the results).
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Impala supports authentication using either Kerberos or LDAP.
|
||||
</p>
|
||||
|
||||
<note conref="../shared/impala_common.xml#common/authentication_vs_authorization"/>
|
||||
|
||||
<p outputclass="toc"/>
|
||||
|
||||
<p>
|
||||
Once you are finished setting up authentication, move on to authorization, which involves specifying what
|
||||
databases, tables, HDFS directories, and so on can be accessed by particular users when they connect through
|
||||
Impala. See <xref href="impala_authorization.xml#authorization"/> for details.
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
1621
docs/topics/impala_authorization.xml
Normal file
1621
docs/topics/impala_authorization.xml
Normal file
File diff suppressed because it is too large
Load Diff
225
docs/topics/impala_avg.xml
Normal file
225
docs/topics/impala_avg.xml
Normal file
@@ -0,0 +1,225 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="avg">
|
||||
|
||||
<title>AVG Function</title>
|
||||
<titlealts audience="PDF"><navtitle>AVG</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="Impala Functions"/>
|
||||
<data name="Category" value="Analytic Functions"/>
|
||||
<data name="Category" value="Aggregate Functions"/>
|
||||
<data name="Category" value="Querying"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<indexterm audience="Cloudera">avg() function</indexterm>
|
||||
An aggregate function that returns the average value from a set of numbers or <codeph>TIMESTAMP</codeph> values.
|
||||
Its single argument can be numeric column, or the numeric result of a function or expression applied to the
|
||||
column value. Rows with a <codeph>NULL</codeph> value for the specified column are ignored. If the table is empty,
|
||||
or all the values supplied to <codeph>AVG</codeph> are <codeph>NULL</codeph>, <codeph>AVG</codeph> returns
|
||||
<codeph>NULL</codeph>.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
||||
|
||||
<codeblock>AVG([DISTINCT | ALL] <varname>expression</varname>) [OVER (<varname>analytic_clause</varname>)]
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
When the query contains a <codeph>GROUP BY</codeph> clause, returns one value for each combination of
|
||||
grouping values.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>Return type:</b> <codeph>DOUBLE</codeph> for numeric values; <codeph>TIMESTAMP</codeph> for
|
||||
<codeph>TIMESTAMP</codeph> values
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/complex_types_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/complex_types_aggregation_explanation"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/complex_types_aggregation_example"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
|
||||
<codeblock>-- Average all the non-NULL values in a column.
|
||||
insert overwrite avg_t values (2),(4),(6),(null),(null);
|
||||
-- The average of the above values is 4: (2+4+6) / 3. The 2 NULL values are ignored.
|
||||
select avg(x) from avg_t;
|
||||
-- Average only certain values from the column.
|
||||
select avg(x) from t1 where month = 'January' and year = '2013';
|
||||
-- Apply a calculation to the value of the column before averaging.
|
||||
select avg(x/3) from t1;
|
||||
-- Apply a function to the value of the column before averaging.
|
||||
-- Here we are substituting a value of 0 for all NULLs in the column,
|
||||
-- so that those rows do factor into the return value.
|
||||
select avg(isnull(x,0)) from t1;
|
||||
-- Apply some number-returning function to a string column and average the results.
|
||||
-- If column s contains any NULLs, length(s) also returns NULL and those rows are ignored.
|
||||
select avg(length(s)) from t1;
|
||||
-- Can also be used in combination with DISTINCT and/or GROUP BY.
|
||||
-- Return more than one result.
|
||||
select month, year, avg(page_visits) from web_stats group by month, year;
|
||||
-- Filter the input to eliminate duplicates before performing the calculation.
|
||||
select avg(distinct x) from t1;
|
||||
-- Filter the output after performing the calculation.
|
||||
select avg(x) from t1 group by y having avg(x) between 1 and 20;
|
||||
</codeblock>
|
||||
|
||||
<p rev="2.0.0">
|
||||
The following examples show how to use <codeph>AVG()</codeph> in an analytic context. They use a table
|
||||
containing integers from 1 to 10. Notice how the <codeph>AVG()</codeph> is reported for each input value, as
|
||||
opposed to the <codeph>GROUP BY</codeph> clause which condenses the result set.
|
||||
<codeblock>select x, property, avg(x) over (partition by property) as avg from int_t where property in ('odd','even');
|
||||
+----+----------+-----+
|
||||
| x | property | avg |
|
||||
+----+----------+-----+
|
||||
| 2 | even | 6 |
|
||||
| 4 | even | 6 |
|
||||
| 6 | even | 6 |
|
||||
| 8 | even | 6 |
|
||||
| 10 | even | 6 |
|
||||
| 1 | odd | 5 |
|
||||
| 3 | odd | 5 |
|
||||
| 5 | odd | 5 |
|
||||
| 7 | odd | 5 |
|
||||
| 9 | odd | 5 |
|
||||
+----+----------+-----+
|
||||
</codeblock>
|
||||
|
||||
Adding an <codeph>ORDER BY</codeph> clause lets you experiment with results that are cumulative or apply to a moving
|
||||
set of rows (the <q>window</q>). The following examples use <codeph>AVG()</codeph> in an analytic context
|
||||
(that is, with an <codeph>OVER()</codeph> clause) to produce a running average of all the even values,
|
||||
then a running average of all the odd values. The basic <codeph>ORDER BY x</codeph> clause implicitly
|
||||
activates a window clause of <codeph>RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph>,
|
||||
which is effectively the same as <codeph>ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph>,
|
||||
therefore all of these examples produce the same results:
|
||||
<codeblock>select x, property,
|
||||
avg(x) over (partition by property <b>order by x</b>) as 'cumulative average'
|
||||
from int_t where property in ('odd','even');
|
||||
+----+----------+--------------------+
|
||||
| x | property | cumulative average |
|
||||
+----+----------+--------------------+
|
||||
| 2 | even | 2 |
|
||||
| 4 | even | 3 |
|
||||
| 6 | even | 4 |
|
||||
| 8 | even | 5 |
|
||||
| 10 | even | 6 |
|
||||
| 1 | odd | 1 |
|
||||
| 3 | odd | 2 |
|
||||
| 5 | odd | 3 |
|
||||
| 7 | odd | 4 |
|
||||
| 9 | odd | 5 |
|
||||
+----+----------+--------------------+
|
||||
|
||||
select x, property,
|
||||
avg(x) over
|
||||
(
|
||||
partition by property
|
||||
<b>order by x</b>
|
||||
<b>range between unbounded preceding and current row</b>
|
||||
) as 'cumulative average'
|
||||
from int_t where property in ('odd','even');
|
||||
+----+----------+--------------------+
|
||||
| x | property | cumulative average |
|
||||
+----+----------+--------------------+
|
||||
| 2 | even | 2 |
|
||||
| 4 | even | 3 |
|
||||
| 6 | even | 4 |
|
||||
| 8 | even | 5 |
|
||||
| 10 | even | 6 |
|
||||
| 1 | odd | 1 |
|
||||
| 3 | odd | 2 |
|
||||
| 5 | odd | 3 |
|
||||
| 7 | odd | 4 |
|
||||
| 9 | odd | 5 |
|
||||
+----+----------+--------------------+
|
||||
|
||||
select x, property,
|
||||
avg(x) over
|
||||
(
|
||||
partition by property
|
||||
<b>order by x</b>
|
||||
<b>rows between unbounded preceding and current row</b>
|
||||
) as 'cumulative average'
|
||||
from int_t where property in ('odd','even');
|
||||
+----+----------+--------------------+
|
||||
| x | property | cumulative average |
|
||||
+----+----------+--------------------+
|
||||
| 2 | even | 2 |
|
||||
| 4 | even | 3 |
|
||||
| 6 | even | 4 |
|
||||
| 8 | even | 5 |
|
||||
| 10 | even | 6 |
|
||||
| 1 | odd | 1 |
|
||||
| 3 | odd | 2 |
|
||||
| 5 | odd | 3 |
|
||||
| 7 | odd | 4 |
|
||||
| 9 | odd | 5 |
|
||||
+----+----------+--------------------+
|
||||
</codeblock>
|
||||
|
||||
The following examples show how to construct a moving window, with a running average taking into account 1 row before
|
||||
and 1 row after the current row, within the same partition (all the even values or all the odd values).
|
||||
Because of a restriction in the Impala <codeph>RANGE</codeph> syntax, this type of
|
||||
moving window is possible with the <codeph>ROWS BETWEEN</codeph> clause but not the <codeph>RANGE BETWEEN</codeph>
|
||||
clause:
|
||||
<codeblock>select x, property,
|
||||
avg(x) over
|
||||
(
|
||||
partition by property
|
||||
<b>order by x</b>
|
||||
<b>rows between 1 preceding and 1 following</b>
|
||||
) as 'moving average'
|
||||
from int_t where property in ('odd','even');
|
||||
+----+----------+----------------+
|
||||
| x | property | moving average |
|
||||
+----+----------+----------------+
|
||||
| 2 | even | 3 |
|
||||
| 4 | even | 4 |
|
||||
| 6 | even | 6 |
|
||||
| 8 | even | 8 |
|
||||
| 10 | even | 9 |
|
||||
| 1 | odd | 2 |
|
||||
| 3 | odd | 3 |
|
||||
| 5 | odd | 5 |
|
||||
| 7 | odd | 7 |
|
||||
| 9 | odd | 8 |
|
||||
+----+----------+----------------+
|
||||
|
||||
-- Doesn't work because of syntax restriction on RANGE clause.
|
||||
select x, property,
|
||||
avg(x) over
|
||||
(
|
||||
partition by property
|
||||
<b>order by x</b>
|
||||
<b>range between 1 preceding and 1 following</b>
|
||||
) as 'moving average'
|
||||
from int_t where property in ('odd','even');
|
||||
ERROR: AnalysisException: RANGE is only supported with both the lower and upper bounds UNBOUNDED or one UNBOUNDED and the other CURRENT ROW.
|
||||
</codeblock>
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
|
||||
|
||||
<!-- This conref appears under SUM(), AVG(), FLOAT, and DOUBLE topics. -->
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/sum_double"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
|
||||
<p>
|
||||
<xref href="impala_analytic_functions.xml#analytic_functions"/>, <xref href="impala_max.xml#max"/>,
|
||||
<xref href="impala_min.xml#min"/>
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
556
docs/topics/impala_avro.xml
Normal file
556
docs/topics/impala_avro.xml
Normal file
@@ -0,0 +1,556 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="avro">
|
||||
|
||||
<title>Using the Avro File Format with Impala Tables</title>
|
||||
<titlealts audience="PDF"><navtitle>Avro Data Files</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="File Formats"/>
|
||||
<data name="Category" value="Avro"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p rev="1.4.0">
|
||||
<indexterm audience="Cloudera">Avro support in Impala</indexterm>
|
||||
Impala supports using tables whose data files use the Avro file format. Impala can query Avro
|
||||
tables, and in Impala 1.4.0 and higher can create them, but currently cannot insert data into them. For
|
||||
insert operations, use Hive, then switch back to Impala to run queries.
|
||||
</p>
|
||||
|
||||
<table>
|
||||
<title>Avro Format Support in Impala</title>
|
||||
<tgroup cols="5">
|
||||
<colspec colname="1" colwidth="10*"/>
|
||||
<colspec colname="2" colwidth="10*"/>
|
||||
<colspec colname="3" colwidth="20*"/>
|
||||
<colspec colname="4" colwidth="30*"/>
|
||||
<colspec colname="5" colwidth="30*"/>
|
||||
<thead>
|
||||
<row>
|
||||
<entry>
|
||||
File Type
|
||||
</entry>
|
||||
<entry>
|
||||
Format
|
||||
</entry>
|
||||
<entry>
|
||||
Compression Codecs
|
||||
</entry>
|
||||
<entry>
|
||||
Impala Can CREATE?
|
||||
</entry>
|
||||
<entry>
|
||||
Impala Can INSERT?
|
||||
</entry>
|
||||
</row>
|
||||
</thead>
|
||||
<tbody>
|
||||
<row conref="impala_file_formats.xml#file_formats/avro_support">
|
||||
<entry/>
|
||||
</row>
|
||||
</tbody>
|
||||
</tgroup>
|
||||
</table>
|
||||
|
||||
<p outputclass="toc inpage"/>
|
||||
</conbody>
|
||||
|
||||
<concept id="avro_create_table">
|
||||
|
||||
<title>Creating Avro Tables</title>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
To create a new table using the Avro file format, issue the <codeph>CREATE TABLE</codeph> statement through
|
||||
Impala with the <codeph>STORED AS AVRO</codeph> clause, or through Hive. If you create the table through
|
||||
Impala, you must include column definitions that match the fields specified in the Avro schema. With Hive,
|
||||
you can omit the columns and just specify the Avro schema.
|
||||
</p>
|
||||
|
||||
<p rev="2.3.0">
|
||||
In <keyword keyref="impala23_full"/> and higher, the <codeph>CREATE TABLE</codeph> for Avro tables can include
|
||||
SQL-style column definitions rather than specifying Avro notation through the <codeph>TBLPROPERTIES</codeph>
|
||||
clause. Impala issues warning messages if there are any mismatches between the types specified in the
|
||||
SQL column definitions and the underlying types; for example, any <codeph>TINYINT</codeph> or
|
||||
<codeph>SMALLINT</codeph> columns are treated as <codeph>INT</codeph> in the underlying Avro files,
|
||||
and therefore are displayed as <codeph>INT</codeph> in any <codeph>DESCRIBE</codeph> or
|
||||
<codeph>SHOW CREATE TABLE</codeph> output.
|
||||
</p>
|
||||
|
||||
<note>
|
||||
<p conref="../shared/impala_common.xml#common/avro_no_timestamp"/>
|
||||
</note>
|
||||
|
||||
<!--
|
||||
To do: Expand these examples to show switching between impala-shell and Hive, loading some data, and then
|
||||
doing DESCRIBE and querying the table.
|
||||
-->
|
||||
|
||||
<p>
|
||||
The following examples demonstrate creating an Avro table in Impala, using either an inline column
|
||||
specification or one taken from a JSON file stored in HDFS:
|
||||
</p>
|
||||
|
||||
<codeblock><![CDATA[
|
||||
[localhost:21000] > CREATE TABLE avro_only_sql_columns
|
||||
> (
|
||||
> id INT,
|
||||
> bool_col BOOLEAN,
|
||||
> tinyint_col TINYINT, /* Gets promoted to INT */
|
||||
> smallint_col SMALLINT, /* Gets promoted to INT */
|
||||
> int_col INT,
|
||||
> bigint_col BIGINT,
|
||||
> float_col FLOAT,
|
||||
> double_col DOUBLE,
|
||||
> date_string_col STRING,
|
||||
> string_col STRING
|
||||
> )
|
||||
> STORED AS AVRO;
|
||||
|
||||
[localhost:21000] > CREATE TABLE impala_avro_table
|
||||
> (bool_col BOOLEAN, int_col INT, long_col BIGINT, float_col FLOAT, double_col DOUBLE, string_col STRING, nullable_int INT)
|
||||
> STORED AS AVRO
|
||||
> TBLPROPERTIES ('avro.schema.literal'='{
|
||||
> "name": "my_record",
|
||||
> "type": "record",
|
||||
> "fields": [
|
||||
> {"name":"bool_col", "type":"boolean"},
|
||||
> {"name":"int_col", "type":"int"},
|
||||
> {"name":"long_col", "type":"long"},
|
||||
> {"name":"float_col", "type":"float"},
|
||||
> {"name":"double_col", "type":"double"},
|
||||
> {"name":"string_col", "type":"string"},
|
||||
> {"name": "nullable_int", "type": ["null", "int"]}]}');
|
||||
|
||||
[localhost:21000] > CREATE TABLE avro_examples_of_all_types (
|
||||
> id INT,
|
||||
> bool_col BOOLEAN,
|
||||
> tinyint_col TINYINT,
|
||||
> smallint_col SMALLINT,
|
||||
> int_col INT,
|
||||
> bigint_col BIGINT,
|
||||
> float_col FLOAT,
|
||||
> double_col DOUBLE,
|
||||
> date_string_col STRING,
|
||||
> string_col STRING
|
||||
> )
|
||||
> STORED AS AVRO
|
||||
> TBLPROPERTIES ('avro.schema.url'='hdfs://localhost:8020/avro_schemas/alltypes.json');
|
||||
]]>
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
The following example demonstrates creating an Avro table in Hive:
|
||||
</p>
|
||||
|
||||
<codeblock><![CDATA[
|
||||
hive> CREATE TABLE hive_avro_table
|
||||
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
|
||||
> STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
|
||||
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
|
||||
> TBLPROPERTIES ('avro.schema.literal'='{
|
||||
> "name": "my_record",
|
||||
> "type": "record",
|
||||
> "fields": [
|
||||
> {"name":"bool_col", "type":"boolean"},
|
||||
> {"name":"int_col", "type":"int"},
|
||||
> {"name":"long_col", "type":"long"},
|
||||
> {"name":"float_col", "type":"float"},
|
||||
> {"name":"double_col", "type":"double"},
|
||||
> {"name":"string_col", "type":"string"},
|
||||
> {"name": "nullable_int", "type": ["null", "int"]}]}');
|
||||
]]>
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
Each field of the record becomes a column of the table. Note that any other information, such as the record
|
||||
name, is ignored.
|
||||
</p>
|
||||
|
||||
<!-- Have not got a working example of this syntax yet from Lenni.
|
||||
<p>
|
||||
The schema can be specified either through the <codeph>TBLPROPERTIES</codeph> clause or the
|
||||
<codeph>WITH SERDEPROPERTIES</codeph> clause.
|
||||
For best compatibility with future versions of Hive, use the <codeph>WITH SERDEPROPERTIES</codeph> clause
|
||||
for this information.
|
||||
</p>
|
||||
-->
|
||||
|
||||
<note>
|
||||
For nullable Avro columns, make sure to put the <codeph>"null"</codeph> entry before the actual type name.
|
||||
In Impala, all columns are nullable; Impala currently does not have a <codeph>NOT NULL</codeph> clause. Any
|
||||
non-nullable property is only enforced on the Avro side.
|
||||
</note>
|
||||
|
||||
<p>
|
||||
Most column types map directly from Avro to Impala under the same names. These are the exceptions and
|
||||
special cases to consider:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
The <codeph>DECIMAL</codeph> type is defined in Avro as a <codeph>BYTE</codeph> type with the
|
||||
<codeph>logicalType</codeph> property set to <codeph>"decimal"</codeph> and a specified precision and
|
||||
scale. Use <codeph>DECIMAL</codeph> in Avro tables only under CDH 5. The infrastructure and components
|
||||
under CDH 4 do not have reliable <codeph>DECIMAL</codeph> support.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
The Avro <codeph>long</codeph> type maps to <codeph>BIGINT</codeph> in Impala.
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p>
|
||||
If you create the table through Hive, switch back to <cmdname>impala-shell</cmdname> and issue an
|
||||
<codeph>INVALIDATE METADATA <varname>table_name</varname></codeph> statement. Then you can run queries for
|
||||
that table through <cmdname>impala-shell</cmdname>.
|
||||
</p>
|
||||
|
||||
<p rev="2.3.0">
|
||||
In rare instances, a mismatch could occur between the Avro schema and the column definitions in the
|
||||
metastore database. In <keyword keyref="impala23_full"/> and higher, Impala checks for such inconsistencies during
|
||||
a <codeph>CREATE TABLE</codeph> statement and each time it loads the metadata for a table (for example,
|
||||
after <codeph>INVALIDATE METADATA</codeph>). Impala uses the following rules to determine how to treat
|
||||
mismatching columns, a process known as <term>schema reconciliation</term>:
|
||||
<ul>
|
||||
<li>
|
||||
If there is a mismatch in the number of columns, Impala uses the column
|
||||
definitions from the Avro schema.
|
||||
</li>
|
||||
<li>
|
||||
If there is a mismatch in column name or type, Impala uses the column definition from the Avro schema.
|
||||
Because a <codeph>CHAR</codeph> or <codeph>VARCHAR</codeph> column in Impala maps to an Avro <codeph>STRING</codeph>,
|
||||
this case is not considered a mismatch and the column is preserved as <codeph>CHAR</codeph> or <codeph>VARCHAR</codeph>
|
||||
in the reconciled schema. <ph rev="2.7.0 IMPALA-3687 CDH-43731">Prior to <keyword keyref="impala27_full"/> the column
|
||||
name and comment for such <codeph>CHAR</codeph> and <codeph>VARCHAR</codeph> columns was also taken from the SQL column definition.
|
||||
In <keyword keyref="impala27_full"/> and higher, the column name and comment from the Avro schema file take precedence for such columns,
|
||||
and only the <codeph>CHAR</codeph> or <codeph>VARCHAR</codeph> type is preserved from the SQL column definition.</ph>
|
||||
</li>
|
||||
<li>
|
||||
An Impala <codeph>TIMESTAMP</codeph> column definition maps to an Avro <codeph>STRING</codeph> and is presented as a <codeph>STRING</codeph>
|
||||
in the reconciled schema, because Avro has no binary <codeph>TIMESTAMP</codeph> representation.
|
||||
As a result, no Avro table can have a <codeph>TIMESTAMP</codeph> column; this restriction is the same as
|
||||
in earlier CDH and Impala releases.
|
||||
</li>
|
||||
</ul>
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/complex_types_unsupported_filetype"/>
|
||||
|
||||
</conbody>
|
||||
</concept>
|
||||
|
||||
<concept id="avro_map_table">
|
||||
|
||||
<title>Using a Hive-Created Avro Table in Impala</title>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
If you have an Avro table created through Hive, you can use it in Impala as long as it contains only
|
||||
Impala-compatible data types. It cannot contain:
|
||||
<ul>
|
||||
<li>
|
||||
Complex types: <codeph>array</codeph>, <codeph>map</codeph>, <codeph>record</codeph>,
|
||||
<codeph>struct</codeph>, <codeph>union</codeph> other than
|
||||
<codeph>[<varname>supported_type</varname>,null]</codeph> or
|
||||
<codeph>[null,<varname>supported_type</varname>]</codeph>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
The Avro-specific types <codeph>enum</codeph>, <codeph>bytes</codeph>, and <codeph>fixed</codeph>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Any scalar type other than those listed in <xref href="impala_datatypes.xml#datatypes"/>
|
||||
</li>
|
||||
</ul>
|
||||
Because Impala and Hive share the same metastore database, Impala can directly access the table definitions
|
||||
and data for tables that were created in Hive.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
If you create an Avro table in Hive, issue an <codeph>INVALIDATE METADATA</codeph> the next time you
|
||||
connect to Impala through <cmdname>impala-shell</cmdname>. This is a one-time operation to make Impala
|
||||
aware of the new table. You can issue the statement while connected to any Impala node, and the catalog
|
||||
service broadcasts the change to all other Impala nodes.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
If you load new data into an Avro table through Hive, either through a Hive <codeph>LOAD DATA</codeph> or
|
||||
<codeph>INSERT</codeph> statement, or by manually copying or moving files into the data directory for the
|
||||
table, issue a <codeph>REFRESH <varname>table_name</varname></codeph> statement the next time you connect
|
||||
to Impala through <cmdname>impala-shell</cmdname>. You can issue the statement while connected to any
|
||||
Impala node, and the catalog service broadcasts the change to all other Impala nodes. If you issue the
|
||||
<codeph>LOAD DATA</codeph> statement through Impala, you do not need a <codeph>REFRESH</codeph> afterward.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Impala only supports fields of type <codeph>boolean</codeph>, <codeph>int</codeph>, <codeph>long</codeph>,
|
||||
<codeph>float</codeph>, <codeph>double</codeph>, and <codeph>string</codeph>, or unions of these types with
|
||||
null; for example, <codeph>["string", "null"]</codeph>. Unions with <codeph>null</codeph> essentially
|
||||
create a nullable type.
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
|
||||
<concept id="avro_json">
|
||||
|
||||
<title>Specifying the Avro Schema through JSON</title>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
While you can embed a schema directly in your <codeph>CREATE TABLE</codeph> statement, as shown above,
|
||||
column width restrictions in the Hive metastore limit the length of schema you can specify. If you
|
||||
encounter problems with long schema literals, try storing your schema as a <codeph>JSON</codeph> file in
|
||||
HDFS instead. Specify your schema in HDFS using table properties similar to the following:
|
||||
</p>
|
||||
|
||||
<codeblock>tblproperties ('avro.schema.url'='hdfs//your-name-node:port/path/to/schema.json');</codeblock>
|
||||
</conbody>
|
||||
</concept>
|
||||
|
||||
<concept id="avro_load_data">
|
||||
|
||||
<title>Loading Data into an Avro Table</title>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="ETL"/>
|
||||
<data name="Category" value="Ingest"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p rev="DOCS-1523">
|
||||
Currently, Impala cannot write Avro data files. Therefore, an Avro table cannot be used as the destination
|
||||
of an Impala <codeph>INSERT</codeph> statement or <codeph>CREATE TABLE AS SELECT</codeph>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
To copy data from another table, issue any <codeph>INSERT</codeph> statements through Hive. For information
|
||||
about loading data into Avro tables through Hive, see
|
||||
<xref href="https://cwiki.apache.org/confluence/display/Hive/AvroSerDe" scope="external" format="html">Avro
|
||||
page on the Hive wiki</xref>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
If you already have data files in Avro format, you can also issue <codeph>LOAD DATA</codeph> in either
|
||||
Impala or Hive. Impala can move existing Avro data files into an Avro table, it just cannot create new
|
||||
Avro data files.
|
||||
</p>
|
||||
|
||||
</conbody>
|
||||
</concept>
|
||||
|
||||
<concept id="avro_compression">
|
||||
|
||||
<title>Enabling Compression for Avro Tables</title>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Compression"/>
|
||||
<data name="Category" value="Snappy"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<indexterm audience="Cloudera">compression</indexterm>
|
||||
To enable compression for Avro tables, specify settings in the Hive shell to enable compression and to
|
||||
specify a codec, then issue a <codeph>CREATE TABLE</codeph> statement as in the preceding examples. Impala
|
||||
supports the <codeph>snappy</codeph> and <codeph>deflate</codeph> codecs for Avro tables.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
For example:
|
||||
</p>
|
||||
|
||||
<codeblock>hive> set hive.exec.compress.output=true;
|
||||
hive> set avro.output.codec=snappy;</codeblock>
|
||||
</conbody>
|
||||
</concept>
|
||||
|
||||
<concept rev="1.1" id="avro_schema_evolution">
|
||||
|
||||
<title>How Impala Handles Avro Schema Evolution</title>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Concepts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
Starting in Impala 1.1, Impala can deal with Avro data files that employ <term>schema evolution</term>,
|
||||
where different data files within the same table use slightly different type definitions. (You would
|
||||
perform the schema evolution operation by issuing an <codeph>ALTER TABLE</codeph> statement in the Hive
|
||||
shell.) The old and new types for any changed columns must be compatible, for example a column might start
|
||||
as an <codeph>int</codeph> and later change to a <codeph>bigint</codeph> or <codeph>float</codeph>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
As with any other tables where the definitions are changed or data is added outside of the current
|
||||
<cmdname>impalad</cmdname> node, ensure that Impala loads the latest metadata for the table if the Avro
|
||||
schema is modified through Hive. Issue a <codeph>REFRESH <varname>table_name</varname></codeph> or
|
||||
<codeph>INVALIDATE METADATA <varname>table_name</varname></codeph> statement. <codeph>REFRESH</codeph>
|
||||
reloads the metadata immediately, <codeph>INVALIDATE METADATA</codeph> reloads the metadata the next time
|
||||
the table is accessed.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
When Avro data files or columns are not consulted during a query, Impala does not check for consistency.
|
||||
Thus, if you issue <codeph>SELECT c1, c2 FROM t1</codeph>, Impala does not return any error if the column
|
||||
<codeph>c3</codeph> changed in an incompatible way. If a query retrieves data from some partitions but not
|
||||
others, Impala does not check the data files for the unused partitions.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
In the Hive DDL statements, you can specify an <codeph>avro.schema.literal</codeph> table property (if the
|
||||
schema definition is short) or an <codeph>avro.schema.url</codeph> property (if the schema definition is
|
||||
long, or to allow convenient editing for the definition).
|
||||
</p>
|
||||
|
||||
<p>
|
||||
For example, running the following SQL code in the Hive shell creates a table using the Avro file format
|
||||
and puts some sample data into it:
|
||||
</p>
|
||||
|
||||
<codeblock>CREATE TABLE avro_table (a string, b string)
|
||||
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
|
||||
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
|
||||
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
|
||||
TBLPROPERTIES (
|
||||
'avro.schema.literal'='{
|
||||
"type": "record",
|
||||
"name": "my_record",
|
||||
"fields": [
|
||||
{"name": "a", "type": "int"},
|
||||
{"name": "b", "type": "string"}
|
||||
]}');
|
||||
|
||||
INSERT OVERWRITE TABLE avro_table SELECT 1, "avro" FROM functional.alltypes LIMIT 1;
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
Once the Avro table is created and contains data, you can query it through the
|
||||
<cmdname>impala-shell</cmdname> command:
|
||||
</p>
|
||||
|
||||
<codeblock>[localhost:21000] > select * from avro_table;
|
||||
+---+------+
|
||||
| a | b |
|
||||
+---+------+
|
||||
| 1 | avro |
|
||||
+---+------+
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
Now in the Hive shell, you change the type of a column and add a new column with a default value:
|
||||
</p>
|
||||
|
||||
<codeblock>-- Promote column "a" from INT to FLOAT (no need to update Avro schema)
|
||||
ALTER TABLE avro_table CHANGE A A FLOAT;
|
||||
|
||||
-- Add column "c" with default
|
||||
ALTER TABLE avro_table ADD COLUMNS (c int);
|
||||
ALTER TABLE avro_table SET TBLPROPERTIES (
|
||||
'avro.schema.literal'='{
|
||||
"type": "record",
|
||||
"name": "my_record",
|
||||
"fields": [
|
||||
{"name": "a", "type": "int"},
|
||||
{"name": "b", "type": "string"},
|
||||
{"name": "c", "type": "int", "default": 10}
|
||||
]}');
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
Once again in <cmdname>impala-shell</cmdname>, you can query the Avro table based on its latest schema
|
||||
definition. Because the table metadata was changed outside of Impala, you issue a <codeph>REFRESH</codeph>
|
||||
statement first so that Impala has up-to-date metadata for the table.
|
||||
</p>
|
||||
|
||||
<codeblock>[localhost:21000] > refresh avro_table;
|
||||
[localhost:21000] > select * from avro_table;
|
||||
+---+------+----+
|
||||
| a | b | c |
|
||||
+---+------+----+
|
||||
| 1 | avro | 10 |
|
||||
+---+------+----+
|
||||
</codeblock>
|
||||
</conbody>
|
||||
</concept>
|
||||
|
||||
<concept id="avro_data_types">
|
||||
|
||||
<title>Data Type Considerations for Avro Tables</title>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
The Avro format defines a set of data types whose names differ from the names of the corresponding Impala
|
||||
data types. If you are preparing Avro files using other Hadoop components such as Pig or MapReduce, you
|
||||
might need to work with the type names defined by Avro. The following figure lists the Avro-defined types
|
||||
and the equivalent types in Impala.
|
||||
</p>
|
||||
|
||||
<codeblock><![CDATA[Primitive Types (Avro -> Impala)
|
||||
--------------------------------
|
||||
STRING -> STRING
|
||||
STRING -> CHAR
|
||||
STRING -> VARCHAR
|
||||
INT -> INT
|
||||
BOOLEAN -> BOOLEAN
|
||||
LONG -> BIGINT
|
||||
FLOAT -> FLOAT
|
||||
DOUBLE -> DOUBLE
|
||||
|
||||
Logical Types
|
||||
-------------
|
||||
BYTES + logicalType = "decimal" -> DECIMAL
|
||||
|
||||
Avro Types with No Impala Equivalent
|
||||
------------------------------------
|
||||
RECORD, MAP, ARRAY, UNION, ENUM, FIXED, NULL
|
||||
|
||||
Impala Types with No Avro Equivalent
|
||||
------------------------------------
|
||||
TIMESTAMP
|
||||
]]>
|
||||
</codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/avro_2gb_strings"/>
|
||||
|
||||
</conbody>
|
||||
</concept>
|
||||
|
||||
<concept id="avro_performance">
|
||||
|
||||
<title>Query Performance for Impala Avro Tables</title>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
In general, expect query performance with Avro tables to be
|
||||
faster than with tables using text data, but slower than with
|
||||
Parquet tables. See <xref href="impala_parquet.xml#parquet"/>
|
||||
for information about using the Parquet file format for
|
||||
high-performance analytic queries.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/s3_block_splitting"/>
|
||||
|
||||
</conbody>
|
||||
</concept>
|
||||
|
||||
</concept>
|
||||
38
docs/topics/impala_batch_size.xml
Normal file
38
docs/topics/impala_batch_size.xml
Normal file
@@ -0,0 +1,38 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="batch_size">
|
||||
|
||||
<title>BATCH_SIZE Query Option</title>
|
||||
<titlealts audience="PDF"><navtitle>BATCH_SIZE</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Impala Query Options"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
<data name="Category" value="Querying"/>
|
||||
<data name="Category" value="Performance"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<indexterm audience="Cloudera">BATCH_SIZE query option</indexterm>
|
||||
Number of rows evaluated at a time by SQL operators. Unspecified or a size of 0 uses a predefined default
|
||||
size. Using a large number improves responsiveness, especially for scan operations, at the cost of a higher memory footprint.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
This option is primarily for testing during Impala development, or for use under the direction of <keyword keyref="support_org"/>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>Type:</b> numeric
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>Default:</b> 0 (meaning the predefined default of 1024)
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
102
docs/topics/impala_bigint.xml
Normal file
102
docs/topics/impala_bigint.xml
Normal file
@@ -0,0 +1,102 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="bigint">
|
||||
|
||||
<title>BIGINT Data Type</title>
|
||||
<titlealts audience="PDF"><navtitle>BIGINT</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Impala Data Types"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Schemas"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
An 8-byte integer data type used in <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph>
|
||||
statements.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
||||
|
||||
<p>
|
||||
In the column definition of a <codeph>CREATE TABLE</codeph> statement:
|
||||
</p>
|
||||
|
||||
<codeblock><varname>column_name</varname> BIGINT</codeblock>
|
||||
|
||||
<p>
|
||||
<b>Range:</b> -9223372036854775808 .. 9223372036854775807. There is no <codeph>UNSIGNED</codeph> subtype.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>Conversions:</b> Impala automatically converts to a floating-point type (<codeph>FLOAT</codeph> or
|
||||
<codeph>DOUBLE</codeph>) automatically. Use <codeph>CAST()</codeph> to convert to <codeph>TINYINT</codeph>,
|
||||
<codeph>SMALLINT</codeph>, <codeph>INT</codeph>, <codeph>STRING</codeph>, or <codeph>TIMESTAMP</codeph>.
|
||||
<ph conref="../shared/impala_common.xml#common/cast_int_to_timestamp"/>
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
|
||||
<codeblock>CREATE TABLE t1 (x BIGINT);
|
||||
SELECT CAST(1000 AS BIGINT);
|
||||
</codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
||||
|
||||
<p>
|
||||
<codeph>BIGINT</codeph> is a convenient type to use for column declarations because you can use any kind of
|
||||
integer values in <codeph>INSERT</codeph> statements and they are promoted to <codeph>BIGINT</codeph> where
|
||||
necessary. However, <codeph>BIGINT</codeph> also requires the most bytes of any integer type on disk and in
|
||||
memory, meaning your queries are not as efficient and scalable as possible if you overuse this type.
|
||||
Therefore, prefer to use the smallest integer type with sufficient range to hold all input values, and
|
||||
<codeph>CAST()</codeph> when necessary to the appropriate type.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
For a convenient and automated way to check the bounds of the <codeph>BIGINT</codeph> type, call the
|
||||
functions <codeph>MIN_BIGINT()</codeph> and <codeph>MAX_BIGINT()</codeph>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
If an integer value is too large to be represented as a <codeph>BIGINT</codeph>, use a
|
||||
<codeph>DECIMAL</codeph> instead with sufficient digits of precision.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/null_bad_numeric_cast"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/partitioning_good"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/hbase_ok"/>
|
||||
|
||||
<!-- <p conref="../shared/impala_common.xml#common/parquet_blurb"/> -->
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/text_bulky"/>
|
||||
|
||||
<!-- <p conref="../shared/impala_common.xml#common/compatibility_blurb"/> -->
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/internals_8_bytes"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/added_forever"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/column_stats_constant"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/sqoop_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/sqoop_timestamp_caveat"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
|
||||
<p>
|
||||
<xref href="impala_literals.xml#numeric_literals"/>, <xref href="impala_tinyint.xml#tinyint"/>,
|
||||
<xref href="impala_smallint.xml#smallint"/>, <xref href="impala_int.xml#int"/>,
|
||||
<xref href="impala_bigint.xml#bigint"/>, <xref href="impala_decimal.xml#decimal"/>,
|
||||
<xref href="impala_math_functions.xml#math_functions"/>
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
794
docs/topics/impala_bit_functions.xml
Normal file
794
docs/topics/impala_bit_functions.xml
Normal file
@@ -0,0 +1,794 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="bit_functions" rev="2.3.0">
|
||||
|
||||
<title>Impala Bit Functions</title>
|
||||
<titlealts audience="PDF"><navtitle>Bit Functions</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Impala Functions"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Querying"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p rev="2.3.0">
|
||||
Bit manipulation functions perform bitwise operations involved in scientific processing or computer science algorithms.
|
||||
For example, these functions include setting, clearing, or testing bits within an integer value, or changing the
|
||||
positions of bits with or without wraparound.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
If a function takes two integer arguments that are required to be of the same type, the smaller argument is promoted
|
||||
to the type of the larger one if required. For example, <codeph>BITAND(1,4096)</codeph> treats both arguments as
|
||||
<codeph>SMALLINT</codeph>, because 1 can be represented as a <codeph>TINYINT</codeph> but 4096 requires a <codeph>SMALLINT</codeph>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Remember that all Impala integer values are signed. Therefore, when dealing with binary values where the most significant
|
||||
bit is 1, the specified or returned values might be negative when represented in base 10.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Whenever any argument is <codeph>NULL</codeph>, either the input value, bit position, or number of shift or rotate positions,
|
||||
the return value from any of these functions is also <codeph>NULL</codeph>
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
|
||||
<p>
|
||||
The bit functions operate on all the integral data types: <xref href="impala_int.xml#int"/>,
|
||||
<xref href="impala_bigint.xml#bigint"/>, <xref href="impala_smallint.xml#smallint"/>, and
|
||||
<xref href="impala_tinyint.xml#tinyint"/>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>Function reference:</b>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Impala supports the following bit functions:
|
||||
</p>
|
||||
|
||||
<!--
|
||||
bitand
|
||||
bitnot
|
||||
bitor
|
||||
bitxor
|
||||
countset
|
||||
getbit
|
||||
rotateleft
|
||||
rotateright
|
||||
setbit
|
||||
shiftleft
|
||||
shiftright
|
||||
-->
|
||||
|
||||
<dl>
|
||||
|
||||
<dlentry id="bitand">
|
||||
|
||||
<dt>
|
||||
<codeph>bitand(integer_type a, same_type b)</codeph>
|
||||
</dt>
|
||||
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">bitand() function</indexterm>
|
||||
<b>Purpose:</b> Returns an integer value representing the bits that are set to 1 in both of the arguments.
|
||||
If the arguments are of different sizes, the smaller is promoted to the type of the larger.
|
||||
<p>
|
||||
<b>Usage notes:</b> The <codeph>bitand()</codeph> function is equivalent to the <codeph>&</codeph> binary operator.
|
||||
</p>
|
||||
<p conref="../shared/impala_common.xml#common/return_type_same"/>
|
||||
<p conref="../shared/impala_common.xml#common/added_in_230"/>
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
<p>
|
||||
The following examples show the results of ANDing integer values.
|
||||
255 contains all 1 bits in its lowermost 7 bits.
|
||||
32767 contains all 1 bits in its lowermost 15 bits.
|
||||
<!--
|
||||
Negative numbers have a 1 in the sign bit and the value is the
|
||||
<xref href="https://en.wikipedia.org/wiki/Two%27s_complement" scope="external" format="html">two's complement</xref>
|
||||
of the positive equivalent.
|
||||
-->
|
||||
You can use the <codeph>bin()</codeph> function to check the binary representation of any
|
||||
integer value, although the result is always represented as a 64-bit value.
|
||||
If necessary, the smaller argument is promoted to the
|
||||
type of the larger one.
|
||||
</p>
|
||||
<codeblock>select bitand(255, 32767); /* 0000000011111111 & 0111111111111111 */
|
||||
+--------------------+
|
||||
| bitand(255, 32767) |
|
||||
+--------------------+
|
||||
| 255 |
|
||||
+--------------------+
|
||||
|
||||
select bitand(32767, 1); /* 0111111111111111 & 0000000000000001 */
|
||||
+------------------+
|
||||
| bitand(32767, 1) |
|
||||
+------------------+
|
||||
| 1 |
|
||||
+------------------+
|
||||
|
||||
select bitand(32, 16); /* 00010000 & 00001000 */
|
||||
+----------------+
|
||||
| bitand(32, 16) |
|
||||
+----------------+
|
||||
| 0 |
|
||||
+----------------+
|
||||
|
||||
select bitand(12,5); /* 00001100 & 00000101 */
|
||||
+---------------+
|
||||
| bitand(12, 5) |
|
||||
+---------------+
|
||||
| 4 |
|
||||
+---------------+
|
||||
|
||||
select bitand(-1,15); /* 11111111 & 00001111 */
|
||||
+----------------+
|
||||
| bitand(-1, 15) |
|
||||
+----------------+
|
||||
| 15 |
|
||||
+----------------+
|
||||
</codeblock>
|
||||
</dd>
|
||||
|
||||
</dlentry>
|
||||
|
||||
<dlentry id="bitnot">
|
||||
|
||||
<dt>
|
||||
<codeph>bitnot(integer_type a)</codeph>
|
||||
</dt>
|
||||
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">bitnot() function</indexterm>
|
||||
<b>Purpose:</b> Inverts all the bits of the input argument.
|
||||
<p>
|
||||
<b>Usage notes:</b> The <codeph>bitnot()</codeph> function is equivalent to the <codeph>~</codeph> unary operator.
|
||||
</p>
|
||||
<p conref="../shared/impala_common.xml#common/return_type_same"/>
|
||||
<p conref="../shared/impala_common.xml#common/added_in_230"/>
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
<p>
|
||||
These examples illustrate what happens when you flip all the bits of an integer value.
|
||||
The sign always changes. The decimal representation is one different between the positive and
|
||||
negative values.
|
||||
<!--
|
||||
because negative values are represented as the
|
||||
<xref href="https://en.wikipedia.org/wiki/Two%27s_complement" scope="external" format="html">two's complement</xref>
|
||||
of the corresponding positive value.
|
||||
-->
|
||||
</p>
|
||||
<codeblock>select bitnot(127); /* 01111111 -> 10000000 */
|
||||
+-------------+
|
||||
| bitnot(127) |
|
||||
+-------------+
|
||||
| -128 |
|
||||
+-------------+
|
||||
|
||||
select bitnot(16); /* 00010000 -> 11101111 */
|
||||
+------------+
|
||||
| bitnot(16) |
|
||||
+------------+
|
||||
| -17 |
|
||||
+------------+
|
||||
|
||||
select bitnot(0); /* 00000000 -> 11111111 */
|
||||
+-----------+
|
||||
| bitnot(0) |
|
||||
+-----------+
|
||||
| -1 |
|
||||
+-----------+
|
||||
|
||||
select bitnot(-128); /* 10000000 -> 01111111 */
|
||||
+--------------+
|
||||
| bitnot(-128) |
|
||||
+--------------+
|
||||
| 127 |
|
||||
+--------------+
|
||||
</codeblock>
|
||||
</dd>
|
||||
|
||||
</dlentry>
|
||||
|
||||
<dlentry id="bitor">
|
||||
|
||||
<dt>
|
||||
<codeph>bitor(integer_type a, same_type b)</codeph>
|
||||
</dt>
|
||||
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">bitor() function</indexterm>
|
||||
<b>Purpose:</b> Returns an integer value representing the bits that are set to 1 in either of the arguments.
|
||||
If the arguments are of different sizes, the smaller is promoted to the type of the larger.
|
||||
<p>
|
||||
<b>Usage notes:</b> The <codeph>bitor()</codeph> function is equivalent to the <codeph>|</codeph> binary operator.
|
||||
</p>
|
||||
<p conref="../shared/impala_common.xml#common/return_type_same"/>
|
||||
<p conref="../shared/impala_common.xml#common/added_in_230"/>
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
<p>
|
||||
The following examples show the results of ORing integer values.
|
||||
</p>
|
||||
<codeblock>select bitor(1,4); /* 00000001 | 00000100 */
|
||||
+-------------+
|
||||
| bitor(1, 4) |
|
||||
+-------------+
|
||||
| 5 |
|
||||
+-------------+
|
||||
|
||||
select bitor(16,48); /* 00001000 | 00011000 */
|
||||
+---------------+
|
||||
| bitor(16, 48) |
|
||||
+---------------+
|
||||
| 48 |
|
||||
+---------------+
|
||||
|
||||
select bitor(0,7); /* 00000000 | 00000111 */
|
||||
+-------------+
|
||||
| bitor(0, 7) |
|
||||
+-------------+
|
||||
| 7 |
|
||||
+-------------+
|
||||
</codeblock>
|
||||
</dd>
|
||||
|
||||
</dlentry>
|
||||
|
||||
<dlentry id="bitxor">
|
||||
|
||||
<dt>
|
||||
<codeph>bitxor(integer_type a, same_type b)</codeph>
|
||||
</dt>
|
||||
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">bitxor() function</indexterm>
|
||||
<b>Purpose:</b> Returns an integer value representing the bits that are set to 1 in one but not both of the arguments.
|
||||
If the arguments are of different sizes, the smaller is promoted to the type of the larger.
|
||||
<p>
|
||||
<b>Usage notes:</b> The <codeph>bitxor()</codeph> function is equivalent to the <codeph>^</codeph> binary operator.
|
||||
</p>
|
||||
<p conref="../shared/impala_common.xml#common/return_type_same"/>
|
||||
<p conref="../shared/impala_common.xml#common/added_in_230"/>
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
<p>
|
||||
The following examples show the results of XORing integer values.
|
||||
XORing a non-zero value with zero returns the non-zero value.
|
||||
XORing two identical values returns zero, because all the 1 bits from the first argument are also 1 bits in the second argument.
|
||||
XORing different non-zero values turns off some bits and leaves others turned on, based on whether the same bit is set in both arguments.
|
||||
</p>
|
||||
<codeblock>select bitxor(0,15); /* 00000000 ^ 00001111 */
|
||||
+---------------+
|
||||
| bitxor(0, 15) |
|
||||
+---------------+
|
||||
| 15 |
|
||||
+---------------+
|
||||
|
||||
select bitxor(7,7); /* 00000111 ^ 00000111 */
|
||||
+--------------+
|
||||
| bitxor(7, 7) |
|
||||
+--------------+
|
||||
| 0 |
|
||||
+--------------+
|
||||
|
||||
select bitxor(8,4); /* 00001000 ^ 00000100 */
|
||||
+--------------+
|
||||
| bitxor(8, 4) |
|
||||
+--------------+
|
||||
| 12 |
|
||||
+--------------+
|
||||
|
||||
select bitxor(3,7); /* 00000011 ^ 00000111 */
|
||||
+--------------+
|
||||
| bitxor(3, 7) |
|
||||
+--------------+
|
||||
| 4 |
|
||||
+--------------+
|
||||
</codeblock>
|
||||
</dd>
|
||||
|
||||
</dlentry>
|
||||
|
||||
<dlentry id="countset">
|
||||
|
||||
<dt>
|
||||
<codeph>countset(integer_type a [, int zero_or_one])</codeph>
|
||||
</dt>
|
||||
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">countset() function</indexterm>
|
||||
<b>Purpose:</b> By default, returns the number of 1 bits in the specified integer value.
|
||||
If the optional second argument is set to zero, it returns the number of 0 bits instead.
|
||||
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
||||
<p>
|
||||
In discussions of information theory, this operation is referred to as the
|
||||
<q><xref href="https://en.wikipedia.org/wiki/Hamming_weight" scope="external" format="html">population count</xref></q>
|
||||
or <q>popcount</q>.
|
||||
</p>
|
||||
<p conref="../shared/impala_common.xml#common/return_type_same"/>
|
||||
<p conref="../shared/impala_common.xml#common/added_in_230"/>
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
<p>
|
||||
The following examples show how to count the number of 1 bits in an integer value.
|
||||
</p>
|
||||
<codeblock>select countset(1); /* 00000001 */
|
||||
+-------------+
|
||||
| countset(1) |
|
||||
+-------------+
|
||||
| 1 |
|
||||
+-------------+
|
||||
|
||||
select countset(3); /* 00000011 */
|
||||
+-------------+
|
||||
| countset(3) |
|
||||
+-------------+
|
||||
| 2 |
|
||||
+-------------+
|
||||
|
||||
select countset(16); /* 00010000 */
|
||||
+--------------+
|
||||
| countset(16) |
|
||||
+--------------+
|
||||
| 1 |
|
||||
+--------------+
|
||||
|
||||
select countset(17); /* 00010001 */
|
||||
+--------------+
|
||||
| countset(17) |
|
||||
+--------------+
|
||||
| 2 |
|
||||
+--------------+
|
||||
|
||||
select countset(7,1); /* 00000111 = 3 1 bits; the function counts 1 bits by default */
|
||||
+----------------+
|
||||
| countset(7, 1) |
|
||||
+----------------+
|
||||
| 3 |
|
||||
+----------------+
|
||||
|
||||
select countset(7,0); /* 00000111 = 5 0 bits; third argument can only be 0 or 1 */
|
||||
+----------------+
|
||||
| countset(7, 0) |
|
||||
+----------------+
|
||||
| 5 |
|
||||
+----------------+
|
||||
</codeblock>
|
||||
</dd>
|
||||
|
||||
</dlentry>
|
||||
|
||||
<dlentry id="getbit">
|
||||
|
||||
<dt>
|
||||
<codeph>getbit(integer_type a, int position)</codeph>
|
||||
</dt>
|
||||
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">getbit() function</indexterm>
|
||||
<b>Purpose:</b> Returns a 0 or 1 representing the bit at a
|
||||
specified position. The positions are numbered right to left, starting at zero.
|
||||
The position argument cannot be negative.
|
||||
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
||||
<p>
|
||||
When you use a literal input value, it is treated as an 8-bit, 16-bit,
|
||||
and so on value, the smallest type that is appropriate.
|
||||
The type of the input value limits the range of the positions.
|
||||
Cast the input value to the appropriate type if you need to
|
||||
ensure it is treated as a 64-bit, 32-bit, and so on value.
|
||||
</p>
|
||||
<p conref="../shared/impala_common.xml#common/return_type_same"/>
|
||||
<p conref="../shared/impala_common.xml#common/added_in_230"/>
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
<p>
|
||||
The following examples show how to test a specific bit within an integer value.
|
||||
</p>
|
||||
<codeblock>select getbit(1,0); /* 00000001 */
|
||||
+--------------+
|
||||
| getbit(1, 0) |
|
||||
+--------------+
|
||||
| 1 |
|
||||
+--------------+
|
||||
|
||||
select getbit(16,1) /* 00010000 */
|
||||
+---------------+
|
||||
| getbit(16, 1) |
|
||||
+---------------+
|
||||
| 0 |
|
||||
+---------------+
|
||||
|
||||
select getbit(16,4) /* 00010000 */
|
||||
+---------------+
|
||||
| getbit(16, 4) |
|
||||
+---------------+
|
||||
| 1 |
|
||||
+---------------+
|
||||
|
||||
select getbit(16,5) /* 00010000 */
|
||||
+---------------+
|
||||
| getbit(16, 5) |
|
||||
+---------------+
|
||||
| 0 |
|
||||
+---------------+
|
||||
|
||||
select getbit(-1,3); /* 11111111 */
|
||||
+---------------+
|
||||
| getbit(-1, 3) |
|
||||
+---------------+
|
||||
| 1 |
|
||||
+---------------+
|
||||
|
||||
select getbit(-1,25); /* 11111111 */
|
||||
ERROR: Invalid bit position: 25
|
||||
|
||||
select getbit(cast(-1 as int),25); /* 11111111111111111111111111111111 */
|
||||
+-----------------------------+
|
||||
| getbit(cast(-1 as int), 25) |
|
||||
+-----------------------------+
|
||||
| 1 |
|
||||
+-----------------------------+
|
||||
</codeblock>
|
||||
</dd>
|
||||
|
||||
</dlentry>
|
||||
|
||||
<dlentry id="rotateleft">
|
||||
|
||||
<dt>
|
||||
<codeph>rotateleft(integer_type a, int positions)</codeph>
|
||||
</dt>
|
||||
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">rotateleft() function</indexterm>
|
||||
<b>Purpose:</b> Rotates an integer value left by a specified number of bits.
|
||||
As the most significant bit is taken out of the original value,
|
||||
if it is a 1 bit, it is <q>rotated</q> back to the least significant bit.
|
||||
Therefore, the final value has the same number of 1 bits as the original value,
|
||||
just in different positions.
|
||||
In computer science terms, this operation is a
|
||||
<q><xref href="https://en.wikipedia.org/wiki/Circular_shift" scope="external" format="html">circular shift</xref></q>.
|
||||
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
||||
<p>
|
||||
Specifying a second argument of zero leaves the original value unchanged.
|
||||
Rotating a -1 value by any number of positions still returns -1,
|
||||
because the original value has all 1 bits and all the 1 bits are
|
||||
preserved during rotation.
|
||||
Similarly, rotating a 0 value by any number of positions still returns 0.
|
||||
Rotating a value by the same number of bits as in the value returns the same value.
|
||||
Because this is a circular operation, the number of positions is not limited
|
||||
to the number of bits in the input value.
|
||||
For example, rotating an 8-bit value by 1, 9, 17, and so on positions returns an
|
||||
identical result in each case.
|
||||
</p>
|
||||
<p conref="../shared/impala_common.xml#common/return_type_same"/>
|
||||
<p conref="../shared/impala_common.xml#common/added_in_230"/>
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
<codeblock>select rotateleft(1,4); /* 00000001 -> 00010000 */
|
||||
+------------------+
|
||||
| rotateleft(1, 4) |
|
||||
+------------------+
|
||||
| 16 |
|
||||
+------------------+
|
||||
|
||||
select rotateleft(-1,155); /* 11111111 -> 11111111 */
|
||||
+---------------------+
|
||||
| rotateleft(-1, 155) |
|
||||
+---------------------+
|
||||
| -1 |
|
||||
+---------------------+
|
||||
|
||||
select rotateleft(-128,1); /* 10000000 -> 00000001 */
|
||||
+---------------------+
|
||||
| rotateleft(-128, 1) |
|
||||
+---------------------+
|
||||
| 1 |
|
||||
+---------------------+
|
||||
|
||||
select rotateleft(-127,3); /* 10000001 -> 00001100 */
|
||||
+---------------------+
|
||||
| rotateleft(-127, 3) |
|
||||
+---------------------+
|
||||
| 12 |
|
||||
+---------------------+
|
||||
|
||||
</codeblock>
|
||||
</dd>
|
||||
|
||||
</dlentry>
|
||||
|
||||
<dlentry id="rotateright">
|
||||
|
||||
<dt>
|
||||
<codeph>rotateright(integer_type a, int positions)</codeph>
|
||||
</dt>
|
||||
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">rotateright() function</indexterm>
|
||||
<b>Purpose:</b> Rotates an integer value right by a specified number of bits.
|
||||
As the least significant bit is taken out of the original value,
|
||||
if it is a 1 bit, it is <q>rotated</q> back to the most significant bit.
|
||||
Therefore, the final value has the same number of 1 bits as the original value,
|
||||
just in different positions.
|
||||
In computer science terms, this operation is a
|
||||
<q><xref href="https://en.wikipedia.org/wiki/Circular_shift" scope="external" format="html">circular shift</xref></q>.
|
||||
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
||||
<p>
|
||||
Specifying a second argument of zero leaves the original value unchanged.
|
||||
Rotating a -1 value by any number of positions still returns -1,
|
||||
because the original value has all 1 bits and all the 1 bits are
|
||||
preserved during rotation.
|
||||
Similarly, rotating a 0 value by any number of positions still returns 0.
|
||||
Rotating a value by the same number of bits as in the value returns the same value.
|
||||
Because this is a circular operation, the number of positions is not limited
|
||||
to the number of bits in the input value.
|
||||
For example, rotating an 8-bit value by 1, 9, 17, and so on positions returns an
|
||||
identical result in each case.
|
||||
</p>
|
||||
<p conref="../shared/impala_common.xml#common/return_type_same"/>
|
||||
<p conref="../shared/impala_common.xml#common/added_in_230"/>
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
<codeblock>select rotateright(16,4); /* 00010000 -> 00000001 */
|
||||
+--------------------+
|
||||
| rotateright(16, 4) |
|
||||
+--------------------+
|
||||
| 1 |
|
||||
+--------------------+
|
||||
|
||||
select rotateright(-1,155); /* 11111111 -> 11111111 */
|
||||
+----------------------+
|
||||
| rotateright(-1, 155) |
|
||||
+----------------------+
|
||||
| -1 |
|
||||
+----------------------+
|
||||
|
||||
select rotateright(-128,1); /* 10000000 -> 01000000 */
|
||||
+----------------------+
|
||||
| rotateright(-128, 1) |
|
||||
+----------------------+
|
||||
| 64 |
|
||||
+----------------------+
|
||||
|
||||
select rotateright(-127,3); /* 10000001 -> 00110000 */
|
||||
+----------------------+
|
||||
| rotateright(-127, 3) |
|
||||
+----------------------+
|
||||
| 48 |
|
||||
+----------------------+
|
||||
</codeblock>
|
||||
</dd>
|
||||
|
||||
</dlentry>
|
||||
|
||||
<dlentry id="setbit">
|
||||
|
||||
<dt>
|
||||
<codeph>setbit(integer_type a, int position [, int zero_or_one])</codeph>
|
||||
</dt>
|
||||
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">setbit() function</indexterm>
|
||||
<b>Purpose:</b> By default, changes a bit at a specified position to a 1, if it is not already.
|
||||
If the optional third argument is set to zero, the specified bit is set to 0 instead.
|
||||
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
||||
If the bit at the specified position was already 1 (by default)
|
||||
or 0 (with a third argument of zero), the return value is
|
||||
the same as the first argument.
|
||||
The positions are numbered right to left, starting at zero.
|
||||
(Therefore, the return value could be different from the first argument
|
||||
even if the position argument is zero.)
|
||||
The position argument cannot be negative.
|
||||
<p>
|
||||
When you use a literal input value, it is treated as an 8-bit, 16-bit,
|
||||
and so on value, the smallest type that is appropriate.
|
||||
The type of the input value limits the range of the positions.
|
||||
Cast the input value to the appropriate type if you need to
|
||||
ensure it is treated as a 64-bit, 32-bit, and so on value.
|
||||
</p>
|
||||
<p conref="../shared/impala_common.xml#common/return_type_same"/>
|
||||
<p conref="../shared/impala_common.xml#common/added_in_230"/>
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
<codeblock>select setbit(0,0); /* 00000000 -> 00000001 */
|
||||
+--------------+
|
||||
| setbit(0, 0) |
|
||||
+--------------+
|
||||
| 1 |
|
||||
+--------------+
|
||||
|
||||
select setbit(0,3); /* 00000000 -> 00001000 */
|
||||
+--------------+
|
||||
| setbit(0, 3) |
|
||||
+--------------+
|
||||
| 8 |
|
||||
+--------------+
|
||||
|
||||
select setbit(7,3); /* 00000111 -> 00001111 */
|
||||
+--------------+
|
||||
| setbit(7, 3) |
|
||||
+--------------+
|
||||
| 15 |
|
||||
+--------------+
|
||||
|
||||
select setbit(15,3); /* 00001111 -> 00001111 */
|
||||
+---------------+
|
||||
| setbit(15, 3) |
|
||||
+---------------+
|
||||
| 15 |
|
||||
+---------------+
|
||||
|
||||
select setbit(0,32); /* By default, 0 is a TINYINT with only 8 bits. */
|
||||
ERROR: Invalid bit position: 32
|
||||
|
||||
select setbit(cast(0 as bigint),32); /* For BIGINT, the position can be 0..63. */
|
||||
+-------------------------------+
|
||||
| setbit(cast(0 as bigint), 32) |
|
||||
+-------------------------------+
|
||||
| 4294967296 |
|
||||
+-------------------------------+
|
||||
|
||||
select setbit(7,3,1); /* 00000111 -> 00001111; setting to 1 is the default */
|
||||
+-----------------+
|
||||
| setbit(7, 3, 1) |
|
||||
+-----------------+
|
||||
| 15 |
|
||||
+-----------------+
|
||||
|
||||
select setbit(7,2,0); /* 00000111 -> 00000011; third argument of 0 clears instead of sets */
|
||||
+-----------------+
|
||||
| setbit(7, 2, 0) |
|
||||
+-----------------+
|
||||
| 3 |
|
||||
+-----------------+
|
||||
</codeblock>
|
||||
</dd>
|
||||
|
||||
</dlentry>
|
||||
|
||||
<dlentry id="shiftleft">
|
||||
|
||||
<dt>
|
||||
<codeph>shiftleft(integer_type a, int positions)</codeph>
|
||||
</dt>
|
||||
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">shiftleft() function</indexterm>
|
||||
<b>Purpose:</b> Shifts an integer value left by a specified number of bits.
|
||||
As the most significant bit is taken out of the original value,
|
||||
it is discarded and the least significant bit becomes 0.
|
||||
In computer science terms, this operation is a <q><xref href="https://en.wikipedia.org/wiki/Logical_shift" scope="external" format="html">logical shift</xref></q>.
|
||||
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
||||
<p>
|
||||
The final value has either the same number of 1 bits as the original value, or fewer.
|
||||
Shifting an 8-bit value by 8 positions, a 16-bit value by 16 positions, and so on produces
|
||||
a result of zero.
|
||||
</p>
|
||||
<p>
|
||||
Specifying a second argument of zero leaves the original value unchanged.
|
||||
Shifting any value by 0 returns the original value.
|
||||
Shifting any value by 1 is the same as multiplying it by 2,
|
||||
as long as the value is small enough; larger values eventually
|
||||
become negative when shifted, as the sign bit is set.
|
||||
Starting with the value 1 and shifting it left by N positions gives
|
||||
the same result as 2 to the Nth power, or <codeph>pow(2,<varname>N</varname>)</codeph>.
|
||||
</p>
|
||||
<p conref="../shared/impala_common.xml#common/return_type_same"/>
|
||||
<p conref="../shared/impala_common.xml#common/added_in_230"/>
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
<codeblock>select shiftleft(1,0); /* 00000001 -> 00000001 */
|
||||
+-----------------+
|
||||
| shiftleft(1, 0) |
|
||||
+-----------------+
|
||||
| 1 |
|
||||
+-----------------+
|
||||
|
||||
select shiftleft(1,3); /* 00000001 -> 00001000 */
|
||||
+-----------------+
|
||||
| shiftleft(1, 3) |
|
||||
+-----------------+
|
||||
| 8 |
|
||||
+-----------------+
|
||||
|
||||
select shiftleft(8,2); /* 00001000 -> 00100000 */
|
||||
+-----------------+
|
||||
| shiftleft(8, 2) |
|
||||
+-----------------+
|
||||
| 32 |
|
||||
+-----------------+
|
||||
|
||||
select shiftleft(127,1); /* 01111111 -> 11111110 */
|
||||
+-------------------+
|
||||
| shiftleft(127, 1) |
|
||||
+-------------------+
|
||||
| -2 |
|
||||
+-------------------+
|
||||
|
||||
select shiftleft(127,5); /* 01111111 -> 11100000 */
|
||||
+-------------------+
|
||||
| shiftleft(127, 5) |
|
||||
+-------------------+
|
||||
| -32 |
|
||||
+-------------------+
|
||||
|
||||
select shiftleft(-1,4); /* 11111111 -> 11110000 */
|
||||
+------------------+
|
||||
| shiftleft(-1, 4) |
|
||||
+------------------+
|
||||
| -16 |
|
||||
+------------------+
|
||||
</codeblock>
|
||||
</dd>
|
||||
|
||||
</dlentry>
|
||||
|
||||
<dlentry id="shiftright">
|
||||
|
||||
<dt>
|
||||
<codeph>shiftright(integer_type a, int positions)</codeph>
|
||||
</dt>
|
||||
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">shiftright() function</indexterm>
|
||||
<b>Purpose:</b> Shifts an integer value right by a specified number of bits.
|
||||
As the least significant bit is taken out of the original value,
|
||||
it is discarded and the most significant bit becomes 0.
|
||||
In computer science terms, this operation is a <q><xref href="https://en.wikipedia.org/wiki/Logical_shift" scope="external" format="html">logical shift</xref></q>.
|
||||
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
||||
<p>
|
||||
Therefore, the final value has either the same number of 1 bits as the original value, or fewer.
|
||||
Shifting an 8-bit value by 8 positions, a 16-bit value by 16 positions, and so on produces
|
||||
a result of zero.
|
||||
</p>
|
||||
<p>
|
||||
Specifying a second argument of zero leaves the original value unchanged.
|
||||
Shifting any value by 0 returns the original value.
|
||||
Shifting any positive value right by 1 is the same as dividing it by 2.
|
||||
Negative values become positive when shifted right.
|
||||
</p>
|
||||
<p conref="../shared/impala_common.xml#common/return_type_same"/>
|
||||
<p conref="../shared/impala_common.xml#common/added_in_230"/>
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
<codeblock>select shiftright(16,0); /* 00010000 -> 00000000 */
|
||||
+-------------------+
|
||||
| shiftright(16, 0) |
|
||||
+-------------------+
|
||||
| 16 |
|
||||
+-------------------+
|
||||
|
||||
select shiftright(16,4); /* 00010000 -> 00000000 */
|
||||
+-------------------+
|
||||
| shiftright(16, 4) |
|
||||
+-------------------+
|
||||
| 1 |
|
||||
+-------------------+
|
||||
|
||||
select shiftright(16,5); /* 00010000 -> 00000000 */
|
||||
+-------------------+
|
||||
| shiftright(16, 5) |
|
||||
+-------------------+
|
||||
| 0 |
|
||||
+-------------------+
|
||||
|
||||
select shiftright(-1,1); /* 11111111 -> 01111111 */
|
||||
+-------------------+
|
||||
| shiftright(-1, 1) |
|
||||
+-------------------+
|
||||
| 127 |
|
||||
+-------------------+
|
||||
|
||||
select shiftright(-1,5); /* 11111111 -> 00000111 */
|
||||
+-------------------+
|
||||
| shiftright(-1, 5) |
|
||||
+-------------------+
|
||||
| 7 |
|
||||
+-------------------+
|
||||
</codeblock>
|
||||
</dd>
|
||||
|
||||
</dlentry>
|
||||
|
||||
</dl>
|
||||
</conbody>
|
||||
</concept>
|
||||
154
docs/topics/impala_boolean.xml
Normal file
154
docs/topics/impala_boolean.xml
Normal file
@@ -0,0 +1,154 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="boolean">
|
||||
|
||||
<title>BOOLEAN Data Type</title>
|
||||
<titlealts audience="PDF"><navtitle>BOOLEAN</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Impala Data Types"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Schemas"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
A data type used in <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph> statements, representing a
|
||||
single true/false choice.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
||||
|
||||
<p>
|
||||
In the column definition of a <codeph>CREATE TABLE</codeph> statement:
|
||||
</p>
|
||||
|
||||
<codeblock><varname>column_name</varname> BOOLEAN</codeblock>
|
||||
|
||||
<p>
|
||||
<b>Range:</b> <codeph>TRUE</codeph> or <codeph>FALSE</codeph>. Do not use quotation marks around the
|
||||
<codeph>TRUE</codeph> and <codeph>FALSE</codeph> literal values. You can write the literal values in
|
||||
uppercase, lowercase, or mixed case. The values queried from a table are always returned in lowercase,
|
||||
<codeph>true</codeph> or <codeph>false</codeph>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>Conversions:</b> Impala does not automatically convert any other type to <codeph>BOOLEAN</codeph>. All
|
||||
conversions must use an explicit call to the <codeph>CAST()</codeph> function.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
You can use <codeph>CAST()</codeph> to convert
|
||||
<!--
|
||||
<codeph>TINYINT</codeph>, <codeph>SMALLINT</codeph>,
|
||||
<codeph>INT</codeph>, <codeph>BIGINT</codeph>, <codeph>FLOAT</codeph>, or <codeph>DOUBLE</codeph>
|
||||
-->
|
||||
any integer or floating-point type to
|
||||
<codeph>BOOLEAN</codeph>: a value of 0 represents <codeph>false</codeph>, and any non-zero value is converted
|
||||
to <codeph>true</codeph>.
|
||||
</p>
|
||||
|
||||
<codeblock>SELECT CAST(42 AS BOOLEAN) AS nonzero_int, CAST(99.44 AS BOOLEAN) AS nonzero_decimal,
|
||||
CAST(000 AS BOOLEAN) AS zero_int, CAST(0.0 AS BOOLEAN) AS zero_decimal;
|
||||
+-------------+-----------------+----------+--------------+
|
||||
| nonzero_int | nonzero_decimal | zero_int | zero_decimal |
|
||||
+-------------+-----------------+----------+--------------+
|
||||
| true | true | false | false |
|
||||
+-------------+-----------------+----------+--------------+
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
When you cast the opposite way, from <codeph>BOOLEAN</codeph> to a numeric type,
|
||||
the result becomes either 1 or 0:
|
||||
</p>
|
||||
|
||||
<codeblock>SELECT CAST(true AS INT) AS true_int, CAST(true AS DOUBLE) AS true_double,
|
||||
CAST(false AS INT) AS false_int, CAST(false AS DOUBLE) AS false_double;
|
||||
+----------+-------------+-----------+--------------+
|
||||
| true_int | true_double | false_int | false_double |
|
||||
+----------+-------------+-----------+--------------+
|
||||
| 1 | 1 | 0 | 0 |
|
||||
+----------+-------------+-----------+--------------+
|
||||
</codeblock>
|
||||
|
||||
<p rev="1.4.0">
|
||||
<!-- BOOLEAN-to-DECIMAL casting requested in IMPALA-991. As of Sept. 2014, designated "won't fix". -->
|
||||
You can cast <codeph>DECIMAL</codeph> values to <codeph>BOOLEAN</codeph>, with the same treatment of zero and
|
||||
non-zero values as the other numeric types. You cannot cast a <codeph>BOOLEAN</codeph> to a
|
||||
<codeph>DECIMAL</codeph>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
You cannot cast a <codeph>STRING</codeph> value to <codeph>BOOLEAN</codeph>, although you can cast a
|
||||
<codeph>BOOLEAN</codeph> value to <codeph>STRING</codeph>, returning <codeph>'1'</codeph> for
|
||||
<codeph>true</codeph> values and <codeph>'0'</codeph> for <codeph>false</codeph> values.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Although you can cast a <codeph>TIMESTAMP</codeph> to a <codeph>BOOLEAN</codeph> or a
|
||||
<codeph>BOOLEAN</codeph> to a <codeph>TIMESTAMP</codeph>, the results are unlikely to be useful. Any non-zero
|
||||
<codeph>TIMESTAMP</codeph> (that is, any value other than <codeph>1970-01-01 00:00:00</codeph>) becomes
|
||||
<codeph>TRUE</codeph> when converted to <codeph>BOOLEAN</codeph>, while <codeph>1970-01-01 00:00:00</codeph>
|
||||
becomes <codeph>FALSE</codeph>. A value of <codeph>FALSE</codeph> becomes <codeph>1970-01-01
|
||||
00:00:00</codeph> when converted to <codeph>BOOLEAN</codeph>, and <codeph>TRUE</codeph> becomes one second
|
||||
past this epoch date, that is, <codeph>1970-01-01 00:00:01</codeph>.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/null_null_arguments"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/partitioning_blurb"/>
|
||||
|
||||
<p>
|
||||
Do not use a <codeph>BOOLEAN</codeph> column as a partition key. Although you can create such a table,
|
||||
subsequent operations produce errors:
|
||||
</p>
|
||||
|
||||
<codeblock>[localhost:21000] > create table truth_table (assertion string) partitioned by (truth boolean);
|
||||
[localhost:21000] > insert into truth_table values ('Pigs can fly',false);
|
||||
ERROR: AnalysisException: INSERT into table with BOOLEAN partition column (truth) is not supported: partitioning.truth_table
|
||||
</codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
|
||||
<codeblock>SELECT 1 < 2;
|
||||
SELECT 2 = 5;
|
||||
SELECT 100 < NULL, 100 > NULL;
|
||||
CREATE TABLE assertions (claim STRING, really BOOLEAN);
|
||||
INSERT INTO assertions VALUES
|
||||
("1 is less than 2", 1 < 2),
|
||||
("2 is the same as 5", 2 = 5),
|
||||
("Grass is green", true),
|
||||
("The moon is made of green cheese", false);
|
||||
SELECT claim FROM assertions WHERE really = TRUE;
|
||||
</codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/hbase_ok"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/parquet_ok"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/text_bulky"/>
|
||||
|
||||
<!-- <p conref="../shared/impala_common.xml#common/compatibility_blurb"/> -->
|
||||
|
||||
<!-- <p conref="../shared/impala_common.xml#common/internals_blurb"/> -->
|
||||
|
||||
<!-- <p conref="../shared/impala_common.xml#common/added_in_20"/> -->
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/column_stats_constant"/>
|
||||
|
||||
<!-- <p conref="../shared/impala_common.xml#common/restrictions_blurb"/> -->
|
||||
|
||||
<!-- <p conref="../shared/impala_common.xml#common/related_info"/> -->
|
||||
|
||||
<p>
|
||||
<b>Related information:</b> <xref href="impala_literals.xml#boolean_literals"/>,
|
||||
<xref href="impala_operators.xml#operators"/>,
|
||||
<xref href="impala_conditional_functions.xml#conditional_functions"/>
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
256
docs/topics/impala_breakpad.xml
Normal file
256
docs/topics/impala_breakpad.xml
Normal file
@@ -0,0 +1,256 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="breakpad" rev="2.6.0 IMPALA-2686 CDH-40238">
|
||||
|
||||
<title>Breakpad Minidumps for Impala (<keyword keyref="impala26"/> or higher only)</title>
|
||||
<titlealts audience="PDF"><navtitle>Breakpad Minidumps</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Troubleshooting"/>
|
||||
<data name="Category" value="Support"/>
|
||||
<data name="Category" value="Administrators"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p rev="2.6.0 IMPALA-2686 CDH-40238">
|
||||
The <xref href="https://chromium.googlesource.com/breakpad/breakpad/" scope="external" format="html">breakpad</xref>
|
||||
project is an open-source framework for crash reporting.
|
||||
In <keyword keyref="impala26_full"/> and higher, Impala can use <codeph>breakpad</codeph> to record stack information and
|
||||
register values when any of the Impala-related daemons crash due to an error such as <codeph>SIGSEGV</codeph>
|
||||
or unhandled exceptions.
|
||||
The dump files are much smaller than traditional core dump files. The dump mechanism itself uses very little
|
||||
memory, which improves reliability if the crash occurs while the system is low on memory.
|
||||
</p>
|
||||
|
||||
<note type="important">
|
||||
Because of the internal mechanisms involving Impala memory allocation and Linux
|
||||
signalling for out-of-memory (OOM) errors, if an Impala-related daemon experiences a
|
||||
crash due to an OOM condition, it does <i>not</i> generate a minidump for that error.
|
||||
<p>
|
||||
|
||||
</p>
|
||||
</note>
|
||||
|
||||
|
||||
<p outputclass="toc inpage" audience="PDF"/>
|
||||
|
||||
</conbody>
|
||||
|
||||
<concept id="breakpad_minidump_enable">
|
||||
<title>Enabling or Disabling Minidump Generation</title>
|
||||
<conbody>
|
||||
<p>
|
||||
By default, a minidump file is generated when an Impala-related daemon crashes.
|
||||
To turn off generation of the minidump files, change the
|
||||
<uicontrol>minidump_path</uicontrol> configuration setting of one or more Impala-related daemons
|
||||
to the empty string, and restart the corresponding services or daemons.
|
||||
</p>
|
||||
|
||||
<p rev="IMPALA-3677 CDH-43745">
|
||||
In <keyword keyref="impala27_full"/> and higher,
|
||||
you can send a <codeph>SIGUSR1</codeph> signal to any Impala-related daemon to write a
|
||||
Breakpad minidump. For advanced troubleshooting, you can now produce a minidump
|
||||
without triggering a crash.
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
|
||||
<concept id="breakpad_minidump_location" rev="IMPALA-3581">
|
||||
<title>Specifying the Location for Minidump Files</title>
|
||||
<conbody>
|
||||
<p>
|
||||
By default, all minidump files are written to the following location
|
||||
on the host where a crash occurs:
|
||||
<!-- Location stated in IMPALA-3581; overridden by different location from IMPALA-2686?
|
||||
<filepath><varname>log_directory</varname>/minidumps/<varname>daemon_name</varname></filepath> -->
|
||||
<ul>
|
||||
<li>
|
||||
<p>
|
||||
Clusters managed by Cloudera Manager: <filepath>/var/log/impala-minidumps/<varname>daemon_name</varname></filepath>
|
||||
</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>
|
||||
Clusters not managed by Cloudera Manager:
|
||||
<filepath><varname>impala_log_dir</varname>/<varname>daemon_name</varname>/minidumps/<varname>daemon_name</varname></filepath>
|
||||
</p>
|
||||
</li>
|
||||
</ul>
|
||||
The minidump files for <cmdname>impalad</cmdname>, <cmdname>catalogd</cmdname>,
|
||||
and <cmdname>statestored</cmdname> are each written to a separate directory.
|
||||
</p>
|
||||
<p>
|
||||
To specify a different location, set the
|
||||
<!-- Again, IMPALA-3581 says one thing and IMPALA-2686 / observation of CM interface says another.
|
||||
<codeph>log_dir</codeph> -->
|
||||
<uicontrol>minidump_path</uicontrol>
|
||||
configuration setting of one or more Impala-related daemons, and restart the corresponding services or daemons.
|
||||
</p>
|
||||
<p>
|
||||
If you specify a relative path for this setting, the value is interpreted relative to
|
||||
the default <uicontrol>minidump_path</uicontrol> directory.
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
|
||||
<concept id="breakpad_minidump_number">
|
||||
<title>Controlling the Number of Minidump Files</title>
|
||||
<conbody>
|
||||
<p>
|
||||
Like any files used for logging or troubleshooting, consider limiting the number of
|
||||
minidump files, or removing unneeded ones, depending on the amount of free storage
|
||||
space on the hosts in the cluster.
|
||||
</p>
|
||||
<p>
|
||||
Because the minidump files are only used for problem resolution, you can remove any such files that
|
||||
are not needed to debug current issues.
|
||||
</p>
|
||||
<p>
|
||||
To control how many minidump files Impala keeps around at any one time,
|
||||
set the <uicontrol>max_minidumps</uicontrol> configuration setting for
|
||||
of one or more Impala-related daemon, and restart the corresponding services or daemons.
|
||||
The default for this setting is 9. A zero or negative value is interpreted as
|
||||
<q>unlimited</q>.
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
|
||||
<concept id="breakpad_minidump_logging">
|
||||
<title>Detecting Crash Events</title>
|
||||
<conbody>
|
||||
<p>
|
||||
You can see in the Impala log files or in the Cloudera Manager charts for Impala
|
||||
when crash events occur that generate minidump files. Because each restart begins
|
||||
a new log file, the <q>crashed</q> message is always at or near the bottom of the
|
||||
log file. (There might be another later message if core dumps are also enabled.)
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
|
||||
<concept id="breakpad_support_process" rev="CDH-39818">
|
||||
<title>Using the Minidump Files for Problem Resolution</title>
|
||||
<conbody>
|
||||
<p>
|
||||
Typically, you provide minidump files to <keyword keyref="support_org"/> as part of problem resolution,
|
||||
in the same way that you might provide a core dump. The <uicontrol>Send Diagnostic Data</uicontrol>
|
||||
under the <uicontrol>Support</uicontrol> menu in Cloudera Manager guides you through the
|
||||
process of selecting a time period and volume of diagnostic data, then collects the data
|
||||
from all hosts and transmits the relevant information for you.
|
||||
</p>
|
||||
<fig id="fig_pqw_gvx_pr">
|
||||
<title>Send Diagnostic Data choice under Support menu</title>
|
||||
<image href="../images/support_send_diagnostic_data.png" scalefit="yes" placement="break"/>
|
||||
</fig>
|
||||
<p>
|
||||
You might get additional instructions from <keyword keyref="support_org"/> about collecting minidumps to better isolate a specific problem.
|
||||
Because the information in the minidump files is limited to stack traces and register contents,
|
||||
the possibility of including sensitive information is much lower than with core dump files.
|
||||
If any sensitive information is included in the minidump, <keyword keyref="support_org"/> preserves the confidentiality of that information.
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
|
||||
<concept id="breakpad_demo">
|
||||
<title>Demonstration of Breakpad Feature</title>
|
||||
<conbody>
|
||||
<p>
|
||||
The following example uses the command <cmdname>kill -11</cmdname> to
|
||||
simulate a <codeph>SIGSEGV</codeph> crash for an <cmdname>impalad</cmdname>
|
||||
process on a single DataNode, then examines the relevant log files and minidump file.
|
||||
</p>
|
||||
<p>
|
||||
First, as root on a worker node, we kill the <cmdname>impalad</cmdname> process with a
|
||||
<codeph>SIGSEGV</codeph> error. The original process ID was 23114. (Cloudera Manager
|
||||
restarts the process with a new pid, as shown by the second <cmdname>ps</cmdname> command.)
|
||||
</p>
|
||||
<codeblock><![CDATA[
|
||||
# ps ax | grep impalad
|
||||
23114 ? Sl 0:18 /opt/cloudera/parcels/<parcel_version>/lib/impala/sbin-retail/impalad --flagfile=/var/run/cloudera-scm-agent/process/114-impala-IMPALAD/impala-conf/impalad_flags
|
||||
31259 pts/0 S+ 0:00 grep impalad
|
||||
#
|
||||
# kill -11 23114
|
||||
#
|
||||
# ps ax | grep impalad
|
||||
31374 ? Rl 0:04 /opt/cloudera/parcels/<parcel_version>/lib/impala/sbin-retail/impalad --flagfile=/var/run/cloudera-scm-agent/process/114-impala-IMPALAD/impala-conf/impalad_flags
|
||||
31475 pts/0 S+ 0:00 grep impalad
|
||||
]]>
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
We locate the log directory underneath <filepath>/var/log</filepath>.
|
||||
There is a <codeph>.INFO</codeph>, <codeph>.WARNING</codeph>, and <codeph>.ERROR</codeph>
|
||||
log file for the 23114 process ID. The minidump message is written to the
|
||||
<codeph>.INFO</codeph> file and the <codeph>.ERROR</codeph> file, but not the
|
||||
<codeph>.WARNING</codeph> file. In this case, a large core file was also produced.
|
||||
</p>
|
||||
<codeblock><![CDATA[
|
||||
# cd /var/log/impalad
|
||||
# ls -la | grep 23114
|
||||
-rw------- 1 impala impala 3539079168 Jun 23 15:20 core.23114
|
||||
-rw-r--r-- 1 impala impala 99057 Jun 23 15:20 hs_err_pid23114.log
|
||||
-rw-r--r-- 1 impala impala 351 Jun 23 15:20 impalad.worker_node_123.impala.log.ERROR.20160623-140343.23114
|
||||
-rw-r--r-- 1 impala impala 29101 Jun 23 15:20 impalad.worker_node_123.impala.log.INFO.20160623-140343.23114
|
||||
-rw-r--r-- 1 impala impala 228 Jun 23 14:03 impalad.worker_node_123.impala.log.WARNING.20160623-140343.23114
|
||||
]]>
|
||||
</codeblock>
|
||||
<p>
|
||||
The <codeph>.INFO</codeph> log includes the location of the minidump file, followed by
|
||||
a report of a core dump. With the breakpad minidump feature enabled, now we might
|
||||
disable core dumps or keep fewer of them around.
|
||||
</p>
|
||||
<codeblock><![CDATA[
|
||||
# cat impalad.worker_node_123.impala.log.INFO.20160623-140343.23114
|
||||
...
|
||||
Wrote minidump to /var/log/impala-minidumps/impalad/0980da2d-a905-01e1-25ff883a-04ee027a.dmp
|
||||
#
|
||||
# A fatal error has been detected by the Java Runtime Environment:
|
||||
#
|
||||
# SIGSEGV (0xb) at pc=0x00000030c0e0b68a, pid=23114, tid=139869541455968
|
||||
#
|
||||
# JRE version: Java(TM) SE Runtime Environment (7.0_67-b01) (build 1.7.0_67-b01)
|
||||
# Java VM: Java HotSpot(TM) 64-Bit Server VM (24.65-b04 mixed mode linux-amd64 compressed oops)
|
||||
# Problematic frame:
|
||||
# C [libpthread.so.0+0xb68a] pthread_cond_wait+0xca
|
||||
#
|
||||
# Core dump written. Default location: /var/log/impalad/core or core.23114
|
||||
#
|
||||
# An error report file with more information is saved as:
|
||||
# /var/log/impalad/hs_err_pid23114.log
|
||||
#
|
||||
# If you would like to submit a bug report, please visit:
|
||||
# http://bugreport.sun.com/bugreport/crash.jsp
|
||||
# The crash happened outside the Java Virtual Machine in native code.
|
||||
# See problematic frame for where to report the bug.
|
||||
...
|
||||
|
||||
# cat impalad.worker_node_123.impala.log.ERROR.20160623-140343.23114
|
||||
|
||||
Log file created at: 2016/06/23 14:03:43
|
||||
Running on machine:.worker_node_123
|
||||
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
|
||||
E0623 14:03:43.911002 23114 logging.cc:118] stderr will be logged to this file.
|
||||
Wrote minidump to /var/log/impala-minidumps/impalad/0980da2d-a905-01e1-25ff883a-04ee027a.dmp
|
||||
]]>
|
||||
</codeblock>
|
||||
<p>
|
||||
The resulting minidump file is much smaller than the corresponding core file,
|
||||
making it much easier to supply diagnostic information to <keyword keyref="support_org"/>.
|
||||
The transmission process for the minidump files is automated through Cloudera Manager.
|
||||
</p>
|
||||
<codeblock><![CDATA[
|
||||
# pwd
|
||||
/var/log/impalad
|
||||
# cd ../impala-minidumps/impalad
|
||||
# ls
|
||||
0980da2d-a905-01e1-25ff883a-04ee027a.dmp
|
||||
# du -kh *
|
||||
2.4M 0980da2d-a905-01e1-25ff883a-04ee027a.dmp
|
||||
]]>
|
||||
</codeblock>
|
||||
</conbody>
|
||||
</concept>
|
||||
|
||||
</concept>
|
||||
25
docs/topics/impala_cdh.xml
Normal file
25
docs/topics/impala_cdh.xml
Normal file
@@ -0,0 +1,25 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="impala_cdh">
|
||||
|
||||
<title>How Impala Works with CDH</title>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Concepts"/>
|
||||
<data name="Category" value="CDH"/>
|
||||
<data name="Category" value="Administrators"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/impala_overview_diagram"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/component_list"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/query_overview"/>
|
||||
</conbody>
|
||||
</concept>
|
||||
278
docs/topics/impala_char.xml
Normal file
278
docs/topics/impala_char.xml
Normal file
@@ -0,0 +1,278 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="char" rev="2.0.0">
|
||||
|
||||
<title>CHAR Data Type (<keyword keyref="impala20"/> or higher only)</title>
|
||||
<titlealts audience="PDF"><navtitle>CHAR</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Impala Data Types"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Schemas"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p rev="2.0.0">
|
||||
<indexterm audience="Cloudera">CHAR data type</indexterm>
|
||||
A fixed-length character type, padded with trailing spaces if necessary to achieve the specified length. If
|
||||
values are longer than the specified length, Impala truncates any trailing characters.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
||||
|
||||
<p>
|
||||
In the column definition of a <codeph>CREATE TABLE</codeph> statement:
|
||||
</p>
|
||||
|
||||
<codeblock><varname>column_name</varname> CHAR(<varname>length</varname>)</codeblock>
|
||||
|
||||
<p>
|
||||
The maximum length you can specify is 255.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>Semantics of trailing spaces:</b>
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
When you store a <codeph>CHAR</codeph> value shorter than the specified length in a table, queries return
|
||||
the value padded with trailing spaces if necessary; the resulting value has the same length as specified in
|
||||
the column definition.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
If you store a <codeph>CHAR</codeph> value containing trailing spaces in a table, those trailing spaces are
|
||||
not stored in the data file. When the value is retrieved by a query, the result could have a different
|
||||
number of trailing spaces. That is, the value includes however many spaces are needed to pad it to the
|
||||
specified length of the column.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
If you compare two <codeph>CHAR</codeph> values that differ only in the number of trailing spaces, those
|
||||
values are considered identical.
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/partitioning_bad"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/hbase_no"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/parquet_blurb"/>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
This type can be read from and written to Parquet files.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
There is no requirement for a particular level of Parquet.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Parquet files generated by Impala and containing this type can be freely interchanged with other components
|
||||
such as Hive and MapReduce.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Any trailing spaces, whether implicitly or explicitly specified, are not written to the Parquet data files.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Parquet data files might contain values that are longer than allowed by the
|
||||
<codeph>CHAR(<varname>n</varname>)</codeph> length limit. Impala ignores any extra trailing characters when
|
||||
it processes those values during a query.
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/text_blurb"/>
|
||||
|
||||
<p>
|
||||
Text data files might contain values that are longer than allowed for a particular
|
||||
<codeph>CHAR(<varname>n</varname>)</codeph> column. Any extra trailing characters are ignored when Impala
|
||||
processes those values during a query. Text data files can also contain values that are shorter than the
|
||||
defined length limit, and Impala pads them with trailing spaces up to the specified length. Any text data
|
||||
files produced by Impala <codeph>INSERT</codeph> statements do not include any trailing blanks for
|
||||
<codeph>CHAR</codeph> columns.
|
||||
</p>
|
||||
|
||||
<p><b>Avro considerations:</b></p>
|
||||
<p conref="../shared/impala_common.xml#common/avro_2gb_strings"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/compatibility_blurb"/>
|
||||
|
||||
<p>
|
||||
This type is available using Impala 2.0 or higher under CDH 4, or with Impala on CDH 5.2 or higher. There are
|
||||
no compatibility issues with other components when exchanging data files or running Impala on CDH 4.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Some other database systems make the length specification optional. For Impala, the length is required.
|
||||
</p>
|
||||
|
||||
<!--
|
||||
<p>
|
||||
The Impala maximum length is larger than for the <codeph>CHAR</codeph> data type in Hive.
|
||||
If a Hive query encounters a <codeph>CHAR</codeph> value longer than 255 during processing,
|
||||
it silently treats the value as length 255.
|
||||
</p>
|
||||
-->
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/internals_max_bytes"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/added_in_20"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/column_stats_constant"/>
|
||||
|
||||
<!-- Seems like a logical design decision but don't think it's currently implemented like this.
|
||||
<p>
|
||||
Because both the maximum and average length are always known and always the same for
|
||||
any given <codeph>CHAR(<varname>n</varname>)</codeph> column, those fields are always filled
|
||||
in for <codeph>SHOW COLUMN STATS</codeph> output, even before you run
|
||||
<codeph>COMPUTE STATS</codeph> on the table.
|
||||
</p>
|
||||
-->
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/udf_blurb_no"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
|
||||
<p>
|
||||
These examples show how trailing spaces are not considered significant when comparing or processing
|
||||
<codeph>CHAR</codeph> values. <codeph>CAST()</codeph> truncates any longer string to fit within the defined
|
||||
length. If a <codeph>CHAR</codeph> value is shorter than the specified length, it is padded on the right with
|
||||
spaces until it matches the specified length. Therefore, <codeph>LENGTH()</codeph> represents the length
|
||||
including any trailing spaces, and <codeph>CONCAT()</codeph> also treats the column value as if it has
|
||||
trailing spaces.
|
||||
</p>
|
||||
|
||||
<codeblock>select cast('x' as char(4)) = cast('x ' as char(4)) as "unpadded equal to padded";
|
||||
+--------------------------+
|
||||
| unpadded equal to padded |
|
||||
+--------------------------+
|
||||
| true |
|
||||
+--------------------------+
|
||||
|
||||
create table char_length(c char(3));
|
||||
insert into char_length values (cast('1' as char(3))), (cast('12' as char(3))), (cast('123' as char(3))), (cast('123456' as char(3)));
|
||||
select concat("[",c,"]") as c, length(c) from char_length;
|
||||
+-------+-----------+
|
||||
| c | length(c) |
|
||||
+-------+-----------+
|
||||
| [1 ] | 3 |
|
||||
| [12 ] | 3 |
|
||||
| [123] | 3 |
|
||||
| [123] | 3 |
|
||||
+-------+-----------+
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
This example shows a case where data values are known to have a specific length, where <codeph>CHAR</codeph>
|
||||
is a logical data type to use.
|
||||
<!--
|
||||
Because all the <codeph>CHAR</codeph> values have a constant predictable length,
|
||||
Impala can efficiently analyze how best to use these values in join queries,
|
||||
aggregation queries, and other contexts where column length is significant.
|
||||
-->
|
||||
</p>
|
||||
|
||||
<codeblock>create table addresses
|
||||
(id bigint,
|
||||
street_name string,
|
||||
state_abbreviation char(2),
|
||||
country_abbreviation char(2));
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
The following example shows how values written by Impala do not physically include the trailing spaces. It
|
||||
creates a table using text format, with <codeph>CHAR</codeph> values much shorter than the declared length,
|
||||
and then prints the resulting data file to show that the delimited values are not separated by spaces. The
|
||||
same behavior applies to binary-format Parquet data files.
|
||||
</p>
|
||||
|
||||
<codeblock>create table char_in_text (a char(20), b char(30), c char(40))
|
||||
row format delimited fields terminated by ',';
|
||||
|
||||
insert into char_in_text values (cast('foo' as char(20)), cast('bar' as char(30)), cast('baz' as char(40))), (cast('hello' as char(20)), cast('goodbye' as char(30)), cast('aloha' as char(40)));
|
||||
|
||||
-- Running this Linux command inside impala-shell using the ! shortcut.
|
||||
!hdfs dfs -cat 'hdfs://127.0.0.1:8020/user/hive/warehouse/impala_doc_testing.db/char_in_text/*.*';
|
||||
foo,bar,baz
|
||||
hello,goodbye,aloha
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
The following example further illustrates the treatment of spaces. It replaces the contents of the previous
|
||||
table with some values including leading spaces, trailing spaces, or both. Any leading spaces are preserved
|
||||
within the data file, but trailing spaces are discarded. Then when the values are retrieved by a query, the
|
||||
leading spaces are retrieved verbatim while any necessary trailing spaces are supplied by Impala.
|
||||
</p>
|
||||
|
||||
<codeblock>insert overwrite char_in_text values (cast('trailing ' as char(20)), cast(' leading and trailing ' as char(30)), cast(' leading' as char(40)));
|
||||
!hdfs dfs -cat 'hdfs://127.0.0.1:8020/user/hive/warehouse/impala_doc_testing.db/char_in_text/*.*';
|
||||
trailing, leading and trailing, leading
|
||||
|
||||
select concat('[',a,']') as a, concat('[',b,']') as b, concat('[',c,']') as c from char_in_text;
|
||||
+------------------------+----------------------------------+--------------------------------------------+
|
||||
| a | b | c |
|
||||
+------------------------+----------------------------------+--------------------------------------------+
|
||||
| [trailing ] | [ leading and trailing ] | [ leading ] |
|
||||
+------------------------+----------------------------------+--------------------------------------------+
|
||||
</codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
|
||||
|
||||
<p>
|
||||
Because the blank-padding behavior requires allocating the maximum length for each value in memory, for
|
||||
scalability reasons avoid declaring <codeph>CHAR</codeph> columns that are much longer than typical values in
|
||||
that column.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/blobs_are_strings"/>
|
||||
|
||||
<p>
|
||||
When an expression compares a <codeph>CHAR</codeph> with a <codeph>STRING</codeph> or
|
||||
<codeph>VARCHAR</codeph>, the <codeph>CHAR</codeph> value is implicitly converted to <codeph>STRING</codeph>
|
||||
first, with trailing spaces preserved.
|
||||
</p>
|
||||
|
||||
<codeblock>select cast("foo " as char(5)) = 'foo' as "char equal to string";
|
||||
+----------------------+
|
||||
| char equal to string |
|
||||
+----------------------+
|
||||
| false |
|
||||
+----------------------+
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
This behavior differs from other popular database systems. To get the expected result of
|
||||
<codeph>TRUE</codeph>, cast the expressions on both sides to <codeph>CHAR</codeph> values of the appropriate
|
||||
length:
|
||||
</p>
|
||||
|
||||
<codeblock>select cast("foo " as char(5)) = cast('foo' as char(3)) as "char equal to string";
|
||||
+----------------------+
|
||||
| char equal to string |
|
||||
+----------------------+
|
||||
| true |
|
||||
+----------------------+
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
This behavior is subject to change in future releases.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
|
||||
<p>
|
||||
<xref href="impala_string.xml#string"/>, <xref href="impala_varchar.xml#varchar"/>,
|
||||
<xref href="impala_literals.xml#string_literals"/>,
|
||||
<xref href="impala_string_functions.xml#string_functions"/>
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
353
docs/topics/impala_cluster_sizing.xml
Normal file
353
docs/topics/impala_cluster_sizing.xml
Normal file
@@ -0,0 +1,353 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="cluster_sizing">
|
||||
|
||||
<title>Cluster Sizing Guidelines for Impala</title>
|
||||
<titlealts audience="PDF"><navtitle>Cluster Sizing</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Clusters"/>
|
||||
<data name="Category" value="Planning"/>
|
||||
<data name="Category" value="Sizing"/>
|
||||
<data name="Category" value="Deploying"/>
|
||||
<!-- Hoist by my own petard. Memory is an important theme of this topic but that's in a <section> title. -->
|
||||
<data name="Category" value="Sectionated Pages"/>
|
||||
<data name="Category" value="Memory"/>
|
||||
<data name="Category" value="Scalability"/>
|
||||
<data name="Category" value="Proof of Concept"/>
|
||||
<data name="Category" value="Requirements"/>
|
||||
<data name="Category" value="Guidelines"/>
|
||||
<data name="Category" value="Best Practices"/>
|
||||
<data name="Category" value="Administrators"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<indexterm audience="Cloudera">cluster sizing</indexterm>
|
||||
This document provides a very rough guideline to estimate the size of a cluster needed for a specific
|
||||
customer application. You can use this information when planning how much and what type of hardware to
|
||||
acquire for a new cluster, or when adding Impala workloads to an existing cluster.
|
||||
</p>
|
||||
|
||||
<note>
|
||||
Before making purchase or deployment decisions, consult your Cloudera representative to verify the
|
||||
conclusions about hardware requirements based on your data volume and workload.
|
||||
</note>
|
||||
|
||||
<!-- <p outputclass="toc inpage"/> -->
|
||||
|
||||
<p>
|
||||
Always use hosts with identical specifications and capacities for all the nodes in the cluster. Currently,
|
||||
Impala divides the work evenly between cluster nodes, regardless of their exact hardware configuration.
|
||||
Because work can be distributed in different ways for different queries, if some hosts are overloaded
|
||||
compared to others in terms of CPU, memory, I/O, or network, you might experience inconsistent performance
|
||||
and overall slowness
|
||||
</p>
|
||||
|
||||
<p>
|
||||
For analytic workloads with star/snowflake schemas, and using consistent hardware for all nodes (64 GB RAM,
|
||||
12 2 TB hard drives, 2x E5-2630L 12 cores total, 10 GB network), the following table estimates the number of
|
||||
DataNodes needed in the cluster based on data size and the number of concurrent queries, for workloads
|
||||
similar to TPC-DS benchmark queries:
|
||||
</p>
|
||||
|
||||
<table>
|
||||
<title>Cluster size estimation based on the number of concurrent queries and data size with a 20 second average query response time</title>
|
||||
<tgroup cols="6">
|
||||
<colspec colnum="1" colname="col1"/>
|
||||
<colspec colnum="2" colname="col2"/>
|
||||
<colspec colnum="3" colname="col3"/>
|
||||
<colspec colnum="4" colname="col4"/>
|
||||
<colspec colnum="5" colname="col5"/>
|
||||
<colspec colnum="6" colname="col6"/>
|
||||
<thead>
|
||||
<row>
|
||||
<entry>
|
||||
Data Size
|
||||
</entry>
|
||||
<entry>
|
||||
1 query
|
||||
</entry>
|
||||
<entry>
|
||||
10 queries
|
||||
</entry>
|
||||
<entry>
|
||||
100 queries
|
||||
</entry>
|
||||
<entry>
|
||||
1000 queries
|
||||
</entry>
|
||||
<entry>
|
||||
2000 queries
|
||||
</entry>
|
||||
</row>
|
||||
</thead>
|
||||
<tbody>
|
||||
<row>
|
||||
<entry>
|
||||
<b>250 GB</b>
|
||||
</entry>
|
||||
<entry>
|
||||
2
|
||||
</entry>
|
||||
<entry>
|
||||
2
|
||||
</entry>
|
||||
<entry>
|
||||
5
|
||||
</entry>
|
||||
<entry>
|
||||
35
|
||||
</entry>
|
||||
<entry>
|
||||
70
|
||||
</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>
|
||||
<b>500 GB</b>
|
||||
</entry>
|
||||
<entry>
|
||||
2
|
||||
</entry>
|
||||
<entry>
|
||||
2
|
||||
</entry>
|
||||
<entry>
|
||||
10
|
||||
</entry>
|
||||
<entry>
|
||||
70
|
||||
</entry>
|
||||
<entry>
|
||||
135
|
||||
</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>
|
||||
<b>1 TB</b>
|
||||
</entry>
|
||||
<entry>
|
||||
2
|
||||
</entry>
|
||||
<entry>
|
||||
2
|
||||
</entry>
|
||||
<entry>
|
||||
15
|
||||
</entry>
|
||||
<entry>
|
||||
135
|
||||
</entry>
|
||||
<entry>
|
||||
270
|
||||
</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>
|
||||
<b>15 TB</b>
|
||||
</entry>
|
||||
<entry>
|
||||
2
|
||||
</entry>
|
||||
<entry>
|
||||
20
|
||||
</entry>
|
||||
<entry>
|
||||
200
|
||||
</entry>
|
||||
<entry>
|
||||
N/A
|
||||
</entry>
|
||||
<entry>
|
||||
N/A
|
||||
</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>
|
||||
<b>30 TB</b>
|
||||
</entry>
|
||||
<entry>
|
||||
4
|
||||
</entry>
|
||||
<entry>
|
||||
40
|
||||
</entry>
|
||||
<entry>
|
||||
400
|
||||
</entry>
|
||||
<entry>
|
||||
N/A
|
||||
</entry>
|
||||
<entry>
|
||||
N/A
|
||||
</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>
|
||||
<b>60 TB</b>
|
||||
</entry>
|
||||
<entry>
|
||||
8
|
||||
</entry>
|
||||
<entry>
|
||||
80
|
||||
</entry>
|
||||
<entry>
|
||||
800
|
||||
</entry>
|
||||
<entry>
|
||||
N/A
|
||||
</entry>
|
||||
<entry>
|
||||
N/A
|
||||
</entry>
|
||||
</row>
|
||||
</tbody>
|
||||
</tgroup>
|
||||
</table>
|
||||
|
||||
<section id="sizing_factors">
|
||||
|
||||
<title>Factors Affecting Scalability</title>
|
||||
|
||||
<p>
|
||||
A typical analytic workload (TPC-DS style queries) using recommended hardware is usually CPU-bound. Each
|
||||
node can process roughly 1.6 GB/sec. Both CPU-bound and disk-bound workloads can scale almost linearly with
|
||||
cluster size. However, for some workloads, the scalability might be bounded by the network, or even by
|
||||
memory.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
If the workload is already network bound (on a 10 GB network), increasing the cluster size won’t reduce
|
||||
the network load; in fact, a larger cluster could increase network traffic because some queries involve
|
||||
<q>broadcast</q> operations to all DataNodes. Therefore, boosting the cluster size does not improve query
|
||||
throughput in a network-constrained environment.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Let’s look at a memory-bound workload. A workload is memory-bound if Impala cannot run any additional
|
||||
concurrent queries because all memory allocated has already been consumed, but neither CPU, disk, nor
|
||||
network is saturated yet. This can happen because currently Impala uses only a single core per node to
|
||||
process join and aggregation queries. For a node with 128 GB of RAM, if a join node takes 50 GB, the system
|
||||
cannot run more than 2 such queries at the same time.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Therefore, at most 2 cores are used. Throughput can still scale almost linearly even for a memory-bound
|
||||
workload. It’s just that the CPU will not be saturated. Per-node throughput will be lower than 1.6
|
||||
GB/sec. Consider increasing the memory per node.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
As long as the workload is not network- or memory-bound, we can use the 1.6 GB/second per node as the
|
||||
throughput estimate.
|
||||
</p>
|
||||
</section>
|
||||
|
||||
<section id="sizing_details">
|
||||
|
||||
<title>A More Precise Approach</title>
|
||||
|
||||
<p>
|
||||
A more precise sizing estimate would require not only queries per minute (QPM), but also an average data
|
||||
size scanned per query (D). With the proper partitioning strategy, D is usually a fraction of the total
|
||||
data size. The following equation can be used as a rough guide to estimate the number of nodes (N) needed:
|
||||
</p>
|
||||
|
||||
<codeblock>Eq 1: N > QPM * D / 100 GB
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
Here is an example. Suppose, on average, a query scans 50 GB of data and the average response time is
|
||||
required to be 15 seconds or less when there are 100 concurrent queries. The QPM is 100/15*60 = 400. We can
|
||||
estimate the number of node using our equation above.
|
||||
</p>
|
||||
|
||||
<codeblock>N > QPM * D / 100GB
|
||||
N > 400 * 50GB / 100GB
|
||||
N > 200
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
Because this figure is a rough estimate, the corresponding number of nodes could be between 100 and 500.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Depending on the complexity of the query, the processing rate of query might change. If the query has more
|
||||
joins, aggregation functions, or CPU-intensive functions such as string processing or complex UDFs, the
|
||||
process rate will be lower than 1.6 GB/second per node. On the other hand, if the query only does scan and
|
||||
filtering on numbers, the processing rate can be higher.
|
||||
</p>
|
||||
</section>
|
||||
|
||||
<section id="sizing_mem_estimate">
|
||||
|
||||
<title>Estimating Memory Requirements</title>
|
||||
<!--
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Memory"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
-->
|
||||
|
||||
<p>
|
||||
Impala can handle joins between multiple large tables. Make sure that statistics are collected for all the
|
||||
joined tables, using the <codeph><xref href="impala_compute_stats.xml#compute_stats">COMPUTE
|
||||
STATS</xref></codeph> statement. However, joining big tables does consume more memory. Follow the steps
|
||||
below to calculate the minimum memory requirement.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Suppose you are running the following join:
|
||||
</p>
|
||||
|
||||
<codeblock>select a.*, b.col_1, b.col_2, … b.col_n
|
||||
from a, b
|
||||
where a.key = b.key
|
||||
and b.col_1 in (1,2,4...)
|
||||
and b.col_4 in (....);
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
And suppose table <codeph>B</codeph> is smaller than table <codeph>A</codeph> (but still a large table).
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The memory requirement for the query is the right-hand table (<codeph>B</codeph>), after decompression,
|
||||
filtering (<codeph>b.col_n in ...</codeph>) and after projection (only using certain columns) must be less
|
||||
than the total memory of the entire cluster.
|
||||
</p>
|
||||
|
||||
<codeblock>Cluster Total Memory Requirement = Size of the smaller table *
|
||||
selectivity factor from the predicate *
|
||||
projection factor * compression ratio
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
In this case, assume that table <codeph>B</codeph> is 100 TB in Parquet format with 200 columns. The
|
||||
predicate on <codeph>B</codeph> (<codeph>b.col_1 in ...and b.col_4 in ...</codeph>) will select only 10% of
|
||||
the rows from <codeph>B</codeph> and for projection, we are only projecting 5 columns out of 200 columns.
|
||||
Usually, Snappy compression gives us 3 times compression, so we estimate a 3x compression factor.
|
||||
</p>
|
||||
|
||||
<codeblock>Cluster Total Memory Requirement = Size of the smaller table *
|
||||
selectivity factor from the predicate *
|
||||
projection factor * compression ratio
|
||||
= 100TB * 10% * 5/200 * 3
|
||||
= 0.75TB
|
||||
= 750GB
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
So, if you have a 10-node cluster, each node has 128 GB of RAM and you give 80% to Impala, then you have 1
|
||||
TB of usable memory for Impala, which is more than 750GB. Therefore, your cluster can handle join queries
|
||||
of this magnitude.
|
||||
</p>
|
||||
</section>
|
||||
</conbody>
|
||||
</concept>
|
||||
56
docs/topics/impala_cm_installation.xml
Normal file
56
docs/topics/impala_cm_installation.xml
Normal file
@@ -0,0 +1,56 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="cm_installation">
|
||||
|
||||
<title>Installing Impala with Cloudera Manager</title>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Installing"/>
|
||||
<data name="Category" value="Cloudera Manager"/>
|
||||
<data name="Category" value="Administrators"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
Before installing Impala through the Cloudera Manager interface, make sure all applicable nodes have the
|
||||
appropriate hardware configuration and levels of operating system and CDH. See
|
||||
<xref href="impala_prereqs.xml#prereqs"/> for details.
|
||||
</p>
|
||||
|
||||
<note rev="1.2.0">
|
||||
<p rev="1.2.0">
|
||||
To install the latest Impala under CDH 4, upgrade Cloudera Manager to 4.8 or higher. Cloudera Manager 4.8 is
|
||||
the first release that can manage the Impala catalog service introduced in Impala 1.2. Cloudera Manager 4.8
|
||||
requires this service to be present, so if you upgrade to Cloudera Manager 4.8, also upgrade Impala to the
|
||||
most recent version at the same time.
|
||||
<!-- Not so relevant now for 1.1.1, but maybe someday we'll capture all this history in a compatibility grid.
|
||||
Upgrade to Cloudera Manager 4.6.2 or higher to enable Cloudera Manager to
|
||||
handle access control for the Impala web UI, available by default through
|
||||
port 25000 on each Impala host.
|
||||
-->
|
||||
</p>
|
||||
</note>
|
||||
|
||||
<p>
|
||||
For information on installing Impala in a Cloudera Manager-managed environment, see
|
||||
<xref audience="integrated" href="cm_ig_install_impala.xml"/><xref audience="standalone" href="http://www.cloudera.com/documentation/enterprise/latest/topics/cm_ig_install_impala.html" scope="external" format="html">Installing Impala</xref>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Managing your Impala installation through Cloudera Manager has a number of advantages. For example, when you
|
||||
make configuration changes to CDH components using Cloudera Manager, it automatically applies changes to the
|
||||
copies of configuration files, such as <codeph>hive-site.xml</codeph>, that Impala keeps under
|
||||
<filepath>/etc/impala/conf</filepath>. It also sets up the Hive Metastore service that is required for
|
||||
Impala running under CDH 4.1.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
In some cases, depending on the level of Impala, CDH, and Cloudera Manager, you might need to add particular
|
||||
component configuration details in some of the free-form option fields on the Impala configuration pages
|
||||
within Cloudera Manager. <ph conref="../shared/impala_common.xml#common/safety_valve"/>
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
53
docs/topics/impala_comments.xml
Normal file
53
docs/topics/impala_comments.xml
Normal file
@@ -0,0 +1,53 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="comments">
|
||||
|
||||
<title>Comments</title>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<indexterm audience="Cloudera">comments (SQL)</indexterm>
|
||||
Impala supports the familiar styles of SQL comments:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
All text from a <codeph>--</codeph> sequence to the end of the line is considered a comment and ignored.
|
||||
This type of comment can occur on a single line by itself, or after all or part of a statement.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
All text from a <codeph>/*</codeph> sequence to the next <codeph>*/</codeph> sequence is considered a
|
||||
comment and ignored. This type of comment can stretch over multiple lines. This type of comment can occur
|
||||
on one or more lines by itself, in the middle of a statement, or before or after a statement.
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p>
|
||||
For example:
|
||||
</p>
|
||||
|
||||
<codeblock>-- This line is a comment about a table.
|
||||
create table ...;
|
||||
|
||||
/*
|
||||
This is a multi-line comment about a query.
|
||||
*/
|
||||
select ...;
|
||||
|
||||
select * from t /* This is an embedded comment about a query. */ where ...;
|
||||
|
||||
select * from t -- This is a trailing comment within a multi-line command.
|
||||
where ...;
|
||||
</codeblock>
|
||||
</conbody>
|
||||
</concept>
|
||||
2737
docs/topics/impala_complex_types.xml
Normal file
2737
docs/topics/impala_complex_types.xml
Normal file
File diff suppressed because it is too large
Load Diff
180
docs/topics/impala_components.xml
Normal file
180
docs/topics/impala_components.xml
Normal file
@@ -0,0 +1,180 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="intro_components">
|
||||
|
||||
<title>Components of the Impala Server</title>
|
||||
<titlealts audience="PDF"><navtitle>Components</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Concepts"/>
|
||||
<data name="Category" value="Administrators"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
The Impala server is a distributed, massively parallel processing (MPP) database engine. It consists of
|
||||
different daemon processes that run on specific hosts within your CDH cluster.
|
||||
</p>
|
||||
|
||||
<p outputclass="toc inpage"/>
|
||||
</conbody>
|
||||
|
||||
<concept id="intro_impalad">
|
||||
|
||||
<title>The Impala Daemon</title>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
The core Impala component is a daemon process that runs on each DataNode of the cluster, physically represented
|
||||
by the <codeph>impalad</codeph> process. It reads and writes to data files; accepts queries transmitted
|
||||
from the <codeph>impala-shell</codeph> command, Hue, JDBC, or ODBC; parallelizes the queries and
|
||||
distributes work across the cluster; and transmits intermediate query results back to the
|
||||
central coordinator node.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
You can submit a query to the Impala daemon running on any DataNode, and that instance of the daemon serves as the
|
||||
<term>coordinator node</term> for that query. The other nodes transmit partial results back to the
|
||||
coordinator, which constructs the final result set for a query. When running experiments with functionality
|
||||
through the <codeph>impala-shell</codeph> command, you might always connect to the same Impala daemon for
|
||||
convenience. For clusters running production workloads, you might load-balance by
|
||||
submitting each query to a different Impala daemon in round-robin style, using the JDBC or ODBC interfaces.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The Impala daemons are in constant communication with the <term>statestore</term>, to confirm which nodes
|
||||
are healthy and can accept new work.
|
||||
</p>
|
||||
|
||||
<p rev="1.2">
|
||||
They also receive broadcast messages from the <cmdname>catalogd</cmdname> daemon (introduced in Impala 1.2)
|
||||
whenever any Impala node in the cluster creates, alters, or drops any type of object, or when an
|
||||
<codeph>INSERT</codeph> or <codeph>LOAD DATA</codeph> statement is processed through Impala. This
|
||||
background communication minimizes the need for <codeph>REFRESH</codeph> or <codeph>INVALIDATE
|
||||
METADATA</codeph> statements that were needed to coordinate metadata across nodes prior to Impala 1.2.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>Related information:</b> <xref href="impala_config_options.xml#config_options"/>,
|
||||
<xref href="impala_processes.xml#processes"/>, <xref href="impala_timeouts.xml#impalad_timeout"/>,
|
||||
<xref href="impala_ports.xml#ports"/>, <xref href="impala_proxy.xml#proxy"/>
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
|
||||
<concept id="intro_statestore">
|
||||
|
||||
<title>The Impala Statestore</title>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
The Impala component known as the <term>statestore</term> checks on the health of Impala daemons on all the
|
||||
DataNodes in a cluster, and continuously relays its findings to each of those daemons. It is physically
|
||||
represented by a daemon process named <codeph>statestored</codeph>; you only need such a process on one
|
||||
host in the cluster. If an Impala daemon goes offline due to hardware failure, network error, software issue,
|
||||
or other reason, the statestore informs all the other Impala daemons so that future queries can avoid making
|
||||
requests to the unreachable node.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Because the statestore's purpose is to help when things go wrong, it is not critical to the normal
|
||||
operation of an Impala cluster. If the statestore is not running or becomes unreachable, the Impala daemons
|
||||
continue running and distributing work among themselves as usual; the cluster just becomes less robust if
|
||||
other Impala daemons fail while the statestore is offline. When the statestore comes back online, it re-establishes
|
||||
communication with the Impala daemons and resumes its monitoring function.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/statestored_catalogd_ha_blurb"/>
|
||||
|
||||
<p>
|
||||
<b>Related information:</b>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<xref href="impala_scalability.xml#statestore_scalability"/>,
|
||||
<xref href="impala_config_options.xml#config_options"/>, <xref href="impala_processes.xml#processes"/>,
|
||||
<xref href="impala_timeouts.xml#statestore_timeout"/>, <xref href="impala_ports.xml#ports"/>
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
|
||||
<concept rev="1.2" id="intro_catalogd">
|
||||
|
||||
<title>The Impala Catalog Service</title>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
The Impala component known as the <term>catalog service</term> relays the metadata changes from Impala SQL
|
||||
statements to all the DataNodes in a cluster. It is physically represented by a daemon process named
|
||||
<codeph>catalogd</codeph>; you only need such a process on one host in the cluster. Because the requests
|
||||
are passed through the statestore daemon, it makes sense to run the <cmdname>statestored</cmdname> and
|
||||
<cmdname>catalogd</cmdname> services on the same host.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The catalog service avoids the need to issue
|
||||
<codeph>REFRESH</codeph> and <codeph>INVALIDATE METADATA</codeph> statements when the metadata changes are
|
||||
performed by statements issued through Impala. When you create a table, load data, and so on through Hive,
|
||||
you do need to issue <codeph>REFRESH</codeph> or <codeph>INVALIDATE METADATA</codeph> on an Impala node
|
||||
before executing a query there.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
This feature touches a number of aspects of Impala:
|
||||
</p>
|
||||
|
||||
<!-- This was formerly a conref, but since the list of links also included a link
|
||||
to this same topic, materializing the list here and removing that
|
||||
circular link. (The conref is still used in Incompatible Changes.)
|
||||
|
||||
<ul conref="../shared/impala_common.xml#common/catalogd_xrefs">
|
||||
<li/>
|
||||
</ul>
|
||||
-->
|
||||
|
||||
<ul id="catalogd_xrefs">
|
||||
<li>
|
||||
<p>
|
||||
See <xref href="impala_install.xml#install"/>, <xref href="impala_upgrading.xml#upgrading"/> and
|
||||
<xref href="impala_processes.xml#processes"/>, for usage information for the
|
||||
<cmdname>catalogd</cmdname> daemon.
|
||||
</p>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<p>
|
||||
The <codeph>REFRESH</codeph> and <codeph>INVALIDATE METADATA</codeph> statements are not needed
|
||||
when the <codeph>CREATE TABLE</codeph>, <codeph>INSERT</codeph>, or other table-changing or
|
||||
data-changing operation is performed through Impala. These statements are still needed if such
|
||||
operations are done through Hive or by manipulating data files directly in HDFS, but in those cases the
|
||||
statements only need to be issued on one Impala node rather than on all nodes. See
|
||||
<xref href="impala_refresh.xml#refresh"/> and
|
||||
<xref href="impala_invalidate_metadata.xml#invalidate_metadata"/> for the latest usage information for
|
||||
those statements.
|
||||
</p>
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/load_catalog_in_background"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/statestored_catalogd_ha_blurb"/>
|
||||
|
||||
<note>
|
||||
<p conref="../shared/impala_common.xml#common/catalog_server_124"/>
|
||||
</note>
|
||||
|
||||
<p>
|
||||
<b>Related information:</b> <xref href="impala_config_options.xml#config_options"/>,
|
||||
<xref href="impala_processes.xml#processes"/>, <xref href="impala_ports.xml#ports"/>
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
</concept>
|
||||
98
docs/topics/impala_compression_codec.xml
Normal file
98
docs/topics/impala_compression_codec.xml
Normal file
@@ -0,0 +1,98 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept rev="2.0.0" id="compression_codec">
|
||||
|
||||
<title>COMPRESSION_CODEC Query Option (<keyword keyref="impala20"/> or higher only)</title>
|
||||
<titlealts audience="PDF"><navtitle>COMPRESSION_CODEC</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Impala Query Options"/>
|
||||
<data name="Category" value="Compression"/>
|
||||
<data name="Category" value="File Formats"/>
|
||||
<data name="Category" value="Parquet"/>
|
||||
<data name="Category" value="Snappy"/>
|
||||
<data name="Category" value="Gzip"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<!-- The initial part of this paragraph is copied straight from the #parquet_compression topic. -->
|
||||
|
||||
<!-- Could turn into a conref. -->
|
||||
|
||||
<p rev="2.0.0">
|
||||
<indexterm audience="Cloudera">COMPRESSION_CODEC query option</indexterm>
|
||||
When Impala writes Parquet data files using the <codeph>INSERT</codeph> statement, the underlying compression
|
||||
is controlled by the <codeph>COMPRESSION_CODEC</codeph> query option.
|
||||
</p>
|
||||
|
||||
<note>
|
||||
Prior to Impala 2.0, this option was named <codeph>PARQUET_COMPRESSION_CODEC</codeph>. In Impala 2.0 and
|
||||
later, the <codeph>PARQUET_COMPRESSION_CODEC</codeph> name is not recognized. Use the more general name
|
||||
<codeph>COMPRESSION_CODEC</codeph> for new code.
|
||||
</note>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
||||
|
||||
<codeblock>SET COMPRESSION_CODEC=<varname>codec_name</varname>;</codeblock>
|
||||
|
||||
<p>
|
||||
The allowed values for this query option are <codeph>SNAPPY</codeph> (the default), <codeph>GZIP</codeph>,
|
||||
and <codeph>NONE</codeph>.
|
||||
</p>
|
||||
|
||||
<note>
|
||||
A Parquet file created with <codeph>COMPRESSION_CODEC=NONE</codeph> is still typically smaller than the
|
||||
original data, due to encoding schemes such as run-length encoding and dictionary encoding that are applied
|
||||
separately from compression.
|
||||
</note>
|
||||
|
||||
<p></p>
|
||||
|
||||
<p>
|
||||
The option value is not case-sensitive.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
If the option is set to an unrecognized value, all kinds of queries will fail due to the invalid option
|
||||
setting, not just queries involving Parquet tables. (The value <codeph>BZIP2</codeph> is also recognized, but
|
||||
is not compatible with Parquet tables.)
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>Type:</b> <codeph>STRING</codeph>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>Default:</b> <codeph>SNAPPY</codeph>
|
||||
</p>
|
||||
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
|
||||
<codeblock>set compression_codec=gzip;
|
||||
insert into parquet_table_highly_compressed select * from t1;
|
||||
|
||||
set compression_codec=snappy;
|
||||
insert into parquet_table_compression_plus_fast_queries select * from t1;
|
||||
|
||||
set compression_codec=none;
|
||||
insert into parquet_table_no_compression select * from t1;
|
||||
|
||||
set compression_codec=foo;
|
||||
select * from t1 limit 5;
|
||||
ERROR: Invalid compression codec: foo
|
||||
</codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
|
||||
<p>
|
||||
For information about how compressing Parquet data files affects query performance, see
|
||||
<xref href="impala_parquet.xml#parquet_compression"/>.
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
432
docs/topics/impala_compute_stats.xml
Normal file
432
docs/topics/impala_compute_stats.xml
Normal file
@@ -0,0 +1,432 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept rev="1.2.2" id="compute_stats">
|
||||
|
||||
<title>COMPUTE STATS Statement</title>
|
||||
<titlealts audience="PDF"><navtitle>COMPUTE STATS</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Performance"/>
|
||||
<data name="Category" value="Scalability"/>
|
||||
<data name="Category" value="ETL"/>
|
||||
<data name="Category" value="Ingest"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="Tables"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<indexterm audience="Cloudera">COMPUTE STATS statement</indexterm>
|
||||
Gathers information about volume and distribution of data in a table and all associated columns and
|
||||
partitions. The information is stored in the metastore database, and used by Impala to help optimize queries.
|
||||
For example, if Impala can determine that a table is large or small, or has many or few distinct values it
|
||||
can organize parallelize the work appropriately for a join query or insert operation. For details about the
|
||||
kinds of information gathered by this statement, see <xref href="impala_perf_stats.xml#perf_stats"/>.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
||||
|
||||
<codeblock rev="2.1.0">COMPUTE STATS [<varname>db_name</varname>.]<varname>table_name</varname>
|
||||
COMPUTE INCREMENTAL STATS [<varname>db_name</varname>.]<varname>table_name</varname> [PARTITION (<varname>partition_spec</varname>)]
|
||||
|
||||
<varname>partition_spec</varname> ::= <varname>partition_col</varname>=<varname>constant_value</varname>
|
||||
</codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/incremental_partition_spec"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
||||
|
||||
<p>
|
||||
Originally, Impala relied on users to run the Hive <codeph>ANALYZE TABLE</codeph> statement, but that method
|
||||
of gathering statistics proved unreliable and difficult to use. The Impala <codeph>COMPUTE STATS</codeph>
|
||||
statement is built from the ground up to improve the reliability and user-friendliness of this operation.
|
||||
<codeph>COMPUTE STATS</codeph> does not require any setup steps or special configuration. You only run a
|
||||
single Impala <codeph>COMPUTE STATS</codeph> statement to gather both table and column statistics, rather
|
||||
than separate Hive <codeph>ANALYZE TABLE</codeph> statements for each kind of statistics.
|
||||
</p>
|
||||
|
||||
<p rev="2.1.0">
|
||||
The <codeph>COMPUTE INCREMENTAL STATS</codeph> variation is a shortcut for partitioned tables that works on a
|
||||
subset of partitions rather than the entire table. The incremental nature makes it suitable for large tables
|
||||
with many partitions, where a full <codeph>COMPUTE STATS</codeph> operation takes too long to be practical
|
||||
each time a partition is added or dropped. See <xref href="impala_perf_stats.xml#perf_stats_incremental"/>
|
||||
for full usage details.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<codeph>COMPUTE INCREMENTAL STATS</codeph> only applies to partitioned tables. If you use the
|
||||
<codeph>INCREMENTAL</codeph> clause for an unpartitioned table, Impala automatically uses the original
|
||||
<codeph>COMPUTE STATS</codeph> statement. Such tables display <codeph>false</codeph> under the
|
||||
<codeph>Incremental stats</codeph> column of the <codeph>SHOW TABLE STATS</codeph> output.
|
||||
</p>
|
||||
|
||||
<note>
|
||||
Because many of the most performance-critical and resource-intensive operations rely on table and column
|
||||
statistics to construct accurate and efficient plans, <codeph>COMPUTE STATS</codeph> is an important step at
|
||||
the end of your ETL process. Run <codeph>COMPUTE STATS</codeph> on all tables as your first step during
|
||||
performance tuning for slow queries, or troubleshooting for out-of-memory conditions:
|
||||
<ul>
|
||||
<li>
|
||||
Accurate statistics help Impala construct an efficient query plan for join queries, improving performance
|
||||
and reducing memory usage.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Accurate statistics help Impala distribute the work effectively for insert operations into Parquet
|
||||
tables, improving performance and reducing memory usage.
|
||||
</li>
|
||||
|
||||
<li rev="1.3.0">
|
||||
Accurate statistics help Impala estimate the memory required for each query, which is important when you
|
||||
use resource management features, such as admission control and the YARN resource management framework.
|
||||
The statistics help Impala to achieve high concurrency, full utilization of available memory, and avoid
|
||||
contention with workloads from other Hadoop components.
|
||||
</li>
|
||||
</ul>
|
||||
</note>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/complex_types_blurb"/>
|
||||
|
||||
<p rev="2.3.0">
|
||||
Currently, the statistics created by the <codeph>COMPUTE STATS</codeph> statement do not include
|
||||
information about complex type columns. The column stats metrics for complex columns are always shown
|
||||
as -1. For queries involving complex type columns, Impala uses
|
||||
heuristics to estimate the data distribution within such columns.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/hbase_blurb"/>
|
||||
|
||||
<p>
|
||||
<codeph>COMPUTE STATS</codeph> works for HBase tables also. The statistics gathered for HBase tables are
|
||||
somewhat different than for HDFS-backed tables, but that metadata is still used for optimization when HBase
|
||||
tables are involved in join queries.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/s3_blurb"/>
|
||||
|
||||
<p rev="2.2.0">
|
||||
<codeph>COMPUTE STATS</codeph> also works for tables where data resides in the Amazon Simple Storage Service (S3).
|
||||
See <xref href="impala_s3.xml#s3"/> for details.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/performance_blurb"/>
|
||||
|
||||
<p>
|
||||
The statistics collected by <codeph>COMPUTE STATS</codeph> are used to optimize join queries
|
||||
<codeph>INSERT</codeph> operations into Parquet tables, and other resource-intensive kinds of SQL statements.
|
||||
See <xref href="impala_perf_stats.xml#perf_stats"/> for details.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
For large tables, the <codeph>COMPUTE STATS</codeph> statement itself might take a long time and you
|
||||
might need to tune its performance. The <codeph>COMPUTE STATS</codeph> statement does not work with the
|
||||
<codeph>EXPLAIN</codeph> statement, or the <codeph>SUMMARY</codeph> command in <cmdname>impala-shell</cmdname>.
|
||||
You can use the <codeph>PROFILE</codeph> statement in <cmdname>impala-shell</cmdname> to examine timing information
|
||||
for the statement as a whole. If a basic <codeph>COMPUTE STATS</codeph> statement takes a long time for a
|
||||
partitioned table, consider switching to the <codeph>COMPUTE INCREMENTAL STATS</codeph> syntax so that only
|
||||
newly added partitions are analyzed each time.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
|
||||
<p>
|
||||
This example shows two tables, <codeph>T1</codeph> and <codeph>T2</codeph>, with a small number distinct
|
||||
values linked by a parent-child relationship between <codeph>T1.ID</codeph> and <codeph>T2.PARENT</codeph>.
|
||||
<codeph>T1</codeph> is tiny, while <codeph>T2</codeph> has approximately 100K rows. Initially, the statistics
|
||||
includes physical measurements such as the number of files, the total size, and size measurements for
|
||||
fixed-length columns such as with the <codeph>INT</codeph> type. Unknown values are represented by -1. After
|
||||
running <codeph>COMPUTE STATS</codeph> for each table, much more information is available through the
|
||||
<codeph>SHOW STATS</codeph> statements. If you were running a join query involving both of these tables, you
|
||||
would need statistics for both tables to get the most effective optimization for the query.
|
||||
</p>
|
||||
|
||||
<!-- Note: chopped off any excess characters at position 87 and after,
|
||||
to avoid weird wrapping in PDF.
|
||||
Applies to any subsequent examples with output from SHOW ... STATS too. -->
|
||||
|
||||
<codeblock>[localhost:21000] > show table stats t1;
|
||||
Query: show table stats t1
|
||||
+-------+--------+------+--------+
|
||||
| #Rows | #Files | Size | Format |
|
||||
+-------+--------+------+--------+
|
||||
| -1 | 1 | 33B | TEXT |
|
||||
+-------+--------+------+--------+
|
||||
Returned 1 row(s) in 0.02s
|
||||
[localhost:21000] > show table stats t2;
|
||||
Query: show table stats t2
|
||||
+-------+--------+----------+--------+
|
||||
| #Rows | #Files | Size | Format |
|
||||
+-------+--------+----------+--------+
|
||||
| -1 | 28 | 960.00KB | TEXT |
|
||||
+-------+--------+----------+--------+
|
||||
Returned 1 row(s) in 0.01s
|
||||
[localhost:21000] > show column stats t1;
|
||||
Query: show column stats t1
|
||||
+--------+--------+------------------+--------+----------+----------+
|
||||
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
|
||||
+--------+--------+------------------+--------+----------+----------+
|
||||
| id | INT | -1 | -1 | 4 | 4 |
|
||||
| s | STRING | -1 | -1 | -1 | -1 |
|
||||
+--------+--------+------------------+--------+----------+----------+
|
||||
Returned 2 row(s) in 1.71s
|
||||
[localhost:21000] > show column stats t2;
|
||||
Query: show column stats t2
|
||||
+--------+--------+------------------+--------+----------+----------+
|
||||
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
|
||||
+--------+--------+------------------+--------+----------+----------+
|
||||
| parent | INT | -1 | -1 | 4 | 4 |
|
||||
| s | STRING | -1 | -1 | -1 | -1 |
|
||||
+--------+--------+------------------+--------+----------+----------+
|
||||
Returned 2 row(s) in 0.01s
|
||||
[localhost:21000] > compute stats t1;
|
||||
Query: compute stats t1
|
||||
+-----------------------------------------+
|
||||
| summary |
|
||||
+-----------------------------------------+
|
||||
| Updated 1 partition(s) and 2 column(s). |
|
||||
+-----------------------------------------+
|
||||
Returned 1 row(s) in 5.30s
|
||||
[localhost:21000] > show table stats t1;
|
||||
Query: show table stats t1
|
||||
+-------+--------+------+--------+
|
||||
| #Rows | #Files | Size | Format |
|
||||
+-------+--------+------+--------+
|
||||
| 3 | 1 | 33B | TEXT |
|
||||
+-------+--------+------+--------+
|
||||
Returned 1 row(s) in 0.01s
|
||||
[localhost:21000] > show column stats t1;
|
||||
Query: show column stats t1
|
||||
+--------+--------+------------------+--------+----------+----------+
|
||||
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
|
||||
+--------+--------+------------------+--------+----------+----------+
|
||||
| id | INT | 3 | -1 | 4 | 4 |
|
||||
| s | STRING | 3 | -1 | -1 | -1 |
|
||||
+--------+--------+------------------+--------+----------+----------+
|
||||
Returned 2 row(s) in 0.02s
|
||||
[localhost:21000] > compute stats t2;
|
||||
Query: compute stats t2
|
||||
+-----------------------------------------+
|
||||
| summary |
|
||||
+-----------------------------------------+
|
||||
| Updated 1 partition(s) and 2 column(s). |
|
||||
+-----------------------------------------+
|
||||
Returned 1 row(s) in 5.70s
|
||||
[localhost:21000] > show table stats t2;
|
||||
Query: show table stats t2
|
||||
+-------+--------+----------+--------+
|
||||
| #Rows | #Files | Size | Format |
|
||||
+-------+--------+----------+--------+
|
||||
| 98304 | 1 | 960.00KB | TEXT |
|
||||
+-------+--------+----------+--------+
|
||||
Returned 1 row(s) in 0.03s
|
||||
[localhost:21000] > show column stats t2;
|
||||
Query: show column stats t2
|
||||
+--------+--------+------------------+--------+----------+----------+
|
||||
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
|
||||
+--------+--------+------------------+--------+----------+----------+
|
||||
| parent | INT | 3 | -1 | 4 | 4 |
|
||||
| s | STRING | 6 | -1 | 14 | 9.3 |
|
||||
+--------+--------+------------------+--------+----------+----------+
|
||||
Returned 2 row(s) in 0.01s</codeblock>
|
||||
|
||||
<p rev="2.1.0">
|
||||
The following example shows how to use the <codeph>INCREMENTAL</codeph> clause, available in Impala 2.1.0 and
|
||||
higher. The <codeph>COMPUTE INCREMENTAL STATS</codeph> syntax lets you collect statistics for newly added or
|
||||
changed partitions, without rescanning the entire table.
|
||||
</p>
|
||||
|
||||
<codeblock>-- Initially the table has no incremental stats, as indicated
|
||||
-- by -1 under #Rows and false under Incremental stats.
|
||||
show table stats item_partitioned;
|
||||
+-------------+-------+--------+----------+--------------+---------+------------------
|
||||
| i_category | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats
|
||||
+-------------+-------+--------+----------+--------------+---------+------------------
|
||||
| Books | -1 | 1 | 223.74KB | NOT CACHED | PARQUET | false
|
||||
| Children | -1 | 1 | 230.05KB | NOT CACHED | PARQUET | false
|
||||
| Electronics | -1 | 1 | 232.67KB | NOT CACHED | PARQUET | false
|
||||
| Home | -1 | 1 | 232.56KB | NOT CACHED | PARQUET | false
|
||||
| Jewelry | -1 | 1 | 223.72KB | NOT CACHED | PARQUET | false
|
||||
| Men | -1 | 1 | 231.25KB | NOT CACHED | PARQUET | false
|
||||
| Music | -1 | 1 | 237.90KB | NOT CACHED | PARQUET | false
|
||||
| Shoes | -1 | 1 | 234.90KB | NOT CACHED | PARQUET | false
|
||||
| Sports | -1 | 1 | 227.97KB | NOT CACHED | PARQUET | false
|
||||
| Women | -1 | 1 | 226.27KB | NOT CACHED | PARQUET | false
|
||||
| Total | -1 | 10 | 2.25MB | 0B | |
|
||||
+-------------+-------+--------+----------+--------------+---------+------------------
|
||||
|
||||
-- After the first COMPUTE INCREMENTAL STATS,
|
||||
-- all partitions have stats.
|
||||
compute incremental stats item_partitioned;
|
||||
+-------------------------------------------+
|
||||
| summary |
|
||||
+-------------------------------------------+
|
||||
| Updated 10 partition(s) and 21 column(s). |
|
||||
+-------------------------------------------+
|
||||
show table stats item_partitioned;
|
||||
+-------------+-------+--------+----------+--------------+---------+------------------
|
||||
| i_category | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats
|
||||
+-------------+-------+--------+----------+--------------+---------+------------------
|
||||
| Books | 1733 | 1 | 223.74KB | NOT CACHED | PARQUET | true
|
||||
| Children | 1786 | 1 | 230.05KB | NOT CACHED | PARQUET | true
|
||||
| Electronics | 1812 | 1 | 232.67KB | NOT CACHED | PARQUET | true
|
||||
| Home | 1807 | 1 | 232.56KB | NOT CACHED | PARQUET | true
|
||||
| Jewelry | 1740 | 1 | 223.72KB | NOT CACHED | PARQUET | true
|
||||
| Men | 1811 | 1 | 231.25KB | NOT CACHED | PARQUET | true
|
||||
| Music | 1860 | 1 | 237.90KB | NOT CACHED | PARQUET | true
|
||||
| Shoes | 1835 | 1 | 234.90KB | NOT CACHED | PARQUET | true
|
||||
| Sports | 1783 | 1 | 227.97KB | NOT CACHED | PARQUET | true
|
||||
| Women | 1790 | 1 | 226.27KB | NOT CACHED | PARQUET | true
|
||||
| Total | 17957 | 10 | 2.25MB | 0B | |
|
||||
+-------------+-------+--------+----------+--------------+---------+------------------
|
||||
|
||||
-- Add a new partition...
|
||||
alter table item_partitioned add partition (i_category='Camping');
|
||||
-- Add or replace files in HDFS outside of Impala,
|
||||
-- rendering the stats for a partition obsolete.
|
||||
!import_data_into_sports_partition.sh
|
||||
refresh item_partitioned;
|
||||
drop incremental stats item_partitioned partition (i_category='Sports');
|
||||
-- Now some partitions have incremental stats
|
||||
-- and some do not.
|
||||
show table stats item_partitioned;
|
||||
+-------------+-------+--------+----------+--------------+---------+------------------
|
||||
| i_category | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats
|
||||
+-------------+-------+--------+----------+--------------+---------+------------------
|
||||
| Books | 1733 | 1 | 223.74KB | NOT CACHED | PARQUET | true
|
||||
| Camping | -1 | 1 | 408.02KB | NOT CACHED | PARQUET | false
|
||||
| Children | 1786 | 1 | 230.05KB | NOT CACHED | PARQUET | true
|
||||
| Electronics | 1812 | 1 | 232.67KB | NOT CACHED | PARQUET | true
|
||||
| Home | 1807 | 1 | 232.56KB | NOT CACHED | PARQUET | true
|
||||
| Jewelry | 1740 | 1 | 223.72KB | NOT CACHED | PARQUET | true
|
||||
| Men | 1811 | 1 | 231.25KB | NOT CACHED | PARQUET | true
|
||||
| Music | 1860 | 1 | 237.90KB | NOT CACHED | PARQUET | true
|
||||
| Shoes | 1835 | 1 | 234.90KB | NOT CACHED | PARQUET | true
|
||||
| Sports | -1 | 1 | 227.97KB | NOT CACHED | PARQUET | false
|
||||
| Women | 1790 | 1 | 226.27KB | NOT CACHED | PARQUET | true
|
||||
| Total | 17957 | 11 | 2.65MB | 0B | |
|
||||
+-------------+-------+--------+----------+--------------+---------+------------------
|
||||
|
||||
-- After another COMPUTE INCREMENTAL STATS,
|
||||
-- all partitions have incremental stats, and only the 2
|
||||
-- partitions without incremental stats were scanned.
|
||||
compute incremental stats item_partitioned;
|
||||
+------------------------------------------+
|
||||
| summary |
|
||||
+------------------------------------------+
|
||||
| Updated 2 partition(s) and 21 column(s). |
|
||||
+------------------------------------------+
|
||||
show table stats item_partitioned;
|
||||
+-------------+-------+--------+----------+--------------+---------+------------------
|
||||
| i_category | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats
|
||||
+-------------+-------+--------+----------+--------------+---------+------------------
|
||||
| Books | 1733 | 1 | 223.74KB | NOT CACHED | PARQUET | true
|
||||
| Camping | 5328 | 1 | 408.02KB | NOT CACHED | PARQUET | true
|
||||
| Children | 1786 | 1 | 230.05KB | NOT CACHED | PARQUET | true
|
||||
| Electronics | 1812 | 1 | 232.67KB | NOT CACHED | PARQUET | true
|
||||
| Home | 1807 | 1 | 232.56KB | NOT CACHED | PARQUET | true
|
||||
| Jewelry | 1740 | 1 | 223.72KB | NOT CACHED | PARQUET | true
|
||||
| Men | 1811 | 1 | 231.25KB | NOT CACHED | PARQUET | true
|
||||
| Music | 1860 | 1 | 237.90KB | NOT CACHED | PARQUET | true
|
||||
| Shoes | 1835 | 1 | 234.90KB | NOT CACHED | PARQUET | true
|
||||
| Sports | 1783 | 1 | 227.97KB | NOT CACHED | PARQUET | true
|
||||
| Women | 1790 | 1 | 226.27KB | NOT CACHED | PARQUET | true
|
||||
| Total | 17957 | 11 | 2.65MB | 0B | |
|
||||
+-------------+-------+--------+----------+--------------+---------+------------------
|
||||
</codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/file_format_blurb"/>
|
||||
|
||||
<p>
|
||||
The <codeph>COMPUTE STATS</codeph> statement works with tables created with any of the file formats supported
|
||||
by Impala. See <xref href="impala_file_formats.xml#file_formats"/> for details about working with the
|
||||
different file formats. The following considerations apply to <codeph>COMPUTE STATS</codeph> depending on the
|
||||
file format of the table.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The <codeph>COMPUTE STATS</codeph> statement works with text tables with no restrictions. These tables can be
|
||||
created through either Impala or Hive.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The <codeph>COMPUTE STATS</codeph> statement works with Parquet tables. These tables can be created through
|
||||
either Impala or Hive.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The <codeph>COMPUTE STATS</codeph> statement works with Avro tables without restriction in CDH 5.4 / Impala 2.2
|
||||
and higher. In earlier releases, <codeph>COMPUTE STATS</codeph> worked only for Avro tables created through Hive,
|
||||
and required the <codeph>CREATE TABLE</codeph> statement to use SQL-style column names and types rather than an
|
||||
Avro-style schema specification.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The <codeph>COMPUTE STATS</codeph> statement works with RCFile tables with no restrictions. These tables can
|
||||
be created through either Impala or Hive.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The <codeph>COMPUTE STATS</codeph> statement works with SequenceFile tables with no restrictions. These
|
||||
tables can be created through either Impala or Hive.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The <codeph>COMPUTE STATS</codeph> statement works with partitioned tables, whether all the partitions use
|
||||
the same file format, or some partitions are defined through <codeph>ALTER TABLE</codeph> to use different
|
||||
file formats.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/ddl_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/cancel_blurb_maybe"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/decimal_no_stats"/>
|
||||
|
||||
<note conref="../shared/impala_common.xml#common/compute_stats_nulls"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/internals_blurb"/>
|
||||
<p>
|
||||
Behind the scenes, the <codeph>COMPUTE STATS</codeph> statement
|
||||
executes two statements: one to count the rows of each partition
|
||||
in the table (or the entire table if unpartitioned) through the
|
||||
<codeph>COUNT(*)</codeph> function,
|
||||
and another to count the approximate number of distinct values
|
||||
in each column through the <codeph>NDV()</codeph> function.
|
||||
You might see these queries in your monitoring and diagnostic displays.
|
||||
The same factors that affect the performance, scalability, and
|
||||
execution of other queries (such as parallel execution, memory usage,
|
||||
admission control, and timeouts) also apply to the queries run by the
|
||||
<codeph>COMPUTE STATS</codeph> statement.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/permissions_blurb"/>
|
||||
<p rev="CDH-19187">
|
||||
The user ID that the <cmdname>impalad</cmdname> daemon runs under,
|
||||
typically the <codeph>impala</codeph> user, must have read
|
||||
permission for all affected files in the source directory:
|
||||
all files in the case of an unpartitioned table or
|
||||
a partitioned table in the case of <codeph>COMPUTE STATS</codeph>;
|
||||
or all the files in partitions without incremental stats in
|
||||
the case of <codeph>COMPUTE INCREMENTAL STATS</codeph>.
|
||||
It must also have read and execute permissions for all
|
||||
relevant directories holding the data files.
|
||||
(Essentially, <codeph>COMPUTE STATS</codeph> requires the
|
||||
same permissions as the underlying <codeph>SELECT</codeph> queries it runs
|
||||
against the table.)
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
|
||||
<p>
|
||||
<xref href="impala_drop_stats.xml#drop_stats"/>, <xref href="impala_show.xml#show_table_stats"/>,
|
||||
<xref href="impala_show.xml#show_column_stats"/>, <xref href="impala_perf_stats.xml#perf_stats"/>
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
296
docs/topics/impala_concepts.xml
Normal file
296
docs/topics/impala_concepts.xml
Normal file
@@ -0,0 +1,296 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="concepts">
|
||||
|
||||
<title>Impala Concepts and Architecture</title>
|
||||
<titlealts audience="PDF"><navtitle>Concepts and Architecture</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Concepts"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Stub Pages"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
<draft-comment author="-dita-use-conref-target" audience="integrated"
|
||||
conref="../shared/cdh_cm_common.xml#id_dgz_rhr_kv/draft-comment-test"/>
|
||||
|
||||
<p>
|
||||
The following sections provide background information to help you become productive using Impala and
|
||||
its features. Where appropriate, the explanations include context to help understand how aspects of Impala
|
||||
relate to other technologies you might already be familiar with, such as relational database management
|
||||
systems and data warehouses, or other Hadoop components such as Hive, HDFS, and HBase.
|
||||
</p>
|
||||
|
||||
<p outputclass="toc"/>
|
||||
</conbody>
|
||||
|
||||
<!-- These other topics are waiting to be filled in. Could become subtopics or top-level topics depending on the depth of coverage in each case. -->
|
||||
|
||||
<concept id="intro_data_lifecycle" audience="Cloudera">
|
||||
|
||||
<title>Overview of the Data Lifecycle for Impala</title>
|
||||
|
||||
<conbody/>
|
||||
</concept>
|
||||
|
||||
<concept id="intro_etl" audience="Cloudera">
|
||||
|
||||
<title>Overview of the Extract, Transform, Load (ETL) Process for Impala</title>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="ETL"/>
|
||||
<data name="Category" value="Ingest"/>
|
||||
<data name="Category" value="Concepts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody/>
|
||||
</concept>
|
||||
|
||||
<concept id="intro_hadoop_data" audience="Cloudera">
|
||||
|
||||
<title>How Impala Works with Hadoop Data Files</title>
|
||||
|
||||
<conbody/>
|
||||
</concept>
|
||||
|
||||
<concept id="intro_web_ui" audience="Cloudera">
|
||||
|
||||
<title>Overview of the Impala Web Interface</title>
|
||||
|
||||
<conbody/>
|
||||
</concept>
|
||||
|
||||
<concept id="intro_bi" audience="Cloudera">
|
||||
|
||||
<title>Using Impala with Business Intelligence Tools</title>
|
||||
|
||||
<conbody/>
|
||||
</concept>
|
||||
|
||||
<concept id="intro_ha" audience="Cloudera">
|
||||
|
||||
<title>Overview of Impala Availability and Fault Tolerance</title>
|
||||
|
||||
<conbody/>
|
||||
</concept>
|
||||
|
||||
<!-- This is pretty much ready to go. Decide if it should go under "Concepts" or "Performance",
|
||||
and if it should be split out into a separate file, and then take out the audience= attribute
|
||||
to make it visible.
|
||||
-->
|
||||
|
||||
<concept id="intro_llvm" audience="Cloudera">
|
||||
|
||||
<title>Overview of Impala Runtime Code Generation</title>
|
||||
|
||||
<conbody>
|
||||
|
||||
<!-- Adapted from the CIDR15 paper written by the Impala team. -->
|
||||
|
||||
<p>
|
||||
Impala uses <term>LLVM</term> (a compiler library and collection of related tools) to perform just-in-time
|
||||
(JIT) compilation within the running <cmdname>impalad</cmdname> process. This runtime code generation
|
||||
technique improves query execution times by generating native code optimized for the architecture of each
|
||||
host in your particular cluster. Performance gains of 5 times or more are typical for representative
|
||||
workloads.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Impala uses runtime code generation to produce query-specific versions of functions that are critical to
|
||||
performance. In particular, code generation is applied to <term>inner loop</term> functions, that is, those
|
||||
that are executed many times (for every tuple) in a given query, and thus constitute a large portion of the
|
||||
total time the query takes to execute. For example, when Impala scans a data file, it calls a function to
|
||||
parse each record into Impala’s in-memory tuple format. For queries scanning large tables, billions of
|
||||
records could result in billions of function calls. This function must therefore be extremely efficient for
|
||||
good query performance, and removing even a few instructions from each function call can result in large
|
||||
query speedups.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Overall, JIT compilation has an effect similar to writing custom code to process a query. For example, it
|
||||
eliminates branches, unrolls loops, propagates constants, offsets and pointers, and inlines functions.
|
||||
Inlining is especially valuable for functions used internally to evaluate expressions, where the function
|
||||
call itself is more expensive than the function body (for example, a function that adds two numbers).
|
||||
Inlining functions also increases instruction-level parallelism, and allows the compiler to make further
|
||||
optimizations such as subexpression elimination across expressions.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Impala generates runtime query code automatically, so you do not need to do anything special to get this
|
||||
performance benefit. This technique is most effective for complex and long-running queries that process
|
||||
large numbers of rows. If you need to issue a series of short, small queries, you might turn off this
|
||||
feature to avoid the overhead of compilation time for each query. In this case, issue the statement
|
||||
<codeph>SET DISABLE_CODEGEN=true</codeph> to turn off runtime code generation for the duration of the
|
||||
current session.
|
||||
</p>
|
||||
|
||||
<!--
|
||||
<p>
|
||||
Without code generation,
|
||||
functions tend to be suboptimal
|
||||
to handle situations that cannot be predicted in advance.
|
||||
For example,
|
||||
a record-parsing function that
|
||||
only handles integer types will be faster at parsing an integer-only file
|
||||
than a function that handles other data types
|
||||
such as strings and floating-point numbers.
|
||||
However, the schemas of the files to
|
||||
be scanned are unknown at compile time,
|
||||
and so a general-purpose function must be used, even if at runtime
|
||||
it is known that more limited functionality is sufficient.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
A source of large runtime overheads are virtual functions. Virtual function calls incur a large performance
|
||||
penalty, particularly when the called function is very simple, as the calls cannot be inlined.
|
||||
If the type of the object instance is known at runtime, we can use code generation to replace the virtual
|
||||
function call with a call directly to the correct function, which can then be inlined. This is especially
|
||||
valuable when evaluating expression trees. In Impala (as in many systems), expressions are composed of a
|
||||
tree of individual operators and functions.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Each type of expression that can appear in a query is implemented internally by overriding a virtual function.
|
||||
Many of these expression functions are quite simple, for example, adding two numbers.
|
||||
The virtual function call can be more expensive than the function body itself. By resolving the virtual
|
||||
function calls with code generation and then inlining the resulting function calls, Impala can evaluate expressions
|
||||
directly with no function call overhead. Inlining functions also increases
|
||||
instruction-level parallelism, and allows the compiler to make further optimizations such as subexpression
|
||||
elimination across expressions.
|
||||
</p>
|
||||
-->
|
||||
</conbody>
|
||||
</concept>
|
||||
|
||||
<!-- Same as the previous section: adapted from CIDR paper, ready to externalize after deciding where to go. -->
|
||||
|
||||
<concept audience="Cloudera" id="intro_io">
|
||||
|
||||
<title>Overview of Impala I/O</title>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
Efficiently retrieving data from HDFS is a challenge for all SQL-on-Hadoop systems. To perform
|
||||
data scans from both disk and memory at or near hardware speed, Impala uses an HDFS feature called
|
||||
<term>short-circuit local reads</term> to bypass the DataNode protocol when reading from local disk. Impala
|
||||
can read at almost disk bandwidth (approximately 100 MB/s per disk) and is typically able to saturate all
|
||||
available disks. For example, with 12 disks, Impala is typically capable of sustaining I/O at 1.2 GB/sec.
|
||||
Furthermore, <term>HDFS caching</term> allows Impala to access memory-resident data at memory bus speed,
|
||||
and saves CPU cycles as there is no need to copy or checksum data blocks within memory.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The I/O manager component interfaces with storage devices to read and write data. I/O manager assigns a
|
||||
fixed number of worker threads per physical disk (currently one thread per rotational disk and eight per
|
||||
SSD), providing an asynchronous interface to clients (<term>scanner threads</term>).
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
|
||||
<!-- Same as the previous section: adapted from CIDR paper, ready to externalize after deciding where to go. -->
|
||||
|
||||
<!-- Although good idea to get some answers from Henry first. -->
|
||||
|
||||
<concept audience="Cloudera" id="intro_state_distribution">
|
||||
|
||||
<title>State distribution</title>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
As a massively parallel database that can run on hundreds of nodes, Impala must coordinate and synchronize
|
||||
its metadata across the entire cluster. Impala's symmetric-node architecture means that any node can accept
|
||||
and execute queries, and thus each node needs up-to-date versions of the system catalog and a knowledge of
|
||||
which hosts the <cmdname>impalad</cmdname> daemons run on. To avoid the overhead of TCP connections and
|
||||
remote procedure calls to retrieve metadata during query planning, Impala implements a simple
|
||||
publish-subscribe service called the <term>statestore</term> to push metadata changes to a set of
|
||||
subscribers (the <cmdname>impalad</cmdname> daemons running on all the DataNodes).
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The statestore maintains a set of topics, which are arrays of <codeph>(<varname>key</varname>,
|
||||
<varname>value</varname>, <varname>version</varname>)</codeph> triplets called <term>entries</term> where
|
||||
<varname>key</varname> and <varname>value</varname> are byte arrays, and <varname>version</varname> is a
|
||||
64-bit integer. A topic is defined by an application, and so the statestore has no understanding of the
|
||||
contents of any topic entry. Topics are persistent through the lifetime of the statestore, but are not
|
||||
persisted across service restarts. Processes that receive updates to any topic are called
|
||||
<term>subscribers</term>, and express their interest by registering with the statestore at startup and
|
||||
providing a list of topics. The statestore responds to registration by sending the subscriber an initial
|
||||
topic update for each registered topic, which consists of all the entries currently in that topic.
|
||||
</p>
|
||||
|
||||
<!-- Henry: OK, but in practice, what is in these topic messages for Impala? -->
|
||||
|
||||
<p>
|
||||
After registration, the statestore periodically sends two kinds of messages to each subscriber. The first
|
||||
kind of message is a topic update, and consists of all changes to a topic (new entries, modified entries
|
||||
and deletions) since the last update was successfully sent to the subscriber. Each subscriber maintains a
|
||||
per-topic most-recent-version identifier which allows the statestore to only send the delta between
|
||||
updates. In response to a topic update, each subscriber sends a list of changes it intends to make to its
|
||||
subscribed topics. Those changes are guaranteed to have been applied by the time the next update is
|
||||
received.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The second kind of statestore message is a <term>heartbeat</term>, formerly sometimes called
|
||||
<term>keepalive</term>. The statestore uses heartbeat messages to maintain the connection to each
|
||||
subscriber, which would otherwise time out its subscription and attempt to re-register.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Prior to Impala 2.0, both kinds of communication were combined in a single kind of message. Because these
|
||||
messages could be very large in instances with thousands of tables, partitions, data files, and so on,
|
||||
Impala 2.0 and higher divides the types of messages so that the small heartbeat pings can be transmitted
|
||||
and acknowledged quickly, increasing the reliability of the statestore mechanism that detects when Impala
|
||||
nodes become unavailable.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
If the statestore detects a failed subscriber (for example, by repeated failed heartbeat deliveries), it
|
||||
stops sending updates to that node.
|
||||
<!-- Henry: what are examples of these transient topic entries? -->
|
||||
Some topic entries are marked as transient, meaning that if their owning subscriber fails, they are
|
||||
removed.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Although the asynchronous nature of this mechanism means that metadata updates might take some time to
|
||||
propagate across the entire cluster, that does not affect the consistency of query planning or results.
|
||||
Each query is planned and coordinated by a particular node, so as long as the coordinator node is aware of
|
||||
the existence of the relevant tables, data files, and so on, it can distribute the query work to other
|
||||
nodes even if those other nodes have not received the latest metadata updates.
|
||||
<!-- Henry: need another example here of what's in a topic, e.g. is it the list of available tables? -->
|
||||
<!--
|
||||
For example, query planning is performed on a single node based on the
|
||||
catalog metadata topic, and once a full plan has been computed, all information required to execute that
|
||||
plan is distributed directly to the executing nodes.
|
||||
There is no requirement that an executing node should
|
||||
know about the same version of the catalog metadata topic.
|
||||
-->
|
||||
</p>
|
||||
|
||||
<p>
|
||||
We have found that the statestore process with default settings scales well to medium sized clusters, and
|
||||
can serve our largest deployments with some configuration changes.
|
||||
<!-- Henry: elaborate on the configuration changes. -->
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<!-- Henry: other examples like load information? How is load information used? -->
|
||||
The statestore does not persist any metadata to disk: all current metadata is pushed to the statestore by
|
||||
its subscribers (for example, load information). Therefore, should a statestore restart, its state can be
|
||||
recovered during the initial subscriber registration phase. Or if the machine that the statestore is
|
||||
running on fails, a new statestore process can be started elsewhere, and subscribers can fail over to it.
|
||||
There is no built-in failover mechanism in Impala, instead deployments commonly use a retargetable DNS
|
||||
entry to force subscribers to automatically move to the new process instance.
|
||||
<!-- Henry: translate that last sentence into instructions / guidelines. -->
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
</concept>
|
||||
443
docs/topics/impala_conditional_functions.xml
Normal file
443
docs/topics/impala_conditional_functions.xml
Normal file
@@ -0,0 +1,443 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="conditional_functions">
|
||||
|
||||
<title>Impala Conditional Functions</title>
|
||||
<titlealts audience="PDF"><navtitle>Conditional Functions</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Impala Functions"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Querying"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
Impala supports the following conditional functions for testing equality, comparison operators, and nullity:
|
||||
</p>
|
||||
|
||||
<dl>
|
||||
<dlentry id="case">
|
||||
|
||||
<dt>
|
||||
<codeph>CASE a WHEN b THEN c [WHEN d THEN e]... [ELSE f] END</codeph>
|
||||
</dt>
|
||||
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">CASE expression</indexterm>
|
||||
<b>Purpose:</b> Compares an expression to one or more possible values, and returns a corresponding result
|
||||
when a match is found.
|
||||
<p conref="../shared/impala_common.xml#common/return_same_type"/>
|
||||
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
||||
<p>
|
||||
In this form of the <codeph>CASE</codeph> expression, the initial value <codeph>A</codeph>
|
||||
being evaluated for each row it typically a column reference, or an expression involving
|
||||
a column. This form can only compare against a set of specified values, not ranges,
|
||||
multi-value comparisons such as <codeph>BETWEEN</codeph> or <codeph>IN</codeph>,
|
||||
regular expressions, or <codeph>NULL</codeph>.
|
||||
</p>
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
<p>
|
||||
Although this example is split across multiple lines, you can put any or all parts of a <codeph>CASE</codeph> expression
|
||||
on a single line, with no punctuation or other separators between the <codeph>WHEN</codeph>,
|
||||
<codeph>ELSE</codeph>, and <codeph>END</codeph> clauses.
|
||||
</p>
|
||||
<codeblock>select case x
|
||||
when 1 then 'one'
|
||||
when 2 then 'two'
|
||||
when 0 then 'zero'
|
||||
else 'out of range'
|
||||
end
|
||||
from t1;
|
||||
</codeblock>
|
||||
</dd>
|
||||
|
||||
</dlentry>
|
||||
|
||||
<dlentry id="case2">
|
||||
|
||||
<dt>
|
||||
<codeph>CASE WHEN a THEN b [WHEN c THEN d]... [ELSE e] END</codeph>
|
||||
</dt>
|
||||
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">CASE expression</indexterm>
|
||||
<b>Purpose:</b> Tests whether any of a sequence of expressions is true, and returns a corresponding
|
||||
result for the first true expression.
|
||||
<p conref="../shared/impala_common.xml#common/return_same_type"/>
|
||||
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
||||
<p>
|
||||
<codeph>CASE</codeph> expressions without an initial test value have more flexibility.
|
||||
For example, they can test different columns in different <codeph>WHEN</codeph> clauses,
|
||||
or use comparison operators such as <codeph>BETWEEN</codeph>, <codeph>IN</codeph> and <codeph>IS NULL</codeph>
|
||||
rather than comparing against discrete values.
|
||||
</p>
|
||||
<p>
|
||||
<codeph>CASE</codeph> expressions are often the foundation of long queries that
|
||||
summarize and format results for easy-to-read reports. For example, you might
|
||||
use a <codeph>CASE</codeph> function call to turn values from a numeric column
|
||||
into category strings corresponding to integer values, or labels such as <q>Small</q>,
|
||||
<q>Medium</q> and <q>Large</q> based on ranges. Then subsequent parts of the
|
||||
query might aggregate based on the transformed values, such as how many
|
||||
values are classified as small, medium, or large. You can also use <codeph>CASE</codeph>
|
||||
to signal problems with out-of-bounds values, <codeph>NULL</codeph> values,
|
||||
and so on.
|
||||
</p>
|
||||
<p>
|
||||
By using operators such as <codeph>OR</codeph>, <codeph>IN</codeph>,
|
||||
<codeph>REGEXP</codeph>, and so on in <codeph>CASE</codeph> expressions,
|
||||
you can build extensive tests and transformations into a single query.
|
||||
Therefore, applications that construct SQL statements often rely heavily on <codeph>CASE</codeph>
|
||||
calls in the generated SQL code.
|
||||
</p>
|
||||
<p>
|
||||
Because this flexible form of the <codeph>CASE</codeph> expressions allows you to perform
|
||||
many comparisons and call multiple functions when evaluating each row, be careful applying
|
||||
elaborate <codeph>CASE</codeph> expressions to queries that process large amounts of data.
|
||||
For example, when practical, evaluate and transform values through <codeph>CASE</codeph>
|
||||
after applying operations such as aggregations that reduce the size of the result set;
|
||||
transform numbers to strings after performing joins with the original numeric values.
|
||||
</p>
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
<p>
|
||||
Although this example is split across multiple lines, you can put any or all parts of a <codeph>CASE</codeph> expression
|
||||
on a single line, with no punctuation or other separators between the <codeph>WHEN</codeph>,
|
||||
<codeph>ELSE</codeph>, and <codeph>END</codeph> clauses.
|
||||
</p>
|
||||
<codeblock>select case
|
||||
when dayname(now()) in ('Saturday','Sunday') then 'result undefined on weekends'
|
||||
when x > y then 'x greater than y'
|
||||
when x = y then 'x and y are equal'
|
||||
when x is null or y is null then 'one of the columns is null'
|
||||
else null
|
||||
end
|
||||
from t1;
|
||||
</codeblock>
|
||||
</dd>
|
||||
|
||||
</dlentry>
|
||||
|
||||
<dlentry id="coalesce">
|
||||
|
||||
<dt>
|
||||
<codeph>coalesce(type v1, type v2, ...)</codeph>
|
||||
</dt>
|
||||
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">coalesce() function</indexterm>
|
||||
<b>Purpose:</b> Returns the first specified argument that is not <codeph>NULL</codeph>, or
|
||||
<codeph>NULL</codeph> if all arguments are <codeph>NULL</codeph>.
|
||||
<p conref="../shared/impala_common.xml#common/return_same_type"/>
|
||||
</dd>
|
||||
|
||||
</dlentry>
|
||||
|
||||
<dlentry rev="2.0.0" id="decode">
|
||||
|
||||
<dt>
|
||||
<codeph>decode(type expression, type search1, type result1 [, type search2, type result2 ...] [, type
|
||||
default] )</codeph>
|
||||
</dt>
|
||||
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">decode() function</indexterm>
|
||||
<b>Purpose:</b> Compares an expression to one or more possible values, and returns a corresponding result
|
||||
when a match is found.
|
||||
<p conref="../shared/impala_common.xml#common/return_same_type"/>
|
||||
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
||||
<p>
|
||||
Can be used as shorthand for a <codeph>CASE</codeph> expression.
|
||||
</p>
|
||||
<p>
|
||||
The original expression and the search expressions must of the same type or convertible types. The
|
||||
result expression can be a different type, but all result expressions must be of the same type.
|
||||
</p>
|
||||
<p>
|
||||
Returns a successful match If the original expression is <codeph>NULL</codeph> and a search expression
|
||||
is also <codeph>NULL</codeph>. the
|
||||
</p>
|
||||
<p>
|
||||
Returns <codeph>NULL</codeph> if the final <codeph>default</codeph> value is omitted and none of the
|
||||
search expressions match the original expression.
|
||||
</p>
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
<p>
|
||||
The following example translates numeric day values into descriptive names:
|
||||
</p>
|
||||
<codeblock>SELECT event, decode(day_of_week, 1, "Monday", 2, "Tuesday", 3, "Wednesday",
|
||||
4, "Thursday", 5, "Friday", 6, "Saturday", 7, "Sunday", "Unknown day")
|
||||
FROM calendar;
|
||||
</codeblock>
|
||||
</dd>
|
||||
|
||||
</dlentry>
|
||||
|
||||
<dlentry id="if">
|
||||
|
||||
<dt>
|
||||
<codeph>if(boolean condition, type ifTrue, type ifFalseOrNull)</codeph>
|
||||
</dt>
|
||||
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">if() function</indexterm>
|
||||
<b>Purpose:</b> Tests an expression and returns a corresponding result depending on whether the result is
|
||||
true, false, or <codeph>NULL</codeph>.
|
||||
<p>
|
||||
<b>Return type:</b> Same as the <codeph>ifTrue</codeph> argument value
|
||||
</p>
|
||||
</dd>
|
||||
|
||||
</dlentry>
|
||||
|
||||
<dlentry rev="1.3.0" id="ifnull">
|
||||
|
||||
<dt>
|
||||
<codeph>ifnull(type a, type ifNull)</codeph>
|
||||
</dt>
|
||||
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">isnull() function</indexterm>
|
||||
<b>Purpose:</b> Alias for the <codeph>isnull()</codeph> function, with the same behavior. To simplify
|
||||
porting SQL with vendor extensions to Impala.
|
||||
<p conref="../shared/impala_common.xml#common/added_in_130"/>
|
||||
</dd>
|
||||
|
||||
</dlentry>
|
||||
|
||||
<dlentry id="isfalse" rev="2.2.0">
|
||||
|
||||
<dt>
|
||||
<codeph>isfalse(<varname>boolean</varname>)</codeph>
|
||||
</dt>
|
||||
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">isfalse() function</indexterm>
|
||||
<b>Purpose:</b> Tests if a Boolean expression is <codeph>false</codeph> or not.
|
||||
Returns <codeph>true</codeph> if so.
|
||||
If the argument is <codeph>NULL</codeph>, returns <codeph>false</codeph>.
|
||||
Identical to <codeph>isnottrue()</codeph>, except it returns the opposite value for a <codeph>NULL</codeph> argument.
|
||||
<p conref="../shared/impala_common.xml#common/return_type_boolean"/>
|
||||
<p conref="../shared/impala_common.xml#common/added_in_220"/>
|
||||
</dd>
|
||||
|
||||
</dlentry>
|
||||
|
||||
<dlentry id="isnotfalse" rev="2.2.0">
|
||||
|
||||
<dt>
|
||||
<codeph>isnotfalse(<varname>boolean</varname>)</codeph>
|
||||
</dt>
|
||||
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">isnotfalse() function</indexterm>
|
||||
<b>Purpose:</b> Tests if a Boolean expression is not <codeph>false</codeph> (that is, either <codeph>true</codeph> or <codeph>NULL</codeph>).
|
||||
Returns <codeph>true</codeph> if so.
|
||||
If the argument is <codeph>NULL</codeph>, returns <codeph>true</codeph>.
|
||||
Identical to <codeph>istrue()</codeph>, except it returns the opposite value for a <codeph>NULL</codeph> argument.
|
||||
<p conref="../shared/impala_common.xml#common/return_type_boolean"/>
|
||||
<p conref="../shared/impala_common.xml#common/for_compatibility_only"/>
|
||||
<p conref="../shared/impala_common.xml#common/added_in_220"/>
|
||||
</dd>
|
||||
|
||||
</dlentry>
|
||||
|
||||
<dlentry id="isnottrue" rev="2.2.0">
|
||||
|
||||
<dt>
|
||||
<codeph>isnottrue(<varname>boolean</varname>)</codeph>
|
||||
</dt>
|
||||
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">isnottrue() function</indexterm>
|
||||
<b>Purpose:</b> Tests if a Boolean expression is not <codeph>true</codeph> (that is, either <codeph>false</codeph> or <codeph>NULL</codeph>).
|
||||
Returns <codeph>true</codeph> if so.
|
||||
If the argument is <codeph>NULL</codeph>, returns <codeph>true</codeph>.
|
||||
Identical to <codeph>isfalse()</codeph>, except it returns the opposite value for a <codeph>NULL</codeph> argument.
|
||||
<p conref="../shared/impala_common.xml#common/return_type_boolean"/>
|
||||
<p conref="../shared/impala_common.xml#common/added_in_220"/>
|
||||
</dd>
|
||||
|
||||
</dlentry>
|
||||
|
||||
<dlentry id="isnull">
|
||||
|
||||
<dt>
|
||||
<codeph>isnull(type a, type ifNull)</codeph>
|
||||
</dt>
|
||||
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">isnull() function</indexterm>
|
||||
<b>Purpose:</b> Tests if an expression is <codeph>NULL</codeph>, and returns the expression result value
|
||||
if not. If the first argument is <codeph>NULL</codeph>, returns the second argument.
|
||||
<p>
|
||||
<b>Compatibility notes:</b> Equivalent to the <codeph>nvl()</codeph> function from Oracle Database or
|
||||
<codeph>ifnull()</codeph> from MySQL. The <codeph>nvl()</codeph> and <codeph>ifnull()</codeph>
|
||||
functions are also available in Impala.
|
||||
</p>
|
||||
<p>
|
||||
<b>Return type:</b> Same as the first argument value
|
||||
</p>
|
||||
</dd>
|
||||
|
||||
</dlentry>
|
||||
|
||||
<dlentry id="istrue" rev="2.2.0">
|
||||
|
||||
<dt>
|
||||
<codeph>istrue(<varname>boolean</varname>)</codeph>
|
||||
</dt>
|
||||
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">istrue() function</indexterm>
|
||||
<b>Purpose:</b> Tests if a Boolean expression is <codeph>true</codeph> or not.
|
||||
Returns <codeph>true</codeph> if so.
|
||||
If the argument is <codeph>NULL</codeph>, returns <codeph>false</codeph>.
|
||||
Identical to <codeph>isnotfalse()</codeph>, except it returns the opposite value for a <codeph>NULL</codeph> argument.
|
||||
<p conref="../shared/impala_common.xml#common/return_type_boolean"/>
|
||||
<p conref="../shared/impala_common.xml#common/for_compatibility_only"/>
|
||||
<p conref="../shared/impala_common.xml#common/added_in_220"/>
|
||||
</dd>
|
||||
|
||||
</dlentry>
|
||||
|
||||
<dlentry id="nonnullvalue" rev="2.2.0">
|
||||
|
||||
<dt>
|
||||
<codeph>nonnullvalue(<varname>expression</varname>)</codeph>
|
||||
</dt>
|
||||
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">function</indexterm>
|
||||
<b>Purpose:</b> Tests if an expression (of any type) is <codeph>NULL</codeph> or not.
|
||||
Returns <codeph>false</codeph> if so.
|
||||
The converse of <codeph>nullvalue()</codeph>.
|
||||
<p conref="../shared/impala_common.xml#common/return_type_boolean"/>
|
||||
<p conref="../shared/impala_common.xml#common/for_compatibility_only"/>
|
||||
<p conref="../shared/impala_common.xml#common/added_in_220"/>
|
||||
</dd>
|
||||
|
||||
</dlentry>
|
||||
|
||||
<dlentry rev="1.3.0" id="nullif">
|
||||
|
||||
<dt>
|
||||
<codeph>nullif(<varname>expr1</varname>,<varname>expr2</varname>)</codeph>
|
||||
</dt>
|
||||
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">nullif() function</indexterm>
|
||||
<b>Purpose:</b> Returns <codeph>NULL</codeph> if the two specified arguments are equal. If the specified
|
||||
arguments are not equal, returns the value of <varname>expr1</varname>. The data types of the expressions
|
||||
must be compatible, according to the conversion rules from <xref href="impala_datatypes.xml#datatypes"/>.
|
||||
You cannot use an expression that evaluates to <codeph>NULL</codeph> for <varname>expr1</varname>; that
|
||||
way, you can distinguish a return value of <codeph>NULL</codeph> from an argument value of
|
||||
<codeph>NULL</codeph>, which would never match <varname>expr2</varname>.
|
||||
<p>
|
||||
<b>Usage notes:</b> This function is effectively shorthand for a <codeph>CASE</codeph> expression of
|
||||
the form:
|
||||
</p>
|
||||
<codeblock>CASE
|
||||
WHEN <varname>expr1</varname> = <varname>expr2</varname> THEN NULL
|
||||
ELSE <varname>expr1</varname>
|
||||
END</codeblock>
|
||||
<p>
|
||||
It is commonly used in division expressions, to produce a <codeph>NULL</codeph> result instead of a
|
||||
divide-by-zero error when the divisor is equal to zero:
|
||||
</p>
|
||||
<codeblock>select 1.0 / nullif(c1,0) as reciprocal from t1;</codeblock>
|
||||
<p>
|
||||
You might also use it for compatibility with other database systems that support the same
|
||||
<codeph>NULLIF()</codeph> function.
|
||||
</p>
|
||||
<p conref="../shared/impala_common.xml#common/return_same_type"/>
|
||||
<p conref="../shared/impala_common.xml#common/added_in_130"/>
|
||||
</dd>
|
||||
|
||||
</dlentry>
|
||||
|
||||
<dlentry rev="1.3.0" id="nullifzero">
|
||||
|
||||
<dt>
|
||||
<codeph>nullifzero(<varname>numeric_expr</varname>)</codeph>
|
||||
</dt>
|
||||
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">nullifzero() function</indexterm>
|
||||
<b>Purpose:</b> Returns <codeph>NULL</codeph> if the numeric expression evaluates to 0, otherwise returns
|
||||
the result of the expression.
|
||||
<p>
|
||||
<b>Usage notes:</b> Used to avoid error conditions such as divide-by-zero in numeric calculations.
|
||||
Serves as shorthand for a more elaborate <codeph>CASE</codeph> expression, to simplify porting SQL with
|
||||
vendor extensions to Impala.
|
||||
</p>
|
||||
<p conref="../shared/impala_common.xml#common/return_same_type"/>
|
||||
<p conref="../shared/impala_common.xml#common/added_in_130"/>
|
||||
</dd>
|
||||
|
||||
</dlentry>
|
||||
|
||||
<dlentry id="nullvalue" rev="2.2.0">
|
||||
|
||||
<dt>
|
||||
<codeph>nullvalue(<varname>expression</varname>)</codeph>
|
||||
</dt>
|
||||
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">function</indexterm>
|
||||
<b>Purpose:</b> Tests if an expression (of any type) is <codeph>NULL</codeph> or not.
|
||||
Returns <codeph>true</codeph> if so.
|
||||
The converse of <codeph>nonnullvalue()</codeph>.
|
||||
<p conref="../shared/impala_common.xml#common/return_type_boolean"/>
|
||||
<p conref="../shared/impala_common.xml#common/for_compatibility_only"/>
|
||||
<p conref="../shared/impala_common.xml#common/added_in_220"/>
|
||||
</dd>
|
||||
|
||||
</dlentry>
|
||||
|
||||
<dlentry id="nvl" rev="1.1">
|
||||
|
||||
<dt>
|
||||
<codeph>nvl(type a, type ifNull)</codeph>
|
||||
</dt>
|
||||
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">nvl() function</indexterm>
|
||||
<b>Purpose:</b> Alias for the <codeph>isnull()</codeph> function. Tests if an expression is
|
||||
<codeph>NULL</codeph>, and returns the expression result value if not. If the first argument is
|
||||
<codeph>NULL</codeph>, returns the second argument. Equivalent to the <codeph>nvl()</codeph> function
|
||||
from Oracle Database or <codeph>ifnull()</codeph> from MySQL.
|
||||
<p>
|
||||
<b>Return type:</b> Same as the first argument value
|
||||
</p>
|
||||
<p conref="../shared/impala_common.xml#common/added_in_11"/>
|
||||
</dd>
|
||||
|
||||
</dlentry>
|
||||
|
||||
<dlentry rev="1.3.0" id="zeroifnull">
|
||||
|
||||
<dt>
|
||||
<codeph>zeroifnull(<varname>numeric_expr</varname>)</codeph>
|
||||
</dt>
|
||||
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">zeroifnull() function</indexterm>
|
||||
<b>Purpose:</b> Returns 0 if the numeric expression evaluates to <codeph>NULL</codeph>, otherwise returns
|
||||
the result of the expression.
|
||||
<p>
|
||||
<b>Usage notes:</b> Used to avoid unexpected results due to unexpected propagation of
|
||||
<codeph>NULL</codeph> values in numeric calculations. Serves as shorthand for a more elaborate
|
||||
<codeph>CASE</codeph> expression, to simplify porting SQL with vendor extensions to Impala.
|
||||
</p>
|
||||
<p conref="../shared/impala_common.xml#common/return_same_type"/>
|
||||
<p conref="../shared/impala_common.xml#common/added_in_130"/>
|
||||
</dd>
|
||||
|
||||
</dlentry>
|
||||
</dl>
|
||||
</conbody>
|
||||
</concept>
|
||||
57
docs/topics/impala_config.xml
Normal file
57
docs/topics/impala_config.xml
Normal file
@@ -0,0 +1,57 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="config">
|
||||
|
||||
<title>Managing Impala</title>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Administrators"/>
|
||||
<data name="Category" value="Configuring"/>
|
||||
<data name="Category" value="JDBC"/>
|
||||
<data name="Category" value="ODBC"/>
|
||||
<data name="Category" value="Stub Pages"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
This section explains how to configure Impala to accept connections from applications that use popular
|
||||
programming APIs:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
<xref href="impala_config_performance.xml#config_performance"/>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<xref href="impala_odbc.xml#impala_odbc"/>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<xref href="impala_jdbc.xml#impala_jdbc"/>
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p>
|
||||
This type of configuration is especially useful when using Impala in combination with Business Intelligence
|
||||
tools, which use these standard interfaces to query different kinds of database and Big Data systems.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
You can also configure these other aspects of Impala:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
<xref href="impala_security.xml#security"/>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<xref href="impala_config_options.xml#config_options"/>
|
||||
</li>
|
||||
</ul>
|
||||
</conbody>
|
||||
</concept>
|
||||
593
docs/topics/impala_config_options.xml
Normal file
593
docs/topics/impala_config_options.xml
Normal file
@@ -0,0 +1,593 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="config_options">
|
||||
|
||||
<title>Modifying Impala Startup Options</title>
|
||||
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Configuring"/>
|
||||
<data name="Category" value="Administrators"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<indexterm audience="Cloudera">defaults file</indexterm>
|
||||
|
||||
<indexterm audience="Cloudera">configuration file</indexterm>
|
||||
|
||||
<indexterm audience="Cloudera">options</indexterm>
|
||||
|
||||
<indexterm audience="Cloudera">IMPALA_STATE_STORE_PORT</indexterm>
|
||||
|
||||
<indexterm audience="Cloudera">IMPALA_BACKEND_PORT</indexterm>
|
||||
|
||||
<indexterm audience="Cloudera">IMPALA_LOG_DIR</indexterm>
|
||||
|
||||
<indexterm audience="Cloudera">IMPALA_STATE_STORE_ARGS</indexterm>
|
||||
|
||||
<indexterm audience="Cloudera">IMPALA_SERVER_ARGS</indexterm>
|
||||
|
||||
<indexterm audience="Cloudera">ENABLE_CORE_DUMPS</indexterm>
|
||||
|
||||
<indexterm audience="Cloudera">core dumps</indexterm>
|
||||
|
||||
<indexterm audience="Cloudera">restarting services</indexterm>
|
||||
|
||||
<indexterm audience="Cloudera">services</indexterm>
|
||||
The configuration options for the Impala-related daemons let you choose which hosts and
|
||||
ports to use for the services that run on a single host, specify directories for logging,
|
||||
control resource usage and security, and specify other aspects of the Impala software.
|
||||
</p>
|
||||
|
||||
<p outputclass="toc inpage"/>
|
||||
|
||||
</conbody>
|
||||
|
||||
<concept id="config_options_cm">
|
||||
|
||||
<title>Configuring Impala Startup Options through Cloudera Manager</title>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
If you manage your cluster through Cloudera Manager, configure the settings for all the
|
||||
Impala-related daemons by navigating to this page:
|
||||
<menucascade><uicontrol>Clusters</uicontrol><uicontrol>Impala</uicontrol><uicontrol>Configuration</uicontrol><uicontrol>View
|
||||
and Edit</uicontrol></menucascade>. See the Cloudera Manager documentation for
|
||||
<xref href="http://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_impala_service.html" scope="external" format="html">instructions
|
||||
about how to configure Impala through Cloudera Manager</xref>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
If the Cloudera Manager interface does not yet have a form field for a newly added
|
||||
option, or if you need to use special options for debugging and troubleshooting, the
|
||||
<uicontrol>Advanced</uicontrol> option page for each daemon includes one or more fields
|
||||
where you can enter option names directly.
|
||||
<ph conref="../shared/impala_common.xml#common/safety_valve"/> There is also a free-form
|
||||
field for query options, on the top-level <uicontrol>Impala Daemon</uicontrol> options
|
||||
page.
|
||||
</p>
|
||||
|
||||
</conbody>
|
||||
|
||||
</concept>
|
||||
|
||||
<concept id="config_options_noncm">
|
||||
|
||||
<title>Configuring Impala Startup Options through the Command Line</title>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
When you run Impala in a non-Cloudera Manager environment, the Impala server,
|
||||
statestore, and catalog services start up using values provided in a defaults file,
|
||||
<filepath>/etc/default/impala</filepath>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
This file includes information about many resources used by Impala. Most of the defaults
|
||||
included in this file should be effective in most cases. For example, typically you
|
||||
would not change the definition of the <codeph>CLASSPATH</codeph> variable, but you
|
||||
would always set the address used by the statestore server. Some of the content you
|
||||
might modify includes:
|
||||
</p>
|
||||
|
||||
<!-- Note: Update the following example for each release with the associated lines from /etc/default/impala
|
||||
from a non-CM-managed system. -->
|
||||
|
||||
<codeblock rev="ver">IMPALA_STATE_STORE_HOST=127.0.0.1
|
||||
IMPALA_STATE_STORE_PORT=24000
|
||||
IMPALA_BACKEND_PORT=22000
|
||||
IMPALA_LOG_DIR=/var/log/impala
|
||||
IMPALA_CATALOG_SERVICE_HOST=...
|
||||
IMPALA_STATE_STORE_HOST=...
|
||||
|
||||
export IMPALA_STATE_STORE_ARGS=${IMPALA_STATE_STORE_ARGS:- \
|
||||
-log_dir=${IMPALA_LOG_DIR} -state_store_port=${IMPALA_STATE_STORE_PORT}}
|
||||
IMPALA_SERVER_ARGS=" \
|
||||
-log_dir=${IMPALA_LOG_DIR} \
|
||||
-catalog_service_host=${IMPALA_CATALOG_SERVICE_HOST} \
|
||||
-state_store_port=${IMPALA_STATE_STORE_PORT} \
|
||||
-use_statestore \
|
||||
-state_store_host=${IMPALA_STATE_STORE_HOST} \
|
||||
-be_port=${IMPALA_BACKEND_PORT}"
|
||||
export ENABLE_CORE_DUMPS=${ENABLE_COREDUMPS:-false}</codeblock>
|
||||
|
||||
<p>
|
||||
To use alternate values, edit the defaults file, then restart all the Impala-related
|
||||
services so that the changes take effect. Restart the Impala server using the following
|
||||
commands:
|
||||
</p>
|
||||
|
||||
<codeblock>$ sudo service impala-server restart
|
||||
Stopping Impala Server: [ OK ]
|
||||
Starting Impala Server: [ OK ]</codeblock>
|
||||
|
||||
<p>
|
||||
Restart the Impala statestore using the following commands:
|
||||
</p>
|
||||
|
||||
<codeblock>$ sudo service impala-state-store restart
|
||||
Stopping Impala State Store Server: [ OK ]
|
||||
Starting Impala State Store Server: [ OK ]</codeblock>
|
||||
|
||||
<p>
|
||||
Restart the Impala catalog service using the following commands:
|
||||
</p>
|
||||
|
||||
<codeblock>$ sudo service impala-catalog restart
|
||||
Stopping Impala Catalog Server: [ OK ]
|
||||
Starting Impala Catalog Server: [ OK ]</codeblock>
|
||||
|
||||
<p>
|
||||
Some common settings to change include:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
<p>
|
||||
Statestore address. Where practical, put the statestore on a separate host not
|
||||
running the <cmdname>impalad</cmdname> daemon. In that recommended configuration,
|
||||
the <cmdname>impalad</cmdname> daemon cannot refer to the statestore server using
|
||||
the loopback address. If the statestore is hosted on a machine with an IP address of
|
||||
192.168.0.27, change:
|
||||
</p>
|
||||
<codeblock>IMPALA_STATE_STORE_HOST=127.0.0.1</codeblock>
|
||||
<p>
|
||||
to:
|
||||
</p>
|
||||
<codeblock>IMPALA_STATE_STORE_HOST=192.168.0.27</codeblock>
|
||||
</li>
|
||||
|
||||
<li rev="1.2">
|
||||
<p>
|
||||
Catalog server address (including both the hostname and the port number). Update the
|
||||
value of the <codeph>IMPALA_CATALOG_SERVICE_HOST</codeph> variable. Cloudera
|
||||
recommends the catalog server be on the same host as the statestore. In that
|
||||
recommended configuration, the <cmdname>impalad</cmdname> daemon cannot refer to the
|
||||
catalog server using the loopback address. If the catalog service is hosted on a
|
||||
machine with an IP address of 192.168.0.27, add the following line:
|
||||
</p>
|
||||
<codeblock>IMPALA_CATALOG_SERVICE_HOST=192.168.0.27:26000</codeblock>
|
||||
<p>
|
||||
The <filepath>/etc/default/impala</filepath> defaults file currently does not define
|
||||
an <codeph>IMPALA_CATALOG_ARGS</codeph> environment variable, but if you add one it
|
||||
will be recognized by the service startup/shutdown script. Add a definition for this
|
||||
variable to <filepath>/etc/default/impala</filepath> and add the option
|
||||
<codeph>-catalog_service_host=<varname>hostname</varname></codeph>. If the port is
|
||||
different than the default 26000, also add the option
|
||||
<codeph>-catalog_service_port=<varname>port</varname></codeph>.
|
||||
</p>
|
||||
</li>
|
||||
|
||||
<li id="mem_limit">
|
||||
<p>
|
||||
Memory limits. You can limit the amount of memory available to Impala. For example,
|
||||
to allow Impala to use no more than 70% of system memory, change:
|
||||
</p>
|
||||
<!-- Note: also needs to be updated for each release to reflect latest /etc/default/impala. -->
|
||||
<codeblock>export IMPALA_SERVER_ARGS=${IMPALA_SERVER_ARGS:- \
|
||||
-log_dir=${IMPALA_LOG_DIR} \
|
||||
-state_store_port=${IMPALA_STATE_STORE_PORT} \
|
||||
-use_statestore -state_store_host=${IMPALA_STATE_STORE_HOST} \
|
||||
-be_port=${IMPALA_BACKEND_PORT}}</codeblock>
|
||||
<p>
|
||||
to:
|
||||
</p>
|
||||
<codeblock>export IMPALA_SERVER_ARGS=${IMPALA_SERVER_ARGS:- \
|
||||
-log_dir=${IMPALA_LOG_DIR} -state_store_port=${IMPALA_STATE_STORE_PORT} \
|
||||
-use_statestore -state_store_host=${IMPALA_STATE_STORE_HOST} \
|
||||
-be_port=${IMPALA_BACKEND_PORT} -mem_limit=70%}</codeblock>
|
||||
<p>
|
||||
You can specify the memory limit using absolute notation such as
|
||||
<codeph>500m</codeph> or <codeph>2G</codeph>, or as a percentage of physical memory
|
||||
such as <codeph>60%</codeph>.
|
||||
</p>
|
||||
|
||||
<note>
|
||||
Queries that exceed the specified memory limit are aborted. Percentage limits are
|
||||
based on the physical memory of the machine and do not consider cgroups.
|
||||
</note>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<p>
|
||||
Core dump enablement. To enable core dumps on systems not managed by Cloudera
|
||||
Manager, change:
|
||||
</p>
|
||||
<codeblock>export ENABLE_CORE_DUMPS=${ENABLE_COREDUMPS:-false}</codeblock>
|
||||
<p>
|
||||
to:
|
||||
</p>
|
||||
<codeblock>export ENABLE_CORE_DUMPS=${ENABLE_COREDUMPS:-true}</codeblock>
|
||||
<p>
|
||||
On systems managed by Cloudera Manager, enable the <uicontrol>Enable Core
|
||||
Dump</uicontrol> setting for the Impala service.
|
||||
</p>
|
||||
|
||||
<note conref="../shared/impala_common.xml#common/core_dump_considerations"/>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<p>
|
||||
Authorization using the open source Sentry plugin. Specify the
|
||||
<codeph>-server_name</codeph> and <codeph>-authorization_policy_file</codeph>
|
||||
options as part of the <codeph>IMPALA_SERVER_ARGS</codeph> and
|
||||
<codeph>IMPALA_STATE_STORE_ARGS</codeph> settings to enable the core Impala support
|
||||
for authentication. See <xref href="impala_authorization.xml#secure_startup"/> for
|
||||
details.
|
||||
</p>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<p>
|
||||
Auditing for successful or blocked Impala queries, another aspect of security.
|
||||
Specify the <codeph>-audit_event_log_dir=<varname>directory_path</varname></codeph>
|
||||
option and optionally the
|
||||
<codeph>-max_audit_event_log_file_size=<varname>number_of_queries</varname></codeph>
|
||||
and <codeph>-abort_on_failed_audit_event</codeph> options as part of the
|
||||
<codeph>IMPALA_SERVER_ARGS</codeph> settings, for each Impala node, to enable and
|
||||
customize auditing. See <xref href="impala_auditing.xml#auditing"/> for details.
|
||||
</p>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<p>
|
||||
Password protection for the Impala web UI, which listens on port 25000 by default.
|
||||
This feature involves adding some or all of the
|
||||
<codeph>--webserver_password_file</codeph>,
|
||||
<codeph>--webserver_authentication_domain</codeph>, and
|
||||
<codeph>--webserver_certificate_file</codeph> options to the
|
||||
<codeph>IMPALA_SERVER_ARGS</codeph> and <codeph>IMPALA_STATE_STORE_ARGS</codeph>
|
||||
settings. See <xref href="impala_security_guidelines.xml#security_guidelines"/> for
|
||||
details.
|
||||
</p>
|
||||
</li>
|
||||
|
||||
<li id="default_query_options">
|
||||
<p rev="DOCS-677">
|
||||
Another setting you might add to <codeph>IMPALA_SERVER_ARGS</codeph> is a
|
||||
comma-separated list of query options and values:
|
||||
<codeblock>-default_query_options='<varname>option</varname>=<varname>value</varname>,<varname>option</varname>=<varname>value</varname>,...'
|
||||
</codeblock>
|
||||
These options control the behavior of queries performed by this
|
||||
<cmdname>impalad</cmdname> instance. The option values you specify here override the
|
||||
default values for <xref href="impala_query_options.xml#query_options">Impala query
|
||||
options</xref>, as shown by the <codeph>SET</codeph> statement in
|
||||
<cmdname>impala-shell</cmdname>.
|
||||
</p>
|
||||
</li>
|
||||
|
||||
<!-- Removing this reference now that the options are de-emphasized / desupported in CDH 5.5 / Impala 2.3 and up.
|
||||
<li rev="1.2">
|
||||
<p>
|
||||
Options for resource management, in conjunction with the YARN component. These options include
|
||||
<codeph>-enable_rm</codeph> and <codeph>-cgroup_hierarchy_path</codeph>.
|
||||
<ph rev="1.4.0">Additional options to help fine-tune the resource estimates are
|
||||
<codeph>-—rm_always_use_defaults</codeph>,
|
||||
<codeph>-—rm_default_memory=<varname>size</varname></codeph>, and
|
||||
<codeph>-—rm_default_cpu_cores</codeph>.</ph> For details about these options, see
|
||||
<xref href="impala_resource_management.xml#rm_options"/>. See
|
||||
<xref href="impala_resource_management.xml#resource_management"/> for information about resource
|
||||
management in general.
|
||||
</p>
|
||||
</li>
|
||||
-->
|
||||
|
||||
<li>
|
||||
<p>
|
||||
During troubleshooting, <keyword keyref="support_org"/> might direct you to change other values,
|
||||
particularly for <codeph>IMPALA_SERVER_ARGS</codeph>, to work around issues or
|
||||
gather debugging information.
|
||||
</p>
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<!-- Removing this reference now that the options are de-emphasized / desupported in CDH 5.5 / Impala 2.3 and up.
|
||||
<p conref="impala_resource_management.xml#rm_options/resource_management_impalad_options"/>
|
||||
-->
|
||||
|
||||
<note>
|
||||
<p>
|
||||
These startup options for the <cmdname>impalad</cmdname> daemon are different from the
|
||||
command-line options for the <cmdname>impala-shell</cmdname> command. For the
|
||||
<cmdname>impala-shell</cmdname> options, see
|
||||
<xref href="impala_shell_options.xml#shell_options"/>.
|
||||
</p>
|
||||
</note>
|
||||
|
||||
<p audience="Cloudera" outputclass="toc inpage"/>
|
||||
|
||||
</conbody>
|
||||
|
||||
<concept audience="Cloudera" id="config_options_impalad_details">
|
||||
|
||||
<title>Configuration Options for impalad Daemon</title>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
Some common settings to change include:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
<p>
|
||||
Statestore address. Where practical, put the statestore on a separate host not
|
||||
running the <cmdname>impalad</cmdname> daemon. In that recommended configuration,
|
||||
the <cmdname>impalad</cmdname> daemon cannot refer to the statestore server using
|
||||
the loopback address. If the statestore is hosted on a machine with an IP address
|
||||
of 192.168.0.27, change:
|
||||
</p>
|
||||
<codeblock>IMPALA_STATE_STORE_HOST=127.0.0.1</codeblock>
|
||||
<p>
|
||||
to:
|
||||
</p>
|
||||
<codeblock>IMPALA_STATE_STORE_HOST=192.168.0.27</codeblock>
|
||||
</li>
|
||||
|
||||
<li rev="1.2">
|
||||
<p>
|
||||
Catalog server address. Update the <codeph>IMPALA_CATALOG_SERVICE_HOST</codeph>
|
||||
variable, including both the hostname and the port number in the value. Cloudera
|
||||
recommends the catalog server be on the same host as the statestore. In that
|
||||
recommended configuration, the <cmdname>impalad</cmdname> daemon cannot refer to
|
||||
the catalog server using the loopback address. If the catalog service is hosted on
|
||||
a machine with an IP address of 192.168.0.27, add the following line:
|
||||
</p>
|
||||
<codeblock>IMPALA_CATALOG_SERVICE_HOST=192.168.0.27:26000</codeblock>
|
||||
<p>
|
||||
The <filepath>/etc/default/impala</filepath> defaults file currently does not
|
||||
define an <codeph>IMPALA_CATALOG_ARGS</codeph> environment variable, but if you
|
||||
add one it will be recognized by the service startup/shutdown script. Add a
|
||||
definition for this variable to <filepath>/etc/default/impala</filepath> and add
|
||||
the option <codeph>-catalog_service_host=<varname>hostname</varname></codeph>. If
|
||||
the port is different than the default 26000, also add the option
|
||||
<codeph>-catalog_service_port=<varname>port</varname></codeph>.
|
||||
</p>
|
||||
</li>
|
||||
|
||||
<li id="mem_limit">
|
||||
Memory limits. You can limit the amount of memory available to Impala. For example,
|
||||
to allow Impala to use no more than 70% of system memory, change:
|
||||
<!-- Note: also needs to be updated for each release to reflect latest /etc/default/impala. -->
|
||||
<codeblock>export IMPALA_SERVER_ARGS=${IMPALA_SERVER_ARGS:- \
|
||||
-log_dir=${IMPALA_LOG_DIR} \
|
||||
-state_store_port=${IMPALA_STATE_STORE_PORT} \
|
||||
-use_statestore -state_store_host=${IMPALA_STATE_STORE_HOST} \
|
||||
-be_port=${IMPALA_BACKEND_PORT}}</codeblock>
|
||||
<p>
|
||||
to:
|
||||
</p>
|
||||
<codeblock>export IMPALA_SERVER_ARGS=${IMPALA_SERVER_ARGS:- \
|
||||
-log_dir=${IMPALA_LOG_DIR} -state_store_port=${IMPALA_STATE_STORE_PORT} \
|
||||
-use_statestore -state_store_host=${IMPALA_STATE_STORE_HOST} \
|
||||
-be_port=${IMPALA_BACKEND_PORT} -mem_limit=70%}</codeblock>
|
||||
<p>
|
||||
You can specify the memory limit using absolute notation such as
|
||||
<codeph>500m</codeph> or <codeph>2G</codeph>, or as a percentage of physical
|
||||
memory such as <codeph>60%</codeph>.
|
||||
</p>
|
||||
|
||||
<note>
|
||||
Queries that exceed the specified memory limit are aborted. Percentage limits are
|
||||
based on the physical memory of the machine and do not consider cgroups.
|
||||
</note>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Core dump enablement. To enable core dumps, change:
|
||||
<codeblock>export ENABLE_CORE_DUMPS=${ENABLE_COREDUMPS:-false}</codeblock>
|
||||
<p>
|
||||
to:
|
||||
</p>
|
||||
<codeblock>export ENABLE_CORE_DUMPS=${ENABLE_COREDUMPS:-true}</codeblock>
|
||||
<note>
|
||||
The location of core dump files may vary according to your operating system
|
||||
configuration. Other security settings may prevent Impala from writing core dumps
|
||||
even when this option is enabled.
|
||||
</note>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Authorization using the open source Sentry plugin. Specify the
|
||||
<codeph>-server_name</codeph> and <codeph>-authorization_policy_file</codeph>
|
||||
options as part of the <codeph>IMPALA_SERVER_ARGS</codeph> and
|
||||
<codeph>IMPALA_STATE_STORE_ARGS</codeph> settings to enable the core Impala support
|
||||
for authentication. See <xref href="impala_authorization.xml#secure_startup"/> for
|
||||
details.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Auditing for successful or blocked Impala queries, another aspect of security.
|
||||
Specify the <codeph>-audit_event_log_dir=<varname>directory_path</varname></codeph>
|
||||
option and optionally the
|
||||
<codeph>-max_audit_event_log_file_size=<varname>number_of_queries</varname></codeph>
|
||||
and <codeph>-abort_on_failed_audit_event</codeph> options as part of the
|
||||
<codeph>IMPALA_SERVER_ARGS</codeph> settings, for each Impala node, to enable and
|
||||
customize auditing. See <xref href="impala_auditing.xml#auditing"/> for details.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Password protection for the Impala web UI, which listens on port 25000 by default.
|
||||
This feature involves adding some or all of the
|
||||
<codeph>--webserver_password_file</codeph>,
|
||||
<codeph>--webserver_authentication_domain</codeph>, and
|
||||
<codeph>--webserver_certificate_file</codeph> options to the
|
||||
<codeph>IMPALA_SERVER_ARGS</codeph> and <codeph>IMPALA_STATE_STORE_ARGS</codeph>
|
||||
settings. See <xref href="impala_security_webui.xml"/> for details.
|
||||
</li>
|
||||
|
||||
<li id="default_query_options">
|
||||
Another setting you might add to <codeph>IMPALA_SERVER_ARGS</codeph> is:
|
||||
<codeblock>-default_query_options='<varname>option</varname>=<varname>value</varname>,<varname>option</varname>=<varname>value</varname>,...'
|
||||
</codeblock>
|
||||
These options control the behavior of queries performed by this
|
||||
<cmdname>impalad</cmdname> instance. The option values you specify here override the
|
||||
default values for <xref href="impala_query_options.xml#query_options">Impala query
|
||||
options</xref>, as shown by the <codeph>SET</codeph> statement in
|
||||
<cmdname>impala-shell</cmdname>.
|
||||
</li>
|
||||
|
||||
<!-- Removing this reference now that the options are de-emphasized / desupported in CDH 5.5 / Impala 2.3 and up.
|
||||
<li rev="1.2">
|
||||
Options for resource management, in conjunction with the YARN component. These options
|
||||
include <codeph>-enable_rm</codeph> and <codeph>-cgroup_hierarchy_path</codeph>.
|
||||
<ph rev="1.4.0">Additional options to help fine-tune the resource estimates are
|
||||
<codeph>-—rm_always_use_defaults</codeph>,
|
||||
<codeph>-—rm_default_memory=<varname>size</varname></codeph>, and
|
||||
<codeph>-—rm_default_cpu_cores</codeph>.</ph> For details about these options, see
|
||||
<xref href="impala_resource_management.xml#rm_options"/>. See
|
||||
<xref href="impala_resource_management.xml#resource_management"/> for information about resource
|
||||
management in general.
|
||||
</li>
|
||||
-->
|
||||
|
||||
<li>
|
||||
During troubleshooting, <keyword keyref="support_org"/> might direct you to change other values,
|
||||
particularly for <codeph>IMPALA_SERVER_ARGS</codeph>, to work around issues or
|
||||
gather debugging information.
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<!-- Removing this reference now that the options are de-emphasized / desupported in CDH 5.5 / Impala 2.3 and up.
|
||||
<p conref="impala_resource_management.xml#rm_options/resource_management_impalad_options"/>
|
||||
-->
|
||||
|
||||
<note>
|
||||
<p>
|
||||
These startup options for the <cmdname>impalad</cmdname> daemon are different from
|
||||
the command-line options for the <cmdname>impala-shell</cmdname> command. For the
|
||||
<cmdname>impala-shell</cmdname> options, see
|
||||
<xref href="impala_shell_options.xml#shell_options"/>.
|
||||
</p>
|
||||
</note>
|
||||
|
||||
</conbody>
|
||||
|
||||
</concept>
|
||||
|
||||
<concept audience="Cloudera" id="config_options_statestored_details">
|
||||
|
||||
<title>Configuration Options for statestored Daemon</title>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p></p>
|
||||
|
||||
</conbody>
|
||||
|
||||
</concept>
|
||||
|
||||
<concept audience="Cloudera" id="config_options_catalogd_details">
|
||||
|
||||
<title>Configuration Options for catalogd Daemon</title>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p></p>
|
||||
|
||||
</conbody>
|
||||
|
||||
</concept>
|
||||
|
||||
</concept>
|
||||
|
||||
<concept id="config_options_checking">
|
||||
|
||||
<title>Checking the Values of Impala Configuration Options</title>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
You can check the current runtime value of all these settings through the Impala web
|
||||
interface, available by default at
|
||||
<codeph>http://<varname>impala_hostname</varname>:25000/varz</codeph> for the
|
||||
<cmdname>impalad</cmdname> daemon,
|
||||
<codeph>http://<varname>impala_hostname</varname>:25010/varz</codeph> for the
|
||||
<cmdname>statestored</cmdname> daemon, or
|
||||
<codeph>http://<varname>impala_hostname</varname>:25020/varz</codeph> for the
|
||||
<cmdname>catalogd</cmdname> daemon. In the Cloudera Manager interface, you can see the
|
||||
link to the appropriate <uicontrol><varname>service_name</varname> Web UI</uicontrol>
|
||||
page when you look at the status page for a specific daemon on a specific host.
|
||||
</p>
|
||||
|
||||
</conbody>
|
||||
|
||||
</concept>
|
||||
|
||||
<concept id="config_options_impalad">
|
||||
|
||||
<title>Startup Options for impalad Daemon</title>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
The <codeph>impalad</codeph> daemon implements the main Impala service, which performs
|
||||
query processing and reads and writes the data files.
|
||||
</p>
|
||||
|
||||
</conbody>
|
||||
|
||||
</concept>
|
||||
|
||||
<concept id="config_options_statestored">
|
||||
|
||||
<title>Startup Options for statestored Daemon</title>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
The <cmdname>statestored</cmdname> daemon implements the Impala statestore service,
|
||||
which monitors the availability of Impala services across the cluster, and handles
|
||||
situations such as nodes becoming unavailable or becoming available again.
|
||||
</p>
|
||||
|
||||
</conbody>
|
||||
|
||||
</concept>
|
||||
|
||||
<concept rev="1.2" id="config_options_catalogd">
|
||||
|
||||
<title>Startup Options for catalogd Daemon</title>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
The <cmdname>catalogd</cmdname> daemon implements the Impala catalog service, which
|
||||
broadcasts metadata changes to all the Impala nodes when Impala creates a table, inserts
|
||||
data, or performs other kinds of DDL and DML operations.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/load_catalog_in_background"/>
|
||||
|
||||
</conbody>
|
||||
|
||||
</concept>
|
||||
|
||||
</concept>
|
||||
179
docs/topics/impala_config_performance.xml
Normal file
179
docs/topics/impala_config_performance.xml
Normal file
@@ -0,0 +1,179 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="config_performance">
|
||||
|
||||
<title>Post-Installation Configuration for Impala</title>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Performance"/>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Configuring"/>
|
||||
<data name="Category" value="Administrators"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p id="p_24">
|
||||
This section describes the mandatory and recommended configuration settings for Impala. If Impala is
|
||||
installed using Cloudera Manager, some of these configurations are completed automatically; you must still
|
||||
configure short-circuit reads manually. If you installed Impala without Cloudera Manager, or if you want to
|
||||
customize your environment, consider making the changes described in this topic.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<!-- Could conref this paragraph from ciiu_install.xml. -->
|
||||
In some cases, depending on the level of Impala, CDH, and Cloudera Manager, you might need to add particular
|
||||
component configuration details in one of the free-form fields on the Impala configuration pages within
|
||||
Cloudera Manager. <ph conref="../shared/impala_common.xml#common/safety_valve"/>
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
You must enable short-circuit reads, whether or not Impala was installed through Cloudera Manager. This
|
||||
setting goes in the Impala configuration settings, not the Hadoop-wide settings.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
If you installed Impala in an environment that is not managed by Cloudera Manager, you must enable block
|
||||
location tracking, and you can optionally enable native checksumming for optimal performance.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
If you deployed Impala using Cloudera Manager see
|
||||
<xref href="impala_perf_testing.xml#performance_testing"/> to confirm proper configuration.
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<section id="section_fhq_wyv_ls">
|
||||
<title>Mandatory: Short-Circuit Reads</title>
|
||||
<p> Enabling short-circuit reads allows Impala to read local data directly
|
||||
from the file system. This removes the need to communicate through the
|
||||
DataNodes, improving performance. This setting also minimizes the number
|
||||
of additional copies of data. Short-circuit reads requires
|
||||
<codeph>libhadoop.so</codeph>
|
||||
<!-- This link went stale. Not obvious how to keep it in sync with whatever Hadoop CDH is using behind the scenes. So hide the link for now. -->
|
||||
<!-- (the <xref href="http://hadoop.apache.org/docs/r0.19.1/native_libraries.html" scope="external" format="html">Hadoop Native Library</xref>) -->
|
||||
(the Hadoop Native Library) to be accessible to both the server and the
|
||||
client. <codeph>libhadoop.so</codeph> is not available if you have
|
||||
installed from a tarball. You must install from an
|
||||
<codeph>.rpm</codeph>, <codeph>.deb</codeph>, or parcel to use
|
||||
short-circuit local reads. <note> If you use Cloudera Manager, you can
|
||||
enable short-circuit reads through a checkbox in the user interface
|
||||
and that setting takes effect for Impala as well. </note>
|
||||
</p>
|
||||
<p>
|
||||
<b>To configure DataNodes for short-circuit reads:</b>
|
||||
</p>
|
||||
<ol id="ol_qlq_wyv_ls">
|
||||
<li id="copy_config_files"> Copy the client
|
||||
<codeph>core-site.xml</codeph> and <codeph>hdfs-site.xml</codeph>
|
||||
configuration files from the Hadoop configuration directory to the
|
||||
Impala configuration directory. The default Impala configuration
|
||||
location is <codeph>/etc/impala/conf</codeph>. </li>
|
||||
<li>
|
||||
<indexterm audience="Cloudera"
|
||||
>dfs.client.read.shortcircuit</indexterm>
|
||||
<indexterm audience="Cloudera">dfs.domain.socket.path</indexterm>
|
||||
<indexterm audience="Cloudera"
|
||||
>dfs.client.file-block-storage-locations.timeout.millis</indexterm>
|
||||
On all Impala nodes, configure the following properties in <!-- Exact timing is unclear, since we say farther down to copy /etc/hadoop/conf/hdfs-site.xml to /etc/impala/conf.
|
||||
Which wouldn't work if we already modified the Impala version of the file here. Not to mention that this
|
||||
doesn't take the CM interface into account, where these /etc files might not exist in those locations. -->
|
||||
<!-- <codeph>/etc/impala/conf/hdfs-site.xml</codeph> as shown: -->
|
||||
Impala's copy of <codeph>hdfs-site.xml</codeph> as shown: <codeblock><property>
|
||||
<name>dfs.client.read.shortcircuit</name>
|
||||
<value>true</value>
|
||||
</property>
|
||||
|
||||
<property>
|
||||
<name>dfs.domain.socket.path</name>
|
||||
<value>/var/run/hdfs-sockets/dn</value>
|
||||
</property>
|
||||
|
||||
<property>
|
||||
<name>dfs.client.file-block-storage-locations.timeout.millis</name>
|
||||
<value>10000</value>
|
||||
</property></codeblock>
|
||||
<!-- Former socket.path value: <value>/var/run/hadoop-hdfs/dn._PORT</value> -->
|
||||
<!--
|
||||
<note>
|
||||
The text <codeph>_PORT</codeph> appears just as shown; you do not need to
|
||||
substitute a number.
|
||||
</note>
|
||||
-->
|
||||
</li>
|
||||
<li>
|
||||
<p> If <codeph>/var/run/hadoop-hdfs/</codeph> is group-writable, make
|
||||
sure its group is <codeph>root</codeph>. </p>
|
||||
<note> If you are also going to enable block location tracking, you
|
||||
can skip copying configuration files and restarting DataNodes and go
|
||||
straight to <xref href="#config_performance/block_location_tracking"
|
||||
>Optional: Block Location Tracking</xref>.
|
||||
Configuring short-circuit reads and block location tracking require
|
||||
the same process of copying files and restarting services, so you
|
||||
can complete that process once when you have completed all
|
||||
configuration changes. Whether you copy files and restart services
|
||||
now or during configuring block location tracking, short-circuit
|
||||
reads are not enabled until you complete those final steps. </note>
|
||||
</li>
|
||||
<li id="restart_all_datanodes"> After applying these changes, restart
|
||||
all DataNodes. </li>
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
<section id="block_location_tracking">
|
||||
|
||||
<title>Mandatory: Block Location Tracking</title>
|
||||
|
||||
<p>
|
||||
Enabling block location metadata allows Impala to know which disk data blocks are located on, allowing
|
||||
better utilization of the underlying disks. Impala will not start unless this setting is enabled.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>To enable block location tracking:</b>
|
||||
</p>
|
||||
|
||||
<ol>
|
||||
<li>
|
||||
For each DataNode, adding the following to the <codeph>hdfs-site.xml</codeph> file:
|
||||
<codeblock><property>
|
||||
<name>dfs.datanode.hdfs-blocks-metadata.enabled</name>
|
||||
<value>true</value>
|
||||
</property> </codeblock>
|
||||
</li>
|
||||
|
||||
<li conref="#config_performance/copy_config_files"/>
|
||||
|
||||
<li conref="#config_performance/restart_all_datanodes"/>
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
<section id="native_checksumming">
|
||||
|
||||
<title>Optional: Native Checksumming</title>
|
||||
|
||||
<p>
|
||||
Enabling native checksumming causes Impala to use an optimized native library for computing checksums, if
|
||||
that library is available.
|
||||
</p>
|
||||
|
||||
<p id="p_29">
|
||||
<b>To enable native checksumming:</b>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
If you installed CDH from packages, the native checksumming library is installed and setup correctly. In
|
||||
such a case, no additional steps are required. Conversely, if you installed by other means, such as with
|
||||
tarballs, native checksumming may not be available due to missing shared objects. Finding the message
|
||||
"<codeph>Unable to load native-hadoop library for your platform... using builtin-java classes where
|
||||
applicable</codeph>" in the Impala logs indicates native checksumming may be unavailable. To enable native
|
||||
checksumming, you must build and install <codeph>libhadoop.so</codeph> (the
|
||||
<!-- Another instance of stale link. -->
|
||||
<!-- <xref href="http://hadoop.apache.org/docs/r0.19.1/native_libraries.html" scope="external" format="html">Hadoop Native Library</xref>). -->
|
||||
Hadoop Native Library).
|
||||
</p>
|
||||
</section>
|
||||
</conbody>
|
||||
</concept>
|
||||
202
docs/topics/impala_connecting.xml
Normal file
202
docs/topics/impala_connecting.xml
Normal file
@@ -0,0 +1,202 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="connecting">
|
||||
|
||||
<title>Connecting to impalad through impala-shell</title>
|
||||
<titlealts audience="PDF"><navtitle>Connecting to impalad</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="impala-shell"/>
|
||||
<data name="Category" value="Network"/>
|
||||
<data name="Category" value="DataNode"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<!--
|
||||
TK: This would be a good theme for a tutorial topic.
|
||||
Lots of nuances to illustrate through sample code.
|
||||
-->
|
||||
|
||||
<p>
|
||||
Within an <cmdname>impala-shell</cmdname> session, you can only issue queries while connected to an instance
|
||||
of the <cmdname>impalad</cmdname> daemon. You can specify the connection information:
|
||||
<ul>
|
||||
<li>
|
||||
Through command-line options when you run the <cmdname>impala-shell</cmdname> command.
|
||||
</li>
|
||||
<li>
|
||||
Through a configuration file that is read when you run the <cmdname>impala-shell</cmdname> command.
|
||||
</li>
|
||||
<li>
|
||||
During an <cmdname>impala-shell</cmdname> session, by issuing a <codeph>CONNECT</codeph> command.
|
||||
</li>
|
||||
</ul>
|
||||
See <xref href="impala_shell_options.xml"/> for the command-line and configuration file options you can use.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
You can connect to any DataNode where an instance of <cmdname>impalad</cmdname> is running,
|
||||
and that host coordinates the execution of all queries sent to it.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
For simplicity during development, you might always connect to the same host, perhaps running <cmdname>impala-shell</cmdname> on
|
||||
the same host as <cmdname>impalad</cmdname> and specifying the hostname as <codeph>localhost</codeph>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
In a production environment, you might enable load balancing, in which you connect to specific host/port combination
|
||||
but queries are forwarded to arbitrary hosts. This technique spreads the overhead of acting as the coordinator
|
||||
node among all the DataNodes in the cluster. See <xref href="impala_proxy.xml"/> for details.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>To connect the Impala shell during shell startup:</b>
|
||||
</p>
|
||||
|
||||
<ol>
|
||||
<li>
|
||||
Locate the hostname of a DataNode within the cluster that is running an instance of the
|
||||
<cmdname>impalad</cmdname> daemon. If that DataNode uses a non-default port (something
|
||||
other than port 21000) for <cmdname>impala-shell</cmdname> connections, find out the
|
||||
port number also.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Use the <codeph>-i</codeph> option to the
|
||||
<cmdname>impala-shell</cmdname> interpreter to specify the connection information for
|
||||
that instance of <cmdname>impalad</cmdname>:
|
||||
<codeblock>
|
||||
# When you are logged into the same machine running impalad.
|
||||
# The prompt will reflect the current hostname.
|
||||
$ impala-shell
|
||||
|
||||
# When you are logged into the same machine running impalad.
|
||||
# The host will reflect the hostname 'localhost'.
|
||||
$ impala-shell -i localhost
|
||||
|
||||
# When you are logged onto a different host, perhaps a client machine
|
||||
# outside the Hadoop cluster.
|
||||
$ impala-shell -i <varname>some.other.hostname</varname>
|
||||
|
||||
# When you are logged onto a different host, and impalad is listening
|
||||
# on a non-default port. Perhaps a load balancer is forwarding requests
|
||||
# to a different host/port combination behind the scenes.
|
||||
$ impala-shell -i <varname>some.other.hostname</varname>:<varname>port_number</varname>
|
||||
</codeblock>
|
||||
</li>
|
||||
</ol>
|
||||
|
||||
<p>
|
||||
<b>To connect the Impala shell after shell startup:</b>
|
||||
</p>
|
||||
|
||||
<ol>
|
||||
<li>
|
||||
Start the Impala shell with no connection:
|
||||
<codeblock>$ impala-shell</codeblock>
|
||||
<p>
|
||||
You should see a prompt like the following:
|
||||
</p>
|
||||
<codeblock>Welcome to the Impala shell. Press TAB twice to see a list of available commands.
|
||||
|
||||
Copyright (c) <varname>year</varname> Cloudera, Inc. All rights reserved.
|
||||
|
||||
<ph conref="../shared/ImpalaVariables.xml#impala_vars/ShellBanner"/>
|
||||
[Not connected] > </codeblock>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Locate the hostname of a DataNode within the cluster that is running an instance of the
|
||||
<cmdname>impalad</cmdname> daemon. If that DataNode uses a non-default port (something
|
||||
other than port 21000) for <cmdname>impala-shell</cmdname> connections, find out the
|
||||
port number also.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Use the <codeph>connect</codeph> command to connect to an Impala instance. Enter a command of the form:
|
||||
<codeblock>[Not connected] > connect <varname>impalad-host</varname>
|
||||
[<varname>impalad-host</varname>:21000] ></codeblock>
|
||||
<note>
|
||||
Replace <varname>impalad-host</varname> with the hostname you have configured for any DataNode running
|
||||
Impala in your environment. The changed prompt indicates a successful connection.
|
||||
</note>
|
||||
</li>
|
||||
</ol>
|
||||
|
||||
<p>
|
||||
<b>To start <cmdname>impala-shell</cmdname> in a specific database:</b>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
You can use all the same connection options as in previous examples.
|
||||
For simplicity, these examples assume that you are logged into one of
|
||||
the DataNodes that is running the <cmdname>impalad</cmdname> daemon.
|
||||
</p>
|
||||
|
||||
<ol>
|
||||
<li>
|
||||
Find the name of the database containing the relevant tables, views, and so
|
||||
on that you want to operate on.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Use the <codeph>-d</codeph> option to the
|
||||
<cmdname>impala-shell</cmdname> interpreter to connect and immediately
|
||||
switch to the specified database, without the need for a <codeph>USE</codeph>
|
||||
statement or fully qualified names:
|
||||
<codeblock>
|
||||
# Subsequent queries with unqualified names operate on
|
||||
# tables, views, and so on inside the database named 'staging'.
|
||||
$ impala-shell -i localhost -d staging
|
||||
|
||||
# It is common during development, ETL, benchmarking, and so on
|
||||
# to have different databases containing the same table names
|
||||
# but with different contents or layouts.
|
||||
$ impala-shell -i localhost -d parquet_snappy_compression
|
||||
$ impala-shell -i localhost -d parquet_gzip_compression
|
||||
</codeblock>
|
||||
</li>
|
||||
</ol>
|
||||
|
||||
<p>
|
||||
<b>To run one or several statements in non-interactive mode:</b>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
You can use all the same connection options as in previous examples.
|
||||
For simplicity, these examples assume that you are logged into one of
|
||||
the DataNodes that is running the <cmdname>impalad</cmdname> daemon.
|
||||
</p>
|
||||
|
||||
<ol>
|
||||
<li>
|
||||
Construct a statement, or a file containing a sequence of statements,
|
||||
that you want to run in an automated way, without typing or copying
|
||||
and pasting each time.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Invoke <cmdname>impala-shell</cmdname> with the <codeph>-q</codeph> option to run a single statement, or
|
||||
the <codeph>-f</codeph> option to run a sequence of statements from a file.
|
||||
The <cmdname>impala-shell</cmdname> command returns immediately, without going into
|
||||
the interactive interpreter.
|
||||
<codeblock>
|
||||
# A utility command that you might run while developing shell scripts
|
||||
# to manipulate HDFS files.
|
||||
$ impala-shell -i localhost -d database_of_interest -q 'show tables'
|
||||
|
||||
# A sequence of CREATE TABLE, CREATE VIEW, and similar DDL statements
|
||||
# can go into a file to make the setup process repeatable.
|
||||
$ impala-shell -i localhost -d database_of_interest -f recreate_tables.sql
|
||||
</codeblock>
|
||||
</li>
|
||||
</ol>
|
||||
|
||||
</conbody>
|
||||
</concept>
|
||||
758
docs/topics/impala_conversion_functions.xml
Normal file
758
docs/topics/impala_conversion_functions.xml
Normal file
@@ -0,0 +1,758 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="conversion_functions">
|
||||
|
||||
<title>Impala Type Conversion Functions</title>
|
||||
<titlealts audience="PDF"><navtitle>Type Conversion Functions</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Impala Functions"/>
|
||||
<data name="Category" value="Impala Data Types"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Querying"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
Conversion functions are usually used in combination with other functions, to explicitly pass the expected
|
||||
data types. Impala has strict rules regarding data types for function parameters. For example, Impala does
|
||||
not automatically convert a <codeph>DOUBLE</codeph> value to <codeph>FLOAT</codeph>, a
|
||||
<codeph>BIGINT</codeph> value to <codeph>INT</codeph>, or other conversion where precision could be lost or
|
||||
overflow could occur. Also, for reporting or dealing with loosely defined schemas in big data contexts,
|
||||
you might frequently need to convert values to or from the <codeph>STRING</codeph> type.
|
||||
</p>
|
||||
|
||||
<note>
|
||||
Although in CDH 5.5.0, the <codeph>SHOW FUNCTIONS</codeph> output for
|
||||
database <codeph>_IMPALA_BUILTINS</codeph> contains some function signatures
|
||||
matching the pattern <codeph>castto*</codeph>, these functions are not intended
|
||||
for public use and are expected to be hidden in future.
|
||||
</note>
|
||||
|
||||
<p>
|
||||
<b>Function reference:</b>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Impala supports the following type conversion functions:
|
||||
</p>
|
||||
|
||||
<dl>
|
||||
|
||||
<dlentry id="cast">
|
||||
<dt>
|
||||
<codeph>cast(<varname>expr</varname> AS <varname>type</varname>)</codeph>
|
||||
</dt>
|
||||
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">cast() function</indexterm>
|
||||
<b>Purpose:</b> Converts the value of an expression to any other type.
|
||||
If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>.
|
||||
<p><b>Usage notes:</b>
|
||||
Use <codeph>CAST</codeph> when passing a column value or literal to a function that
|
||||
expects a parameter with a different type.
|
||||
Frequently used in SQL operations such as <codeph>CREATE TABLE AS SELECT</codeph>
|
||||
and <codeph>INSERT ... VALUES</codeph> to ensure that values from various sources
|
||||
are of the appropriate type for the destination columns.
|
||||
Where practical, do a one-time <codeph>CAST()</codeph> operation during the ingestion process
|
||||
to make each column into the appropriate type, rather than using many <codeph>CAST()</codeph>
|
||||
operations in each query; doing type conversions for each row during each query can be expensive
|
||||
for tables with millions or billions of rows.
|
||||
</p>
|
||||
<p conref="../shared/impala_common.xml#common/timezone_conversion_caveat"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
<codeblock>select concat('Here are the first ',10,' results.'); -- Fails
|
||||
select concat('Here are the first ',cast(10 as string),' results.'); -- Succeeds
|
||||
</codeblock>
|
||||
<p>
|
||||
The following example starts with a text table where every column has a type of <codeph>STRING</codeph>,
|
||||
which might be how you ingest data of unknown schema until you can verify the cleanliness of the underly values.
|
||||
Then it uses <codeph>CAST()</codeph> to create a new Parquet table with the same data, but using specific
|
||||
numeric data types for the columns with numeric data. Using numeric types of appropriate sizes can result in
|
||||
substantial space savings on disk and in memory, and performance improvements in queries,
|
||||
over using strings or larger-than-necessary numeric types.
|
||||
</p>
|
||||
<codeblock>create table t1 (name string, x string, y string, z string);
|
||||
|
||||
create table t2 stored as parquet
|
||||
as select
|
||||
name,
|
||||
cast(x as bigint) x,
|
||||
cast(y as timestamp) y,
|
||||
cast(z as smallint) z
|
||||
from t1;
|
||||
|
||||
describe t2;
|
||||
+------+----------+---------+
|
||||
| name | type | comment |
|
||||
+------+----------+---------+
|
||||
| name | string | |
|
||||
| x | bigint | |
|
||||
| y | smallint | |
|
||||
| z | tinyint | |
|
||||
+------+----------+---------+
|
||||
</codeblock>
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
<p>
|
||||
<!-- TK: Can you cast to or from MAP, ARRAY, STRUCT? -->
|
||||
For details of casts from each kind of data type, see the description of
|
||||
the appropriate type:
|
||||
<xref href="impala_tinyint.xml#tinyint"/>,
|
||||
<xref href="impala_smallint.xml#smallint"/>,
|
||||
<xref href="impala_int.xml#int"/>,
|
||||
<xref href="impala_bigint.xml#bigint"/>,
|
||||
<xref href="impala_float.xml#float"/>,
|
||||
<xref href="impala_double.xml#double"/>,
|
||||
<xref href="impala_decimal.xml#decimal"/>,
|
||||
<xref href="impala_string.xml#string"/>,
|
||||
<xref href="impala_char.xml#char"/>,
|
||||
<xref href="impala_varchar.xml#varchar"/>,
|
||||
<xref href="impala_timestamp.xml#timestamp"/>,
|
||||
<xref href="impala_boolean.xml#boolean"/>
|
||||
</p>
|
||||
</dd>
|
||||
</dlentry>
|
||||
|
||||
<dlentry rev="2.3.0" id="casttobigint" audience="Cloudera">
|
||||
<dt>
|
||||
<codeph>casttobigint(type value)</codeph>
|
||||
</dt>
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">casttobigint() function</indexterm>
|
||||
<b>Purpose:</b> Converts the value of an expression to <codeph>BIGINT</codeph>. If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>.
|
||||
<p><b>Return type:</b> <codeph>bigint</codeph></p>
|
||||
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_usage"/>
|
||||
<p conref="../shared/impala_common.xml#common/added_in_230"/>
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_example"/>
|
||||
<codeblock>create table small_types (x tinyint, y smallint, z int);
|
||||
|
||||
create table big_types as
|
||||
select casttobigint(x) as x, casttobigint(y) as y, casttobigint(z) as z
|
||||
from small_types;
|
||||
|
||||
describe big_types;
|
||||
+------+--------+---------+
|
||||
| name | type | comment |
|
||||
+------+--------+---------+
|
||||
| x | bigint | |
|
||||
| y | bigint | |
|
||||
| z | bigint | |
|
||||
+------+--------+---------+
|
||||
</codeblock>
|
||||
</dd>
|
||||
</dlentry>
|
||||
|
||||
<dlentry rev="2.3.0" id="casttoboolean" audience="Cloudera">
|
||||
<dt>
|
||||
<codeph>casttoboolean(type value)</codeph>
|
||||
</dt>
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">casttoboolean() function</indexterm>
|
||||
<b>Purpose:</b> Converts the value of an expression to <codeph>BOOLEAN</codeph>.
|
||||
Numeric values of 0 evaluate to <codeph>false</codeph>, and non-zero values evaluate to <codeph>true</codeph>.
|
||||
If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>.
|
||||
In particular, <codeph>STRING</codeph> values (even <codeph>'1'</codeph>, <codeph>'0'</codeph>, <codeph>'true'</codeph>
|
||||
or <codeph>'false'</codeph>) always return <codeph>NULL</codeph> when converted to <codeph>BOOLEAN</codeph>.
|
||||
<p><b>Return type:</b> <codeph>boolean</codeph></p>
|
||||
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_usage"/>
|
||||
<p conref="../shared/impala_common.xml#common/added_in_230"/>
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_example"/>
|
||||
<codeblock>select casttoboolean(0);
|
||||
+------------------+
|
||||
| casttoboolean(0) |
|
||||
+------------------+
|
||||
| false |
|
||||
+------------------+
|
||||
|
||||
select casttoboolean(1);
|
||||
+------------------+
|
||||
| casttoboolean(1) |
|
||||
+------------------+
|
||||
| true |
|
||||
+------------------+
|
||||
|
||||
select casttoboolean(99);
|
||||
+-------------------+
|
||||
| casttoboolean(99) |
|
||||
+-------------------+
|
||||
| true |
|
||||
+-------------------+
|
||||
|
||||
select casttoboolean(0.0);
|
||||
+--------------------+
|
||||
| casttoboolean(0.0) |
|
||||
+--------------------+
|
||||
| false |
|
||||
+--------------------+
|
||||
|
||||
select casttoboolean(0.5);
|
||||
+--------------------+
|
||||
| casttoboolean(0.5) |
|
||||
+--------------------+
|
||||
| true |
|
||||
+--------------------+
|
||||
|
||||
select casttoboolean('');
|
||||
+-------------------+
|
||||
| casttoboolean('') |
|
||||
+-------------------+
|
||||
| NULL |
|
||||
+-------------------+
|
||||
|
||||
select casttoboolean('yes');
|
||||
+----------------------+
|
||||
| casttoboolean('yes') |
|
||||
+----------------------+
|
||||
| NULL |
|
||||
+----------------------+
|
||||
|
||||
select casttoboolean('0');
|
||||
+--------------------+
|
||||
| casttoboolean('0') |
|
||||
+--------------------+
|
||||
| NULL |
|
||||
+--------------------+
|
||||
|
||||
select casttoboolean('true');
|
||||
+-----------------------+
|
||||
| casttoboolean('true') |
|
||||
+-----------------------+
|
||||
| NULL |
|
||||
+-----------------------+
|
||||
|
||||
select casttoboolean('false');
|
||||
+------------------------+
|
||||
| casttoboolean('false') |
|
||||
+------------------------+
|
||||
| NULL |
|
||||
+------------------------+
|
||||
</codeblock>
|
||||
</dd>
|
||||
</dlentry>
|
||||
|
||||
<dlentry rev="2.3.0" id="casttochar" audience="Cloudera">
|
||||
<dt>
|
||||
<codeph>casttochar(type value)</codeph>
|
||||
</dt>
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">casttochar() function</indexterm>
|
||||
<b>Purpose:</b> Converts the value of an expression to <codeph>CHAR</codeph>. If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>.
|
||||
<p><b>Return type:</b> <codeph>char</codeph></p>
|
||||
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_usage"/>
|
||||
<p conref="../shared/impala_common.xml#common/added_in_230"/>
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_example"/>
|
||||
<codeblock>create table char_types as select casttochar('hello world') as c1, casttochar('xyz') as c2, casttochar('x') as c3;
|
||||
+-------------------+
|
||||
| summary |
|
||||
+-------------------+
|
||||
| Inserted 1 row(s) |
|
||||
+-------------------+
|
||||
|
||||
describe char_types;
|
||||
+------+--------+---------+
|
||||
| name | type | comment |
|
||||
+------+--------+---------+
|
||||
| c1 | string | |
|
||||
| c2 | string | |
|
||||
| c3 | string | |
|
||||
+------+--------+---------+
|
||||
</codeblock>
|
||||
</dd>
|
||||
</dlentry>
|
||||
|
||||
<dlentry rev="2.3.0" id="casttodecimal" audience="Cloudera">
|
||||
<dt>
|
||||
<codeph>casttodecimal(type value)</codeph>
|
||||
</dt>
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">casttodecimal() function</indexterm>
|
||||
<b>Purpose:</b> Converts the value of an expression to <codeph>DECIMAL</codeph>. If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>.
|
||||
<p><b>Return type:</b> <codeph>decimal</codeph></p>
|
||||
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_usage"/>
|
||||
<p conref="../shared/impala_common.xml#common/added_in_230"/>
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_example"/>
|
||||
<codeblock>select casttodecimal(5.4);
|
||||
+--------------------+
|
||||
| casttodecimal(5.4) |
|
||||
+--------------------+
|
||||
| 5.4 |
|
||||
+--------------------+
|
||||
</codeblock>
|
||||
</dd>
|
||||
</dlentry>
|
||||
|
||||
<dlentry rev="2.3.0" id="casttodouble" audience="Cloudera">
|
||||
<dt>
|
||||
<codeph>casttodouble(type value)</codeph>
|
||||
</dt>
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">casttodouble() function</indexterm>
|
||||
<b>Purpose:</b> Converts the value of an expression to <codeph>DOUBLE</codeph>. If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>.
|
||||
<p><b>Return type:</b> <codeph>double</codeph></p>
|
||||
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_usage"/>
|
||||
<p conref="../shared/impala_common.xml#common/added_in_230"/>
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_example"/>
|
||||
<codeblock>select casttodouble(5);
|
||||
+-----------------+
|
||||
| casttodouble(5) |
|
||||
+-----------------+
|
||||
| 5 |
|
||||
+-----------------+
|
||||
|
||||
select casttodouble('3.141');
|
||||
+-----------------------+
|
||||
| casttodouble('3.141') |
|
||||
+-----------------------+
|
||||
| 3.141 |
|
||||
+-----------------------+
|
||||
|
||||
select casttodouble(1e6);
|
||||
+--------------------+
|
||||
| casttodouble(1e+6) |
|
||||
+--------------------+
|
||||
| 1000000 |
|
||||
+--------------------+
|
||||
|
||||
select casttodouble(true);
|
||||
+--------------------+
|
||||
| casttodouble(true) |
|
||||
+--------------------+
|
||||
| 1 |
|
||||
+--------------------+
|
||||
|
||||
select casttodouble(now());
|
||||
+---------------------+
|
||||
| casttodouble(now()) |
|
||||
+---------------------+
|
||||
| 1447622306.031178 |
|
||||
+---------------------+
|
||||
</codeblock>
|
||||
</dd>
|
||||
</dlentry>
|
||||
|
||||
<dlentry rev="2.3.0" id="casttofloat" audience="Cloudera">
|
||||
<dt>
|
||||
<codeph>casttofloat(type value)</codeph>
|
||||
</dt>
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">casttofloat() function</indexterm>
|
||||
<b>Purpose:</b> Converts the value of an expression to <codeph>FLOAT</codeph>. If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>.
|
||||
<p><b>Return type:</b> <codeph>float</codeph></p>
|
||||
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_usage"/>
|
||||
<p conref="../shared/impala_common.xml#common/added_in_230"/>
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_example"/>
|
||||
<codeblock>select casttofloat(5);
|
||||
+----------------+
|
||||
| casttofloat(5) |
|
||||
+----------------+
|
||||
| 5 |
|
||||
+----------------+
|
||||
|
||||
select casttofloat('3.141');
|
||||
+----------------------+
|
||||
| casttofloat('3.141') |
|
||||
+----------------------+
|
||||
| 3.141000032424927 |
|
||||
+----------------------+
|
||||
|
||||
select casttofloat(1e6);
|
||||
+-------------------+
|
||||
| casttofloat(1e+6) |
|
||||
+-------------------+
|
||||
| 1000000 |
|
||||
+-------------------+
|
||||
|
||||
select casttofloat(true);
|
||||
+-------------------+
|
||||
| casttofloat(true) |
|
||||
+-------------------+
|
||||
| 1 |
|
||||
+-------------------+
|
||||
|
||||
select casttofloat(now());
|
||||
+--------------------+
|
||||
| casttofloat(now()) |
|
||||
+--------------------+
|
||||
| 1447622400 |
|
||||
+--------------------+
|
||||
</codeblock>
|
||||
</dd>
|
||||
</dlentry>
|
||||
|
||||
<dlentry rev="2.3.0" id="casttoint" audience="Cloudera">
|
||||
<dt>
|
||||
<codeph>casttoint(type value)</codeph>
|
||||
</dt>
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">casttoint() function</indexterm>
|
||||
<b>Purpose:</b> Converts the value of an expression to <codeph>INT</codeph>. If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>.
|
||||
<p><b>Return type:</b> <codeph>int</codeph></p>
|
||||
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_usage"/>
|
||||
<p conref="../shared/impala_common.xml#common/added_in_230"/>
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_example"/>
|
||||
<codeblock>select casttoint(5.4);
|
||||
+----------------+
|
||||
| casttoint(5.4) |
|
||||
+----------------+
|
||||
| 5 |
|
||||
+----------------+
|
||||
|
||||
select casttoint(true);
|
||||
+-----------------+
|
||||
| casttoint(true) |
|
||||
+-----------------+
|
||||
| 1 |
|
||||
+-----------------+
|
||||
|
||||
select casttoint(now());
|
||||
+------------------+
|
||||
| casttoint(now()) |
|
||||
+------------------+
|
||||
| 1447622487 |
|
||||
+------------------+
|
||||
|
||||
select casttoint('3.141');
|
||||
+--------------------+
|
||||
| casttoint('3.141') |
|
||||
+--------------------+
|
||||
| NULL |
|
||||
+--------------------+
|
||||
|
||||
select casttoint('3');
|
||||
+----------------+
|
||||
| casttoint('3') |
|
||||
+----------------+
|
||||
| 3 |
|
||||
+----------------+
|
||||
</codeblock>
|
||||
</dd>
|
||||
</dlentry>
|
||||
|
||||
<dlentry rev="2.3.0" id="casttosmallint" audience="Cloudera">
|
||||
<dt>
|
||||
<codeph>casttosmallint(type value)</codeph>
|
||||
</dt>
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">casttosmallint() function</indexterm>
|
||||
<b>Purpose:</b> Converts the value of an expression to <codeph>SMALLINT</codeph>. If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>.
|
||||
<p><b>Return type:</b> <codeph>smallint</codeph></p>
|
||||
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_usage"/>
|
||||
<p conref="../shared/impala_common.xml#common/added_in_230"/>
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_example"/>
|
||||
<codeblock>create table big_types (x bigint, y int, z smallint);
|
||||
|
||||
create table small_types as
|
||||
select casttosmallint(x) as x, casttosmallint(y) as y, casttosmallint(z) as z
|
||||
from big_types;
|
||||
|
||||
describe small_types;
|
||||
+------+----------+---------+
|
||||
| name | type | comment |
|
||||
+------+----------+---------+
|
||||
| x | smallint | |
|
||||
| y | smallint | |
|
||||
| z | smallint | |
|
||||
+------+----------+---------+
|
||||
</codeblock>
|
||||
</dd>
|
||||
</dlentry>
|
||||
|
||||
<dlentry rev="2.3.0" id="casttostring" audience="Cloudera">
|
||||
<dt>
|
||||
<codeph>casttostring(type value)</codeph>
|
||||
</dt>
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">casttostring() function</indexterm>
|
||||
<b>Purpose:</b> Converts the value of an expression to <codeph>STRING</codeph>. If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>.
|
||||
<p><b>Return type:</b> <codeph>string</codeph></p>
|
||||
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_usage"/>
|
||||
<p conref="../shared/impala_common.xml#common/added_in_230"/>
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_example"/>
|
||||
<codeblock>create table numeric_types (x int, y bigint, z tinyint);
|
||||
|
||||
create table string_types as
|
||||
select casttostring(x) as x, casttostring(y) as y, casttostring(z) as z
|
||||
from numeric_types;
|
||||
|
||||
describe string_types;
|
||||
+------+--------+---------+
|
||||
| name | type | comment |
|
||||
+------+--------+---------+
|
||||
| x | string | |
|
||||
| y | string | |
|
||||
| z | string | |
|
||||
+------+--------+---------+
|
||||
</codeblock>
|
||||
</dd>
|
||||
</dlentry>
|
||||
|
||||
<dlentry rev="2.3.0" id="casttotimestamp" audience="Cloudera">
|
||||
<dt>
|
||||
<codeph>casttotimestamp(type value)</codeph>
|
||||
</dt>
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">casttotimestamp() function</indexterm>
|
||||
<b>Purpose:</b> Converts the value of an expression to <codeph>TIMESTAMP</codeph>. If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>.
|
||||
<p><b>Return type:</b> <codeph>timestamp</codeph></p>
|
||||
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_usage"/>
|
||||
<p conref="../shared/impala_common.xml#common/added_in_230"/>
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_example"/>
|
||||
<codeblock>select casttotimestamp(1000);
|
||||
+-----------------------+
|
||||
| casttotimestamp(1000) |
|
||||
+-----------------------+
|
||||
| 1970-01-01 00:16:40 |
|
||||
+-----------------------+
|
||||
|
||||
select casttotimestamp(1000.0);
|
||||
+-------------------------+
|
||||
| casttotimestamp(1000.0) |
|
||||
+-------------------------+
|
||||
| 1970-01-01 00:16:40 |
|
||||
+-------------------------+
|
||||
|
||||
select casttotimestamp('1000');
|
||||
+-------------------------+
|
||||
| casttotimestamp('1000') |
|
||||
+-------------------------+
|
||||
| NULL |
|
||||
+-------------------------+
|
||||
</codeblock>
|
||||
</dd>
|
||||
</dlentry>
|
||||
|
||||
<dlentry rev="2.3.0" id="casttotinyint" audience="Cloudera">
|
||||
<dt>
|
||||
<codeph>casttotinyint(type value)</codeph>
|
||||
</dt>
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">casttotinyint() function</indexterm>
|
||||
<b>Purpose:</b> Converts the value of an expression to <codeph>TINYINT</codeph>. If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>.
|
||||
<p><b>Return type:</b> <codeph>tinyint</codeph></p>
|
||||
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_usage"/>
|
||||
<p conref="../shared/impala_common.xml#common/added_in_230"/>
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_example"/>
|
||||
<codeblock>create table big_types (x bigint, y int, z smallint);
|
||||
|
||||
create table tiny_types as
|
||||
select casttotinyint(x) as x, casttotinyint(y) as y, casttotinyint(z) as z
|
||||
from big_types;
|
||||
|
||||
describe tiny_types;
|
||||
+------+---------+---------+
|
||||
| name | type | comment |
|
||||
+------+---------+---------+
|
||||
| x | tinyint | |
|
||||
| y | tinyint | |
|
||||
| z | tinyint | |
|
||||
+------+---------+---------+
|
||||
</codeblock>
|
||||
</dd>
|
||||
</dlentry>
|
||||
|
||||
<dlentry rev="2.3.0" id="casttovarchar" audience="Cloudera">
|
||||
<dt>
|
||||
<codeph>casttovarchar(type value)</codeph>
|
||||
</dt>
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">casttovarchar() function</indexterm>
|
||||
<b>Purpose:</b> Converts the value of an expression to <codeph>VARCHAR</codeph>. If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>.
|
||||
<p><b>Return type:</b> <codeph>varchar</codeph></p>
|
||||
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_usage"/>
|
||||
<p conref="../shared/impala_common.xml#common/added_in_230"/>
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_example"/>
|
||||
<codeblock>select casttovarchar('abcd');
|
||||
+-----------------------+
|
||||
| casttovarchar('abcd') |
|
||||
+-----------------------+
|
||||
| abcd |
|
||||
+-----------------------+
|
||||
|
||||
select casttovarchar(999);
|
||||
+--------------------+
|
||||
| casttovarchar(999) |
|
||||
+--------------------+
|
||||
| 999 |
|
||||
+--------------------+
|
||||
|
||||
select casttovarchar(999.5);
|
||||
+----------------------+
|
||||
| casttovarchar(999.5) |
|
||||
+----------------------+
|
||||
| 999.5 |
|
||||
+----------------------+
|
||||
|
||||
select casttovarchar(now());
|
||||
+-------------------------------+
|
||||
| casttovarchar(now()) |
|
||||
+-------------------------------+
|
||||
| 2015-11-15 21:26:13.528073000 |
|
||||
+-------------------------------+
|
||||
|
||||
select casttovarchar(true);
|
||||
+---------------------+
|
||||
| casttovarchar(true) |
|
||||
+---------------------+
|
||||
| 1 |
|
||||
+---------------------+
|
||||
</codeblock>
|
||||
</dd>
|
||||
</dlentry>
|
||||
|
||||
<dlentry rev="2.3.0" id="typeof">
|
||||
<dt>
|
||||
<codeph>typeof(type value)</codeph>
|
||||
</dt>
|
||||
<dd>
|
||||
<indexterm audience="Cloudera">typeof() function</indexterm>
|
||||
<b>Purpose:</b> Returns the name of the data type corresponding to an expression. For types with
|
||||
extra attributes, such as length for <codeph>CHAR</codeph> and <codeph>VARCHAR</codeph>,
|
||||
or precision and scale for <codeph>DECIMAL</codeph>, includes the full specification of the type.
|
||||
<!-- To do: How about for columns of complex types? Or fields within complex types? -->
|
||||
<p><b>Return type:</b> <codeph>string</codeph></p>
|
||||
<p><b>Usage notes:</b> Typically used in interactive exploration of a schema, or in application code that programmatically generates schema definitions such as <codeph>CREATE TABLE</codeph> statements.
|
||||
For example, previously, to understand the type of an expression such as
|
||||
<codeph>col1 / col2</codeph> or <codeph>concat(col1, col2, col3)</codeph>,
|
||||
you might have created a dummy table with a single row, using syntax such as <codeph>CREATE TABLE foo AS SELECT 5 / 3.0</codeph>,
|
||||
and then doing a <codeph>DESCRIBE</codeph> to see the type of the row.
|
||||
Or you might have done a <codeph>CREATE TABLE AS SELECT</codeph> operation to create a table and
|
||||
copy data into it, only learning the types of the columns by doing a <codeph>DESCRIBE</codeph> afterward.
|
||||
This technique is especially useful for arithmetic expressions involving <codeph>DECIMAL</codeph> types,
|
||||
because the precision and scale of the result is typically different than that of the operands.
|
||||
</p>
|
||||
<p conref="../shared/impala_common.xml#common/added_in_230"/>
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
<p>
|
||||
These examples show how to check the type of a simple literal or function value.
|
||||
Notice how adding even tiny integers together changes the data type of the result to
|
||||
avoid overflow, and how the results of arithmetic operations on <codeph>DECIMAL</codeph> values
|
||||
have specific precision and scale attributes.
|
||||
</p>
|
||||
<codeblock>select typeof(2)
|
||||
+-----------+
|
||||
| typeof(2) |
|
||||
+-----------+
|
||||
| TINYINT |
|
||||
+-----------+
|
||||
|
||||
select typeof(2+2)
|
||||
+---------------+
|
||||
| typeof(2 + 2) |
|
||||
+---------------+
|
||||
| SMALLINT |
|
||||
+---------------+
|
||||
|
||||
select typeof('xyz')
|
||||
+---------------+
|
||||
| typeof('xyz') |
|
||||
+---------------+
|
||||
| STRING |
|
||||
+---------------+
|
||||
|
||||
select typeof(now())
|
||||
+---------------+
|
||||
| typeof(now()) |
|
||||
+---------------+
|
||||
| TIMESTAMP |
|
||||
+---------------+
|
||||
|
||||
select typeof(5.3 / 2.1)
|
||||
+-------------------+
|
||||
| typeof(5.3 / 2.1) |
|
||||
+-------------------+
|
||||
| DECIMAL(6,4) |
|
||||
+-------------------+
|
||||
|
||||
select typeof(5.30001 / 2342.1);
|
||||
+--------------------------+
|
||||
| typeof(5.30001 / 2342.1) |
|
||||
+--------------------------+
|
||||
| DECIMAL(13,11) |
|
||||
+--------------------------+
|
||||
|
||||
select typeof(typeof(2+2))
|
||||
+-----------------------+
|
||||
| typeof(typeof(2 + 2)) |
|
||||
+-----------------------+
|
||||
| STRING |
|
||||
+-----------------------+
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
This example shows how even if you do not have a record of the type of a column,
|
||||
for example because the type was changed by <codeph>ALTER TABLE</codeph> after the
|
||||
original <codeph>CREATE TABLE</codeph>, you can still find out the type in a
|
||||
more compact form than examining the full <codeph>DESCRIBE</codeph> output.
|
||||
Remember to use <codeph>LIMIT 1</codeph> in such cases, to avoid an identical
|
||||
result value for every row in the table.
|
||||
</p>
|
||||
<codeblock>create table typeof_example (a int, b tinyint, c smallint, d bigint);
|
||||
|
||||
/* Empty result set if there is no data in the table. */
|
||||
select typeof(a) from typeof_example;
|
||||
|
||||
/* OK, now we have some data but the type of column A is being changed. */
|
||||
insert into typeof_example values (1, 2, 3, 4);
|
||||
alter table typeof_example change a a bigint;
|
||||
|
||||
/* We can always find out the current type of that column without doing a full DESCRIBE. */
|
||||
select typeof(a) from typeof_example limit 1;
|
||||
+-----------+
|
||||
| typeof(a) |
|
||||
+-----------+
|
||||
| BIGINT |
|
||||
+-----------+
|
||||
</codeblock>
|
||||
<p>
|
||||
This example shows how you might programmatically generate a <codeph>CREATE TABLE</codeph> statement
|
||||
with the appropriate column definitions to hold the result values of arbitrary expressions.
|
||||
The <codeph>typeof()</codeph> function lets you construct a detailed <codeph>CREATE TABLE</codeph> statement
|
||||
without actually creating the table, as opposed to <codeph>CREATE TABLE AS SELECT</codeph> operations
|
||||
where you create the destination table but only learn the column data types afterward through <codeph>DESCRIBE</codeph>.
|
||||
</p>
|
||||
<codeblock>describe typeof_example;
|
||||
+------+----------+---------+
|
||||
| name | type | comment |
|
||||
+------+----------+---------+
|
||||
| a | bigint | |
|
||||
| b | tinyint | |
|
||||
| c | smallint | |
|
||||
| d | bigint | |
|
||||
+------+----------+---------+
|
||||
|
||||
/* An ETL or business intelligence tool might create variations on a table with different file formats,
|
||||
different sets of columns, and so on. TYPEOF() lets an application introspect the types of the original columns. */
|
||||
select concat('create table derived_table (a ', typeof(a), ', b ', typeof(b), ', c ',
|
||||
typeof(c), ', d ', typeof(d), ') stored as parquet;')
|
||||
as 'create table statement'
|
||||
from typeof_example limit 1;
|
||||
+-------------------------------------------------------------------------------------------+
|
||||
| create table statement |
|
||||
+-------------------------------------------------------------------------------------------+
|
||||
| create table derived_table (a BIGINT, b TINYINT, c SMALLINT, d BIGINT) stored as parquet; |
|
||||
+-------------------------------------------------------------------------------------------+
|
||||
</codeblock>
|
||||
</dd>
|
||||
</dlentry>
|
||||
|
||||
</dl>
|
||||
|
||||
</conbody>
|
||||
</concept>
|
||||
236
docs/topics/impala_count.xml
Normal file
236
docs/topics/impala_count.xml
Normal file
@@ -0,0 +1,236 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="count">
|
||||
|
||||
<title>COUNT Function</title>
|
||||
<titlealts audience="PDF"><navtitle>COUNT</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="Impala Functions"/>
|
||||
<data name="Category" value="Analytic Functions"/>
|
||||
<data name="Category" value="Aggregate Functions"/>
|
||||
<data name="Category" value="Querying"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<indexterm audience="Cloudera">count() function</indexterm>
|
||||
An aggregate function that returns the number of rows, or the number of non-<codeph>NULL</codeph> rows.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
||||
|
||||
<codeblock>COUNT([DISTINCT | ALL] <varname>expression</varname>) [OVER (<varname>analytic_clause</varname>)]</codeblock>
|
||||
|
||||
<p>
|
||||
Depending on the argument, <codeph>COUNT()</codeph> considers rows that meet certain conditions:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
The notation <codeph>COUNT(*)</codeph> includes <codeph>NULL</codeph> values in the total.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
The notation <codeph>COUNT(<varname>column_name</varname>)</codeph> only considers rows where the column
|
||||
contains a non-<codeph>NULL</codeph> value.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
You can also combine <codeph>COUNT</codeph> with the <codeph>DISTINCT</codeph> operator to eliminate
|
||||
duplicates before counting, and to count the combinations of values across multiple columns.
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p>
|
||||
When the query contains a <codeph>GROUP BY</codeph> clause, returns one value for each combination of
|
||||
grouping values.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>Return type:</b> <codeph>BIGINT</codeph>
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/partition_key_optimization"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/complex_types_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/complex_types_aggregation_explanation"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/complex_types_aggregation_example"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
|
||||
<codeblock>-- How many rows total are in the table, regardless of NULL values?
|
||||
select count(*) from t1;
|
||||
-- How many rows are in the table with non-NULL values for a column?
|
||||
select count(c1) from t1;
|
||||
-- Count the rows that meet certain conditions.
|
||||
-- Again, * includes NULLs, so COUNT(*) might be greater than COUNT(col).
|
||||
select count(*) from t1 where x > 10;
|
||||
select count(c1) from t1 where x > 10;
|
||||
-- Can also be used in combination with DISTINCT and/or GROUP BY.
|
||||
-- Combine COUNT and DISTINCT to find the number of unique values.
|
||||
-- Must use column names rather than * with COUNT(DISTINCT ...) syntax.
|
||||
-- Rows with NULL values are not counted.
|
||||
select count(distinct c1) from t1;
|
||||
-- Rows with a NULL value in _either_ column are not counted.
|
||||
select count(distinct c1, c2) from t1;
|
||||
-- Return more than one result.
|
||||
select month, year, count(distinct visitor_id) from web_stats group by month, year;
|
||||
</codeblock>
|
||||
|
||||
<p rev="2.0.0">
|
||||
The following examples show how to use <codeph>COUNT()</codeph> in an analytic context. They use a table
|
||||
containing integers from 1 to 10. Notice how the <codeph>COUNT()</codeph> is reported for each input value, as
|
||||
opposed to the <codeph>GROUP BY</codeph> clause which condenses the result set.
|
||||
<codeblock>select x, property, count(x) over (partition by property) as count from int_t where property in ('odd','even');
|
||||
+----+----------+-------+
|
||||
| x | property | count |
|
||||
+----+----------+-------+
|
||||
| 2 | even | 5 |
|
||||
| 4 | even | 5 |
|
||||
| 6 | even | 5 |
|
||||
| 8 | even | 5 |
|
||||
| 10 | even | 5 |
|
||||
| 1 | odd | 5 |
|
||||
| 3 | odd | 5 |
|
||||
| 5 | odd | 5 |
|
||||
| 7 | odd | 5 |
|
||||
| 9 | odd | 5 |
|
||||
+----+----------+-------+
|
||||
</codeblock>
|
||||
|
||||
Adding an <codeph>ORDER BY</codeph> clause lets you experiment with results that are cumulative or apply to a moving
|
||||
set of rows (the <q>window</q>). The following examples use <codeph>COUNT()</codeph> in an analytic context
|
||||
(that is, with an <codeph>OVER()</codeph> clause) to produce a running count of all the even values,
|
||||
then a running count of all the odd values. The basic <codeph>ORDER BY x</codeph> clause implicitly
|
||||
activates a window clause of <codeph>RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph>,
|
||||
which is effectively the same as <codeph>ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph>,
|
||||
therefore all of these examples produce the same results:
|
||||
<codeblock>select x, property,
|
||||
count(x) over (partition by property <b>order by x</b>) as 'cumulative count'
|
||||
from int_t where property in ('odd','even');
|
||||
+----+----------+------------------+
|
||||
| x | property | cumulative count |
|
||||
+----+----------+------------------+
|
||||
| 2 | even | 1 |
|
||||
| 4 | even | 2 |
|
||||
| 6 | even | 3 |
|
||||
| 8 | even | 4 |
|
||||
| 10 | even | 5 |
|
||||
| 1 | odd | 1 |
|
||||
| 3 | odd | 2 |
|
||||
| 5 | odd | 3 |
|
||||
| 7 | odd | 4 |
|
||||
| 9 | odd | 5 |
|
||||
+----+----------+------------------+
|
||||
|
||||
select x, property,
|
||||
count(x) over
|
||||
(
|
||||
partition by property
|
||||
<b>order by x</b>
|
||||
<b>range between unbounded preceding and current row</b>
|
||||
) as 'cumulative total'
|
||||
from int_t where property in ('odd','even');
|
||||
+----+----------+------------------+
|
||||
| x | property | cumulative count |
|
||||
+----+----------+------------------+
|
||||
| 2 | even | 1 |
|
||||
| 4 | even | 2 |
|
||||
| 6 | even | 3 |
|
||||
| 8 | even | 4 |
|
||||
| 10 | even | 5 |
|
||||
| 1 | odd | 1 |
|
||||
| 3 | odd | 2 |
|
||||
| 5 | odd | 3 |
|
||||
| 7 | odd | 4 |
|
||||
| 9 | odd | 5 |
|
||||
+----+----------+------------------+
|
||||
|
||||
select x, property,
|
||||
count(x) over
|
||||
(
|
||||
partition by property
|
||||
<b>order by x</b>
|
||||
<b>rows between unbounded preceding and current row</b>
|
||||
) as 'cumulative total'
|
||||
from int_t where property in ('odd','even');
|
||||
+----+----------+------------------+
|
||||
| x | property | cumulative count |
|
||||
+----+----------+------------------+
|
||||
| 2 | even | 1 |
|
||||
| 4 | even | 2 |
|
||||
| 6 | even | 3 |
|
||||
| 8 | even | 4 |
|
||||
| 10 | even | 5 |
|
||||
| 1 | odd | 1 |
|
||||
| 3 | odd | 2 |
|
||||
| 5 | odd | 3 |
|
||||
| 7 | odd | 4 |
|
||||
| 9 | odd | 5 |
|
||||
+----+----------+------------------+
|
||||
</codeblock>
|
||||
|
||||
The following examples show how to construct a moving window, with a running count taking into account 1 row before
|
||||
and 1 row after the current row, within the same partition (all the even values or all the odd values).
|
||||
Therefore, the count is consistently 3 for rows in the middle of the window, and 2 for
|
||||
rows near the ends of the window, where there is no preceding or no following row in the partition.
|
||||
Because of a restriction in the Impala <codeph>RANGE</codeph> syntax, this type of
|
||||
moving window is possible with the <codeph>ROWS BETWEEN</codeph> clause but not the <codeph>RANGE BETWEEN</codeph>
|
||||
clause:
|
||||
<codeblock>select x, property,
|
||||
count(x) over
|
||||
(
|
||||
partition by property
|
||||
<b>order by x</b>
|
||||
<b>rows between 1 preceding and 1 following</b>
|
||||
) as 'moving total'
|
||||
from int_t where property in ('odd','even');
|
||||
+----+----------+--------------+
|
||||
| x | property | moving total |
|
||||
+----+----------+--------------+
|
||||
| 2 | even | 2 |
|
||||
| 4 | even | 3 |
|
||||
| 6 | even | 3 |
|
||||
| 8 | even | 3 |
|
||||
| 10 | even | 2 |
|
||||
| 1 | odd | 2 |
|
||||
| 3 | odd | 3 |
|
||||
| 5 | odd | 3 |
|
||||
| 7 | odd | 3 |
|
||||
| 9 | odd | 2 |
|
||||
+----+----------+--------------+
|
||||
|
||||
-- Doesn't work because of syntax restriction on RANGE clause.
|
||||
select x, property,
|
||||
count(x) over
|
||||
(
|
||||
partition by property
|
||||
<b>order by x</b>
|
||||
<b>range between 1 preceding and 1 following</b>
|
||||
) as 'moving total'
|
||||
from int_t where property in ('odd','even');
|
||||
ERROR: AnalysisException: RANGE is only supported with both the lower and upper bounds UNBOUNDED or one UNBOUNDED and the other CURRENT ROW.
|
||||
</codeblock>
|
||||
</p>
|
||||
|
||||
<note conref="../shared/impala_common.xml#common/multiple_count_distinct"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
|
||||
<p>
|
||||
<xref href="impala_analytic_functions.xml#analytic_functions"/>
|
||||
</p>
|
||||
|
||||
</conbody>
|
||||
</concept>
|
||||
35
docs/topics/impala_create_data_source.xml
Normal file
35
docs/topics/impala_create_data_source.xml
Normal file
@@ -0,0 +1,35 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept audience="Cloudera" rev="1.4.0" id="create_data_source">
|
||||
|
||||
<title>CREATE DATA SOURCE Statement</title>
|
||||
<titlealts audience="PDF"><navtitle>CREATE DATA SOURCE</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="DDL"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<indexterm audience="Cloudera">CREATE DATA SOURCE statement</indexterm>
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/ddl_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/sync_ddl_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
</conbody>
|
||||
</concept>
|
||||
137
docs/topics/impala_create_database.xml
Normal file
137
docs/topics/impala_create_database.xml
Normal file
@@ -0,0 +1,137 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="create_database">
|
||||
|
||||
<title>CREATE DATABASE Statement</title>
|
||||
<titlealts audience="PDF"><navtitle>CREATE DATABASE</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="Databases"/>
|
||||
<data name="Category" value="Schemas"/>
|
||||
<data name="Category" value="DDL"/>
|
||||
<data name="Category" value="S3"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<indexterm audience="Cloudera">CREATE DATABASE statement</indexterm>
|
||||
Creates a new database.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
In Impala, a database is both:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
A logical construct for grouping together related tables, views, and functions within their own namespace.
|
||||
You might use a separate database for each application, set of related tables, or round of experimentation.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
A physical construct represented by a directory tree in HDFS. Tables (internal tables), partitions, and
|
||||
data files are all located under this directory. You can perform HDFS-level operations such as backing it up and measuring space usage,
|
||||
or remove it with a <codeph>DROP DATABASE</codeph> statement.
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
||||
|
||||
<codeblock>CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] <varname>database_name</varname>[COMMENT '<varname>database_comment</varname>']
|
||||
[LOCATION <varname>hdfs_path</varname>];</codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/ddl_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
||||
|
||||
<p>
|
||||
A database is physically represented as a directory in HDFS, with a filename extension <codeph>.db</codeph>,
|
||||
under the main Impala data directory. If the associated HDFS directory does not exist, it is created for you.
|
||||
All databases and their associated directories are top-level objects, with no physical or logical nesting.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
After creating a database, to make it the current database within an <cmdname>impala-shell</cmdname> session,
|
||||
use the <codeph>USE</codeph> statement. You can refer to tables in the current database without prepending
|
||||
any qualifier to their names.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
When you first connect to Impala through <cmdname>impala-shell</cmdname>, the database you start in (before
|
||||
issuing any <codeph>CREATE DATABASE</codeph> or <codeph>USE</codeph> statements) is named
|
||||
<codeph>default</codeph>.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/builtins_db"/>
|
||||
|
||||
<p>
|
||||
After creating a database, your <cmdname>impala-shell</cmdname> session or another
|
||||
<cmdname>impala-shell</cmdname> connected to the same node can immediately access that database. To access
|
||||
the database through the Impala daemon on a different node, issue the <codeph>INVALIDATE METADATA</codeph>
|
||||
statement first while connected to that other node.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Setting the <codeph>LOCATION</codeph> attribute for a new database is a way to work with sets of files in an
|
||||
HDFS directory structure outside the default Impala data directory, as opposed to setting the
|
||||
<codeph>LOCATION</codeph> attribute for each individual table.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/sync_ddl_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/hive_blurb"/>
|
||||
|
||||
<p>
|
||||
When you create a database in Impala, the database can also be used by Hive.
|
||||
When you create a database in Hive, issue an <codeph>INVALIDATE METADATA</codeph>
|
||||
statement in Impala to make Impala permanently aware of the new database.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The <codeph>SHOW DATABASES</codeph> statement lists all databases, or the databases whose name
|
||||
matches a wildcard pattern. <ph rev="2.5.0">In <keyword keyref="impala25_full"/> and higher, the
|
||||
<codeph>SHOW DATABASES</codeph> output includes a second column that displays the associated
|
||||
comment, if any, for each database.</ph>
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/s3_blurb"/>
|
||||
|
||||
<p rev="2.6.0 CDH-39913 IMPALA-1878">
|
||||
To specify that any tables created within a database reside on the Amazon S3 system,
|
||||
you can include an <codeph>s3a://</codeph> prefix on the <codeph>LOCATION</codeph>
|
||||
attribute. In <keyword keyref="impala26_full"/> and higher, Impala automatically creates any
|
||||
required folders as the databases, tables, and partitions are created, and removes
|
||||
them when they are dropped.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/s3_ddl"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/permissions_blurb"/>
|
||||
<p rev="CDH-19187">
|
||||
The user ID that the <cmdname>impalad</cmdname> daemon runs under,
|
||||
typically the <codeph>impala</codeph> user, must have write
|
||||
permission for the parent HDFS directory under which the database
|
||||
is located.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
|
||||
<codeblock conref="../shared/impala_common.xml#common/create_drop_db_example"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
|
||||
<p>
|
||||
<xref href="impala_databases.xml#databases"/>, <xref href="impala_drop_database.xml#drop_database"/>,
|
||||
<xref href="impala_use.xml#use"/>, <xref href="impala_show.xml#show_databases"/>,
|
||||
<xref href="impala_tables.xml#tables"/>
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
492
docs/topics/impala_create_function.xml
Normal file
492
docs/topics/impala_create_function.xml
Normal file
@@ -0,0 +1,492 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept rev="1.2" id="create_function">
|
||||
|
||||
<title>CREATE FUNCTION Statement</title>
|
||||
<titlealts audience="PDF"><navtitle>CREATE FUNCTION</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="DDL"/>
|
||||
<data name="Category" value="Schemas"/>
|
||||
<data name="Category" value="Impala Functions"/>
|
||||
<data name="Category" value="UDFs"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<indexterm audience="Cloudera">CREATE FUNCTION statement</indexterm>
|
||||
Creates a user-defined function (UDF), which you can use to implement custom logic during
|
||||
<codeph>SELECT</codeph> or <codeph>INSERT</codeph> operations.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
||||
|
||||
<p>
|
||||
The syntax is different depending on whether you create a scalar UDF, which is called once for each row and
|
||||
implemented by a single function, or a user-defined aggregate function (UDA), which is implemented by
|
||||
multiple functions that compute intermediate results across sets of rows.
|
||||
</p>
|
||||
|
||||
<p rev="2.5.0 IMPALA-2843 CDH-39148">
|
||||
In <keyword keyref="impala25_full"/> and higher, the syntax is also different for creating or dropping scalar Java-based UDFs.
|
||||
The statements for Java UDFs use a new syntax, without any argument types or return type specified. Java-based UDFs
|
||||
created using the new syntax persist across restarts of the Impala catalog server, and can be shared transparently
|
||||
between Impala and Hive.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
To create a persistent scalar C++ UDF with <codeph>CREATE FUNCTION</codeph>:
|
||||
</p>
|
||||
|
||||
<codeblock>CREATE FUNCTION [IF NOT EXISTS] [<varname>db_name</varname>.]<varname>function_name</varname>([<varname>arg_type</varname>[, <varname>arg_type</varname>...])
|
||||
RETURNS <varname>return_type</varname>
|
||||
LOCATION '<varname>hdfs_path_to_dot_so</varname>'
|
||||
SYMBOL='<varname>symbol_name</varname>'</codeblock>
|
||||
|
||||
<p rev="2.5.0 IMPALA-2843 CDH-39148">
|
||||
To create a persistent Java UDF with <codeph>CREATE FUNCTION</codeph>:
|
||||
<codeblock>CREATE FUNCTION [IF NOT EXISTS] [<varname>db_name</varname>.]<varname>function_name</varname>
|
||||
LOCATION '<varname>hdfs_path_to_jar</varname>'
|
||||
SYMBOL='<varname>class_name</varname>'</codeblock>
|
||||
</p>
|
||||
|
||||
<!--
|
||||
Examples:
|
||||
CREATE FUNCTION IF NOT EXISTS foo location '/path/to/jar' SYMBOL='TestUdf';
|
||||
CREATE FUNCTION bar location '/path/to/jar' SYMBOL='TestUdf2';
|
||||
DROP FUNCTION foo;
|
||||
DROP FUNCTION IF EXISTS bar;
|
||||
-->
|
||||
|
||||
<p>
|
||||
To create a persistent UDA, which must be written in C++, issue a <codeph>CREATE AGGREGATE FUNCTION</codeph> statement:
|
||||
</p>
|
||||
|
||||
<codeblock>CREATE [AGGREGATE] FUNCTION [IF NOT EXISTS] [<varname>db_name</varname>.]<varname>function_name</varname>([<varname>arg_type</varname>[, <varname>arg_type</varname>...])
|
||||
RETURNS <varname>return_type</varname>
|
||||
LOCATION '<varname>hdfs_path</varname>'
|
||||
[INIT_FN='<varname>function</varname>]
|
||||
UPDATE_FN='<varname>function</varname>
|
||||
MERGE_FN='<varname>function</varname>
|
||||
[PREPARE_FN='<varname>function</varname>]
|
||||
[CLOSEFN='<varname>function</varname>]
|
||||
<ph rev="2.0.0">[SERIALIZE_FN='<varname>function</varname>]</ph>
|
||||
[FINALIZE_FN='<varname>function</varname>]
|
||||
<ph rev="2.3.0 IMPALA-1829 CDH-30572">[INTERMEDIATE <varname>type_spec</varname>]</ph></codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/ddl_blurb"/>
|
||||
|
||||
<p>
|
||||
<b>Varargs notation:</b>
|
||||
</p>
|
||||
|
||||
<note rev="CDH-39271 CDH-38572">
|
||||
<p rev="CDH-39271 CDH-38572">
|
||||
Variable-length argument lists are supported for C++ UDFs, but currently not for Java UDFs.
|
||||
</p>
|
||||
</note>
|
||||
|
||||
<p>
|
||||
If the underlying implementation of your function accepts a variable number of arguments:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
The variable arguments must go last in the argument list.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
The variable arguments must all be of the same type.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
You must include at least one instance of the variable arguments in every function call invoked from SQL.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
You designate the variable portion of the argument list in the <codeph>CREATE FUNCTION</codeph> statement
|
||||
by including <codeph>...</codeph> immediately after the type name of the first variable argument. For
|
||||
example, to create a function that accepts an <codeph>INT</codeph> argument, followed by a
|
||||
<codeph>BOOLEAN</codeph>, followed by one or more <codeph>STRING</codeph> arguments, your <codeph>CREATE
|
||||
FUNCTION</codeph> statement would look like:
|
||||
<codeblock>CREATE FUNCTION <varname>func_name</varname> (INT, BOOLEAN, STRING ...)
|
||||
RETURNS <varname>type</varname> LOCATION '<varname>path</varname>' SYMBOL='<varname>entry_point</varname>';
|
||||
</codeblock>
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p rev="CDH-39271 CDH-38572">
|
||||
See <xref href="impala_udf.xml#udf_varargs"/> for how to code a C++ UDF to accept
|
||||
variable-length argument lists.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>Scalar and aggregate functions:</b>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The simplest kind of user-defined function returns a single scalar value each time it is called, typically
|
||||
once for each row in the result set. This general kind of function is what is usually meant by UDF.
|
||||
User-defined aggregate functions (UDAs) are a specialized kind of UDF that produce a single value based on
|
||||
the contents of multiple rows. You usually use UDAs in combination with a <codeph>GROUP BY</codeph> clause to
|
||||
condense a large result set into a smaller one, or even a single row summarizing column values across an
|
||||
entire table.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
You create UDAs by using the <codeph>CREATE AGGREGATE FUNCTION</codeph> syntax. The clauses
|
||||
<codeph>INIT_FN</codeph>, <codeph>UPDATE_FN</codeph>, <codeph>MERGE_FN</codeph>,
|
||||
<ph rev="2.0.0"><codeph>SERIALIZE_FN</codeph>,</ph> <codeph>FINALIZE_FN</codeph>, and
|
||||
<codeph>INTERMEDIATE</codeph> only apply when you create a UDA rather than a scalar UDF.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The <codeph>*_FN</codeph> clauses specify functions to call at different phases of function processing.
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
<b>Initialize:</b> The function you specify with the <codeph>INIT_FN</codeph> clause does any initial
|
||||
setup, such as initializing member variables in internal data structures. This function is often a stub for
|
||||
simple UDAs. You can omit this clause and a default (no-op) function will be used.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<b>Update:</b> The function you specify with the <codeph>UPDATE_FN</codeph> clause is called once for each
|
||||
row in the original result set, that is, before any <codeph>GROUP BY</codeph> clause is applied. A separate
|
||||
instance of the function is called for each different value returned by the <codeph>GROUP BY</codeph>
|
||||
clause. The final argument passed to this function is a pointer, to which you write an updated value based
|
||||
on its original value and the value of the first argument.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<b>Merge:</b> The function you specify with the <codeph>MERGE_FN</codeph> clause is called an arbitrary
|
||||
number of times, to combine intermediate values produced by different nodes or different threads as Impala
|
||||
reads and processes data files in parallel. The final argument passed to this function is a pointer, to
|
||||
which you write an updated value based on its original value and the value of the first argument.
|
||||
</li>
|
||||
|
||||
<li rev="2.0.0">
|
||||
<b>Serialize:</b> The function you specify with the <codeph>SERIALIZE_FN</codeph> clause frees memory
|
||||
allocated to intermediate results. It is required if any memory was allocated by the Allocate function in
|
||||
the Init, Update, or Merge functions, or if the intermediate type contains any pointers. See
|
||||
<xref href="https://github.com/cloudera/impala-udf-samples/blob/master/uda-sample.cc" scope="external" format="html">the
|
||||
UDA code samples</xref> for details.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<b>Finalize:</b> The function you specify with the <codeph>FINALIZE_FN</codeph> clause does any required
|
||||
teardown for resources acquired by your UDF, such as freeing memory, closing file handles if you explicitly
|
||||
opened any files, and so on. This function is often a stub for simple UDAs. You can omit this clause and a
|
||||
default (no-op) function will be used. It is required in UDAs where the final return type is different than
|
||||
the intermediate type. or if any memory was allocated by the Allocate function in the Init, Update, or
|
||||
Merge functions. See
|
||||
<xref href="https://github.com/cloudera/impala-udf-samples/blob/master/uda-sample.cc" scope="external" format="html">the
|
||||
UDA code samples</xref> for details.
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p>
|
||||
If you use a consistent naming convention for each of the underlying functions, Impala can automatically
|
||||
determine the names based on the first such clause, so the others are optional.
|
||||
</p>
|
||||
|
||||
<p audience="Cloudera">
|
||||
The <codeph>INTERMEDIATE</codeph> clause specifies the data type of intermediate values passed from the
|
||||
<q>update</q> phase to the <q>merge</q> phase, and from the <q>merge</q> phase to the <q>finalize</q> phase.
|
||||
You can use any of the existing Impala data types, or the special notation
|
||||
<codeph>CHAR(<varname>n</varname>)</codeph> to allocate a scratch area of <varname>n</varname> bytes for the
|
||||
intermediate result. For example, if the different phases of your UDA pass strings to each other but in the
|
||||
end the function returns a <codeph>BIGINT</codeph> value, you would specify <codeph>INTERMEDIATE
|
||||
STRING</codeph>. Likewise, if the different phases of your UDA pass 2 separate <codeph>BIGINT</codeph> values
|
||||
between them (8 bytes each), you would specify <codeph>INTERMEDIATE CHAR(16)</codeph> so that each function
|
||||
could read from and write to a 16-byte buffer.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
For end-to-end examples of UDAs, see <xref href="impala_udf.xml#udfs"/>.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/complex_types_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/udfs_no_complex_types"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
You can write Impala UDFs in either C++ or Java. C++ UDFs are new to Impala, and are the recommended format
|
||||
for high performance utilizing native code. Java-based UDFs are compatible between Impala and Hive, and are
|
||||
most suited to reusing existing Hive UDFs. (Impala can run Java-based Hive UDFs but not Hive UDAs.)
|
||||
</li>
|
||||
|
||||
<li rev="2.5.0 IMPALA-1748 CDH-38369 IMPALA-2843 CDH-39148">
|
||||
<keyword keyref="impala25_full"/> introduces UDF improvements to persistence for both C++ and Java UDFs,
|
||||
and better compatibility between Impala and Hive for Java UDFs.
|
||||
See <xref href="impala_udf.xml#udfs"/> for details.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
The body of the UDF is represented by a <codeph>.so</codeph> or <codeph>.jar</codeph> file, which you store
|
||||
in HDFS and the <codeph>CREATE FUNCTION</codeph> statement distributes to each Impala node.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Impala calls the underlying code during SQL statement evaluation, as many times as needed to process all
|
||||
the rows from the result set. All UDFs are assumed to be deterministic, that is, to always return the same
|
||||
result when passed the same argument values. Impala might or might not skip some invocations of a UDF if
|
||||
the result value is already known from a previous call. Therefore, do not rely on the UDF being called a
|
||||
specific number of times, and do not return different result values based on some external factor such as
|
||||
the current time, a random number function, or an external data source that could be updated while an
|
||||
Impala query is in progress.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
The names of the function arguments in the UDF are not significant, only their number, positions, and data
|
||||
types.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
You can overload the same function name by creating multiple versions of the function, each with a
|
||||
different argument signature. For security reasons, you cannot make a UDF with the same name as any
|
||||
built-in function.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
In the UDF code, you represent the function return result as a <codeph>struct</codeph>. This
|
||||
<codeph>struct</codeph> contains 2 fields. The first field is a <codeph>boolean</codeph> representing
|
||||
whether the value is <codeph>NULL</codeph> or not. (When this field is <codeph>true</codeph>, the return
|
||||
value is interpreted as <codeph>NULL</codeph>.) The second field is the same type as the specified function
|
||||
return type, and holds the return value when the function returns something other than
|
||||
<codeph>NULL</codeph>.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
In the UDF code, you represent the function arguments as an initial pointer to a UDF context structure,
|
||||
followed by references to zero or more <codeph>struct</codeph>s, corresponding to each of the arguments.
|
||||
Each <codeph>struct</codeph> has the same 2 fields as with the return value, a <codeph>boolean</codeph>
|
||||
field representing whether the argument is <codeph>NULL</codeph>, and a field of the appropriate type
|
||||
holding any non-<codeph>NULL</codeph> argument value.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
For sample code and build instructions for UDFs,
|
||||
see <xref href="https://github.com/cloudera/impala/tree/master/be/src/udf_samples" scope="external" format="html">the sample UDFs in the Impala github repo</xref>.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Because the file representing the body of the UDF is stored in HDFS, it is automatically available to all
|
||||
the Impala nodes. You do not need to manually copy any UDF-related files between servers.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Because Impala currently does not have any <codeph>ALTER FUNCTION</codeph> statement, if you need to rename
|
||||
a function, move it to a different database, or change its signature or other properties, issue a
|
||||
<codeph>DROP FUNCTION</codeph> statement for the original function followed by a <codeph>CREATE
|
||||
FUNCTION</codeph> with the desired properties.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Because each UDF is associated with a particular database, either issue a <codeph>USE</codeph> statement
|
||||
before doing any <codeph>CREATE FUNCTION</codeph> statements, or specify the name of the function as
|
||||
<codeph><varname>db_name</varname>.<varname>function_name</varname></codeph>.
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/sync_ddl_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/compatibility_blurb"/>
|
||||
|
||||
<p>
|
||||
Impala can run UDFs that were created through Hive, as long as they refer to Impala-compatible data types
|
||||
(not composite or nested column types). Hive can run Java-based UDFs that were created through Impala, but
|
||||
not Impala UDFs written in C++.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/current_user_caveat"/>
|
||||
|
||||
<p><b>Persistence:</b></p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/udf_persistence_restriction"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/permissions_blurb_no"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
|
||||
<p>
|
||||
For additional examples of all kinds of user-defined functions, see <xref href="impala_udf.xml#udfs"/>.
|
||||
</p>
|
||||
|
||||
<p rev="2.5.0 IMPALA-2843 CDH-39148">
|
||||
The following example shows how to take a Java jar file and make all the functions inside one of its classes
|
||||
into UDFs under a single (overloaded) function name in Impala. Each <codeph>CREATE FUNCTION</codeph> or
|
||||
<codeph>DROP FUNCTION</codeph> statement applies to all the overloaded Java functions with the same name.
|
||||
This example uses the signatureless syntax for <codeph>CREATE FUNCTION</codeph> and <codeph>DROP FUNCTION</codeph>,
|
||||
which is available in <keyword keyref="impala25_full"/> and higher.
|
||||
</p>
|
||||
<p rev="2.5.0 IMPALA-2843 CDH-39148">
|
||||
At the start, the jar file is in the local filesystem. Then it is copied into HDFS, so that it is
|
||||
available for Impala to reference through the <codeph>CREATE FUNCTION</codeph> statement and
|
||||
queries that refer to the Impala function name.
|
||||
</p>
|
||||
<codeblock rev="2.5.0 IMPALA-2843 CDH-39148">
|
||||
$ jar -tvf udf-examples-cdh570.jar
|
||||
0 Mon Feb 22 04:06:50 PST 2016 META-INF/
|
||||
122 Mon Feb 22 04:06:48 PST 2016 META-INF/MANIFEST.MF
|
||||
0 Mon Feb 22 04:06:46 PST 2016 com/
|
||||
0 Mon Feb 22 04:06:46 PST 2016 com/cloudera/
|
||||
0 Mon Feb 22 04:06:46 PST 2016 com/cloudera/impala/
|
||||
2460 Mon Feb 22 04:06:46 PST 2016 com/cloudera/impala/IncompatibleUdfTest.class
|
||||
541 Mon Feb 22 04:06:46 PST 2016 com/cloudera/impala/TestUdfException.class
|
||||
3438 Mon Feb 22 04:06:46 PST 2016 com/cloudera/impala/JavaUdfTest.class
|
||||
5872 Mon Feb 22 04:06:46 PST 2016 com/cloudera/impala/TestUdf.class
|
||||
...
|
||||
$ hdfs dfs -put udf-examples-cdh570.jar /user/impala/udfs
|
||||
$ hdfs dfs -ls /user/impala/udfs
|
||||
Found 2 items
|
||||
-rw-r--r-- 3 jrussell supergroup 853 2015-10-09 14:05 /user/impala/udfs/hello_world.jar
|
||||
-rw-r--r-- 3 jrussell supergroup 7366 2016-06-08 14:25 /user/impala/udfs/udf-examples-cdh570.jar
|
||||
</codeblock>
|
||||
<p rev="2.5.0 IMPALA-2843 CDH-39148">
|
||||
In <cmdname>impala-shell</cmdname>, the <codeph>CREATE FUNCTION</codeph> refers to the HDFS path of the jar file
|
||||
and the fully qualified class name inside the jar. Each of the functions inside the class becomes an
|
||||
Impala function, each one overloaded under the specified Impala function name.
|
||||
</p>
|
||||
<codeblock rev="2.5.0 IMPALA-2843 CDH-39148">
|
||||
[localhost:21000] > create function testudf location '/user/impala/udfs/udf-examples-cdh570.jar' symbol='com.cloudera.impala.TestUdf';
|
||||
[localhost:21000] > show functions;
|
||||
+-------------+---------------------------------------+-------------+---------------+
|
||||
| return type | signature | binary type | is persistent |
|
||||
+-------------+---------------------------------------+-------------+---------------+
|
||||
| BIGINT | testudf(BIGINT) | JAVA | true |
|
||||
| BOOLEAN | testudf(BOOLEAN) | JAVA | true |
|
||||
| BOOLEAN | testudf(BOOLEAN, BOOLEAN) | JAVA | true |
|
||||
| BOOLEAN | testudf(BOOLEAN, BOOLEAN, BOOLEAN) | JAVA | true |
|
||||
| DOUBLE | testudf(DOUBLE) | JAVA | true |
|
||||
| DOUBLE | testudf(DOUBLE, DOUBLE) | JAVA | true |
|
||||
| DOUBLE | testudf(DOUBLE, DOUBLE, DOUBLE) | JAVA | true |
|
||||
| FLOAT | testudf(FLOAT) | JAVA | true |
|
||||
| FLOAT | testudf(FLOAT, FLOAT) | JAVA | true |
|
||||
| FLOAT | testudf(FLOAT, FLOAT, FLOAT) | JAVA | true |
|
||||
| INT | testudf(INT) | JAVA | true |
|
||||
| DOUBLE | testudf(INT, DOUBLE) | JAVA | true |
|
||||
| INT | testudf(INT, INT) | JAVA | true |
|
||||
| INT | testudf(INT, INT, INT) | JAVA | true |
|
||||
| SMALLINT | testudf(SMALLINT) | JAVA | true |
|
||||
| SMALLINT | testudf(SMALLINT, SMALLINT) | JAVA | true |
|
||||
| SMALLINT | testudf(SMALLINT, SMALLINT, SMALLINT) | JAVA | true |
|
||||
| STRING | testudf(STRING) | JAVA | true |
|
||||
| STRING | testudf(STRING, STRING) | JAVA | true |
|
||||
| STRING | testudf(STRING, STRING, STRING) | JAVA | true |
|
||||
| TINYINT | testudf(TINYINT) | JAVA | true |
|
||||
+-------------+---------------------------------------+-------------+---------------+
|
||||
</codeblock>
|
||||
<p rev="2.5.0 IMPALA-2843 CDH-39148">
|
||||
These are all simple functions that return their single arguments, or
|
||||
sum, concatenate, and so on their multiple arguments. Impala determines which
|
||||
overloaded function to use based on the number and types of the arguments.
|
||||
</p>
|
||||
<codeblock rev="2.5.0 IMPALA-2843 CDH-39148">
|
||||
insert into bigint_x values (1), (2), (4), (3);
|
||||
select testudf(x) from bigint_x;
|
||||
+-----------------+
|
||||
| udfs.testudf(x) |
|
||||
+-----------------+
|
||||
| 1 |
|
||||
| 2 |
|
||||
| 4 |
|
||||
| 3 |
|
||||
+-----------------+
|
||||
|
||||
insert into int_x values (1), (2), (4), (3);
|
||||
select testudf(x, x+1, x*x) from int_x;
|
||||
+-------------------------------+
|
||||
| udfs.testudf(x, x + 1, x * x) |
|
||||
+-------------------------------+
|
||||
| 4 |
|
||||
| 9 |
|
||||
| 25 |
|
||||
| 16 |
|
||||
+-------------------------------+
|
||||
|
||||
select testudf(x) from string_x;
|
||||
+-----------------+
|
||||
| udfs.testudf(x) |
|
||||
+-----------------+
|
||||
| one |
|
||||
| two |
|
||||
| four |
|
||||
| three |
|
||||
+-----------------+
|
||||
select testudf(x,x) from string_x;
|
||||
+--------------------+
|
||||
| udfs.testudf(x, x) |
|
||||
+--------------------+
|
||||
| oneone |
|
||||
| twotwo |
|
||||
| fourfour |
|
||||
| threethree |
|
||||
+--------------------+
|
||||
</codeblock>
|
||||
|
||||
<p rev="2.5.0 IMPALA-2843 CDH-39148">
|
||||
The previous example used the same Impala function name as the name of the class.
|
||||
This example shows how the Impala function name is independent of the underlying
|
||||
Java class or function names. A second <codeph>CREATE FUNCTION</codeph> statement
|
||||
results in a set of overloaded functions all named <codeph>my_func</codeph>,
|
||||
to go along with the overloaded functions all named <codeph>testudf</codeph>.
|
||||
</p>
|
||||
<codeblock rev="2.5.0 IMPALA-2843 CDH-39148">
|
||||
create function my_func location '/user/impala/udfs/udf-examples-cdh570.jar'
|
||||
symbol='com.cloudera.impala.TestUdf';
|
||||
|
||||
show functions;
|
||||
+-------------+---------------------------------------+-------------+---------------+
|
||||
| return type | signature | binary type | is persistent |
|
||||
+-------------+---------------------------------------+-------------+---------------+
|
||||
| BIGINT | my_func(BIGINT) | JAVA | true |
|
||||
| BOOLEAN | my_func(BOOLEAN) | JAVA | true |
|
||||
| BOOLEAN | my_func(BOOLEAN, BOOLEAN) | JAVA | true |
|
||||
...
|
||||
| BIGINT | testudf(BIGINT) | JAVA | true |
|
||||
| BOOLEAN | testudf(BOOLEAN) | JAVA | true |
|
||||
| BOOLEAN | testudf(BOOLEAN, BOOLEAN) | JAVA | true |
|
||||
...
|
||||
</codeblock>
|
||||
<p rev="2.5.0 IMPALA-2843 CDH-39148">
|
||||
The corresponding <codeph>DROP FUNCTION</codeph> statement with no signature
|
||||
drops all the overloaded functions with that name.
|
||||
</p>
|
||||
<codeblock rev="2.5.0 IMPALA-2843 CDH-39148">
|
||||
drop function my_func;
|
||||
show functions;
|
||||
+-------------+---------------------------------------+-------------+---------------+
|
||||
| return type | signature | binary type | is persistent |
|
||||
+-------------+---------------------------------------+-------------+---------------+
|
||||
| BIGINT | testudf(BIGINT) | JAVA | true |
|
||||
| BOOLEAN | testudf(BOOLEAN) | JAVA | true |
|
||||
| BOOLEAN | testudf(BOOLEAN, BOOLEAN) | JAVA | true |
|
||||
...
|
||||
</codeblock>
|
||||
<p rev="2.5.0 IMPALA-2843 CDH-39148">
|
||||
The signatureless <codeph>CREATE FUNCTION</codeph> syntax for Java UDFs ensures that
|
||||
the functions shown in this example remain available after the Impala service
|
||||
(specifically, the Catalog Server) are restarted.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
|
||||
<p>
|
||||
<xref href="impala_udf.xml#udfs"/> for more background information, usage instructions, and examples for
|
||||
Impala UDFs; <xref href="impala_drop_function.xml#drop_function"/>
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
70
docs/topics/impala_create_role.xml
Normal file
70
docs/topics/impala_create_role.xml
Normal file
@@ -0,0 +1,70 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept rev="1.4.0" id="create_role">
|
||||
|
||||
<title>CREATE ROLE Statement (<keyword keyref="impala20"/> or higher only)</title>
|
||||
<titlealts audience="PDF"><navtitle>CREATE ROLE</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="DDL"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="Sentry"/>
|
||||
<data name="Category" value="Security"/>
|
||||
<data name="Category" value="Roles"/>
|
||||
<data name="Category" value="Administrators"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
<!-- Consider whether to go deeper into categories like Security for the Sentry-related statements. -->
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<indexterm audience="Cloudera">CREATE ROLE statement</indexterm>
|
||||
<!-- Copied from Sentry docs. Turn into conref. -->
|
||||
The <codeph>CREATE ROLE</codeph> statement creates a role to which privileges can be granted. Privileges can
|
||||
be granted to roles, which can then be assigned to users. A user that has been assigned a role will only be
|
||||
able to exercise the privileges of that role. Only users that have administrative privileges can create/drop
|
||||
roles. By default, the <codeph>hive</codeph>, <codeph>impala</codeph> and <codeph>hue</codeph> users have
|
||||
administrative privileges in Sentry.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
||||
|
||||
<codeblock>CREATE ROLE <varname>role_name</varname>
|
||||
</codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/privileges_blurb"/>
|
||||
|
||||
<p>
|
||||
Only administrative users (those with <codeph>ALL</codeph> privileges on the server, defined in the Sentry
|
||||
policy file) can use this statement.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/compatibility_blurb"/>
|
||||
|
||||
<p>
|
||||
Impala makes use of any roles and privileges specified by the <codeph>GRANT</codeph> and
|
||||
<codeph>REVOKE</codeph> statements in Hive, and Hive makes use of any roles and privileges specified by the
|
||||
<codeph>GRANT</codeph> and <codeph>REVOKE</codeph> statements in Impala. The Impala <codeph>GRANT</codeph>
|
||||
and <codeph>REVOKE</codeph> statements for privileges do not require the <codeph>ROLE</codeph> keyword to be
|
||||
repeated before each role name, unlike the equivalent Hive statements.
|
||||
</p>
|
||||
|
||||
<!-- To do: nail down the new SHOW syntax, e.g. SHOW ROLES, SHOW CURRENT ROLES, SHOW GROUPS. -->
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/permissions_blurb_no"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
|
||||
<p>
|
||||
<xref href="impala_authorization.xml#authorization"/>, <xref href="impala_grant.xml#grant"/>,
|
||||
<xref href="impala_revoke.xml#revoke"/>, <xref href="impala_drop_role.xml#drop_role"/>,
|
||||
<xref href="impala_show.xml#show"/>
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
832
docs/topics/impala_create_table.xml
Normal file
832
docs/topics/impala_create_table.xml
Normal file
@@ -0,0 +1,832 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="create_table" outputclass="impala sql_statement">
|
||||
|
||||
<title outputclass="impala_title sql_statement_title">CREATE TABLE Statement</title>
|
||||
<titlealts audience="PDF"><navtitle>CREATE TABLE</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="DDL"/>
|
||||
<data name="Category" value="Impala Data Types"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
<data name="Category" value="HDFS Caching"/>
|
||||
<data name="Category" value="Tables"/>
|
||||
<data name="Category" value="Schemas"/>
|
||||
<data name="Category" value="S3"/>
|
||||
<!-- <data name="Category" value="Kudu"/> -->
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<indexterm audience="Cloudera">CREATE TABLE statement</indexterm>
|
||||
Creates a new table and specifies its characteristics. While creating a table, you optionally specify aspects
|
||||
such as:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
Whether the table is internal or external.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
The columns and associated data types.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
The columns used for physically partitioning the data.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
The file format for data files.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
The HDFS directory where the data files are located.
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
||||
|
||||
<p>
|
||||
The general syntax for creating a table and specifying its columns is as follows:
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>Explicit column definitions:</b>
|
||||
</p>
|
||||
|
||||
<codeblock>CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [<varname>db_name</varname>.]<varname>table_name</varname>
|
||||
(<varname>col_name</varname> <varname>data_type</varname> [COMMENT '<varname>col_comment</varname>'], ...)
|
||||
[PARTITIONED BY (<varname>col_name</varname> <varname>data_type</varname> [COMMENT '<varname>col_comment</varname>'], ...)]
|
||||
[COMMENT '<varname>table_comment</varname>']
|
||||
[WITH SERDEPROPERTIES ('<varname>key1</varname>'='<varname>value1</varname>', '<varname>key2</varname>'='<varname>value2</varname>', ...)]
|
||||
[
|
||||
[ROW FORMAT <varname>row_format</varname>] [STORED AS <varname>file_format</varname>]
|
||||
]
|
||||
[LOCATION '<varname>hdfs_path</varname>']
|
||||
[TBLPROPERTIES ('<varname>key1</varname>'='<varname>value1</varname>', '<varname>key2</varname>'='<varname>value2</varname>', ...)]
|
||||
<ph rev="1.4.0"> [CACHED IN '<varname>pool_name</varname>'</ph> <ph rev="2.2.0">[WITH REPLICATION = <varname>integer</varname>]</ph> | UNCACHED]
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
<b>Column definitions inferred from data file:</b>
|
||||
</p>
|
||||
|
||||
<codeblock>CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [<varname>db_name</varname>.]<varname>table_name</varname>
|
||||
LIKE PARQUET '<varname>hdfs_path_of_parquet_file</varname>'
|
||||
[COMMENT '<varname>table_comment</varname>']
|
||||
[PARTITIONED BY (<varname>col_name</varname> <varname>data_type</varname> [COMMENT '<varname>col_comment</varname>'], ...)]
|
||||
[WITH SERDEPROPERTIES ('<varname>key1</varname>'='<varname>value1</varname>', '<varname>key2</varname>'='<varname>value2</varname>', ...)]
|
||||
[
|
||||
[ROW FORMAT <varname>row_format</varname>] [STORED AS <varname>file_format</varname>]
|
||||
]
|
||||
[LOCATION '<varname>hdfs_path</varname>']
|
||||
[TBLPROPERTIES ('<varname>key1</varname>'='<varname>value1</varname>', '<varname>key2</varname>'='<varname>value2</varname>', ...)]
|
||||
<ph rev="1.4.0"> [CACHED IN '<varname>pool_name</varname>'</ph> <ph rev="2.2.0">[WITH REPLICATION = <varname>integer</varname>]</ph> | UNCACHED]
|
||||
data_type:
|
||||
<varname>primitive_type</varname>
|
||||
| array_type
|
||||
| map_type
|
||||
| struct_type
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
<b>CREATE TABLE AS SELECT:</b>
|
||||
</p>
|
||||
|
||||
<codeblock>CREATE [EXTERNAL] TABLE [IF NOT EXISTS] <varname>db_name</varname>.]<varname>table_name</varname>
|
||||
<ph rev="2.5.0">[PARTITIONED BY (<varname>col_name</varname>[, ...])]</ph>
|
||||
[COMMENT '<varname>table_comment</varname>']
|
||||
[WITH SERDEPROPERTIES ('<varname>key1</varname>'='<varname>value1</varname>', '<varname>key2</varname>'='<varname>value2</varname>', ...)]
|
||||
[
|
||||
[ROW FORMAT <varname>row_format</varname>] <ph rev="CDH-41501">[STORED AS <varname>ctas_file_format</varname>]</ph>
|
||||
]
|
||||
[LOCATION '<varname>hdfs_path</varname>']
|
||||
[TBLPROPERTIES ('<varname>key1</varname>'='<varname>value1</varname>', '<varname>key2</varname>'='<varname>value2</varname>', ...)]
|
||||
<ph rev="1.4.0"> [CACHED IN '<varname>pool_name</varname>'</ph> <ph rev="2.2.0">[WITH REPLICATION = <varname>integer</varname>]</ph> | UNCACHED]
|
||||
AS
|
||||
<varname>select_statement</varname></codeblock>
|
||||
|
||||
<codeblock>primitive_type:
|
||||
TINYINT
|
||||
| SMALLINT
|
||||
| INT
|
||||
| BIGINT
|
||||
| BOOLEAN
|
||||
| FLOAT
|
||||
| DOUBLE
|
||||
<ph rev="1.4.0">| DECIMAL</ph>
|
||||
| STRING
|
||||
<ph rev="2.0.0">| CHAR</ph>
|
||||
<ph rev="2.0.0">| VARCHAR</ph>
|
||||
| TIMESTAMP
|
||||
|
||||
<ph rev="2.3.0">complex_type:
|
||||
struct_type
|
||||
| array_type
|
||||
| map_type
|
||||
|
||||
struct_type: STRUCT < <varname>name</varname> : <varname>primitive_or_complex_type</varname> [COMMENT '<varname>comment_string</varname>'], ... >
|
||||
|
||||
array_type: ARRAY < <varname>primitive_or_complex_type</varname> >
|
||||
|
||||
map_type: MAP < <varname>primitive_type</varname>, <varname>primitive_or_complex_type</varname> >
|
||||
</ph>
|
||||
row_format:
|
||||
DELIMITED [FIELDS TERMINATED BY '<varname>char</varname>' [ESCAPED BY '<varname>char</varname>']]
|
||||
[LINES TERMINATED BY '<varname>char</varname>']
|
||||
|
||||
file_format:
|
||||
PARQUET
|
||||
| TEXTFILE
|
||||
| AVRO
|
||||
| SEQUENCEFILE
|
||||
| RCFILE
|
||||
|
||||
<ph rev="CDH-41501">ctas_file_format:
|
||||
PARQUET
|
||||
| TEXTFILE</ph>
|
||||
</codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/ddl_blurb"/>
|
||||
|
||||
<!-- Should really have some info up front about all the data types and file formats.
|
||||
Consider adding here, or at least making inline links to the relevant keywords
|
||||
in the syntax spec above. -->
|
||||
|
||||
<p>
|
||||
<b>Column definitions:</b>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Depending on the form of the <codeph>CREATE TABLE</codeph> statement, the column definitions are
|
||||
required or not allowed.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
With the <codeph>CREATE TABLE AS SELECT</codeph> and <codeph>CREATE TABLE LIKE</codeph>
|
||||
syntax, you do not specify the columns at all; the column names and types are derived from the source table, query,
|
||||
or data file.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
With the basic <codeph>CREATE TABLE</codeph> syntax, you must list one or more columns,
|
||||
its name, type, and optionally a comment, in addition to any columns used as partitioning keys.
|
||||
There is one exception where the column list is not required: when creating an Avro table with the
|
||||
<codeph>STORED AS AVRO</codeph> clause, you can omit the list of columns and specify the same metadata
|
||||
as part of the <codeph>TBLPROPERTIES</codeph> clause.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/complex_types_blurb"/>
|
||||
|
||||
<p rev="2.3.0">
|
||||
The Impala complex types (<codeph>STRUCT</codeph>, <codeph>ARRAY</codeph>, or <codeph>MAP</codeph>)
|
||||
are available in <keyword keyref="impala23_full"/> and higher.
|
||||
Because you can nest these types (for example, to make an array of maps or a struct
|
||||
with an array field), these types are also sometimes referred to as nested types.
|
||||
See <xref href="impala_complex_types.xml#complex_types"/> for usage details.
|
||||
</p>
|
||||
|
||||
<!-- This is kind of an obscure and rare usage scenario. Consider moving all the complex type stuff further down
|
||||
after some of the more common clauses. -->
|
||||
<p rev="2.3.0">
|
||||
Impala can create tables containing complex type columns, with any supported file format.
|
||||
Because currently Impala can only query complex type columns in Parquet tables, creating
|
||||
tables with complex type columns and other file formats such as text is of limited use.
|
||||
For example, you might create a text table including some columns with complex types with Impala, and use Hive
|
||||
as part of your to ingest the nested type data and copy it to an identical Parquet table.
|
||||
Or you might create a partitioned table containing complex type columns using one file format, and
|
||||
use <codeph>ALTER TABLE</codeph> to change the file format of individual partitions to Parquet; Impala
|
||||
can then query only the Parquet-format partitions in that table.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/complex_types_partitioning"/>
|
||||
|
||||
<p>
|
||||
<b>Internal and external tables (EXTERNAL and LOCATION clauses):</b>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
By default, Impala creates an <q>internal</q> table, where Impala manages the underlying data files for the
|
||||
table, and physically deletes the data files when you drop the table. If you specify the
|
||||
<codeph>EXTERNAL</codeph> clause, Impala treats the table as an <q>external</q> table, where the data files
|
||||
are typically produced outside Impala and queried from their original locations in HDFS, and Impala leaves
|
||||
the data files in place when you drop the table. For details about internal and external tables, see
|
||||
<xref href="impala_tables.xml#tables"/>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Typically, for an external table you include a <codeph>LOCATION</codeph> clause to specify the path to the
|
||||
HDFS directory where Impala reads and writes files for the table. For example, if your data pipeline produces
|
||||
Parquet files in the HDFS directory <filepath>/user/etl/destination</filepath>, you might create an external
|
||||
table as follows:
|
||||
</p>
|
||||
|
||||
<codeblock>CREATE EXTERNAL TABLE external_parquet (c1 INT, c2 STRING, c3 TIMESTAMP)
|
||||
STORED AS PARQUET LOCATION '/user/etl/destination';
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
Although the <codeph>EXTERNAL</codeph> and <codeph>LOCATION</codeph> clauses are often specified together,
|
||||
<codeph>LOCATION</codeph> is optional for external tables, and you can also specify <codeph>LOCATION</codeph>
|
||||
for internal tables. The difference is all about whether Impala <q>takes control</q> of the underlying data
|
||||
files and moves them when you rename the table, or deletes them when you drop the table. For more about
|
||||
internal and external tables and how they interact with the <codeph>LOCATION</codeph> attribute, see
|
||||
<xref href="impala_tables.xml#tables"/>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>Partitioned tables (PARTITIONED BY clause):</b>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The <codeph>PARTITIONED BY</codeph> clause divides the data files based on the values from one or more
|
||||
specified columns. Impala queries can use the partition metadata to minimize the amount of data that is read
|
||||
from disk or transmitted across the network, particularly during join queries. For details about
|
||||
partitioning, see <xref href="impala_partitioning.xml#partitioning"/>.
|
||||
</p>
|
||||
|
||||
<p rev="2.5.0">
|
||||
Prior to <keyword keyref="impala25_full"/> you could use a partitioned table
|
||||
as the source and copy data from it, but could not specify any partitioning clauses for the new table.
|
||||
In <keyword keyref="impala25_full"/> and higher, you can now use the <codeph>PARTITIONED BY</codeph> clause with a
|
||||
<codeph>CREATE TABLE AS SELECT</codeph> statement. See the examples under the following discussion of
|
||||
the <codeph>CREATE TABLE AS SELECT</codeph> syntax variation.
|
||||
</p>
|
||||
|
||||
<!--
|
||||
<p rev="kudu">
|
||||
<b>Partitioning for Kudu tables (DISTRIBUTE BY clause)</b>
|
||||
</p>
|
||||
|
||||
<p rev="kudu">
|
||||
For Kudu tables, you specify logical partitioning across one or more columns using the
|
||||
<codeph>DISTRIBUTE BY</codeph> clause. In contrast to partitioning for HDFS-based tables,
|
||||
multiple values for a partition key column can be located in the same partition.
|
||||
The optional <codeph>HASH</codeph> clause lets you divide one or a set of partition key columns
|
||||
into a specified number of buckets; you can use more than one <codeph>HASH</codeph>
|
||||
clause, specifying a distinct set of partition key columns for each.
|
||||
The optional <codeph>RANGE</codeph> clause further subdivides the partitions, based on
|
||||
a set of literal values for the partition key columns.
|
||||
</p>
|
||||
-->
|
||||
|
||||
<p>
|
||||
<b>Specifying file format (STORED AS and ROW FORMAT clauses):</b>
|
||||
</p>
|
||||
|
||||
<p rev="DOCS-1523">
|
||||
The <codeph>STORED AS</codeph> clause identifies the format of the underlying data files. Currently, Impala
|
||||
can query more types of file formats than it can create or insert into. Use Hive to perform any create or
|
||||
data load operations that are not currently available in Impala. For example, Impala can create an Avro,
|
||||
SequenceFile, or RCFile table but cannot insert data into it. There are also Impala-specific procedures for using
|
||||
compression with each kind of file format. For details about working with data files of various formats, see
|
||||
<xref href="impala_file_formats.xml#file_formats"/>.
|
||||
</p>
|
||||
|
||||
<note>
|
||||
In Impala 1.4.0 and higher, Impala can create Avro tables, which formerly required doing the <codeph>CREATE
|
||||
TABLE</codeph> statement in Hive. See <xref href="impala_avro.xml#avro"/> for details and examples.
|
||||
</note>
|
||||
|
||||
<p>
|
||||
By default (when no <codeph>STORED AS</codeph> clause is specified), data files in Impala tables are created
|
||||
as text files with Ctrl-A (hex 01) characters as the delimiter.
|
||||
<!-- Verify if ROW FORMAT is entirely ignored outside of text tables, or does it apply somehow to SequenceFile and/or RCFile too? -->
|
||||
Specify the <codeph>ROW FORMAT DELIMITED</codeph> clause to produce or ingest data files that use a different
|
||||
delimiter character such as tab or <codeph>|</codeph>, or a different line end character such as carriage
|
||||
return or newline. When specifying delimiter and line end characters with the <codeph>FIELDS TERMINATED
|
||||
BY</codeph> and <codeph>LINES TERMINATED BY</codeph> clauses, use <codeph>'\t'</codeph> for tab,
|
||||
<codeph>'\n'</codeph> for newline or linefeed, <codeph>'\r'</codeph> for carriage return, and
|
||||
<codeph>\</codeph><codeph>0</codeph> for ASCII <codeph>nul</codeph> (hex 00). For more examples of text
|
||||
tables, see <xref href="impala_txtfile.xml#txtfile"/>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The <codeph>ESCAPED BY</codeph> clause applies both to text files that you create through an
|
||||
<codeph>INSERT</codeph> statement to an Impala <codeph>TEXTFILE</codeph> table, and to existing data files
|
||||
that you put into an Impala table directory. (You can ingest existing data files either by creating the table
|
||||
with <codeph>CREATE EXTERNAL TABLE ... LOCATION</codeph>, the <codeph>LOAD DATA</codeph> statement, or
|
||||
through an HDFS operation such as <codeph>hdfs dfs -put <varname>file</varname>
|
||||
<varname>hdfs_path</varname></codeph>.) Choose an escape character that is not used anywhere else in the
|
||||
file, and put it in front of each instance of the delimiter character that occurs within a field value.
|
||||
Surrounding field values with quotation marks does not help Impala to parse fields with embedded delimiter
|
||||
characters; the quotation marks are considered to be part of the column value. If you want to use
|
||||
<codeph>\</codeph> as the escape character, specify the clause in <cmdname>impala-shell</cmdname> as
|
||||
<codeph>ESCAPED BY '\\'</codeph>.
|
||||
</p>
|
||||
|
||||
<note conref="../shared/impala_common.xml#common/thorn"/>
|
||||
|
||||
<p>
|
||||
<b>Cloning tables (LIKE clause):</b>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
To create an empty table with the same columns, comments, and other attributes as another table, use the
|
||||
following variation. The <codeph>CREATE TABLE ... LIKE</codeph> form allows a restricted set of clauses,
|
||||
currently only the <codeph>LOCATION</codeph>, <codeph>COMMENT</codeph>, and <codeph>STORED AS</codeph>
|
||||
clauses.
|
||||
</p>
|
||||
|
||||
<codeblock>CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [<varname>db_name</varname>.]<varname>table_name</varname>
|
||||
<ph rev="1.4.0">LIKE { [<varname>db_name</varname>.]<varname>table_name</varname> | PARQUET '<varname>hdfs_path_of_parquet_file</varname>' }</ph>
|
||||
[COMMENT '<varname>table_comment</varname>']
|
||||
[STORED AS <varname>file_format</varname>]
|
||||
[LOCATION '<varname>hdfs_path</varname>']</codeblock>
|
||||
|
||||
<note rev="1.2.0">
|
||||
<p rev="1.2.0">
|
||||
To clone the structure of a table and transfer data into it in a single operation, use the <codeph>CREATE
|
||||
TABLE AS SELECT</codeph> syntax described in the next subsection.
|
||||
</p>
|
||||
</note>
|
||||
|
||||
<p>
|
||||
When you clone the structure of an existing table using the <codeph>CREATE TABLE ... LIKE</codeph> syntax,
|
||||
the new table keeps the same file format as the original one, so you only need to specify the <codeph>STORED
|
||||
AS</codeph> clause if you want to use a different file format, or when specifying a view as the original
|
||||
table. (Creating a table <q>like</q> a view produces a text table by default.)
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Although normally Impala cannot create an HBase table directly, Impala can clone the structure of an existing
|
||||
HBase table with the <codeph>CREATE TABLE ... LIKE</codeph> syntax, preserving the file format and metadata
|
||||
from the original table.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
There are some exceptions to the ability to use <codeph>CREATE TABLE ... LIKE</codeph> with an Avro table.
|
||||
For example, you cannot use this technique for an Avro table that is specified with an Avro schema but no
|
||||
columns. When in doubt, check if a <codeph>CREATE TABLE ... LIKE</codeph> operation works in Hive; if not, it
|
||||
typically will not work in Impala either.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
If the original table is partitioned, the new table inherits the same partition key columns. Because the new
|
||||
table is initially empty, it does not inherit the actual partitions that exist in the original one. To create
|
||||
partitions in the new table, insert data or issue <codeph>ALTER TABLE ... ADD PARTITION</codeph> statements.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/create_table_like_view"/>
|
||||
|
||||
<p>
|
||||
Because <codeph>CREATE TABLE ... LIKE</codeph> only manipulates table metadata, not the physical data of the
|
||||
table, issue <codeph>INSERT INTO TABLE</codeph> statements afterward to copy any data from the original table
|
||||
into the new one, optionally converting the data to a new file format. (For some file formats, Impala can do
|
||||
a <codeph>CREATE TABLE ... LIKE</codeph> to create the table, but Impala cannot insert data in that file
|
||||
format; in these cases, you must load the data in Hive. See
|
||||
<xref href="impala_file_formats.xml#file_formats"/> for details.)
|
||||
</p>
|
||||
|
||||
<p rev="1.2" id="ctas">
|
||||
<b>CREATE TABLE AS SELECT:</b>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The <codeph>CREATE TABLE AS SELECT</codeph> syntax is a shorthand notation to create a table based on column
|
||||
definitions from another table, and copy data from the source table to the destination table without issuing
|
||||
any separate <codeph>INSERT</codeph> statement. This idiom is so popular that it has its own acronym,
|
||||
<q>CTAS</q>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The following examples show how to copy data from a source table <codeph>T1</codeph>
|
||||
to a variety of destinations tables, applying various transformations to the table
|
||||
properties, table layout, or the data itself as part of the operation:
|
||||
</p>
|
||||
|
||||
<codeblock>
|
||||
-- Sample table to be the source of CTAS operations.
|
||||
CREATE TABLE t1 (x INT, y STRING);
|
||||
INSERT INTO t1 VALUES (1, 'one'), (2, 'two'), (3, 'three');
|
||||
|
||||
-- Clone all the columns and data from one table to another.
|
||||
CREATE TABLE clone_of_t1 AS SELECT * FROM t1;
|
||||
+-------------------+
|
||||
| summary |
|
||||
+-------------------+
|
||||
| Inserted 3 row(s) |
|
||||
+-------------------+
|
||||
|
||||
-- Clone the columns and data, and convert the data to a different file format.
|
||||
CREATE TABLE parquet_version_of_t1 STORED AS PARQUET AS SELECT * FROM t1;
|
||||
+-------------------+
|
||||
| summary |
|
||||
+-------------------+
|
||||
| Inserted 3 row(s) |
|
||||
+-------------------+
|
||||
|
||||
-- Copy only some rows to the new table.
|
||||
CREATE TABLE subset_of_t1 AS SELECT * FROM t1 WHERE x >= 2;
|
||||
+-------------------+
|
||||
| summary |
|
||||
+-------------------+
|
||||
| Inserted 2 row(s) |
|
||||
+-------------------+
|
||||
|
||||
-- Same idea as CREATE TABLE LIKE: clone table layout but do not copy any data.
|
||||
CREATE TABLE empty_clone_of_t1 AS SELECT * FROM t1 WHERE 1=0;
|
||||
+-------------------+
|
||||
| summary |
|
||||
+-------------------+
|
||||
| Inserted 0 row(s) |
|
||||
+-------------------+
|
||||
|
||||
-- Reorder and rename columns and transform the data.
|
||||
CREATE TABLE t5 AS SELECT upper(y) AS s, x+1 AS a, 'Entirely new column' AS n FROM t1;
|
||||
+-------------------+
|
||||
| summary |
|
||||
+-------------------+
|
||||
| Inserted 3 row(s) |
|
||||
+-------------------+
|
||||
SELECT * FROM t5;
|
||||
+-------+---+---------------------+
|
||||
| s | a | n |
|
||||
+-------+---+---------------------+
|
||||
| ONE | 2 | Entirely new column |
|
||||
| TWO | 3 | Entirely new column |
|
||||
| THREE | 4 | Entirely new column |
|
||||
+-------+---+---------------------+
|
||||
</codeblock>
|
||||
|
||||
<!-- These are a little heavyweight to get into here. Therefore commenting out.
|
||||
Some overlap with the new column-changing examples in the code listing above.
|
||||
Create tables with different column order, names, or types than the original.
|
||||
CREATE TABLE some_columns_from_t1 AS SELECT c1, c3, c5 FROM t1;
|
||||
CREATE TABLE reordered_columns_from_t1 AS SELECT c4, c3, c1, c2 FROM t1;
|
||||
CREATE TABLE synthesized_columns AS SELECT upper(c1) AS all_caps, c2+c3 AS total, "California" AS state FROM t1;</codeblock>
|
||||
-->
|
||||
|
||||
<!-- CREATE TABLE AS <select> now incorporated up higher in the original syntax diagram. -->
|
||||
|
||||
<p rev="1.2">
|
||||
See <xref href="impala_select.xml#select"/> for details about query syntax for the <codeph>SELECT</codeph>
|
||||
portion of a <codeph>CREATE TABLE AS SELECT</codeph> statement.
|
||||
</p>
|
||||
|
||||
<p rev="1.2">
|
||||
The newly created table inherits the column names that you select from the original table, which you can
|
||||
override by specifying column aliases in the query. Any column or table comments from the original table are
|
||||
not carried over to the new table.
|
||||
</p>
|
||||
|
||||
<note rev="DOCS-1523">
|
||||
When using the <codeph>STORED AS</codeph> clause with a <codeph>CREATE TABLE AS SELECT</codeph>
|
||||
statement, the destination table must be a file format that Impala can write to: currently,
|
||||
text or Parquet. You cannot specify an Avro, SequenceFile, or RCFile table as the destination
|
||||
table for a CTAS operation.
|
||||
</note>
|
||||
|
||||
<p rev="2.5.0">
|
||||
Prior to <keyword keyref="impala25_full"/> you could use a partitioned table
|
||||
as the source and copy data from it, but could not specify any partitioning clauses for the new table.
|
||||
In <keyword keyref="impala25_full"/> and higher, you can now use the <codeph>PARTITIONED BY</codeph> clause with a
|
||||
<codeph>CREATE TABLE AS SELECT</codeph> statement. The following example demonstrates how you can copy
|
||||
data from an unpartitioned table in a <codeph>CREATE TABLE AS SELECT</codeph> operation, creating a new
|
||||
partitioned table in the process. The main syntax consideration is the column order in the <codeph>PARTITIONED BY</codeph>
|
||||
clause and the select list: the partition key columns must be listed last in the select list, in the same
|
||||
order as in the <codeph>PARTITIONED BY</codeph> clause. Therefore, in this case, the column order in the
|
||||
destination table is different from the source table. You also only specify the column names in the
|
||||
<codeph>PARTITIONED BY</codeph> clause, not the data types or column comments.
|
||||
</p>
|
||||
|
||||
<codeblock rev="2.5.0">
|
||||
create table partitions_no (year smallint, month tinyint, s string);
|
||||
insert into partitions_no values (2016, 1, 'January 2016'),
|
||||
(2016, 2, 'February 2016'), (2016, 3, 'March 2016');
|
||||
|
||||
-- Prove that the source table is not partitioned.
|
||||
show partitions partitions_no;
|
||||
ERROR: AnalysisException: Table is not partitioned: ctas_partition_by.partitions_no
|
||||
|
||||
-- Create new table with partitions based on column values from source table.
|
||||
<b>create table partitions_yes partitioned by (year, month)
|
||||
as select s, year, month from partitions_no;</b>
|
||||
+-------------------+
|
||||
| summary |
|
||||
+-------------------+
|
||||
| Inserted 3 row(s) |
|
||||
+-------------------+
|
||||
|
||||
-- Prove that the destination table is partitioned.
|
||||
show partitions partitions_yes;
|
||||
+-------+-------+-------+--------+------+...
|
||||
| year | month | #Rows | #Files | Size |...
|
||||
+-------+-------+-------+--------+------+...
|
||||
| 2016 | 1 | -1 | 1 | 13B |...
|
||||
| 2016 | 2 | -1 | 1 | 14B |...
|
||||
| 2016 | 3 | -1 | 1 | 11B |...
|
||||
| Total | | -1 | 3 | 38B |...
|
||||
+-------+-------+-------+--------+------+...
|
||||
</codeblock>
|
||||
|
||||
<p rev="2.5.0">
|
||||
The most convenient layout for partitioned tables is with all the
|
||||
partition key columns at the end. The CTAS <codeph>PARTITIONED BY</codeph> syntax
|
||||
requires that column order in the select list, resulting in that same
|
||||
column order in the destination table.
|
||||
</p>
|
||||
|
||||
<codeblock rev="2.5.0">
|
||||
describe partitions_no;
|
||||
+-------+----------+---------+
|
||||
| name | type | comment |
|
||||
+-------+----------+---------+
|
||||
| year | smallint | |
|
||||
| month | tinyint | |
|
||||
| s | string | |
|
||||
+-------+----------+---------+
|
||||
|
||||
-- The CTAS operation forced us to put the partition key columns last.
|
||||
-- Having those columns last works better with idioms such as SELECT *
|
||||
-- for partitioned tables.
|
||||
describe partitions_yes;
|
||||
+-------+----------+---------+
|
||||
| name | type | comment |
|
||||
+-------+----------+---------+
|
||||
| s | string | |
|
||||
| year | smallint | |
|
||||
| month | tinyint | |
|
||||
+-------+----------+---------+
|
||||
</codeblock>
|
||||
|
||||
<p rev="2.5.0">
|
||||
Attempting to use a select list with the partition key columns
|
||||
not at the end results in an error due to a column name mismatch:
|
||||
</p>
|
||||
|
||||
<codeblock rev="2.5.0">
|
||||
-- We expect this CTAS to fail because non-key column S
|
||||
-- comes after key columns YEAR and MONTH in the select list.
|
||||
create table partitions_maybe partitioned by (year, month)
|
||||
as select year, month, s from partitions_no;
|
||||
ERROR: AnalysisException: Partition column name mismatch: year != month
|
||||
</codeblock>
|
||||
|
||||
<p rev="1.2">
|
||||
For example, the following statements show how you can clone all the data in a table, or a subset of the
|
||||
columns and/or rows, or reorder columns, rename them, or construct them out of expressions:
|
||||
</p>
|
||||
|
||||
<p rev="1.2">
|
||||
As part of a CTAS operation, you can convert the data to any file format that Impala can write (currently,
|
||||
<codeph>TEXTFILE</codeph> and <codeph>PARQUET</codeph>). You cannot specify the lower-level properties of a
|
||||
text table, such as the delimiter.
|
||||
</p>
|
||||
|
||||
<p rev="obwl" conref="../shared/impala_common.xml#common/insert_sort_blurb"/>
|
||||
|
||||
<p rev="1.4.0">
|
||||
<b>CREATE TABLE LIKE PARQUET:</b>
|
||||
</p>
|
||||
|
||||
<p rev="1.4.0">
|
||||
The variation <codeph>CREATE TABLE ... LIKE PARQUET '<varname>hdfs_path_of_parquet_file</varname>'</codeph>
|
||||
lets you skip the column definitions of the <codeph>CREATE TABLE</codeph> statement. The column names and
|
||||
data types are automatically configured based on the organization of the specified Parquet data file, which
|
||||
must already reside in HDFS. You can use a data file located outside the Impala database directories, or a
|
||||
file from an existing Impala Parquet table; either way, Impala only uses the column definitions from the file
|
||||
and does not use the HDFS location for the <codeph>LOCATION</codeph> attribute of the new table. (Although
|
||||
you can also specify the enclosing directory with the <codeph>LOCATION</codeph> attribute, to both use the
|
||||
same schema as the data file and point the Impala table at the associated directory for querying.)
|
||||
</p>
|
||||
|
||||
<p rev="1.4.0">
|
||||
The following considerations apply when you use the <codeph>CREATE TABLE LIKE PARQUET</codeph> technique:
|
||||
</p>
|
||||
|
||||
<ul rev="1.4.0">
|
||||
<li>
|
||||
Any column comments from the original table are not preserved in the new table. Each column in the new
|
||||
table has a comment stating the low-level Parquet field type used to deduce the appropriate SQL column
|
||||
type.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
If you use a data file from a partitioned Impala table, any partition key columns from the original table
|
||||
are left out of the new table, because they are represented in HDFS directory names rather than stored in
|
||||
the data file. To preserve the partition information, repeat the same <codeph>PARTITION</codeph> clause as
|
||||
in the original <codeph>CREATE TABLE</codeph> statement.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
The file format of the new table defaults to text, as with other kinds of <codeph>CREATE TABLE</codeph>
|
||||
statements. To make the new table also use Parquet format, include the clause <codeph>STORED AS
|
||||
PARQUET</codeph> in the <codeph>CREATE TABLE LIKE PARQUET</codeph> statement.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
If the Parquet data file comes from an existing Impala table, currently, any <codeph>TINYINT</codeph> or
|
||||
<codeph>SMALLINT</codeph> columns are turned into <codeph>INT</codeph> columns in the new table.
|
||||
Internally, Parquet stores such values as 32-bit integers.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
When the destination table uses the Parquet file format, the <codeph>CREATE TABLE AS SELECT</codeph> and
|
||||
<codeph>INSERT ... SELECT</codeph> statements always create at least one data file, even if the
|
||||
<codeph>SELECT</codeph> part of the statement does not match any rows. You can use such an empty Parquet
|
||||
data file as a template for subsequent <codeph>CREATE TABLE LIKE PARQUET</codeph> statements.
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p>
|
||||
For more details about creating Parquet tables, and examples of the <codeph>CREATE TABLE LIKE
|
||||
PARQUET</codeph> syntax, see <xref href="impala_parquet.xml#parquet"/>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>Visibility and Metadata (TBLPROPERTIES and WITH SERDEPROPERTIES clauses):</b>
|
||||
</p>
|
||||
|
||||
<p rev="1.2">
|
||||
You can associate arbitrary items of metadata with a table by specifying the <codeph>TBLPROPERTIES</codeph>
|
||||
clause. This clause takes a comma-separated list of key-value pairs and stores those items in the metastore
|
||||
database. You can also change the table properties later with an <codeph>ALTER TABLE</codeph> statement. You
|
||||
can observe the table properties for different delimiter and escape characters using the <codeph>DESCRIBE
|
||||
FORMATTED</codeph> command, and change those settings for an existing table with <codeph>ALTER TABLE ... SET
|
||||
TBLPROPERTIES</codeph>.
|
||||
</p>
|
||||
|
||||
<p rev="1.2">
|
||||
You can also associate SerDes properties with the table by specifying key-value pairs through the
|
||||
<codeph>WITH SERDEPROPERTIES</codeph> clause. This metadata is not used by Impala, which has its own built-in
|
||||
serializer and deserializer for the file formats it supports. Particular property values might be needed for
|
||||
Hive compatibility with certain variations of file formats, particularly Avro.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Some DDL operations that interact with other Hadoop components require specifying particular values in the
|
||||
<codeph>SERDEPROPERTIES</codeph> or <codeph>TBLPROPERTIES</codeph> fields, such as creating an Avro table or
|
||||
an HBase table. (You typically create HBase tables in Hive, because they require additional clauses not
|
||||
currently available in Impala.)
|
||||
<!-- Haven't got a working example from Lenni, so suppressing this recommendation for now.
|
||||
The Avro schema properties can be specified through either
|
||||
<codeph>TBLPROPERTIES</codeph> or <codeph>SERDEPROPERTIES</codeph>;
|
||||
for best compatibility with future versions of Hive,
|
||||
use <codeph>SERDEPROPERTIES</codeph> in this case.
|
||||
-->
|
||||
</p>
|
||||
|
||||
<p>
|
||||
To see the column definitions and column comments for an existing table, for example before issuing a
|
||||
<codeph>CREATE TABLE ... LIKE</codeph> or a <codeph>CREATE TABLE ... AS SELECT</codeph> statement, issue the
|
||||
statement <codeph>DESCRIBE <varname>table_name</varname></codeph>. To see even more detail, such as the
|
||||
location of data files and the values for clauses such as <codeph>ROW FORMAT</codeph> and <codeph>STORED
|
||||
AS</codeph>, issue the statement <codeph>DESCRIBE FORMATTED <varname>table_name</varname></codeph>.
|
||||
<codeph>DESCRIBE FORMATTED</codeph> is also needed to see any overall table comment (as opposed to individual
|
||||
column comments).
|
||||
</p>
|
||||
|
||||
<p>
|
||||
After creating a table, your <cmdname>impala-shell</cmdname> session or another
|
||||
<cmdname>impala-shell</cmdname> connected to the same node can immediately query that table. There might be a
|
||||
brief interval (one statestore heartbeat) before the table can be queried through a different Impala node. To
|
||||
make the <codeph>CREATE TABLE</codeph> statement return only when the table is recognized by all Impala nodes
|
||||
in the cluster, enable the <codeph>SYNC_DDL</codeph> query option.
|
||||
</p>
|
||||
|
||||
<p rev="1.4.0">
|
||||
<b>HDFS caching (CACHED IN clause):</b>
|
||||
</p>
|
||||
|
||||
<p rev="1.4.0">
|
||||
If you specify the <codeph>CACHED IN</codeph> clause, any existing or future data files in the table
|
||||
directory or the partition subdirectories are designated to be loaded into memory with the HDFS caching
|
||||
mechanism. See <xref href="impala_perf_hdfs_caching.xml#hdfs_caching"/> for details about using the HDFS
|
||||
caching feature.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/impala_cache_replication_factor"/>
|
||||
|
||||
<!-- Say something in here about the SHOW statement, e.g. SHOW TABLES, SHOW TABLE/COLUMN STATS, SHOW PARTITIONS. -->
|
||||
|
||||
<p>
|
||||
<b>Column order</b>:
|
||||
</p>
|
||||
|
||||
<p>
|
||||
If you intend to use the table to hold data files produced by some external source, specify the columns in
|
||||
the same order as they appear in the data files.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
If you intend to insert or copy data into the table through Impala, or if you have control over the way
|
||||
externally produced data files are arranged, use your judgment to specify columns in the most convenient
|
||||
order:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
<p>
|
||||
If certain columns are often <codeph>NULL</codeph>, specify those columns last. You might produce data
|
||||
files that omit these trailing columns entirely. Impala automatically fills in the <codeph>NULL</codeph>
|
||||
values if so.
|
||||
</p>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<p>
|
||||
If an unpartitioned table will be used as the source for an <codeph>INSERT ... SELECT</codeph> operation
|
||||
into a partitioned table, specify last in the unpartitioned table any columns that correspond to
|
||||
partition key columns in the partitioned table, and in the same order as the partition key columns are
|
||||
declared in the partitioned table. This technique lets you use <codeph>INSERT ... SELECT *</codeph> when
|
||||
copying data to the partitioned table, rather than specifying each column name individually.
|
||||
</p>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<p>
|
||||
If you specify columns in an order that you later discover is suboptimal, you can sometimes work around
|
||||
the problem without recreating the table. You can create a view that selects columns from the original
|
||||
table in a permuted order, then do a <codeph>SELECT *</codeph> from the view. When inserting data into a
|
||||
table, you can specify a permuted order for the inserted columns to match the order in the destination
|
||||
table.
|
||||
</p>
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/hive_blurb"/>
|
||||
|
||||
<p>
|
||||
Impala queries can make use of metadata about the table and columns, such as the number of rows in a table or
|
||||
the number of different values in a column. Prior to Impala 1.2.2, to create this metadata, you issued the
|
||||
<codeph>ANALYZE TABLE</codeph> statement in Hive to gather this information, after creating the table and
|
||||
loading representative data into it. In Impala 1.2.2 and higher, the <codeph>COMPUTE STATS</codeph> statement
|
||||
produces these statistics within Impala, without needing to use Hive at all.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/hbase_blurb"/>
|
||||
|
||||
<note>
|
||||
<p>
|
||||
The Impala <codeph>CREATE TABLE</codeph> statement cannot create an HBase table, because it currently does
|
||||
not support the <codeph>STORED BY</codeph> clause needed for HBase tables. Create such tables in Hive, then
|
||||
query them through Impala. For information on using Impala with HBase tables, see
|
||||
<xref href="impala_hbase.xml#impala_hbase"/>.
|
||||
</p>
|
||||
</note>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/s3_blurb"/>
|
||||
<p rev="2.2.0">
|
||||
To create a table where the data resides in the Amazon Simple Storage Service (S3),
|
||||
specify a <codeph>s3a://</codeph> prefix <codeph>LOCATION</codeph> attribute pointing to the data files in S3.
|
||||
</p>
|
||||
|
||||
<p rev="2.6.0 CDH-39913 IMPALA-1878">
|
||||
In <keyword keyref="impala26_full"/> and higher, you can
|
||||
use this special <codeph>LOCATION</codeph> syntax
|
||||
as part of a <codeph>CREATE TABLE AS SELECT</codeph> statement.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/s3_ddl"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/insert_sort_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/hdfs_blurb"/>
|
||||
|
||||
<p>
|
||||
The <codeph>CREATE TABLE</codeph> statement for an internal table creates a directory in HDFS. The
|
||||
<codeph>CREATE EXTERNAL TABLE</codeph> statement associates the table with an existing HDFS directory, and
|
||||
does not create any new directory in HDFS. To locate the HDFS data directory for a table, issue a
|
||||
<codeph>DESCRIBE FORMATTED <varname>table</varname></codeph> statement. To examine the contents of that HDFS
|
||||
directory, use an OS command such as <codeph>hdfs dfs -ls hdfs://<varname>path</varname></codeph>, either
|
||||
from the OS command line or through the <codeph>shell</codeph> or <codeph>!</codeph> commands in
|
||||
<cmdname>impala-shell</cmdname>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The <codeph>CREATE TABLE AS SELECT</codeph> syntax creates data files under the table data directory to hold
|
||||
any data copied by the <codeph>INSERT</codeph> portion of the statement. (Even if no data is copied, Impala
|
||||
might create one or more empty data files.)
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/permissions_blurb"/>
|
||||
<p rev="CDH-19187">
|
||||
The user ID that the <cmdname>impalad</cmdname> daemon runs under,
|
||||
typically the <codeph>impala</codeph> user, must have both execute and write
|
||||
permission for the database directory where the table is being created.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/security_blurb"/>
|
||||
<p conref="../shared/impala_common.xml#common/redaction_yes"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/cancel_blurb_maybe"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
|
||||
<p>
|
||||
<xref href="impala_tables.xml#tables"/>,
|
||||
<xref href="impala_alter_table.xml#alter_table"/>, <xref href="impala_drop_table.xml#drop_table"/>,
|
||||
<xref href="impala_partitioning.xml#partitioning"/>, <xref href="impala_tables.xml#internal_tables"/>,
|
||||
<xref href="impala_tables.xml#external_tables"/>, <xref href="impala_compute_stats.xml#compute_stats"/>,
|
||||
<xref href="impala_sync_ddl.xml#sync_ddl"/>, <xref href="impala_show.xml#show_tables"/>,
|
||||
<xref href="impala_show.xml#show_create_table"/>, <xref href="impala_describe.xml#describe"/>
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
139
docs/topics/impala_create_view.xml
Normal file
139
docs/topics/impala_create_view.xml
Normal file
@@ -0,0 +1,139 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept rev="1.1" id="create_view">
|
||||
|
||||
<title>CREATE VIEW Statement</title>
|
||||
<titlealts audience="PDF"><navtitle>CREATE VIEW</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="DDL"/>
|
||||
<data name="Category" value="Tables"/>
|
||||
<data name="Category" value="Schemas"/>
|
||||
<data name="Category" value="Views"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<indexterm audience="Cloudera">CREATE VIEW statement</indexterm>
|
||||
The <codeph>CREATE VIEW</codeph> statement lets you create a shorthand abbreviation for a more complicated
|
||||
query. The base query can involve joins, expressions, reordered columns, column aliases, and other SQL
|
||||
features that can make a query hard to understand or maintain.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Because a view is purely a logical construct (an alias for a query) with no physical data behind it,
|
||||
<codeph>ALTER VIEW</codeph> only involves changes to metadata in the metastore database, not any data files
|
||||
in HDFS.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
||||
|
||||
<codeblock>CREATE VIEW [IF NOT EXISTS] <varname>view_name</varname> [(<varname>column_list</varname>)]
|
||||
AS <varname>select_statement</varname></codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/ddl_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
||||
|
||||
<p>
|
||||
The <codeph>CREATE VIEW</codeph> statement can be useful in scenarios such as the following:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
To turn even the most lengthy and complicated SQL query into a one-liner. You can issue simple queries
|
||||
against the view from applications, scripts, or interactive queries in <cmdname>impala-shell</cmdname>.
|
||||
For example:
|
||||
<codeblock>select * from <varname>view_name</varname>;
|
||||
select * from <varname>view_name</varname> order by c1 desc limit 10;</codeblock>
|
||||
The more complicated and hard-to-read the original query, the more benefit there is to simplifying the
|
||||
query using a view.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
To hide the underlying table and column names, to minimize maintenance problems if those names change. In
|
||||
that case, you re-create the view using the new names, and all queries that use the view rather than the
|
||||
underlying tables keep running with no changes.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
To experiment with optimization techniques and make the optimized queries available to all applications.
|
||||
For example, if you find a combination of <codeph>WHERE</codeph> conditions, join order, join hints, and so
|
||||
on that works the best for a class of queries, you can establish a view that incorporates the
|
||||
best-performing techniques. Applications can then make relatively simple queries against the view, without
|
||||
repeating the complicated and optimized logic over and over. If you later find a better way to optimize the
|
||||
original query, when you re-create the view, all the applications immediately take advantage of the
|
||||
optimized base query.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
To simplify a whole class of related queries, especially complicated queries involving joins between
|
||||
multiple tables, complicated expressions in the column list, and other SQL syntax that makes the query
|
||||
difficult to understand and debug. For example, you might create a view that joins several tables, filters
|
||||
using several <codeph>WHERE</codeph> conditions, and selects several columns from the result set.
|
||||
Applications might issue queries against this view that only vary in their <codeph>LIMIT</codeph>,
|
||||
<codeph>ORDER BY</codeph>, and similar simple clauses.
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p>
|
||||
For queries that require repeating complicated clauses over and over again, for example in the select list,
|
||||
<codeph>ORDER BY</codeph>, and <codeph>GROUP BY</codeph> clauses, you can use the <codeph>WITH</codeph>
|
||||
clause as an alternative to creating a view.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/complex_types_blurb"/>
|
||||
<p conref="../shared/impala_common.xml#common/complex_types_views"/>
|
||||
<p conref="../shared/impala_common.xml#common/complex_types_views_caveat"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/sync_ddl_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/security_blurb"/>
|
||||
<p conref="../shared/impala_common.xml#common/redaction_yes"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/permissions_blurb_no"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
|
||||
<!-- TK: Elaborate on these, show queries and real output. -->
|
||||
|
||||
<codeblock>-- Create a view that is exactly the same as the underlying table.
|
||||
create view v1 as select * from t1;
|
||||
|
||||
-- Create a view that includes only certain columns from the underlying table.
|
||||
create view v2 as select c1, c3, c7 from t1;
|
||||
|
||||
-- Create a view that filters the values from the underlying table.
|
||||
create view v3 as select distinct c1, c3, c7 from t1 where c1 is not null and c5 > 0;
|
||||
|
||||
-- Create a view that that reorders and renames columns from the underlying table.
|
||||
create view v4 as select c4 as last_name, c6 as address, c2 as birth_date from t1;
|
||||
|
||||
-- Create a view that runs functions to convert or transform certain columns.
|
||||
create view v5 as select c1, cast(c3 as string) c3, concat(c4,c5) c5, trim(c6) c6, "Constant" c8 from t1;
|
||||
|
||||
-- Create a view that hides the complexity of a view query.
|
||||
create view v6 as select t1.c1, t2.c2 from t1 join t2 on t1.id = t2.id;
|
||||
</codeblock>
|
||||
|
||||
<!-- These examples show CREATE VIEW and corresponding DROP VIEW statements, with different combinations
|
||||
of qualified and unqualified names. -->
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/create_drop_view_examples"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
|
||||
<p>
|
||||
<xref href="impala_views.xml#views"/>, <xref href="impala_alter_view.xml#alter_view"/>,
|
||||
<xref href="impala_drop_view.xml#drop_view"/>
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
22
docs/topics/impala_data_sources.xml
Normal file
22
docs/topics/impala_data_sources.xml
Normal file
@@ -0,0 +1,22 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept rev="1.4.0" id="data_sources">
|
||||
|
||||
<title>Data Sources</title>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<xref href="impala_create_data_source.xml#create_data_source"/>
|
||||
<xref href="impala_drop_data_source.xml#drop_data_source"/>
|
||||
<xref href="impala_create_table.xml#create_table"/>
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
65
docs/topics/impala_databases.xml
Normal file
65
docs/topics/impala_databases.xml
Normal file
@@ -0,0 +1,65 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="databases">
|
||||
|
||||
<title>Overview of Impala Databases</title>
|
||||
<titlealts audience="PDF"><navtitle>Databases</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Databases"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Querying"/>
|
||||
<data name="Category" value="Schemas"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
In Impala, a database is a logical container for a group of tables. Each database defines a separate
|
||||
namespace. Within a database, you can refer to the tables inside it using their unqualified names. Different
|
||||
databases can contain tables with identical names.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Creating a database is a lightweight operation. There are minimal database-specific properties to configure,
|
||||
only <codeph>LOCATION</codeph> and <codeph>COMMENT</codeph>. There is no <codeph>ALTER DATABASE</codeph> statement.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Typically, you create a separate database for each project or application, to avoid naming conflicts between
|
||||
tables and to make clear which tables are related to each other. The <codeph>USE</codeph> statement lets
|
||||
you switch between databases. Unqualified references to tables, views, and functions refer to objects
|
||||
within the current database. You can also refer to objects in other databases by using qualified names
|
||||
of the form <codeph><varname>dbname</varname>.<varname>object_name</varname></codeph>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Each database is physically represented by a directory in HDFS. When you do not specify a <codeph>LOCATION</codeph>
|
||||
attribute, the directory is located in the Impala data directory with the associated tables managed by Impala.
|
||||
When you do specify a <codeph>LOCATION</codeph> attribute, any read and write operations for tables in that
|
||||
database are relative to the specified HDFS directory.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
There is a special database, named <codeph>default</codeph>, where you begin when you connect to Impala.
|
||||
Tables created in <codeph>default</codeph> are physically located one level higher in HDFS than all the
|
||||
user-created databases.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/builtins_db"/>
|
||||
|
||||
<p>
|
||||
<b>Related statements:</b>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<xref href="impala_create_database.xml#create_database"/>,
|
||||
<xref href="impala_drop_database.xml#drop_database"/>, <xref href="impala_use.xml#use"/>,
|
||||
<xref href="impala_show.xml#show_databases"/>
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
43
docs/topics/impala_datatypes.xml
Normal file
43
docs/topics/impala_datatypes.xml
Normal file
@@ -0,0 +1,43 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="datatypes">
|
||||
|
||||
<title>Data Types</title>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Impala Data Types"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="Schemas"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<indexterm audience="Cloudera">data types</indexterm>
|
||||
Impala supports a set of data types that you can use for table columns, expression values, and function
|
||||
arguments and return values.
|
||||
</p>
|
||||
|
||||
<note>
|
||||
Currently, Impala supports only scalar types, not composite or nested types. Accessing a table containing any
|
||||
columns with unsupported types causes an error.
|
||||
</note>
|
||||
|
||||
<p outputclass="toc"/>
|
||||
|
||||
<p>
|
||||
For the notation to write literals of each of these data types, see
|
||||
<xref href="impala_literals.xml#literals"/>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
See <xref href="impala_langref_unsupported.xml#langref_hiveql_delta"/> for differences between Impala and
|
||||
Hive data types.
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
104
docs/topics/impala_date.xml
Normal file
104
docs/topics/impala_date.xml
Normal file
@@ -0,0 +1,104 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept audience="Cloudera" id="date" rev="2.0.0">
|
||||
|
||||
<title>DATE Data Type (<keyword keyref="impala21"/> or higher only)</title>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Impala Data Types"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Dates and Times"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<indexterm audience="Cloudera">DATE data type</indexterm>
|
||||
A type representing the date (year, month, and day) as a single numeric value. Used to represent a broader
|
||||
date range than possible with the <codeph>TIMESTAMP</codeph> type, with fewer distinct values than
|
||||
<codeph>TIMESTAMP</codeph>, and in a more compact and efficient form than using a <codeph>STRING</codeph>
|
||||
such as <codeph>'2014-12-31'</codeph>.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
||||
|
||||
<codeblock><varname>column_name</varname> DATE</codeblock>
|
||||
|
||||
<p>
|
||||
<b>Range:</b> January 1, -4712 BC .. December 31, 9999 AD.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/hbase_ok"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/parquet_blurb"/>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
This type can be read from and written to Parquet files.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
There is no requirement for a particular level of Parquet.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Parquet files generated by Impala and containing this type can be freely interchanged with other components
|
||||
such as Hive and MapReduce.
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/hive_blurb"/>
|
||||
|
||||
<p>
|
||||
TK.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/conversion_blurb"/>
|
||||
|
||||
<p>
|
||||
TK.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/partitioning_blurb"/>
|
||||
|
||||
<p>
|
||||
This type can be used for partition key columns. Because it has less granularity (and thus fewer distinct
|
||||
values) than an equivalent <codeph>TIMESTAMP</codeph> column, and numeric columns are more efficient as
|
||||
partition keys than strings, prefer to partition by a <codeph>DATE</codeph> column rather than a
|
||||
<codeph>TIMESTAMP</codeph> column or a <codeph>STRING</codeph> representation of a date.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/compatibility_blurb"/>
|
||||
|
||||
<p>
|
||||
This type is available on CDH 5.2 or higher.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/internals_2_bytes"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/added_in_20"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/column_stats_constant"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
|
||||
|
||||
<p>
|
||||
Things happen when converting <codeph>TIMESTAMP</codeph> to <codeph>DATE</codeph> or <codeph>DATE</codeph> to
|
||||
<codeph>TIMESTAMP</codeph>. TK.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
|
||||
<p>
|
||||
The <xref href="impala_timestamp.xml#timestamp">TIMESTAMP</xref> data type is closely related. Some functions
|
||||
from <xref href="impala_datetime_functions.xml#datetime_functions"/> accept and return <codeph>DATE</codeph>
|
||||
values.
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
2482
docs/topics/impala_datetime_functions.xml
Normal file
2482
docs/topics/impala_datetime_functions.xml
Normal file
File diff suppressed because it is too large
Load Diff
150
docs/topics/impala_ddl.xml
Normal file
150
docs/topics/impala_ddl.xml
Normal file
@@ -0,0 +1,150 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="ddl">
|
||||
|
||||
<title>DDL Statements</title>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="DDL"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Schemas"/>
|
||||
<data name="Category" value="Tables"/>
|
||||
<data name="Category" value="Databases"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
DDL refers to <q>Data Definition Language</q>, a subset of SQL statements that change the structure of the
|
||||
database schema in some way, typically by creating, deleting, or modifying schema objects such as databases,
|
||||
tables, and views. Most Impala DDL statements start with the keywords <codeph>CREATE</codeph>,
|
||||
<codeph>DROP</codeph>, or <codeph>ALTER</codeph>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The Impala DDL statements are:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
<xref href="impala_alter_table.xml#alter_table"/>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<xref href="impala_alter_view.xml#alter_view"/>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<xref href="impala_compute_stats.xml#compute_stats"/>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<xref href="impala_create_database.xml#create_database"/>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<xref href="impala_create_function.xml#create_function"/>
|
||||
</li>
|
||||
|
||||
<li rev="2.0.0">
|
||||
<xref href="impala_create_role.xml#create_role"/>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<xref href="impala_create_table.xml#create_table"/>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<xref href="impala_create_view.xml#create_view"/>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<xref href="impala_drop_database.xml#drop_database"/>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<xref href="impala_drop_function.xml#drop_function"/>
|
||||
</li>
|
||||
|
||||
<li rev="2.0.0">
|
||||
<xref href="impala_drop_role.xml#drop_role"/>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<xref href="impala_drop_table.xml#drop_table"/>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<xref href="impala_drop_view.xml#drop_view"/>
|
||||
</li>
|
||||
|
||||
<li rev="2.0.0">
|
||||
<xref href="impala_grant.xml#grant"/>
|
||||
</li>
|
||||
|
||||
<li rev="2.0.0">
|
||||
<xref href="impala_revoke.xml#revoke"/>
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p>
|
||||
After Impala executes a DDL command, information about available tables, columns, views, partitions, and so
|
||||
on is automatically synchronized between all the Impala nodes in a cluster. (Prior to Impala 1.2, you had to
|
||||
issue a <codeph>REFRESH</codeph> or <codeph>INVALIDATE METADATA</codeph> statement manually on the other
|
||||
nodes to make them aware of the changes.)
|
||||
</p>
|
||||
|
||||
<p>
|
||||
If the timing of metadata updates is significant, for example if you use round-robin scheduling where each
|
||||
query could be issued through a different Impala node, you can enable the
|
||||
<xref href="impala_sync_ddl.xml#sync_ddl">SYNC_DDL</xref> query option to make the DDL statement wait until
|
||||
all nodes have been notified about the metadata changes.
|
||||
</p>
|
||||
|
||||
<p rev="2.2.0">
|
||||
See <xref href="impala_s3.xml#s3"/> for details about how Impala DDL statements interact with
|
||||
tables and partitions stored in the Amazon S3 filesystem.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Although the <codeph>INSERT</codeph> statement is officially classified as a DML (data manipulation language)
|
||||
statement, it also involves metadata changes that must be broadcast to all Impala nodes, and so is also
|
||||
affected by the <codeph>SYNC_DDL</codeph> query option.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Because the <codeph>SYNC_DDL</codeph> query option makes each DDL operation take longer than normal, you
|
||||
might only enable it before the last DDL operation in a sequence. For example, if you are running a script
|
||||
that issues multiple of DDL operations to set up an entire new schema, add several new partitions, and so on,
|
||||
you might minimize the performance overhead by enabling the query option only before the last
|
||||
<codeph>CREATE</codeph>, <codeph>DROP</codeph>, <codeph>ALTER</codeph>, or <codeph>INSERT</codeph> statement.
|
||||
The script only finishes when all the relevant metadata changes are recognized by all the Impala nodes, so
|
||||
you could connect to any node and issue queries through it.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The classification of DDL, DML, and other statements is not necessarily the same between Impala and Hive.
|
||||
Impala organizes these statements in a way intended to be familiar to people familiar with relational
|
||||
databases or data warehouse products. Statements that modify the metastore database, such as <codeph>COMPUTE
|
||||
STATS</codeph>, are classified as DDL. Statements that only query the metastore database, such as
|
||||
<codeph>SHOW</codeph> or <codeph>DESCRIBE</codeph>, are put into a separate category of utility statements.
|
||||
</p>
|
||||
|
||||
<note>
|
||||
The query types shown in the Impala debug web user interface might not match exactly the categories listed
|
||||
here. For example, currently the <codeph>USE</codeph> statement is shown as DDL in the debug web UI. The
|
||||
query types shown in the debug web UI are subject to change, for improved consistency.
|
||||
</note>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
|
||||
<p>
|
||||
The other major classifications of SQL statements are data manipulation language (see
|
||||
<xref href="impala_dml.xml#dml"/>) and queries (see <xref href="impala_select.xml#select"/>).
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
33
docs/topics/impala_debug_action.xml
Normal file
33
docs/topics/impala_debug_action.xml
Normal file
@@ -0,0 +1,33 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="debug_action">
|
||||
|
||||
<title>DEBUG_ACTION Query Option</title>
|
||||
<titlealts audience="PDF"><navtitle>DEBUG_ACTION</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Impala Query Options"/>
|
||||
<data name="Category" value="Querying"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Administrators"/>
|
||||
<data name="Category" value="Troubleshooting"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<indexterm audience="Cloudera">DEBUG_ACTION query option</indexterm>
|
||||
Introduces artificial problem conditions within queries. For internal Cloudera debugging and troubleshooting.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>Type:</b> <codeph>STRING</codeph>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>Default:</b> empty string
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
817
docs/topics/impala_decimal.xml
Normal file
817
docs/topics/impala_decimal.xml
Normal file
@@ -0,0 +1,817 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept rev="1.4.0" id="decimal">
|
||||
|
||||
<title>DECIMAL Data Type (<keyword keyref="impala14"/> or higher only)</title>
|
||||
<titlealts audience="PDF"><navtitle>DECIMAL</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Impala Data Types"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Schemas"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
A numeric data type with fixed scale and precision, used in <codeph>CREATE TABLE</codeph> and <codeph>ALTER
|
||||
TABLE</codeph> statements. Suitable for financial and other arithmetic calculations where the imprecise
|
||||
representation and rounding behavior of <codeph>FLOAT</codeph> and <codeph>DOUBLE</codeph> make those types
|
||||
impractical.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
||||
|
||||
<p>
|
||||
In the column definition of a <codeph>CREATE TABLE</codeph> statement:
|
||||
</p>
|
||||
|
||||
<codeblock><varname>column_name</varname> DECIMAL[(<varname>precision</varname>[,<varname>scale</varname>])]</codeblock>
|
||||
|
||||
<p>
|
||||
<codeph>DECIMAL</codeph> with no precision or scale values is equivalent to <codeph>DECIMAL(9,0)</codeph>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>Precision and Scale:</b>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<varname>precision</varname> represents the total number of digits that can be represented by the column,
|
||||
regardless of the location of the decimal point. This value must be between 1 and 38. For example,
|
||||
representing integer values up to 9999, and floating-point values up to 99.99, both require a precision of 4.
|
||||
You can also represent corresponding negative values, without any change in the precision. For example, the
|
||||
range -9999 to 9999 still only requires a precision of 4.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<varname>scale</varname> represents the number of fractional digits. This value must be less than or equal to
|
||||
<varname>precision</varname>. A scale of 0 produces integral values, with no fractional part. If precision
|
||||
and scale are equal, all the digits come after the decimal point, making all the values between 0 and
|
||||
0.999... or 0 and -0.999...
|
||||
</p>
|
||||
|
||||
<p>
|
||||
When <varname>precision</varname> and <varname>scale</varname> are omitted, a <codeph>DECIMAL</codeph> value
|
||||
is treated as <codeph>DECIMAL(9,0)</codeph>, that is, an integer value ranging from
|
||||
<codeph>-999,999,999</codeph> to <codeph>999,999,999</codeph>. This is the largest <codeph>DECIMAL</codeph>
|
||||
value that can still be represented in 4 bytes. If precision is specified but scale is omitted, Impala uses a
|
||||
value of zero for the scale.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Both <varname>precision</varname> and <varname>scale</varname> must be specified as integer literals, not any
|
||||
other kind of constant expressions.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
To check the precision or scale for arbitrary values, you can call the
|
||||
<xref href="impala_math_functions.xml#math_functions"><codeph>precision()</codeph> and
|
||||
<codeph>scale()</codeph> built-in functions</xref>. For example, you might use these values to figure out how
|
||||
many characters are required for various fields in a report, or to understand the rounding characteristics of
|
||||
a formula as applied to a particular <codeph>DECIMAL</codeph> column.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>Range:</b>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The maximum precision value is 38. Thus, the largest integral value is represented by
|
||||
<codeph>DECIMAL(38,0)</codeph> (999... with 9 repeated 38 times). The most precise fractional value (between
|
||||
0 and 1, or 0 and -1) is represented by <codeph>DECIMAL(38,38)</codeph>, with 38 digits to the right of the
|
||||
decimal point. The value closest to 0 would be .0000...1 (37 zeros and the final 1). The value closest to 1
|
||||
would be .999... (9 repeated 38 times).
|
||||
</p>
|
||||
|
||||
<p>
|
||||
For a given precision and scale, the range of <codeph>DECIMAL</codeph> values is the same in the positive and
|
||||
negative directions. For example, <codeph>DECIMAL(4,2)</codeph> can represent from -99.99 to 99.99. This is
|
||||
different from other integral numeric types where the positive and negative bounds differ slightly.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
When you use <codeph>DECIMAL</codeph> values in arithmetic expressions, the precision and scale of the result
|
||||
value are determined as follows:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
<p>
|
||||
For addition and subtraction, the precision and scale are based on the maximum possible result, that is,
|
||||
if all the digits of the input values were 9s and the absolute values were added together.
|
||||
</p>
|
||||
<!-- Seems like buggy output from this first query, so hiding the example for the time being. -->
|
||||
<codeblock audience="Cloudera"><![CDATA[[localhost:21000] > select 50000.5 + 12.444, precision(50000.5 + 12.444), scale(50000.5 + 12.444);
|
||||
+------------------+-----------------------------+-------------------------+
|
||||
| 50000.5 + 12.444 | precision(50000.5 + 12.444) | scale(50000.5 + 12.444) |
|
||||
+------------------+-----------------------------+-------------------------+
|
||||
| 50012.944 | 9 | 3 |
|
||||
+------------------+-----------------------------+-------------------------+
|
||||
[localhost:21000] > select 99999.9 + 99.999, precision(99999.9 + 99.999), scale(99999.9 + 99.999);
|
||||
+------------------+-----------------------------+-------------------------+
|
||||
| 99999.9 + 99.999 | precision(99999.9 + 99.999) | scale(99999.9 + 99.999) |
|
||||
+------------------+-----------------------------+-------------------------+
|
||||
| 100099.899 | 9 | 3 |
|
||||
+------------------+-----------------------------+-------------------------+
|
||||
]]>
|
||||
</codeblock>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<p>
|
||||
For multiplication, the precision is the sum of the precisions of the input values. The scale is the sum
|
||||
of the scales of the input values.
|
||||
</p>
|
||||
</li>
|
||||
|
||||
<!-- Need to add some specifics to discussion of division. Details here: http://blogs.msdn.com/b/sqlprogrammability/archive/2006/03/29/564110.aspx -->
|
||||
|
||||
<li>
|
||||
<p>
|
||||
For division, Impala sets the precision and scale to values large enough to represent the whole and
|
||||
fractional parts of the result.
|
||||
</p>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<p>
|
||||
For <codeph>UNION</codeph>, the scale is the larger of the scales of the input values, and the precision
|
||||
is increased if necessary to accommodate any additional fractional digits. If the same input value has
|
||||
the largest precision and the largest scale, the result value has the same precision and scale. If one
|
||||
value has a larger precision but smaller scale, the scale of the result value is increased. For example,
|
||||
<codeph>DECIMAL(20,2) UNION DECIMAL(8,6)</codeph> produces a result of type
|
||||
<codeph>DECIMAL(24,6)</codeph>. The extra 4 fractional digits of scale (6-2) are accommodated by
|
||||
extending the precision by the same amount (20+4).
|
||||
</p>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<p>
|
||||
To doublecheck, you can always call the <codeph>PRECISION()</codeph> and <codeph>SCALE()</codeph>
|
||||
functions on the results of an arithmetic expression to see the relevant values, or use a <codeph>CREATE
|
||||
TABLE AS SELECT</codeph> statement to define a column based on the return type of the expression.
|
||||
</p>
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/compatibility_blurb"/>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
Using the <codeph>DECIMAL</codeph> type is only supported under <keyword keyref="impala14_full"/> and higher.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Use the <codeph>DECIMAL</codeph> data type in Impala for applications where you used the
|
||||
<codeph>NUMBER</codeph> data type in Oracle. The Impala <codeph>DECIMAL</codeph> type does not support the
|
||||
Oracle idioms of <codeph>*</codeph> for scale or negative values for precision.
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p>
|
||||
<b>Conversions and casting:</b>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<ph conref="../shared/impala_common.xml#common/cast_int_to_timestamp"/>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Impala automatically converts between <codeph>DECIMAL</codeph> and other numeric types where possible. A
|
||||
<codeph>DECIMAL</codeph> with zero scale is converted to or from the smallest appropriate integral type. A
|
||||
<codeph>DECIMAL</codeph> with a fractional part is automatically converted to or from the smallest
|
||||
appropriate floating-point type. If the destination type does not have sufficient precision or scale to hold
|
||||
all possible values of the source type, Impala raises an error and does not convert the value.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
For example, these statements show how expressions of <codeph>DECIMAL</codeph> and other types are reconciled
|
||||
to the same type in the context of <codeph>UNION</codeph> queries and <codeph>INSERT</codeph> statements:
|
||||
</p>
|
||||
|
||||
<codeblock><![CDATA[[localhost:21000] > select cast(1 as int) as x union select cast(1.5 as decimal(9,4)) as x;
|
||||
+----------------+
|
||||
| x |
|
||||
+----------------+
|
||||
| 1.5000 |
|
||||
| 1.0000 |
|
||||
+----------------+
|
||||
[localhost:21000] > create table int_vs_decimal as select cast(1 as int) as x union select cast(1.5 as decimal(9,4)) as x;
|
||||
+-------------------+
|
||||
| summary |
|
||||
+-------------------+
|
||||
| Inserted 2 row(s) |
|
||||
+-------------------+
|
||||
[localhost:21000] > desc int_vs_decimal;
|
||||
+------+---------------+---------+
|
||||
| name | type | comment |
|
||||
+------+---------------+---------+
|
||||
| x | decimal(14,4) | |
|
||||
+------+---------------+---------+
|
||||
]]>
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
To avoid potential conversion errors, you can use <codeph>CAST()</codeph> to convert <codeph>DECIMAL</codeph>
|
||||
values to <codeph>FLOAT</codeph>, <codeph>TINYINT</codeph>, <codeph>SMALLINT</codeph>, <codeph>INT</codeph>,
|
||||
<codeph>BIGINT</codeph>, <codeph>STRING</codeph>, <codeph>TIMESTAMP</codeph>, or <codeph>BOOLEAN</codeph>.
|
||||
You can use exponential notation in <codeph>DECIMAL</codeph> literals or when casting from
|
||||
<codeph>STRING</codeph>, for example <codeph>1.0e6</codeph> to represent one million.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
If you cast a value with more fractional digits than the scale of the destination type, any extra fractional
|
||||
digits are truncated (not rounded). Casting a value to a target type with not enough precision produces a
|
||||
result of <codeph>NULL</codeph> and displays a runtime warning.
|
||||
</p>
|
||||
|
||||
<codeblock><![CDATA[[localhost:21000] > select cast(1.239 as decimal(3,2));
|
||||
+-----------------------------+
|
||||
| cast(1.239 as decimal(3,2)) |
|
||||
+-----------------------------+
|
||||
| 1.23 |
|
||||
+-----------------------------+
|
||||
[localhost:21000] > select cast(1234 as decimal(3));
|
||||
+----------------------------+
|
||||
| cast(1234 as decimal(3,0)) |
|
||||
+----------------------------+
|
||||
| NULL |
|
||||
+----------------------------+
|
||||
WARNINGS: Expression overflowed, returning NULL
|
||||
]]>
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
When you specify integer literals, for example in <codeph>INSERT ... VALUES</codeph> statements or arithmetic
|
||||
expressions, those numbers are interpreted as the smallest applicable integer type. You must use
|
||||
<codeph>CAST()</codeph> calls for some combinations of integer literals and <codeph>DECIMAL</codeph>
|
||||
precision. For example, <codeph>INT</codeph> has a maximum value that is 10 digits long,
|
||||
<codeph>TINYINT</codeph> has a maximum value that is 3 digits long, and so on. If you specify a value such as
|
||||
123456 to go into a <codeph>DECIMAL</codeph> column, Impala checks if the column has enough precision to
|
||||
represent the largest value of that integer type, and raises an error if not. Therefore, use an expression
|
||||
like <codeph>CAST(123456 TO DECIMAL(9,0))</codeph> for <codeph>DECIMAL</codeph> columns with precision 9 or
|
||||
less, <codeph>CAST(50 TO DECIMAL(2,0))</codeph> for <codeph>DECIMAL</codeph> columns with precision 2 or
|
||||
less, and so on. For <codeph>DECIMAL</codeph> columns with precision 10 or greater, Impala automatically
|
||||
interprets the value as the correct <codeph>DECIMAL</codeph> type; however, because
|
||||
<codeph>DECIMAL(10)</codeph> requires 8 bytes of storage while <codeph>DECIMAL(9)</codeph> requires only 4
|
||||
bytes, only use precision of 10 or higher when actually needed.
|
||||
</p>
|
||||
|
||||
<codeblock><![CDATA[[localhost:21000] > create table decimals_9_0 (x decimal);
|
||||
[localhost:21000] > insert into decimals_9_0 values (1), (2), (4), (8), (16), (1024), (32768), (65536), (1000000);
|
||||
ERROR: AnalysisException: Possible loss of precision for target table 'decimal_testing.decimals_9_0'.
|
||||
Expression '1' (type: INT) would need to be cast to DECIMAL(9,0) for column 'x'
|
||||
[localhost:21000] > insert into decimals_9_0 values (cast(1 as decimal)), (cast(2 as decimal)), (cast(4 as decimal)), (cast(8 as decimal)), (cast(16 as decimal)), (cast(1024 as decimal)), (cast(32768 as decimal)), (cast(65536 as decimal)), (cast(1000000 as decimal));
|
||||
|
||||
[localhost:21000] > create table decimals_10_0 (x decimal(10,0));
|
||||
[localhost:21000] > insert into decimals_10_0 values (1), (2), (4), (8), (16), (1024), (32768), (65536), (1000000);
|
||||
]]>
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
Be aware that in memory and for binary file formats such as Parquet or Avro, <codeph>DECIMAL(10)</codeph> or
|
||||
higher consumes 8 bytes while <codeph>DECIMAL(9)</codeph> (the default for <codeph>DECIMAL</codeph>) or lower
|
||||
consumes 4 bytes. Therefore, to conserve space in large tables, use the smallest-precision
|
||||
<codeph>DECIMAL</codeph> type that is appropriate and <codeph>CAST()</codeph> literal values where necessary,
|
||||
rather than declaring <codeph>DECIMAL</codeph> columns with high precision for convenience.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
To represent a very large or precise <codeph>DECIMAL</codeph> value as a literal, for example one that
|
||||
contains more digits than can be represented by a <codeph>BIGINT</codeph> literal, use a quoted string or a
|
||||
floating-point value for the number, and <codeph>CAST()</codeph> to the desired <codeph>DECIMAL</codeph>
|
||||
type:
|
||||
</p>
|
||||
|
||||
<codeblock>insert into decimals_38_5 values (1), (2), (4), (8), (16), (1024), (32768), (65536), (1000000),
|
||||
(cast("999999999999999999999999999999" as decimal(38,5))),
|
||||
(cast(999999999999999999999999999999. as decimal(38,5)));
|
||||
</codeblock>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
<p> The result of the <codeph>SUM()</codeph> aggregate function on
|
||||
<codeph>DECIMAL</codeph> values is promoted to a precision of 38,
|
||||
with the same precision as the underlying column. Thus, the result can
|
||||
represent the largest possible value at that particular precision. </p>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<p>
|
||||
<codeph>STRING</codeph> columns, literals, or expressions can be converted to <codeph>DECIMAL</codeph> as
|
||||
long as the overall number of digits and digits to the right of the decimal point fit within the
|
||||
specified precision and scale for the declared <codeph>DECIMAL</codeph> type. By default, a
|
||||
<codeph>DECIMAL</codeph> value with no specified scale or precision can hold a maximum of 9 digits of an
|
||||
integer value. If there are more digits in the string value than are allowed by the
|
||||
<codeph>DECIMAL</codeph> scale and precision, the result is <codeph>NULL</codeph>.
|
||||
</p>
|
||||
<p>
|
||||
The following examples demonstrate how <codeph>STRING</codeph> values with integer and fractional parts
|
||||
are represented when converted to <codeph>DECIMAL</codeph>. If the scale is 0, the number is treated
|
||||
as an integer value with a maximum of <varname>precision</varname> digits. If the precision is greater than
|
||||
0, the scale must be increased to account for the digits both to the left and right of the decimal point.
|
||||
As the precision increases, output values are printed with additional trailing zeros after the decimal
|
||||
point if needed. Any trailing zeros after the decimal point in the <codeph>STRING</codeph> value must fit
|
||||
within the number of digits specified by the precision.
|
||||
</p>
|
||||
<codeblock><![CDATA[[localhost:21000] > select cast('100' as decimal); -- Small integer value fits within 9 digits of scale.
|
||||
+-----------------------------+
|
||||
| cast('100' as decimal(9,0)) |
|
||||
+-----------------------------+
|
||||
| 100 |
|
||||
+-----------------------------+
|
||||
[localhost:21000] > select cast('100' as decimal(3,0)); -- Small integer value fits within 3 digits of scale.
|
||||
+-----------------------------+
|
||||
| cast('100' as decimal(3,0)) |
|
||||
+-----------------------------+
|
||||
| 100 |
|
||||
+-----------------------------+
|
||||
[localhost:21000] > select cast('100' as decimal(2,0)); -- 2 digits of scale is not enough!
|
||||
+-----------------------------+
|
||||
| cast('100' as decimal(2,0)) |
|
||||
+-----------------------------+
|
||||
| NULL |
|
||||
+-----------------------------+
|
||||
[localhost:21000] > select cast('100' as decimal(3,1)); -- (3,1) = 2 digits left of the decimal point, 1 to the right. Not enough.
|
||||
+-----------------------------+
|
||||
| cast('100' as decimal(3,1)) |
|
||||
+-----------------------------+
|
||||
| NULL |
|
||||
+-----------------------------+
|
||||
[localhost:21000] > select cast('100' as decimal(4,1)); -- 4 digits total, 1 to the right of the decimal point.
|
||||
+-----------------------------+
|
||||
| cast('100' as decimal(4,1)) |
|
||||
+-----------------------------+
|
||||
| 100.0 |
|
||||
+-----------------------------+
|
||||
[localhost:21000] > select cast('98.6' as decimal(3,1)); -- (3,1) can hold a 3 digit number with 1 fractional digit.
|
||||
+------------------------------+
|
||||
| cast('98.6' as decimal(3,1)) |
|
||||
+------------------------------+
|
||||
| 98.6 |
|
||||
+------------------------------+
|
||||
[localhost:21000] > select cast('98.6' as decimal(15,1)); -- Larger scale allows bigger numbers but still only 1 fractional digit.
|
||||
+-------------------------------+
|
||||
| cast('98.6' as decimal(15,1)) |
|
||||
+-------------------------------+
|
||||
| 98.6 |
|
||||
+-------------------------------+
|
||||
[localhost:21000] > select cast('98.6' as decimal(15,5)); -- Larger precision allows more fractional digits, outputs trailing zeros.
|
||||
+-------------------------------+
|
||||
| cast('98.6' as decimal(15,5)) |
|
||||
+-------------------------------+
|
||||
| 98.60000 |
|
||||
+-------------------------------+
|
||||
[localhost:21000] > select cast('98.60000' as decimal(15,1)); -- Trailing zeros in the string must fit within 'scale' digits (1 in this case).
|
||||
+-----------------------------------+
|
||||
| cast('98.60000' as decimal(15,1)) |
|
||||
+-----------------------------------+
|
||||
| NULL |
|
||||
+-----------------------------------+
|
||||
]]>
|
||||
</codeblock>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Most built-in arithmetic functions such as <codeph>SIN()</codeph> and <codeph>COS()</codeph> continue to
|
||||
accept only <codeph>DOUBLE</codeph> values because they are so commonly used in scientific context for
|
||||
calculations of IEEE 954-compliant values. The built-in functions that accept and return
|
||||
<codeph>DECIMAL</codeph> are:
|
||||
<!-- List from Skye: positive, negative, least, greatest, fnv_hash, if, nullif, zeroifnull, isnull, coalesce -->
|
||||
<!-- Nong had already told me about abs, ceil, floor, round, truncate -->
|
||||
<ul>
|
||||
<li>
|
||||
<codeph>ABS()</codeph>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<codeph>CEIL()</codeph>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<codeph>COALESCE()</codeph>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<codeph>FLOOR()</codeph>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<codeph>FNV_HASH()</codeph>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<codeph>GREATEST()</codeph>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<codeph>IF()</codeph>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<codeph>ISNULL()</codeph>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<codeph>LEAST()</codeph>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<codeph>NEGATIVE()</codeph>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<codeph>NULLIF()</codeph>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<codeph>POSITIVE()</codeph>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<codeph>PRECISION()</codeph>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<codeph>ROUND()</codeph>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<codeph>SCALE()</codeph>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<codeph>TRUNCATE()</codeph>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<codeph>ZEROIFNULL()</codeph>
|
||||
</li>
|
||||
</ul>
|
||||
See <xref href="impala_functions.xml#builtins"/> for details.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<p>
|
||||
<codeph>BIGINT</codeph>, <codeph>INT</codeph>, <codeph>SMALLINT</codeph>, and <codeph>TINYINT</codeph>
|
||||
values can all be cast to <codeph>DECIMAL</codeph>. The number of digits to the left of the decimal point
|
||||
in the <codeph>DECIMAL</codeph> type must be sufficient to hold the largest value of the corresponding
|
||||
integer type. Note that integer literals are treated as the smallest appropriate integer type, meaning
|
||||
there is sometimes a range of values that require one more digit of <codeph>DECIMAL</codeph> scale than
|
||||
you might expect. For integer values, the precision of the <codeph>DECIMAL</codeph> type can be zero; if
|
||||
the precision is greater than zero, remember to increase the scale value by an equivalent amount to hold
|
||||
the required number of digits to the left of the decimal point.
|
||||
</p>
|
||||
<p>
|
||||
The following examples show how different integer types are converted to <codeph>DECIMAL</codeph>.
|
||||
</p>
|
||||
<!-- According to Nong, it's a bug that so many integer digits can be converted to a DECIMAL
|
||||
value with small (s,p) spec. So expect to re-do this example. -->
|
||||
<codeblock><![CDATA[[localhost:21000] > select cast(1 as decimal(1,0));
|
||||
+-------------------------+
|
||||
| cast(1 as decimal(1,0)) |
|
||||
+-------------------------+
|
||||
| 1 |
|
||||
+-------------------------+
|
||||
[localhost:21000] > select cast(9 as decimal(1,0));
|
||||
+-------------------------+
|
||||
| cast(9 as decimal(1,0)) |
|
||||
+-------------------------+
|
||||
| 9 |
|
||||
+-------------------------+
|
||||
[localhost:21000] > select cast(10 as decimal(1,0));
|
||||
+--------------------------+
|
||||
| cast(10 as decimal(1,0)) |
|
||||
+--------------------------+
|
||||
| 10 |
|
||||
+--------------------------+
|
||||
[localhost:21000] > select cast(10 as decimal(1,1));
|
||||
+--------------------------+
|
||||
| cast(10 as decimal(1,1)) |
|
||||
+--------------------------+
|
||||
| 10.0 |
|
||||
+--------------------------+
|
||||
[localhost:21000] > select cast(100 as decimal(1,1));
|
||||
+---------------------------+
|
||||
| cast(100 as decimal(1,1)) |
|
||||
+---------------------------+
|
||||
| 100.0 |
|
||||
+---------------------------+
|
||||
[localhost:21000] > select cast(1000 as decimal(1,1));
|
||||
+----------------------------+
|
||||
| cast(1000 as decimal(1,1)) |
|
||||
+----------------------------+
|
||||
| 1000.0 |
|
||||
+----------------------------+
|
||||
]]>
|
||||
</codeblock>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<p>
|
||||
When a <codeph>DECIMAL</codeph> value is converted to any of the integer types, any fractional part is
|
||||
truncated (that is, rounded towards zero):
|
||||
</p>
|
||||
<codeblock><![CDATA[[localhost:21000] > create table num_dec_days (x decimal(4,1));
|
||||
[localhost:21000] > insert into num_dec_days values (1), (2), (cast(4.5 as decimal(4,1)));
|
||||
[localhost:21000] > insert into num_dec_days values (cast(0.1 as decimal(4,1))), (cast(.9 as decimal(4,1))), (cast(9.1 as decimal(4,1))), (cast(9.9 as decimal(4,1)));
|
||||
[localhost:21000] > select cast(x as int) from num_dec_days;
|
||||
+----------------+
|
||||
| cast(x as int) |
|
||||
+----------------+
|
||||
| 1 |
|
||||
| 2 |
|
||||
| 4 |
|
||||
| 0 |
|
||||
| 0 |
|
||||
| 9 |
|
||||
| 9 |
|
||||
+----------------+
|
||||
]]>
|
||||
</codeblock>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<p>
|
||||
You cannot directly cast <codeph>TIMESTAMP</codeph> or <codeph>BOOLEAN</codeph> values to or from
|
||||
<codeph>DECIMAL</codeph> values. You can turn a <codeph>DECIMAL</codeph> value into a time-related
|
||||
representation using a two-step process, by converting it to an integer value and then using that result
|
||||
in a call to a date and time function such as <codeph>from_unixtime()</codeph>.
|
||||
</p>
|
||||
<codeblock><![CDATA[[localhost:21000] > select from_unixtime(cast(cast(1000.0 as decimal) as bigint));
|
||||
+-------------------------------------------------------------+
|
||||
| from_unixtime(cast(cast(1000.0 as decimal(9,0)) as bigint)) |
|
||||
+-------------------------------------------------------------+
|
||||
| 1970-01-01 00:16:40 |
|
||||
+-------------------------------------------------------------+
|
||||
[localhost:21000] > select now() + interval cast(x as int) days from num_dec_days; -- x is a DECIMAL column.
|
||||
|
||||
[localhost:21000] > create table num_dec_days (x decimal(4,1));
|
||||
[localhost:21000] > insert into num_dec_days values (1), (2), (cast(4.5 as decimal(4,1)));
|
||||
[localhost:21000] > select now() + interval cast(x as int) days from num_dec_days; -- The 4.5 value is truncated to 4 and becomes '4 days'.
|
||||
+--------------------------------------+
|
||||
| now() + interval cast(x as int) days |
|
||||
+--------------------------------------+
|
||||
| 2014-05-13 23:11:55.163284000 |
|
||||
| 2014-05-14 23:11:55.163284000 |
|
||||
| 2014-05-16 23:11:55.163284000 |
|
||||
+--------------------------------------+
|
||||
]]>
|
||||
</codeblock>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<p>
|
||||
Because values in <codeph>INSERT</codeph> statements are checked rigorously for type compatibility, be
|
||||
prepared to use <codeph>CAST()</codeph> function calls around literals, column references, or other
|
||||
expressions that you are inserting into a <codeph>DECIMAL</codeph> column.
|
||||
</p>
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/null_bad_numeric_cast"/>
|
||||
|
||||
<p>
|
||||
<b>DECIMAL differences from integer and floating-point types:</b>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
With the <codeph>DECIMAL</codeph> type, you are concerned with the number of overall digits of a number
|
||||
rather than powers of 2 (as in <codeph>TINYINT</codeph>, <codeph>SMALLINT</codeph>, and so on). Therefore,
|
||||
the limits with integral values of <codeph>DECIMAL</codeph> types fall around 99, 999, 9999, and so on rather
|
||||
than 32767, 65535, 2
|
||||
<sup>32</sup>
|
||||
-1, and so on. For fractional values, you do not need to account for imprecise representation of the
|
||||
fractional part according to the IEEE-954 standard (as in <codeph>FLOAT</codeph> and
|
||||
<codeph>DOUBLE</codeph>). Therefore, when you insert a fractional value into a <codeph>DECIMAL</codeph>
|
||||
column, you can compare, sum, query, <codeph>GROUP BY</codeph>, and so on that column and get back the
|
||||
original values rather than some <q>close but not identical</q> value.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<codeph>FLOAT</codeph> and <codeph>DOUBLE</codeph> can cause problems or unexpected behavior due to inability
|
||||
to precisely represent certain fractional values, for example dollar and cents values for currency. You might
|
||||
find output values slightly different than you inserted, equality tests that do not match precisely, or
|
||||
unexpected values for <codeph>GROUP BY</codeph> columns. <codeph>DECIMAL</codeph> can help reduce unexpected
|
||||
behavior and rounding errors, at the expense of some performance overhead for assignments and comparisons.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>Literals and expressions:</b>
|
||||
<ul>
|
||||
<li>
|
||||
<p>
|
||||
When you use an integer literal such as <codeph>1</codeph> or <codeph>999</codeph> in a SQL statement,
|
||||
depending on the context, Impala will treat it as either the smallest appropriate
|
||||
<codeph>DECIMAL</codeph> type, or the smallest integer type (<codeph>TINYINT</codeph>,
|
||||
<codeph>SMALLINT</codeph>, <codeph>INT</codeph>, or <codeph>BIGINT</codeph>). To minimize memory usage,
|
||||
Impala prefers to treat the literal as the smallest appropriate integer type.
|
||||
</p>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<p>
|
||||
When you use a floating-point literal such as <codeph>1.1</codeph> or <codeph>999.44</codeph> in a SQL
|
||||
statement, depending on the context, Impala will treat it as either the smallest appropriate
|
||||
<codeph>DECIMAL</codeph> type, or the smallest floating-point type (<codeph>FLOAT</codeph> or
|
||||
<codeph>DOUBLE</codeph>). To avoid loss of accuracy, Impala prefers to treat the literal as a
|
||||
<codeph>DECIMAL</codeph>.
|
||||
</p>
|
||||
</li>
|
||||
</ul>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>Storage considerations:</b>
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
Only the precision determines the storage size for <codeph>DECIMAL</codeph> values; the scale setting has
|
||||
no effect on the storage size.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Text, RCFile, and SequenceFile tables all use ASCII-based formats. In these text-based file formats,
|
||||
leading zeros are not stored, but trailing zeros are stored. In these tables, each <codeph>DECIMAL</codeph>
|
||||
value takes up as many bytes as there are digits in the value, plus an extra byte if the decimal point is
|
||||
present and an extra byte for negative values. Once the values are loaded into memory, they are represented
|
||||
in 4, 8, or 16 bytes as described in the following list items. The on-disk representation varies depending
|
||||
on the file format of the table.
|
||||
</li>
|
||||
|
||||
<!-- Next couple of points can be conref'ed with identical list bullets farther down under File Format Considerations. -->
|
||||
|
||||
<li>
|
||||
Parquet and Avro tables use binary formats, In these tables, Impala stores each value in as few bytes as
|
||||
possible
|
||||
<!-- 4, 8, or 16 bytes -->
|
||||
depending on the precision specified for the <codeph>DECIMAL</codeph> column.
|
||||
<ul>
|
||||
<li>
|
||||
In memory, <codeph>DECIMAL</codeph> values with precision of 9 or less are stored in 4 bytes.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
In memory, <codeph>DECIMAL</codeph> values with precision of 10 through 18 are stored in 8 bytes.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
In memory, <codeph>DECIMAL</codeph> values with precision greater than 18 are stored in 16 bytes.
|
||||
</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/file_format_blurb"/>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
The <codeph>DECIMAL</codeph> data type can be stored in any of the file formats supported by Impala, as
|
||||
described in <xref href="impala_file_formats.xml#file_formats"/>. Impala only writes to tables that use the
|
||||
Parquet and text formats, so those formats are the focus for file format compatibility.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Impala can query Avro, RCFile, or SequenceFile tables containing <codeph>DECIMAL</codeph> columns, created
|
||||
by other Hadoop components, on CDH 5 only.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
You can use <codeph>DECIMAL</codeph> columns in Impala tables that are mapped to HBase tables. Impala can
|
||||
query and insert into such tables.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Text, RCFile, and SequenceFile tables all use ASCII-based formats. In these tables, each
|
||||
<codeph>DECIMAL</codeph> value takes up as many bytes as there are digits in the value, plus an extra byte
|
||||
if the decimal point is present. The binary format of Parquet or Avro files offers more compact storage for
|
||||
<codeph>DECIMAL</codeph> columns.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Parquet and Avro tables use binary formats, In these tables, Impala stores each value in 4, 8, or 16 bytes
|
||||
depending on the precision specified for the <codeph>DECIMAL</codeph> column.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Parquet files containing <codeph>DECIMAL</codeph> columns are not expected to be readable under CDH 4. See
|
||||
the <b>Compatibility</b> section for details.
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p>
|
||||
<b>UDF considerations:</b> When writing a C++ UDF, use the <codeph>DecimalVal</codeph> data type defined in
|
||||
<filepath>/usr/include/impala_udf/udf.h</filepath>.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/partitioning_blurb"/>
|
||||
|
||||
<p>
|
||||
You can use a <codeph>DECIMAL</codeph> column as a partition key. Doing so provides a better match between
|
||||
the partition key values and the HDFS directory names than using a <codeph>DOUBLE</codeph> or
|
||||
<codeph>FLOAT</codeph> partitioning column:
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/schema_evolution_blurb"/>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
For text-based formats (text, RCFile, and SequenceFile tables), you can issue an <codeph>ALTER TABLE ...
|
||||
REPLACE COLUMNS</codeph> statement to change the precision and scale of an existing
|
||||
<codeph>DECIMAL</codeph> column. As long as the values in the column fit within the new precision and
|
||||
scale, they are returned correctly by a query. Any values that do not fit within the new precision and
|
||||
scale are returned as <codeph>NULL</codeph>, and Impala reports the conversion error. Leading zeros do not
|
||||
count against the precision value, but trailing zeros after the decimal point do.
|
||||
<codeblock><![CDATA[[localhost:21000] > create table text_decimals (x string);
|
||||
[localhost:21000] > insert into text_decimals values ("1"), ("2"), ("99.99"), ("1.234"), ("000001"), ("1.000000000");
|
||||
[localhost:21000] > select * from text_decimals;
|
||||
+-------------+
|
||||
| x |
|
||||
+-------------+
|
||||
| 1 |
|
||||
| 2 |
|
||||
| 99.99 |
|
||||
| 1.234 |
|
||||
| 000001 |
|
||||
| 1.000000000 |
|
||||
+-------------+
|
||||
[localhost:21000] > alter table text_decimals replace columns (x decimal(4,2));
|
||||
[localhost:21000] > select * from text_decimals;
|
||||
+-------+
|
||||
| x |
|
||||
+-------+
|
||||
| 1.00 |
|
||||
| 2.00 |
|
||||
| 99.99 |
|
||||
| NULL |
|
||||
| 1.00 |
|
||||
| NULL |
|
||||
+-------+
|
||||
ERRORS:
|
||||
Backend 0:Error converting column: 0 TO DECIMAL(4, 2) (Data is: 1.234)
|
||||
file: hdfs://127.0.0.1:8020/user/hive/warehouse/decimal_testing.db/text_decimals/634d4bd3aa0
|
||||
e8420-b4b13bab7f1be787_56794587_data.0
|
||||
record: 1.234
|
||||
Error converting column: 0 TO DECIMAL(4, 2) (Data is: 1.000000000)
|
||||
file: hdfs://127.0.0.1:8020/user/hive/warehouse/decimal_testing.db/text_decimals/cd40dc68e20
|
||||
c565a-cc4bd86c724c96ba_311873428_data.0
|
||||
record: 1.000000000
|
||||
]]>
|
||||
</codeblock>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
For binary formats (Parquet and Avro tables), although an <codeph>ALTER TABLE ... REPLACE COLUMNS</codeph>
|
||||
statement that changes the precision or scale of a <codeph>DECIMAL</codeph> column succeeds, any subsequent
|
||||
attempt to query the changed column results in a fatal error. (The other columns can still be queried
|
||||
successfully.) This is because the metadata about the columns is stored in the data files themselves, and
|
||||
<codeph>ALTER TABLE</codeph> does not actually make any updates to the data files. If the metadata in the
|
||||
data files disagrees with the metadata in the metastore database, Impala cancels the query.
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
|
||||
<codeblock>CREATE TABLE t1 (x DECIMAL, y DECIMAL(5,2), z DECIMAL(25,0));
|
||||
INSERT INTO t1 VALUES (5, 99.44, 123456), (300, 6.7, 999999999);
|
||||
SELECT x+y, ROUND(y,1), z/98.6 FROM t1;
|
||||
SELECT CAST(1000.5 AS DECIMAL);
|
||||
</codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/decimal_no_stats"/>
|
||||
|
||||
<!-- <p conref="../shared/impala_common.xml#common/partitioning_good"/> -->
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/hbase_ok"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/parquet_ok"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/text_bulky"/>
|
||||
|
||||
<!-- <p conref="../shared/impala_common.xml#common/compatibility_blurb"/> -->
|
||||
|
||||
<!-- <p conref="../shared/impala_common.xml#common/internals_blurb"/> -->
|
||||
|
||||
<!-- <p conref="../shared/impala_common.xml#common/added_in_20"/> -->
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/column_stats_constant"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
|
||||
<p>
|
||||
<xref href="impala_literals.xml#numeric_literals"/>, <xref href="impala_tinyint.xml#tinyint"/>,
|
||||
<xref href="impala_smallint.xml#smallint"/>, <xref href="impala_int.xml#int"/>,
|
||||
<xref href="impala_bigint.xml#bigint"/>, <xref href="impala_decimal.xml#decimal"/>,
|
||||
<xref href="impala_math_functions.xml#math_functions"/> (especially <codeph>PRECISION()</codeph> and
|
||||
<codeph>SCALE()</codeph>)
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
37
docs/topics/impala_default_order_by_limit.xml
Normal file
37
docs/topics/impala_default_order_by_limit.xml
Normal file
@@ -0,0 +1,37 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept rev="obwl" id="default_order_by_limit">
|
||||
|
||||
<title>DEFAULT_ORDER_BY_LIMIT Query Option</title>
|
||||
<titlealts audience="PDF"><navtitle>DEFAULT_ORDER_BY_LIMIT</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Impala Query Options"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/obwl_query_options"/>
|
||||
|
||||
<p rev="1.4.0">
|
||||
Prior to Impala 1.4.0, Impala queries that use the <codeph><xref href="impala_order_by.xml#order_by">ORDER
|
||||
BY</xref></codeph> clause must also include a
|
||||
<codeph><xref href="impala_limit.xml#limit">LIMIT</xref></codeph> clause, to avoid accidentally producing
|
||||
huge result sets that must be sorted. Sorting a huge result set is a memory-intensive operation. In Impala
|
||||
1.4.0 and higher, Impala uses a temporary disk work area to perform the sort if that operation would
|
||||
otherwise exceed the Impala memory limit on a particular host.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>Type: numeric</b>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>Default:</b> -1 (no default limit)
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
88
docs/topics/impala_delegation.xml
Normal file
88
docs/topics/impala_delegation.xml
Normal file
@@ -0,0 +1,88 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept rev="1.2" id="delegation">
|
||||
|
||||
<title>Configuring Impala Delegation for Hue and BI Tools</title>
|
||||
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Security"/>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Authentication"/>
|
||||
<data name="Category" value="Delegation"/>
|
||||
<data name="Category" value="Hue"/>
|
||||
<data name="Category" value="Administrators"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<!--
|
||||
When users connect to Impala directly through the <cmdname>impala-shell</cmdname> interpreter, the Sentry
|
||||
authorization framework determines what actions they can take and what data they can see.
|
||||
-->
|
||||
When users submit Impala queries through a separate application, such as Hue or a business intelligence tool,
|
||||
typically all requests are treated as coming from the same user. In Impala 1.2 and higher, authentication is
|
||||
extended by a new feature that allows applications to pass along credentials for the users that connect to
|
||||
them (known as <q>delegation</q>), and issue Impala queries with the privileges for those users. Currently,
|
||||
the delegation feature is available only for Impala queries submitted through application interfaces such as
|
||||
Hue and BI tools; for example, Impala cannot issue queries using the privileges of the HDFS user.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The delegation feature is enabled by a startup option for <cmdname>impalad</cmdname>:
|
||||
<codeph>--authorized_proxy_user_config</codeph>. When you specify this option, users whose names you specify
|
||||
(such as <codeph>hue</codeph>) can delegate the execution of a query to another user. The query runs with the
|
||||
privileges of the delegated user, not the original user such as <codeph>hue</codeph>. The name of the
|
||||
delegated user is passed using the HiveServer2 configuration property <codeph>impala.doas.user</codeph>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
You can specify a list of users that the application user can delegate to, or <codeph>*</codeph> to allow a
|
||||
superuser to delegate to any other user. For example:
|
||||
</p>
|
||||
|
||||
<codeblock>impalad --authorized_proxy_user_config 'hue=user1,user2;admin=*' ...</codeblock>
|
||||
|
||||
<note>
|
||||
Make sure to use single quotes or escape characters to ensure that any <codeph>*</codeph> characters do not
|
||||
undergo wildcard expansion when specified in command-line arguments.
|
||||
</note>
|
||||
|
||||
<p>
|
||||
See <xref href="impala_config_options.xml#config_options"/> for details about adding or changing
|
||||
<cmdname>impalad</cmdname> startup options. See
|
||||
<xref href="http://blog.cloudera.com/blog/2013/07/how-hiveserver2-brings-security-and-concurrency-to-apache-hive/" scope="external" format="html">this
|
||||
Cloudera blog post</xref> for background information about the delegation capability in HiveServer2.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
To set up authentication for the delegated users:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
<p>
|
||||
On the server side, configure either user/password authentication through LDAP, or Kerberos
|
||||
authentication, for all the delegated users. See <xref href="impala_ldap.xml#ldap"/> or
|
||||
<xref href="impala_kerberos.xml#kerberos"/> for details.
|
||||
</p>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<p>
|
||||
On the client side, follow the instructions in the <q>Using User Name and Password</q> section in the
|
||||
<xref href="http://www.cloudera.com/content/cloudera-content/cloudera-docs/Connectors/PDF/Cloudera-ODBC-Driver-for-Impala-Install-Guide.pdf" scope="external" format="pdf">ODBC
|
||||
driver installation guide</xref>. Then search for <q>delegation</q> in that same installation guide to
|
||||
learn about the <uicontrol>Delegation UID</uicontrol> field and <codeph>DelegationUID</codeph> configuration keyword to enable the delegation feature for
|
||||
ODBC-based BI tools.
|
||||
</p>
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
</conbody>
|
||||
|
||||
</concept>
|
||||
65
docs/topics/impala_delete.xml
Normal file
65
docs/topics/impala_delete.xml
Normal file
@@ -0,0 +1,65 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="delete">
|
||||
|
||||
<title>DELETE Statement (<keyword keyref="impala28"/> or higher only)</title>
|
||||
<titlealts audience="PDF"><navtitle>DELETE</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="Kudu"/>
|
||||
<data name="Category" value="ETL"/>
|
||||
<data name="Category" value="Ingest"/>
|
||||
<data name="Category" value="DML"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<indexterm audience="Cloudera">DELETE statement</indexterm>
|
||||
Deletes one or more rows from a Kudu table.
|
||||
Although deleting a single row or a range of rows would be inefficient for tables using HDFS
|
||||
data files, Kudu is able to perform this operation efficiently. Therefore, this statement
|
||||
only works for Impala tables that use the Kudu storage engine.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
||||
|
||||
<codeblock>
|
||||
</codeblock>
|
||||
|
||||
<p rev="kudu">
|
||||
Normally, a <codeph>DELETE</codeph> operation for a Kudu table fails if
|
||||
some partition key columns are not found, due to their being deleted or changed
|
||||
by a concurrent <codeph>UPDATE</codeph> or <codeph>DELETE</codeph> operation.
|
||||
Specify <codeph>DELETE IGNORE <varname>rest_of_statement</varname></codeph> to
|
||||
make the <codeph>DELETE</codeph> continue in this case. The rows with the nonexistent
|
||||
duplicate partition key column values are not removed.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/dml_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/sync_ddl_blurb"/>
|
||||
|
||||
<note conref="../shared/impala_common.xml#common/compute_stats_next"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
<codeblock>
|
||||
|
||||
</codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
|
||||
<p>
|
||||
<xref href="impala_kudu.xml#impala_kudu"/>
|
||||
</p>
|
||||
|
||||
</conbody>
|
||||
|
||||
</concept>
|
||||
689
docs/topics/impala_describe.xml
Normal file
689
docs/topics/impala_describe.xml
Normal file
@@ -0,0 +1,689 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="describe">
|
||||
|
||||
<title id="desc">DESCRIBE Statement</title>
|
||||
<titlealts audience="PDF"><navtitle>DESCRIBE</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Impala Data Types"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
<data name="Category" value="Tables"/>
|
||||
<data name="Category" value="Reports"/>
|
||||
<data name="Category" value="Schemas"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<indexterm audience="Cloudera">DESCRIBE statement</indexterm>
|
||||
The <codeph>DESCRIBE</codeph> statement displays metadata about a table, such as the column names and their
|
||||
data types.
|
||||
<ph rev="2.3.0">In <keyword keyref="impala23_full"/> and higher, you can specify the name of a complex type column, which takes
|
||||
the form of a dotted path. The path might include multiple components in the case of a nested type definition.</ph>
|
||||
<ph rev="2.5.0">In <keyword keyref="impala25_full"/> and higher, the <codeph>DESCRIBE DATABASE</codeph> form can display
|
||||
information about a database.</ph>
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
||||
|
||||
<codeblock rev="2.5.0">DESCRIBE [DATABASE] [FORMATTED|EXTENDED] <varname>object_name</varname>
|
||||
|
||||
object_name ::=
|
||||
[<varname>db_name</varname>.]<varname>table_name</varname>[.<varname>complex_col_name</varname> ...]
|
||||
| <varname>db_name</varname>
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
You can use the abbreviation <codeph>DESC</codeph> for the <codeph>DESCRIBE</codeph> statement.
|
||||
</p>
|
||||
|
||||
<p rev="1.1">
|
||||
The <codeph>DESCRIBE FORMATTED</codeph> variation displays additional information, in a format familiar to
|
||||
users of Apache Hive. The extra information includes low-level details such as whether the table is internal
|
||||
or external, when it was created, the file format, the location of the data in HDFS, whether the object is a
|
||||
table or a view, and (for views) the text of the query from the view definition.
|
||||
</p>
|
||||
|
||||
<note>
|
||||
The <codeph>Compressed</codeph> field is not a reliable indicator of whether the table contains compressed
|
||||
data. It typically always shows <codeph>No</codeph>, because the compression settings only apply during the
|
||||
session that loads data and are not stored persistently with the table metadata.
|
||||
</note>
|
||||
|
||||
<p rev="2.5.0 IMPALA-2196">
|
||||
<b>Describing databases:</b>
|
||||
</p>
|
||||
|
||||
<p rev="2.5.0">
|
||||
By default, the <codeph>DESCRIBE</codeph> output for a database includes the location
|
||||
and the comment, which can be set by the <codeph>LOCATION</codeph> and <codeph>COMMENT</codeph>
|
||||
clauses on the <codeph>CREATE DATABASE</codeph> statement.
|
||||
</p>
|
||||
|
||||
<p rev="2.5.0">
|
||||
The additional information displayed by the <codeph>FORMATTED</codeph> or <codeph>EXTENDED</codeph>
|
||||
keyword includes the HDFS user ID that is considered the owner of the database, and any
|
||||
optional database properties. The properties could be specified by the <codeph>WITH DBPROPERTIES</codeph>
|
||||
clause if the database is created using a Hive <codeph>CREATE DATABASE</codeph> statement.
|
||||
Impala currently does not set or do any special processing based on those properties.
|
||||
</p>
|
||||
|
||||
<p rev="2.5.0">
|
||||
The following examples show the variations in syntax and output for
|
||||
describing databases. This feature is available in <keyword keyref="impala25_full"/>
|
||||
and higher.
|
||||
</p>
|
||||
|
||||
<codeblock rev="2.5.0">
|
||||
describe database default;
|
||||
+---------+----------------------+-----------------------+
|
||||
| name | location | comment |
|
||||
+---------+----------------------+-----------------------+
|
||||
| default | /user/hive/warehouse | Default Hive database |
|
||||
+---------+----------------------+-----------------------+
|
||||
|
||||
describe database formatted default;
|
||||
+---------+----------------------+-----------------------+
|
||||
| name | location | comment |
|
||||
+---------+----------------------+-----------------------+
|
||||
| default | /user/hive/warehouse | Default Hive database |
|
||||
| Owner: | | |
|
||||
| | public | ROLE |
|
||||
+---------+----------------------+-----------------------+
|
||||
|
||||
describe database extended default;
|
||||
+---------+----------------------+-----------------------+
|
||||
| name | location | comment |
|
||||
+---------+----------------------+-----------------------+
|
||||
| default | /user/hive/warehouse | Default Hive database |
|
||||
| Owner: | | |
|
||||
| | public | ROLE |
|
||||
+---------+----------------------+-----------------------+
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
<b>Describing tables:</b>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
If the <codeph>DATABASE</codeph> keyword is omitted, the default
|
||||
for the <codeph>DESCRIBE</codeph> statement is to refer to a table.
|
||||
</p>
|
||||
|
||||
<codeblock>
|
||||
-- By default, the table is assumed to be in the current database.
|
||||
describe my_table;
|
||||
+------+--------+---------+
|
||||
| name | type | comment |
|
||||
+------+--------+---------+
|
||||
| x | int | |
|
||||
| s | string | |
|
||||
+------+--------+---------+
|
||||
|
||||
-- Use a fully qualified table name to specify a table in any database.
|
||||
describe my_database.my_table;
|
||||
+------+--------+---------+
|
||||
| name | type | comment |
|
||||
+------+--------+---------+
|
||||
| x | int | |
|
||||
| s | string | |
|
||||
+------+--------+---------+
|
||||
|
||||
-- The formatted or extended output includes additional useful information.
|
||||
-- The LOCATION field is especially useful to know for DDL statements and HDFS commands
|
||||
-- during ETL jobs. (The LOCATION includes a full hdfs:// URL, omitted here for readability.)
|
||||
describe formatted my_table;
|
||||
+------------------------------+----------------------------------------------+----------------------+
|
||||
| name | type | comment |
|
||||
+------------------------------+----------------------------------------------+----------------------+
|
||||
| # col_name | data_type | comment |
|
||||
| | NULL | NULL |
|
||||
| x | int | NULL |
|
||||
| s | string | NULL |
|
||||
| | NULL | NULL |
|
||||
| # Detailed Table Information | NULL | NULL |
|
||||
| Database: | my_database | NULL |
|
||||
| Owner: | jrussell | NULL |
|
||||
| CreateTime: | Fri Mar 18 15:58:00 PDT 2016 | NULL |
|
||||
| LastAccessTime: | UNKNOWN | NULL |
|
||||
| Protect Mode: | None | NULL |
|
||||
| Retention: | 0 | NULL |
|
||||
| Location: | /user/hive/warehouse/my_database.db/my_table | NULL |
|
||||
| Table Type: | MANAGED_TABLE | NULL |
|
||||
| Table Parameters: | NULL | NULL |
|
||||
| | transient_lastDdlTime | 1458341880 |
|
||||
| | NULL | NULL |
|
||||
| # Storage Information | NULL | NULL |
|
||||
| SerDe Library: | org. ... .LazySimpleSerDe | NULL |
|
||||
| InputFormat: | org.apache.hadoop.mapred.TextInputFormat | NULL |
|
||||
| OutputFormat: | org. ... .HiveIgnoreKeyTextOutputFormat | NULL |
|
||||
| Compressed: | No | NULL |
|
||||
| Num Buckets: | 0 | NULL |
|
||||
| Bucket Columns: | [] | NULL |
|
||||
| Sort Columns: | [] | NULL |
|
||||
+------------------------------+----------------------------------------------+----------------------+
|
||||
</codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/complex_types_blurb"/>
|
||||
|
||||
<p rev="2.3.0">
|
||||
Because the column definitions for complex types can become long, particularly when such types are nested,
|
||||
the <codeph>DESCRIBE</codeph> statement uses special formatting for complex type columns to make the output readable.
|
||||
</p>
|
||||
|
||||
<p rev="2.3.0">
|
||||
For the <codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and <codeph>MAP</codeph> types available in
|
||||
<keyword keyref="impala23_full"/> and higher, the <codeph>DESCRIBE</codeph> output is formatted to avoid
|
||||
excessively long lines for multiple fields within a <codeph>STRUCT</codeph>, or a nested sequence of
|
||||
complex types.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/complex_types_describe"/>
|
||||
|
||||
<p rev="2.3.0">
|
||||
For example, here is the <codeph>DESCRIBE</codeph> output for a table containing a single top-level column
|
||||
of each complex type:
|
||||
</p>
|
||||
|
||||
<codeblock rev="2.3.0"><![CDATA[create table t1 (x int, a array<int>, s struct<f1: string, f2: bigint>, m map<string,int>) stored as parquet;
|
||||
|
||||
describe t1;
|
||||
+------+-----------------+---------+
|
||||
| name | type | comment |
|
||||
+------+-----------------+---------+
|
||||
| x | int | |
|
||||
| a | array<int> | |
|
||||
| s | struct< | |
|
||||
| | f1:string, | |
|
||||
| | f2:bigint | |
|
||||
| | > | |
|
||||
| m | map<string,int> | |
|
||||
+------+-----------------+---------+
|
||||
]]>
|
||||
</codeblock>
|
||||
|
||||
<p rev="2.3.0">
|
||||
Here are examples showing how to <q>drill down</q> into the layouts of complex types, including
|
||||
using multi-part names to examine the definitions of nested types.
|
||||
The <codeph>< ></codeph> delimiters identify the columns with complex types;
|
||||
these are the columns where you can descend another level to see the parts that make up
|
||||
the complex type.
|
||||
This technique helps you to understand the multi-part names you use as table references in queries
|
||||
involving complex types, and the corresponding column names you refer to in the <codeph>SELECT</codeph> list.
|
||||
These tables are from the <q>nested TPC-H</q> schema, shown in detail in
|
||||
<xref href="impala_complex_types.xml#complex_sample_schema"/>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The <codeph>REGION</codeph> table contains an <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph>
|
||||
elements:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
<p>
|
||||
The first <codeph>DESCRIBE</codeph> specifies the table name, to display the definition
|
||||
of each top-level column.
|
||||
</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>
|
||||
The second <codeph>DESCRIBE</codeph> specifies the name of a complex
|
||||
column, <codeph>REGION.R_NATIONS</codeph>, showing that when you include the name of an <codeph>ARRAY</codeph>
|
||||
column in a <codeph>FROM</codeph> clause, that table reference acts like a two-column table with
|
||||
columns <codeph>ITEM</codeph> and <codeph>POS</codeph>.
|
||||
</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>
|
||||
The final <codeph>DESCRIBE</codeph> specifies the fully qualified name of the <codeph>ITEM</codeph> field,
|
||||
to display the layout of its underlying <codeph>STRUCT</codeph> type in table format, with the fields
|
||||
mapped to column names.
|
||||
</p>
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<codeblock rev="2.3.0"><![CDATA[
|
||||
-- #1: The overall layout of the entire table.
|
||||
describe region;
|
||||
+-------------+-------------------------+---------+
|
||||
| name | type | comment |
|
||||
+-------------+-------------------------+---------+
|
||||
| r_regionkey | smallint | |
|
||||
| r_name | string | |
|
||||
| r_comment | string | |
|
||||
| r_nations | array<struct< | |
|
||||
| | n_nationkey:smallint, | |
|
||||
| | n_name:string, | |
|
||||
| | n_comment:string | |
|
||||
| | >> | |
|
||||
+-------------+-------------------------+---------+
|
||||
|
||||
-- #2: The ARRAY column within the table.
|
||||
describe region.r_nations;
|
||||
+------+-------------------------+---------+
|
||||
| name | type | comment |
|
||||
+------+-------------------------+---------+
|
||||
| item | struct< | |
|
||||
| | n_nationkey:smallint, | |
|
||||
| | n_name:string, | |
|
||||
| | n_comment:string | |
|
||||
| | > | |
|
||||
| pos | bigint | |
|
||||
+------+-------------------------+---------+
|
||||
|
||||
-- #3: The STRUCT that makes up each ARRAY element.
|
||||
-- The fields of the STRUCT act like columns of a table.
|
||||
describe region.r_nations.item;
|
||||
+-------------+----------+---------+
|
||||
| name | type | comment |
|
||||
+-------------+----------+---------+
|
||||
| n_nationkey | smallint | |
|
||||
| n_name | string | |
|
||||
| n_comment | string | |
|
||||
+-------------+----------+---------+
|
||||
]]>
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
The <codeph>CUSTOMER</codeph> table contains an <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph>
|
||||
elements, where one field in the <codeph>STRUCT</codeph> is another <codeph>ARRAY</codeph> of
|
||||
<codeph>STRUCT</codeph> elements:
|
||||
</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p>
|
||||
Again, the initial <codeph>DESCRIBE</codeph> specifies only the table name.
|
||||
</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>
|
||||
The second <codeph>DESCRIBE</codeph> specifies the qualified name of the complex
|
||||
column, <codeph>CUSTOMER.C_ORDERS</codeph>, showing how an <codeph>ARRAY</codeph>
|
||||
is represented as a two-column table with columns <codeph>ITEM</codeph> and <codeph>POS</codeph>.
|
||||
</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>
|
||||
The third <codeph>DESCRIBE</codeph> specifies the qualified name of the <codeph>ITEM</codeph>
|
||||
of the <codeph>ARRAY</codeph> column, to see the structure of the nested <codeph>ARRAY</codeph>.
|
||||
Again, it has has two parts, <codeph>ITEM</codeph> and <codeph>POS</codeph>. Because the
|
||||
<codeph>ARRAY</codeph> contains a <codeph>STRUCT</codeph>, the layout of the <codeph>STRUCT</codeph>
|
||||
is shown.
|
||||
</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>
|
||||
The fourth and fifth <codeph>DESCRIBE</codeph> statements drill down into a <codeph>STRUCT</codeph> field that
|
||||
is itself a complex type, an <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph>.
|
||||
The <codeph>ITEM</codeph> portion of the qualified name is only required when the <codeph>ARRAY</codeph>
|
||||
elements are anonymous. The fields of the <codeph>STRUCT</codeph> give names to any other complex types
|
||||
nested inside the <codeph>STRUCT</codeph>. Therefore, the <codeph>DESCRIBE</codeph> parameters
|
||||
<codeph>CUSTOMER.C_ORDERS.ITEM.O_LINEITEMS</codeph> and <codeph>CUSTOMER.C_ORDERS.O_LINEITEMS</codeph>
|
||||
are equivalent. (For brevity, leave out the <codeph>ITEM</codeph> portion of
|
||||
a qualified name when it is not required.)
|
||||
</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>
|
||||
The final <codeph>DESCRIBE</codeph> shows the layout of the deeply nested <codeph>STRUCT</codeph> type.
|
||||
Because there are no more complex types nested inside this <codeph>STRUCT</codeph>, this is as far
|
||||
as you can drill down into the layout for this table.
|
||||
</p>
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<codeblock rev="2.3.0"><![CDATA[-- #1: The overall layout of the entire table.
|
||||
describe customer;
|
||||
+--------------+------------------------------------+
|
||||
| name | type |
|
||||
+--------------+------------------------------------+
|
||||
| c_custkey | bigint |
|
||||
... more scalar columns ...
|
||||
| c_orders | array<struct< |
|
||||
| | o_orderkey:bigint, |
|
||||
| | o_orderstatus:string, |
|
||||
| | o_totalprice:decimal(12,2), |
|
||||
| | o_orderdate:string, |
|
||||
| | o_orderpriority:string, |
|
||||
| | o_clerk:string, |
|
||||
| | o_shippriority:int, |
|
||||
| | o_comment:string, |
|
||||
| | o_lineitems:array<struct< |
|
||||
| | l_partkey:bigint, |
|
||||
| | l_suppkey:bigint, |
|
||||
| | l_linenumber:int, |
|
||||
| | l_quantity:decimal(12,2), |
|
||||
| | l_extendedprice:decimal(12,2), |
|
||||
| | l_discount:decimal(12,2), |
|
||||
| | l_tax:decimal(12,2), |
|
||||
| | l_returnflag:string, |
|
||||
| | l_linestatus:string, |
|
||||
| | l_shipdate:string, |
|
||||
| | l_commitdate:string, |
|
||||
| | l_receiptdate:string, |
|
||||
| | l_shipinstruct:string, |
|
||||
| | l_shipmode:string, |
|
||||
| | l_comment:string |
|
||||
| | >> |
|
||||
| | >> |
|
||||
+--------------+------------------------------------+
|
||||
|
||||
-- #2: The ARRAY column within the table.
|
||||
describe customer.c_orders;
|
||||
+------+------------------------------------+
|
||||
| name | type |
|
||||
+------+------------------------------------+
|
||||
| item | struct< |
|
||||
| | o_orderkey:bigint, |
|
||||
| | o_orderstatus:string, |
|
||||
... more struct fields ...
|
||||
| | o_lineitems:array<struct< |
|
||||
| | l_partkey:bigint, |
|
||||
| | l_suppkey:bigint, |
|
||||
... more nested struct fields ...
|
||||
| | l_comment:string |
|
||||
| | >> |
|
||||
| | > |
|
||||
| pos | bigint |
|
||||
+------+------------------------------------+
|
||||
|
||||
-- #3: The STRUCT that makes up each ARRAY element.
|
||||
-- The fields of the STRUCT act like columns of a table.
|
||||
describe customer.c_orders.item;
|
||||
+-----------------+----------------------------------+
|
||||
| name | type |
|
||||
+-----------------+----------------------------------+
|
||||
| o_orderkey | bigint |
|
||||
| o_orderstatus | string |
|
||||
| o_totalprice | decimal(12,2) |
|
||||
| o_orderdate | string |
|
||||
| o_orderpriority | string |
|
||||
| o_clerk | string |
|
||||
| o_shippriority | int |
|
||||
| o_comment | string |
|
||||
| o_lineitems | array<struct< |
|
||||
| | l_partkey:bigint, |
|
||||
| | l_suppkey:bigint, |
|
||||
... more struct fields ...
|
||||
| | l_comment:string |
|
||||
| | >> |
|
||||
+-----------------+----------------------------------+
|
||||
|
||||
-- #4: The ARRAY nested inside the STRUCT elements of the first ARRAY.
|
||||
describe customer.c_orders.item.o_lineitems;
|
||||
+------+----------------------------------+
|
||||
| name | type |
|
||||
+------+----------------------------------+
|
||||
| item | struct< |
|
||||
| | l_partkey:bigint, |
|
||||
| | l_suppkey:bigint, |
|
||||
... more struct fields ...
|
||||
| | l_comment:string |
|
||||
| | > |
|
||||
| pos | bigint |
|
||||
+------+----------------------------------+
|
||||
|
||||
-- #5: Shorter form of the previous DESCRIBE. Omits the .ITEM portion of the name
|
||||
-- because O_LINEITEMS and other field names provide a way to refer to things
|
||||
-- inside the ARRAY element.
|
||||
describe customer.c_orders.o_lineitems;
|
||||
+------+----------------------------------+
|
||||
| name | type |
|
||||
+------+----------------------------------+
|
||||
| item | struct< |
|
||||
| | l_partkey:bigint, |
|
||||
| | l_suppkey:bigint, |
|
||||
... more struct fields ...
|
||||
| | l_comment:string |
|
||||
| | > |
|
||||
| pos | bigint |
|
||||
+------+----------------------------------+
|
||||
|
||||
-- #6: The STRUCT representing ARRAY elements nested inside
|
||||
-- another ARRAY of STRUCTs. The lack of any complex types
|
||||
-- in this output means this is as far as DESCRIBE can
|
||||
-- descend into the table layout.
|
||||
describe customer.c_orders.o_lineitems.item;
|
||||
+-----------------+---------------+
|
||||
| name | type |
|
||||
+-----------------+---------------+
|
||||
| l_partkey | bigint |
|
||||
| l_suppkey | bigint |
|
||||
... more scalar columns ...
|
||||
| l_comment | string |
|
||||
+-----------------+---------------+
|
||||
]]>
|
||||
</codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
||||
|
||||
<p>
|
||||
After the <cmdname>impalad</cmdname> daemons are restarted, the first query against a table can take longer
|
||||
than subsequent queries, because the metadata for the table is loaded before the query is processed. This
|
||||
one-time delay for each table can cause misleading results in benchmark tests or cause unnecessary concern.
|
||||
To <q>warm up</q> the Impala metadata cache, you can issue a <codeph>DESCRIBE</codeph> statement in advance
|
||||
for each table you intend to access later.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
When you are dealing with data files stored in HDFS, sometimes it is important to know details such as the
|
||||
path of the data files for an Impala table, and the hostname for the namenode. You can get this information
|
||||
from the <codeph>DESCRIBE FORMATTED</codeph> output. You specify HDFS URIs or path specifications with
|
||||
statements such as <codeph>LOAD DATA</codeph> and the <codeph>LOCATION</codeph> clause of <codeph>CREATE
|
||||
TABLE</codeph> or <codeph>ALTER TABLE</codeph>. You might also use HDFS URIs or paths with Linux commands
|
||||
such as <cmdname>hadoop</cmdname> and <cmdname>hdfs</cmdname> to copy, rename, and so on, data files in HDFS.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/sync_ddl_blurb"/>
|
||||
|
||||
<p rev="1.2.1">
|
||||
Each table can also have associated table statistics and column statistics. To see these categories of
|
||||
information, use the <codeph>SHOW TABLE STATS <varname>table_name</varname></codeph> and <codeph>SHOW COLUMN
|
||||
STATS <varname>table_name</varname></codeph> statements.
|
||||
<!--
|
||||
For example, the table statistics can often show you the number
|
||||
and total size of the files in the table, even if you have not
|
||||
run <codeph>COMPUTE STATS</codeph>.
|
||||
-->
|
||||
See <xref href="impala_show.xml#show"/> for details.
|
||||
</p>
|
||||
|
||||
<note conref="../shared/impala_common.xml#common/compute_stats_next"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
|
||||
<p>
|
||||
The following example shows the results of both a standard <codeph>DESCRIBE</codeph> and <codeph>DESCRIBE
|
||||
FORMATTED</codeph> for different kinds of schema objects:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
<codeph>DESCRIBE</codeph> for a table or a view returns the name, type, and comment for each of the
|
||||
columns. For a view, if the column value is computed by an expression, the column name is automatically
|
||||
generated as <codeph>_c0</codeph>, <codeph>_c1</codeph>, and so on depending on the ordinal number of the
|
||||
column.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
A table created with no special format or storage clauses is designated as a <codeph>MANAGED_TABLE</codeph>
|
||||
(an <q>internal table</q> in Impala terminology). Its data files are stored in an HDFS directory under the
|
||||
default Hive data directory. By default, it uses Text data format.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
A view is designated as <codeph>VIRTUAL_VIEW</codeph> in <codeph>DESCRIBE FORMATTED</codeph> output. Some
|
||||
of its properties are <codeph>NULL</codeph> or blank because they are inherited from the base table. The
|
||||
text of the query that defines the view is part of the <codeph>DESCRIBE FORMATTED</codeph> output.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
A table with additional clauses in the <codeph>CREATE TABLE</codeph> statement has differences in
|
||||
<codeph>DESCRIBE FORMATTED</codeph> output. The output for <codeph>T2</codeph> includes the
|
||||
<codeph>EXTERNAL_TABLE</codeph> keyword because of the <codeph>CREATE EXTERNAL TABLE</codeph> syntax, and
|
||||
different <codeph>InputFormat</codeph> and <codeph>OutputFormat</codeph> fields to reflect the Parquet file
|
||||
format.
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<codeblock>[localhost:21000] > create table t1 (x int, y int, s string);
|
||||
Query: create table t1 (x int, y int, s string)
|
||||
[localhost:21000] > describe t1;
|
||||
Query: describe t1
|
||||
Query finished, fetching results ...
|
||||
+------+--------+---------+
|
||||
| name | type | comment |
|
||||
+------+--------+---------+
|
||||
| x | int | |
|
||||
| y | int | |
|
||||
| s | string | |
|
||||
+------+--------+---------+
|
||||
Returned 3 row(s) in 0.13s
|
||||
[localhost:21000] > describe formatted t1;
|
||||
Query: describe formatted t1
|
||||
Query finished, fetching results ...
|
||||
+------------------------------+--------------------------------------------+------------+
|
||||
| name | type | comment |
|
||||
+------------------------------+--------------------------------------------+------------+
|
||||
| # col_name | data_type | comment |
|
||||
| | NULL | NULL |
|
||||
| x | int | None |
|
||||
| y | int | None |
|
||||
| s | string | None |
|
||||
| | NULL | NULL |
|
||||
| # Detailed Table Information | NULL | NULL |
|
||||
| Database: | describe_formatted | NULL |
|
||||
| Owner: | cloudera | NULL |
|
||||
| CreateTime: | Mon Jul 22 17:03:16 EDT 2013 | NULL |
|
||||
| LastAccessTime: | UNKNOWN | NULL |
|
||||
| Protect Mode: | None | NULL |
|
||||
| Retention: | 0 | NULL |
|
||||
| Location: | hdfs://127.0.0.1:8020/user/hive/warehouse/ | |
|
||||
| | describe_formatted.db/t1 | NULL |
|
||||
| Table Type: | MANAGED_TABLE | NULL |
|
||||
| Table Parameters: | NULL | NULL |
|
||||
| | transient_lastDdlTime | 1374526996 |
|
||||
| | NULL | NULL |
|
||||
| # Storage Information | NULL | NULL |
|
||||
| SerDe Library: | org.apache.hadoop.hive.serde2.lazy. | |
|
||||
| | LazySimpleSerDe | NULL |
|
||||
| InputFormat: | org.apache.hadoop.mapred.TextInputFormat | NULL |
|
||||
| OutputFormat: | org.apache.hadoop.hive.ql.io. | |
|
||||
| | HiveIgnoreKeyTextOutputFormat | NULL |
|
||||
| Compressed: | No | NULL |
|
||||
| Num Buckets: | 0 | NULL |
|
||||
| Bucket Columns: | [] | NULL |
|
||||
| Sort Columns: | [] | NULL |
|
||||
+------------------------------+--------------------------------------------+------------+
|
||||
Returned 26 row(s) in 0.03s
|
||||
[localhost:21000] > create view v1 as select x, upper(s) from t1;
|
||||
Query: create view v1 as select x, upper(s) from t1
|
||||
[localhost:21000] > describe v1;
|
||||
Query: describe v1
|
||||
Query finished, fetching results ...
|
||||
+------+--------+---------+
|
||||
| name | type | comment |
|
||||
+------+--------+---------+
|
||||
| x | int | |
|
||||
| _c1 | string | |
|
||||
+------+--------+---------+
|
||||
Returned 2 row(s) in 0.10s
|
||||
[localhost:21000] > describe formatted v1;
|
||||
Query: describe formatted v1
|
||||
Query finished, fetching results ...
|
||||
+------------------------------+------------------------------+----------------------+
|
||||
| name | type | comment |
|
||||
+------------------------------+------------------------------+----------------------+
|
||||
| # col_name | data_type | comment |
|
||||
| | NULL | NULL |
|
||||
| x | int | None |
|
||||
| _c1 | string | None |
|
||||
| | NULL | NULL |
|
||||
| # Detailed Table Information | NULL | NULL |
|
||||
| Database: | describe_formatted | NULL |
|
||||
| Owner: | cloudera | NULL |
|
||||
| CreateTime: | Mon Jul 22 16:56:38 EDT 2013 | NULL |
|
||||
| LastAccessTime: | UNKNOWN | NULL |
|
||||
| Protect Mode: | None | NULL |
|
||||
| Retention: | 0 | NULL |
|
||||
| Table Type: | VIRTUAL_VIEW | NULL |
|
||||
| Table Parameters: | NULL | NULL |
|
||||
| | transient_lastDdlTime | 1374526598 |
|
||||
| | NULL | NULL |
|
||||
| # Storage Information | NULL | NULL |
|
||||
| SerDe Library: | null | NULL |
|
||||
| InputFormat: | null | NULL |
|
||||
| OutputFormat: | null | NULL |
|
||||
| Compressed: | No | NULL |
|
||||
| Num Buckets: | 0 | NULL |
|
||||
| Bucket Columns: | [] | NULL |
|
||||
| Sort Columns: | [] | NULL |
|
||||
| | NULL | NULL |
|
||||
| # View Information | NULL | NULL |
|
||||
| View Original Text: | SELECT x, upper(s) FROM t1 | NULL |
|
||||
| View Expanded Text: | SELECT x, upper(s) FROM t1 | NULL |
|
||||
+------------------------------+------------------------------+----------------------+
|
||||
Returned 28 row(s) in 0.03s
|
||||
[localhost:21000] > create external table t2 (x int, y int, s string) stored as parquet location '/user/cloudera/sample_data';
|
||||
[localhost:21000] > describe formatted t2;
|
||||
Query: describe formatted t2
|
||||
Query finished, fetching results ...
|
||||
+------------------------------+----------------------------------------------------+------------+
|
||||
| name | type | comment |
|
||||
+------------------------------+----------------------------------------------------+------------+
|
||||
| # col_name | data_type | comment |
|
||||
| | NULL | NULL |
|
||||
| x | int | None |
|
||||
| y | int | None |
|
||||
| s | string | None |
|
||||
| | NULL | NULL |
|
||||
| # Detailed Table Information | NULL | NULL |
|
||||
| Database: | describe_formatted | NULL |
|
||||
| Owner: | cloudera | NULL |
|
||||
| CreateTime: | Mon Jul 22 17:01:47 EDT 2013 | NULL |
|
||||
| LastAccessTime: | UNKNOWN | NULL |
|
||||
| Protect Mode: | None | NULL |
|
||||
| Retention: | 0 | NULL |
|
||||
| Location: | hdfs://127.0.0.1:8020/user/cloudera/sample_data | NULL |
|
||||
| Table Type: | EXTERNAL_TABLE | NULL |
|
||||
| Table Parameters: | NULL | NULL |
|
||||
| | EXTERNAL | TRUE |
|
||||
| | transient_lastDdlTime | 1374526907 |
|
||||
| | NULL | NULL |
|
||||
| # Storage Information | NULL | NULL |
|
||||
| SerDe Library: | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL |
|
||||
| InputFormat: | com.cloudera.impala.hive.serde.ParquetInputFormat | NULL |
|
||||
| OutputFormat: | com.cloudera.impala.hive.serde.ParquetOutputFormat | NULL |
|
||||
| Compressed: | No | NULL |
|
||||
| Num Buckets: | 0 | NULL |
|
||||
| Bucket Columns: | [] | NULL |
|
||||
| Sort Columns: | [] | NULL |
|
||||
+------------------------------+----------------------------------------------------+------------+
|
||||
Returned 27 row(s) in 0.17s</codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/permissions_blurb"/>
|
||||
<p rev="CDH-19187">
|
||||
The user ID that the <cmdname>impalad</cmdname> daemon runs under,
|
||||
typically the <codeph>impala</codeph> user, must have read and execute
|
||||
permissions for all directories that are part of the table.
|
||||
(A table could span multiple different HDFS directories if it is partitioned.
|
||||
The directories could be widely scattered because a partition can reside
|
||||
in an arbitrary HDFS directory based on its <codeph>LOCATION</codeph> attribute.)
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
|
||||
<p>
|
||||
<xref href="impala_tables.xml#tables"/>, <xref href="impala_create_table.xml#create_table"/>,
|
||||
<xref href="impala_show.xml#show_tables"/>, <xref href="impala_show.xml#show_create_table"/>
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
231
docs/topics/impala_development.xml
Normal file
231
docs/topics/impala_development.xml
Normal file
@@ -0,0 +1,231 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="intro_dev">
|
||||
|
||||
<title>Developing Impala Applications</title>
|
||||
<titlealts audience="PDF"><navtitle>Developing Applications</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
<data name="Category" value="Concepts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
The core development language with Impala is SQL. You can also use Java or other languages to interact with
|
||||
Impala through the standard JDBC and ODBC interfaces used by many business intelligence tools. For
|
||||
specialized kinds of analysis, you can supplement the SQL built-in functions by writing
|
||||
<xref href="impala_udf.xml#udfs">user-defined functions (UDFs)</xref> in C++ or Java.
|
||||
</p>
|
||||
|
||||
<p outputclass="toc inpage"/>
|
||||
</conbody>
|
||||
|
||||
<concept id="intro_sql">
|
||||
|
||||
<title>Overview of the Impala SQL Dialect</title>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="Concepts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
The Impala SQL dialect is highly compatible with the SQL syntax used in the Apache Hive component (HiveQL). As
|
||||
such, it is familiar to users who are already familiar with running SQL queries on the Hadoop
|
||||
infrastructure. Currently, Impala SQL supports a subset of HiveQL statements, data types, and built-in
|
||||
functions. Impala also includes additional built-in functions for common industry features, to simplify
|
||||
porting SQL from non-Hadoop systems.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
For users coming to Impala from traditional database or data warehousing backgrounds, the following aspects of the SQL dialect
|
||||
might seem familiar:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
<p>
|
||||
The <xref href="impala_select.xml#select">SELECT statement</xref> includes familiar clauses such as <codeph>WHERE</codeph>,
|
||||
<codeph>GROUP BY</codeph>, <codeph>ORDER BY</codeph>, and <codeph>WITH</codeph>.
|
||||
You will find familiar notions such as
|
||||
<xref href="impala_joins.xml#joins">joins</xref>, <xref href="impala_functions.xml#builtins">built-in
|
||||
functions</xref> for processing strings, numbers, and dates,
|
||||
<xref href="impala_aggregate_functions.xml#aggregate_functions">aggregate functions</xref>,
|
||||
<xref href="impala_subqueries.xml#subqueries">subqueries</xref>, and
|
||||
<xref href="impala_operators.xml#comparison_operators">comparison operators</xref>
|
||||
such as <codeph>IN()</codeph> and <codeph>BETWEEN</codeph>.
|
||||
The <codeph>SELECT</codeph> statement is the place where SQL standards compliance is most important.
|
||||
</p>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<p>
|
||||
From the data warehousing world, you will recognize the notion of
|
||||
<xref href="impala_partitioning.xml#partitioning">partitioned tables</xref>.
|
||||
One or more columns serve as partition keys, and the data is physically arranged so that
|
||||
queries that refer to the partition key columns in the <codeph>WHERE</codeph> clause
|
||||
can skip partitions that do not match the filter conditions. For example, if you have 10
|
||||
years worth of data and use a clause such as <codeph>WHERE year = 2015</codeph>,
|
||||
<codeph>WHERE year > 2010</codeph>, or <codeph>WHERE year IN (2014, 2015)</codeph>,
|
||||
Impala skips all the data for non-matching years, greatly reducing the amount of I/O
|
||||
for the query.
|
||||
</p>
|
||||
</li>
|
||||
|
||||
<li rev="1.2">
|
||||
<p>
|
||||
In Impala 1.2 and higher, <xref href="impala_udf.xml#udfs">UDFs</xref> let you perform custom comparisons
|
||||
and transformation logic during <codeph>SELECT</codeph> and <codeph>INSERT...SELECT</codeph> statements.
|
||||
</p>
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p>
|
||||
For users coming to Impala from traditional database or data warehousing backgrounds, the following aspects of the SQL dialect
|
||||
might require some learning and practice for you to become proficient in the Hadoop environment:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
<p>
|
||||
Impala SQL is focused on queries and includes relatively little DML. There is no <codeph>UPDATE</codeph>
|
||||
or <codeph>DELETE</codeph> statement. Stale data is typically discarded (by <codeph>DROP TABLE</codeph>
|
||||
or <codeph>ALTER TABLE ... DROP PARTITION</codeph> statements) or replaced (by <codeph>INSERT
|
||||
OVERWRITE</codeph> statements).
|
||||
</p>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<p>
|
||||
All data creation is done by <codeph>INSERT</codeph> statements, which typically insert data in bulk by
|
||||
querying from other tables. There are two variations, <codeph>INSERT INTO</codeph> which appends to the
|
||||
existing data, and <codeph>INSERT OVERWRITE</codeph> which replaces the entire contents of a table or
|
||||
partition (similar to <codeph>TRUNCATE TABLE</codeph> followed by a new <codeph>INSERT</codeph>).
|
||||
Although there is an <codeph>INSERT ... VALUES</codeph> syntax to create a small number of values in
|
||||
a single statement, it is far more efficient to use the <codeph>INSERT ... SELECT</codeph> to copy
|
||||
and transform large amounts of data from one table to another in a single operation.
|
||||
</p>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<p>
|
||||
You often construct Impala table definitions and data files in some other environment, and then attach
|
||||
Impala so that it can run real-time queries. The same data files and table metadata are shared with other
|
||||
components of the Hadoop ecosystem. In particular, Impala can access tables created by Hive or data
|
||||
inserted by Hive, and Hive can access tables and data produced by Impala. Many other Hadoop components
|
||||
can write files in formats such as Parquet and Avro, that can then be queried by Impala.
|
||||
</p>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<p>
|
||||
Because Hadoop and Impala are focused on data warehouse-style operations on large data sets, Impala SQL
|
||||
includes some idioms that you might find in the import utilities for traditional database systems. For
|
||||
example, you can create a table that reads comma-separated or tab-separated text files, specifying the
|
||||
separator in the <codeph>CREATE TABLE</codeph> statement. You can create <b>external tables</b> that read
|
||||
existing data files but do not move or transform them.
|
||||
</p>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<p>
|
||||
Because Impala reads large quantities of data that might not be perfectly tidy and predictable, it does
|
||||
not require length constraints on string data types. For example, you can define a database column as
|
||||
<codeph>STRING</codeph> with unlimited length, rather than <codeph>CHAR(1)</codeph> or
|
||||
<codeph>VARCHAR(64)</codeph>. <ph rev="2.0.0">(Although in Impala 2.0 and later, you can also use
|
||||
length-constrained <codeph>CHAR</codeph> and <codeph>VARCHAR</codeph> types.)</ph>
|
||||
</p>
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
|
||||
<p>
|
||||
<b>Related information:</b> <xref href="impala_langref.xml#langref"/>, especially
|
||||
<xref href="impala_langref_sql.xml#langref_sql"/> and <xref href="impala_functions.xml#builtins"/>
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
|
||||
<!-- Bunch of potential concept topics for future consideration. Major areas of Impala modelled on areas of discussion for Oracle Database, and distributed databases in general. -->
|
||||
|
||||
<concept id="intro_datatypes" audience="Cloudera">
|
||||
|
||||
<title>Overview of Impala SQL Data Types</title>
|
||||
|
||||
<conbody/>
|
||||
</concept>
|
||||
|
||||
<concept id="intro_network" audience="Cloudera">
|
||||
|
||||
<title>Overview of Impala Network Topology</title>
|
||||
|
||||
<conbody/>
|
||||
</concept>
|
||||
|
||||
<concept id="intro_cluster" audience="Cloudera">
|
||||
|
||||
<title>Overview of Impala Cluster Topology</title>
|
||||
|
||||
<conbody/>
|
||||
</concept>
|
||||
|
||||
<concept id="intro_apis">
|
||||
|
||||
<title>Overview of Impala Programming Interfaces</title>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="JDBC"/>
|
||||
<data name="Category" value="ODBC"/>
|
||||
<data name="Category" value="Hue"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
You can connect and submit requests to the Impala daemons through:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
The <codeph><xref href="impala_impala_shell.xml#impala_shell">impala-shell</xref></codeph> interactive
|
||||
command interpreter.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
The <xref href="http://gethue.com/" scope="external" format="html">Hue</xref> web-based user interface.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<xref href="impala_jdbc.xml#impala_jdbc">JDBC</xref>.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<xref href="impala_odbc.xml#impala_odbc">ODBC</xref>.
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p>
|
||||
With these options, you can use Impala in heterogeneous environments, with JDBC or ODBC applications
|
||||
running on non-Linux platforms. You can also use Impala on combination with various Business Intelligence
|
||||
tools that use the JDBC and ODBC interfaces.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Each <codeph>impalad</codeph> daemon process, running on separate nodes in a cluster, listens to
|
||||
<xref href="impala_ports.xml#ports">several ports</xref> for incoming requests. Requests from
|
||||
<codeph>impala-shell</codeph> and Hue are routed to the <codeph>impalad</codeph> daemons through the same
|
||||
port. The <codeph>impalad</codeph> daemons listen on separate ports for JDBC and ODBC requests.
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
</concept>
|
||||
36
docs/topics/impala_disable_cached_reads.xml
Normal file
36
docs/topics/impala_disable_cached_reads.xml
Normal file
@@ -0,0 +1,36 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="disable_cached_reads" rev="1.4.0">
|
||||
|
||||
<title>DISABLE_CACHED_READS Query Option</title>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Impala Query Options"/>
|
||||
<data name="Category" value="HDFS"/>
|
||||
<data name="Category" value="HDFS Caching"/>
|
||||
<data name="Category" value="Querying"/>
|
||||
<data name="Category" value="Performance"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<indexterm audience="Cloudera">DISABLE_CACHED_READS query option</indexterm>
|
||||
Prevents Impala from reading data files that are <q>pinned</q> in memory
|
||||
through the HDFS caching feature. Primarily a debugging option for
|
||||
cases where processing of HDFS cached data is concentrated on a single
|
||||
host, leading to excessive CPU usage on that host.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/type_boolean"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/default_false"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/added_in_140"/>
|
||||
|
||||
</conbody>
|
||||
</concept>
|
||||
38
docs/topics/impala_disable_codegen.xml
Normal file
38
docs/topics/impala_disable_codegen.xml
Normal file
@@ -0,0 +1,38 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="disable_codegen">
|
||||
|
||||
<title>DISABLE_CODEGEN Query Option</title>
|
||||
<titlealts audience="PDF"><navtitle>DISABLE_CODEGEN</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Impala Query Options"/>
|
||||
<data name="Category" value="Troubleshooting"/>
|
||||
<data name="Category" value="Performance"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<indexterm audience="Cloudera">DISABLE_CODEGEN query option</indexterm>
|
||||
This is a debug option, intended for diagnosing and working around issues that cause crashes. If a query
|
||||
fails with an <q>illegal instruction</q> or other hardware-specific message, try setting
|
||||
<codeph>DISABLE_CODEGEN=true</codeph> and running the query again. If the query succeeds only when the
|
||||
<codeph>DISABLE_CODEGEN</codeph> option is turned on, submit the problem to <keyword keyref="support_org"/> and include that
|
||||
detail in the problem report. Do not otherwise run with this setting turned on, because it results in lower
|
||||
overall performance.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Because the code generation phase adds a small amount of overhead for each query, you might turn on the
|
||||
<codeph>DISABLE_CODEGEN</codeph> option to achieve maximum throughput when running many short-lived queries
|
||||
against small tables.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/type_boolean"/>
|
||||
<p conref="../shared/impala_common.xml#common/default_false_0"/>
|
||||
|
||||
</conbody>
|
||||
</concept>
|
||||
29
docs/topics/impala_disable_outermost_topn.xml
Normal file
29
docs/topics/impala_disable_outermost_topn.xml
Normal file
@@ -0,0 +1,29 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="disable_outermost_topn" rev="2.5.0">
|
||||
|
||||
<title>DISABLE_OUTERMOST_TOPN Query Option</title>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Impala Query Options"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p rev="2.5.0">
|
||||
<indexterm audience="Cloudera">DISABLE_OUTERMOST_TOPN query option</indexterm>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>Type:</b>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>Default:</b>
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
65
docs/topics/impala_disable_row_runtime_filtering.xml
Normal file
65
docs/topics/impala_disable_row_runtime_filtering.xml
Normal file
@@ -0,0 +1,65 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="disable_row_runtime_filtering" rev="2.5.0">
|
||||
|
||||
<title>DISABLE_ROW_RUNTIME_FILTERING Query Option (<keyword keyref="impala25"/> or higher only)</title>
|
||||
<titlealts audience="PDF"><navtitle>DISABLE_ROW_RUNTIME_FILTERING</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Impala Query Options"/>
|
||||
<data name="Category" value="Performance"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p rev="2.5.0">
|
||||
<indexterm audience="Cloudera">DISABLE_ROW_RUNTIME_FILTERING query option</indexterm>
|
||||
The <codeph>DISABLE_ROW_RUNTIME_FILTERING</codeph> query option
|
||||
reduces the scope of the runtime filtering feature. Queries still dynamically prune
|
||||
partitions, but do not apply the filtering logic to individual rows within partitions.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Only applies to queries against Parquet tables. For other file formats, Impala
|
||||
only prunes at the level of partitions, not individual rows.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/type_boolean"/>
|
||||
<p conref="../shared/impala_common.xml#common/default_false"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/added_in_250"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
||||
|
||||
<p>
|
||||
Impala automatically evaluates whether the per-row filters are being
|
||||
effective at reducing the amount of intermediate data. Therefore,
|
||||
this option is typically only needed for the rare case where Impala
|
||||
cannot accurately determine how effective the per-row filtering is
|
||||
for a query.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/runtime_filtering_option_caveat"/>
|
||||
|
||||
<p>
|
||||
Because this setting only improves query performance in very specific
|
||||
circumstances, depending on the query characteristics and data distribution,
|
||||
only use it when you determine through benchmarking that it improves
|
||||
performance of specific expensive queries.
|
||||
Consider setting this query option immediately before the expensive query and
|
||||
unsetting it immediately afterward.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
<p>
|
||||
<xref href="impala_runtime_filtering.xml"/>,
|
||||
<xref href="impala_runtime_filter_mode.xml#runtime_filter_mode"/>
|
||||
<!-- , <xref href="impala_partitioning.xml#dynamic_partition_pruning"/> -->
|
||||
</p>
|
||||
|
||||
</conbody>
|
||||
</concept>
|
||||
45
docs/topics/impala_disable_streaming_preaggregations.xml
Normal file
45
docs/topics/impala_disable_streaming_preaggregations.xml
Normal file
@@ -0,0 +1,45 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="disable_streaming_preaggregations" rev="2.5.0 IMPALA-1305">
|
||||
|
||||
<title>DISABLE_STREAMING_PREAGGREGATIONS Query Option (<keyword keyref="impala25"/> or higher only)</title>
|
||||
<titlealts audience="PDF"><navtitle>DISABLE_STREAMING_PREAGGREGATIONS</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Impala Query Options"/>
|
||||
<data name="Category" value="Performance"/>
|
||||
<data name="Category" value="Aggregate Functions"/>
|
||||
<data name="Category" value="Troubleshooting"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p rev="2.5.0 IMPALA-1305">
|
||||
<indexterm audience="Cloudera">DISABLE_STREAMING_PREAGGREGATIONS query option</indexterm>
|
||||
Turns off the <q>streaming preaggregation</q> optimization that is available in <keyword keyref="impala25_full"/>
|
||||
and higher. This optimization reduces unnecessary work performed by queries that perform aggregation
|
||||
operations on columns with few or no duplicate values, for example <codeph>DISTINCT <varname>id_column</varname></codeph>
|
||||
or <codeph>GROUP BY <varname>unique_column</varname></codeph>. If the optimization causes regressions in
|
||||
existing queries that use aggregation functions, you can turn it off as needed by setting this query option.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/type_boolean"/>
|
||||
<p conref="../shared/impala_common.xml#common/default_false_0"/>
|
||||
|
||||
<note conref="../shared/impala_common.xml#common/one_but_not_true"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
||||
<p>
|
||||
Typically, queries that would require enabling this option involve very large numbers of
|
||||
aggregated values, such as a billion or more distinct keys being processed on each
|
||||
worker node.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/added_in_250"/>
|
||||
|
||||
</conbody>
|
||||
</concept>
|
||||
53
docs/topics/impala_disable_unsafe_spills.xml
Normal file
53
docs/topics/impala_disable_unsafe_spills.xml
Normal file
@@ -0,0 +1,53 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept rev="2.0.0" id="disable_unsafe_spills">
|
||||
|
||||
<title>DISABLE_UNSAFE_SPILLS Query Option (<keyword keyref="impala20"/> or higher only)</title>
|
||||
<titlealts audience="PDF"><navtitle>DISABLE_UNSAFE_SPILLS</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Impala Query Options"/>
|
||||
<data name="Category" value="Performance"/>
|
||||
<data name="Category" value="Scalability"/>
|
||||
<data name="Category" value="Memory"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p rev="2.0.0">
|
||||
<indexterm audience="Cloudera">DISABLE_UNSAFE_SPILLS query option</indexterm>
|
||||
Enable this option if you prefer to have queries fail when they exceed the Impala memory limit, rather than
|
||||
write temporary data to disk.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Queries that <q>spill</q> to disk typically complete successfully, when in earlier Impala releases they would have failed.
|
||||
However, queries with exorbitant memory requirements due to missing statistics or inefficient join clauses could
|
||||
become so slow as a result that you would rather have them cancelled automatically and reduce the memory
|
||||
usage through standard Impala tuning techniques.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
This option prevents only <q>unsafe</q> spill operations, meaning that one or more tables are missing
|
||||
statistics or the query does not include a hint to set the most efficient mechanism for a join or
|
||||
<codeph>INSERT ... SELECT</codeph> into a partitioned table. These are the tables most likely to result in
|
||||
suboptimal execution plans that could cause unnecessary spilling. Therefore, leaving this option enabled is a
|
||||
good way to find tables on which to run the <codeph>COMPUTE STATS</codeph> statement.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
See <xref href="impala_scalability.xml#spill_to_disk"/> for information about the <q>spill to disk</q>
|
||||
feature for queries processing large result sets with joins, <codeph>ORDER BY</codeph>, <codeph>GROUP
|
||||
BY</codeph>, <codeph>DISTINCT</codeph>, aggregation functions, or analytic functions.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/type_boolean"/>
|
||||
<p conref="../shared/impala_common.xml#common/default_false_0"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/added_in_20"/>
|
||||
</conbody>
|
||||
</concept>
|
||||
129
docs/topics/impala_disk_space.xml
Normal file
129
docs/topics/impala_disk_space.xml
Normal file
@@ -0,0 +1,129 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="disk_space">
|
||||
|
||||
<title>Managing Disk Space for Impala Data</title>
|
||||
<titlealts audience="PDF"><navtitle>Managing Disk Space</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Disk Storage"/>
|
||||
<data name="Category" value="Administrators"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
<data name="Category" value="Tables"/>
|
||||
<data name="Category" value="Compression"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
Although Impala typically works with many large files in an HDFS storage system with plenty of capacity,
|
||||
there are times when you might perform some file cleanup to reclaim space, or advise developers on techniques
|
||||
to minimize space consumption and file duplication.
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
<p>
|
||||
Use compact binary file formats where practical. Numeric and time-based data in particular can be stored
|
||||
in more compact form in binary data files. Depending on the file format, various compression and encoding
|
||||
features can reduce file size even further. You can specify the <codeph>STORED AS</codeph> clause as part
|
||||
of the <codeph>CREATE TABLE</codeph> statement, or <codeph>ALTER TABLE</codeph> with the <codeph>SET
|
||||
FILEFORMAT</codeph> clause for an existing table or partition within a partitioned table. See
|
||||
<xref href="impala_file_formats.xml#file_formats"/> for details about file formats, especially
|
||||
<xref href="impala_parquet.xml#parquet"/>. See <xref href="impala_create_table.xml#create_table"/> and
|
||||
<xref href="impala_alter_table.xml#alter_table"/> for syntax details.
|
||||
</p>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<p>
|
||||
You manage underlying data files differently depending on whether the corresponding Impala table is
|
||||
defined as an <xref href="impala_tables.xml#internal_tables">internal</xref> or
|
||||
<xref href="impala_tables.xml#external_tables">external</xref> table:
|
||||
</p>
|
||||
<ul>
|
||||
<li>
|
||||
Use the <codeph>DESCRIBE FORMATTED</codeph> statement to check if a particular table is internal
|
||||
(managed by Impala) or external, and to see the physical location of the data files in HDFS. See
|
||||
<xref href="impala_describe.xml#describe"/> for details.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
For Impala-managed (<q>internal</q>) tables, use <codeph>DROP TABLE</codeph> statements to remove
|
||||
data files. See <xref href="impala_drop_table.xml#drop_table"/> for details.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
For tables not managed by Impala (<q>external</q> tables), use appropriate HDFS-related commands such
|
||||
as <codeph>hadoop fs</codeph>, <codeph>hdfs dfs</codeph>, or <codeph>distcp</codeph>, to create, move,
|
||||
copy, or delete files within HDFS directories that are accessible by the <codeph>impala</codeph> user.
|
||||
Issue a <codeph>REFRESH <varname>table_name</varname></codeph> statement after adding or removing any
|
||||
files from the data directory of an external table. See <xref href="impala_refresh.xml#refresh"/> for
|
||||
details.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Use external tables to reference HDFS data files in their original location. With this technique, you
|
||||
avoid copying the files, and you can map more than one Impala table to the same set of data files. When
|
||||
you drop the Impala table, the data files are left undisturbed. See
|
||||
<xref href="impala_tables.xml#external_tables"/> for details.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Use the <codeph>LOAD DATA</codeph> statement to move HDFS files into the data directory for an Impala
|
||||
table from inside Impala, without the need to specify the HDFS path of the destination directory. This
|
||||
technique works for both internal and external tables. See
|
||||
<xref href="impala_load_data.xml#load_data"/> for details.
|
||||
</li>
|
||||
</ul>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<p>
|
||||
Make sure that the HDFS trashcan is configured correctly. When you remove files from HDFS, the space
|
||||
might not be reclaimed for use by other files until sometime later, when the trashcan is emptied. See
|
||||
<xref href="impala_drop_table.xml#drop_table"/> and the FAQ entry
|
||||
<xref href="impala_faq.xml#faq_sql/faq_drop_table_space"/> for details. See
|
||||
<xref href="impala_prereqs.xml#prereqs_account"/> for permissions needed for the HDFS trashcan to operate
|
||||
correctly.
|
||||
</p>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<p>
|
||||
Drop all tables in a database before dropping the database itself. See
|
||||
<xref href="impala_drop_database.xml#drop_database"/> for details.
|
||||
</p>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<p>
|
||||
Clean up temporary files after failed <codeph>INSERT</codeph> statements. If an <codeph>INSERT</codeph>
|
||||
statement encounters an error, and you see a directory named <filepath>.impala_insert_staging</filepath>
|
||||
or <filepath>_impala_insert_staging</filepath> left behind in the data directory for the table, it might
|
||||
contain temporary data files taking up space in HDFS. You might be able to salvage these data files, for
|
||||
example if they are complete but could not be moved into place due to a permission error. Or, you might
|
||||
delete those files through commands such as <codeph>hadoop fs</codeph> or <codeph>hdfs dfs</codeph>, to
|
||||
reclaim space before re-trying the <codeph>INSERT</codeph>. Issue <codeph>DESCRIBE FORMATTED
|
||||
<varname>table_name</varname></codeph> to see the HDFS path where you can check for temporary files.
|
||||
</p>
|
||||
</li>
|
||||
|
||||
<li rev="1.4.0">
|
||||
<p rev="obwl" conref="../shared/impala_common.xml#common/order_by_scratch_dir"/>
|
||||
</li>
|
||||
|
||||
<li rev="2.2.0">
|
||||
<p>
|
||||
If you use the Amazon Simple Storage Service (S3) as a place to offload
|
||||
data to reduce the volume of local storage, Impala 2.2.0 and higher
|
||||
can query the data directly from S3.
|
||||
See <xref href="impala_s3.xml#s3"/> for details.
|
||||
</p>
|
||||
</li>
|
||||
</ul>
|
||||
</conbody>
|
||||
</concept>
|
||||
61
docs/topics/impala_distinct.xml
Normal file
61
docs/topics/impala_distinct.xml
Normal file
@@ -0,0 +1,61 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="distinct">
|
||||
|
||||
<title>DISTINCT Operator</title>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="Querying"/>
|
||||
<data name="Category" value="Aggregate Functions"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<indexterm audience="Cloudera">DISTINCT operator</indexterm>
|
||||
The <codeph>DISTINCT</codeph> operator in a <codeph>SELECT</codeph> statement filters the result set to
|
||||
remove duplicates:
|
||||
</p>
|
||||
|
||||
<codeblock>-- Returns the unique values from one column.
|
||||
-- NULL is included in the set of values if any rows have a NULL in this column.
|
||||
select distinct c_birth_country from customer;
|
||||
-- Returns the unique combinations of values from multiple columns.
|
||||
select distinct c_salutation, c_last_name from customer;</codeblock>
|
||||
|
||||
<p>
|
||||
You can use <codeph>DISTINCT</codeph> in combination with an aggregation function, typically
|
||||
<codeph>COUNT()</codeph>, to find how many different values a column contains:
|
||||
</p>
|
||||
|
||||
<codeblock>-- Counts the unique values from one column.
|
||||
-- NULL is not included as a distinct value in the count.
|
||||
select count(distinct c_birth_country) from customer;
|
||||
-- Counts the unique combinations of values from multiple columns.
|
||||
select count(distinct c_salutation, c_last_name) from customer;</codeblock>
|
||||
|
||||
<p>
|
||||
One construct that Impala SQL does <i>not</i> support is using <codeph>DISTINCT</codeph> in more than one
|
||||
aggregation function in the same query. For example, you could not have a single query with both
|
||||
<codeph>COUNT(DISTINCT c_first_name)</codeph> and <codeph>COUNT(DISTINCT c_last_name)</codeph> in the
|
||||
<codeph>SELECT</codeph> list.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/zero_length_strings"/>
|
||||
|
||||
<note conref="../shared/impala_common.xml#common/multiple_count_distinct"/>
|
||||
|
||||
<note>
|
||||
<p>
|
||||
In contrast with some database systems that always return <codeph>DISTINCT</codeph> values in sorted order,
|
||||
Impala does not do any ordering of <codeph>DISTINCT</codeph> values. Always include an <codeph>ORDER
|
||||
BY</codeph> clause if you need the values in alphabetical or numeric sorted order.
|
||||
</p>
|
||||
</note>
|
||||
</conbody>
|
||||
</concept>
|
||||
91
docs/topics/impala_dml.xml
Normal file
91
docs/topics/impala_dml.xml
Normal file
@@ -0,0 +1,91 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="dml">
|
||||
|
||||
<title>DML Statements</title>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="DML"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Tables"/>
|
||||
<data name="Category" value="ETL"/>
|
||||
<data name="Category" value="Ingest"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
DML refers to <q>Data Manipulation Language</q>, a subset of SQL statements that modify the data stored in
|
||||
tables. Because Impala focuses on query performance and leverages the append-only nature of HDFS storage,
|
||||
currently Impala only supports a small set of DML statements:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
<xref keyref="delete"/>. Works for Kudu tables only.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<xref keyref="insert"/>.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<xref keyref="load_data"/>. Does not apply for HBase or Kudu tables.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<xref keyref="update"/>. Works for Kudu tables only.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<xref keyref="upsert"/>. Works for Kudu tables only.
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p>
|
||||
<codeph>INSERT</codeph> in Impala is primarily optimized for inserting large volumes of data in a single
|
||||
statement, to make effective use of the multi-megabyte HDFS blocks. This is the way in Impala to create new
|
||||
data files. If you intend to insert one or a few rows at a time, such as using the <codeph>INSERT ...
|
||||
VALUES</codeph> syntax, that technique is much more efficient for Impala tables stored in HBase. See
|
||||
<xref href="impala_hbase.xml#impala_hbase"/> for details.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<codeph>LOAD DATA</codeph> moves existing data files into the directory for an Impala table, making them
|
||||
immediately available for Impala queries. This is one way in Impala to work with data files produced by other
|
||||
Hadoop components. (<codeph>CREATE EXTERNAL TABLE</codeph> is the other alternative; with external tables,
|
||||
you can query existing data files, while the files remain in their original location.)
|
||||
</p>
|
||||
|
||||
<p>
|
||||
In <keyword keyref="impala28_full"/> and higher, Impala does support the <codeph>UPDATE</codeph>, <codeph>DELETE</codeph>,
|
||||
and <codeph>UPSERT</codeph> statements for Kudu tables.
|
||||
For HDFS or S3 tables, to simulate the effects of an <codeph>UPDATE</codeph> or <codeph>DELETE</codeph> statement
|
||||
in other database systems, typically you use <codeph>INSERT</codeph> or <codeph>CREATE TABLE AS SELECT</codeph> to copy data
|
||||
from one table to another, filtering out or changing the appropriate rows during the copy operation.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
You can also achieve a result similar to <codeph>UPDATE</codeph> by using Impala tables stored in HBase.
|
||||
When you insert a row into an HBase table, and the table
|
||||
already contains a row with the same value for the key column, the older row is hidden, effectively the same
|
||||
as a single-row <codeph>UPDATE</codeph>.
|
||||
</p>
|
||||
|
||||
<p rev="2.6.0">
|
||||
Impala can perform DML operations for tables or partitions stored in the Amazon S3 filesystem
|
||||
with <keyword keyref="impala26_full"/> and higher. See <xref href="impala_s3.xml#s3"/> for details.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
|
||||
<p>
|
||||
The other major classifications of SQL statements are data definition language (see
|
||||
<xref href="impala_ddl.xml#ddl"/>) and queries (see <xref href="impala_select.xml#select"/>).
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
100
docs/topics/impala_double.xml
Normal file
100
docs/topics/impala_double.xml
Normal file
@@ -0,0 +1,100 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="double">
|
||||
|
||||
<title>DOUBLE Data Type</title>
|
||||
<titlealts audience="PDF"><navtitle>DOUBLE</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Impala Data Types"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Schemas"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
A double precision floating-point data type used in <codeph>CREATE TABLE</codeph> and <codeph>ALTER
|
||||
TABLE</codeph> statements.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
||||
|
||||
<p>
|
||||
In the column definition of a <codeph>CREATE TABLE</codeph> statement:
|
||||
</p>
|
||||
|
||||
<codeblock><varname>column_name</varname> DOUBLE</codeblock>
|
||||
|
||||
<p>
|
||||
<b>Range:</b> 4.94065645841246544e-324d .. 1.79769313486231570e+308, positive or negative
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>Precision:</b> 15 to 17 significant digits, depending on usage. The number of significant digits does
|
||||
not depend on the position of the decimal point.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>Representation:</b> The values are stored in 8 bytes, using
|
||||
<xref href="https://en.wikipedia.org/wiki/Double-precision_floating-point_format" scope="external" format="html">IEEE 754 Double Precision Binary Floating Point</xref> format.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>Conversions:</b> Impala does not automatically convert <codeph>DOUBLE</codeph> to any other type. You can
|
||||
use <codeph>CAST()</codeph> to convert <codeph>DOUBLE</codeph> values to <codeph>FLOAT</codeph>,
|
||||
<codeph>TINYINT</codeph>, <codeph>SMALLINT</codeph>, <codeph>INT</codeph>, <codeph>BIGINT</codeph>,
|
||||
<codeph>STRING</codeph>, <codeph>TIMESTAMP</codeph>, or <codeph>BOOLEAN</codeph>. You can use exponential
|
||||
notation in <codeph>DOUBLE</codeph> literals or when casting from <codeph>STRING</codeph>, for example
|
||||
<codeph>1.0e6</codeph> to represent one million.
|
||||
<ph conref="../shared/impala_common.xml#common/cast_int_to_timestamp"/>
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
||||
|
||||
<p>
|
||||
The data type <codeph>REAL</codeph> is an alias for <codeph>DOUBLE</codeph>.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
|
||||
<codeblock>CREATE TABLE t1 (x DOUBLE);
|
||||
SELECT CAST(1000.5 AS DOUBLE);
|
||||
</codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/partitioning_imprecise"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/hbase_ok"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/parquet_ok"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/text_bulky"/>
|
||||
|
||||
<!-- <p conref="../shared/impala_common.xml#common/compatibility_blurb"/> -->
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/internals_8_bytes"/>
|
||||
|
||||
<!-- <p conref="../shared/impala_common.xml#common/added_in_20"/> -->
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/column_stats_constant"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
|
||||
|
||||
<!-- This conref appears under SUM(), AVG(), FLOAT, and DOUBLE topics. -->
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/sum_double"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/float_double_decimal_caveat"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
|
||||
<p>
|
||||
<xref href="impala_literals.xml#numeric_literals"/>, <xref href="impala_math_functions.xml#math_functions"/>,
|
||||
<xref href="impala_float.xml#float"/>
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
35
docs/topics/impala_drop_data_source.xml
Normal file
35
docs/topics/impala_drop_data_source.xml
Normal file
@@ -0,0 +1,35 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept audience="Cloudera" rev="1.4.0" id="drop_data_source">
|
||||
|
||||
<title>DROP DATA SOURCE Statement</title>
|
||||
<titlealts audience="PDF"><navtitle>DROP DATA SOURCE</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="DDL"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<indexterm audience="Cloudera">DROP DATA SOURCE statement</indexterm>
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/ddl_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
|
||||
</conbody>
|
||||
</concept>
|
||||
130
docs/topics/impala_drop_database.xml
Normal file
130
docs/topics/impala_drop_database.xml
Normal file
@@ -0,0 +1,130 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="drop_database">
|
||||
|
||||
<title>DROP DATABASE Statement</title>
|
||||
<titlealts audience="PDF"><navtitle>DROP DATABASE</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="Databases"/>
|
||||
<data name="Category" value="DDL"/>
|
||||
<data name="Category" value="Schemas"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<indexterm audience="Cloudera">DROP DATABASE statement</indexterm>
|
||||
Removes a database from the system. The physical operations involve removing the metadata for the database
|
||||
from the metastore, and deleting the corresponding <codeph>*.db</codeph> directory from HDFS.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
||||
|
||||
<codeblock>DROP (DATABASE|SCHEMA) [IF EXISTS] <varname>database_name</varname> <ph rev="2.3.0">[RESTRICT | CASCADE]</ph>;</codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/ddl_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
||||
|
||||
<p>
|
||||
By default, the database must be empty before it can be dropped, to avoid losing any data.
|
||||
</p>
|
||||
|
||||
<p rev="2.3.0">
|
||||
In <keyword keyref="impala23_full"/> and higher, you can include the <codeph>CASCADE</codeph>
|
||||
clause to make Impala drop all tables and other objects in the database before dropping the database itself.
|
||||
The <codeph>RESTRICT</codeph> clause enforces the original requirement that the database be empty
|
||||
before being dropped. Because the <codeph>RESTRICT</codeph> behavior is still the default, this
|
||||
clause is optional.
|
||||
</p>
|
||||
|
||||
<p rev="2.3.0">
|
||||
The automatic dropping resulting from the <codeph>CASCADE</codeph> clause follows the same rules as the
|
||||
corresponding <codeph>DROP TABLE</codeph>, <codeph>DROP VIEW</codeph>, and <codeph>DROP FUNCTION</codeph> statements.
|
||||
In particular, the HDFS directories and data files for any external tables are left behind when the
|
||||
tables are removed.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
When you do not use the <codeph>CASCADE</codeph> clause, drop or move all the objects inside the database manually
|
||||
before dropping the database itself:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
<p>
|
||||
Use the <codeph>SHOW TABLES</codeph> statement to locate all tables and views in the database,
|
||||
and issue <codeph>DROP TABLE</codeph> and <codeph>DROP VIEW</codeph> statements to remove them all.
|
||||
</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>
|
||||
Use the <codeph>SHOW FUNCTIONS</codeph> and <codeph>SHOW AGGREGATE FUNCTIONS</codeph> statements
|
||||
to locate all user-defined functions in the database, and issue <codeph>DROP FUNCTION</codeph>
|
||||
and <codeph>DROP AGGREGATE FUNCTION</codeph> statements to remove them all.
|
||||
</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>
|
||||
To keep tables or views contained by a database while removing the database itself, use
|
||||
<codeph>ALTER TABLE</codeph> and <codeph>ALTER VIEW</codeph> to move the relevant
|
||||
objects to a different database before dropping the original database.
|
||||
</p>
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p>
|
||||
You cannot drop the current database, that is, the database your session connected to
|
||||
either through the <codeph>USE</codeph> statement or the <codeph>-d</codeph> option of <cmdname>impala-shell</cmdname>.
|
||||
Issue a <codeph>USE</codeph> statement to switch to a different database first.
|
||||
Because the <codeph>default</codeph> database is always available, issuing
|
||||
<codeph>USE default</codeph> is a convenient way to leave the current database
|
||||
before dropping it.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/hive_blurb"/>
|
||||
|
||||
<p>
|
||||
When you drop a database in Impala, the database can no longer be used by Hive.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
|
||||
<!-- Better to conref the same examples in both places. -->
|
||||
|
||||
<p>
|
||||
See <xref href="impala_create_database.xml#create_database"/> for examples covering <codeph>CREATE
|
||||
DATABASE</codeph>, <codeph>USE</codeph>, and <codeph>DROP DATABASE</codeph>.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/s3_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/s3_ddl"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/permissions_blurb"/>
|
||||
<p rev="CDH-19187">
|
||||
The user ID that the <cmdname>impalad</cmdname> daemon runs under,
|
||||
typically the <codeph>impala</codeph> user, must have write
|
||||
permission for the directory associated with the database.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
|
||||
<codeblock conref="../shared/impala_common.xml#common/create_drop_db_example"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
|
||||
<p>
|
||||
<xref href="impala_databases.xml#databases"/>, <xref href="impala_create_database.xml#create_database"/>,
|
||||
<xref href="impala_use.xml#use"/>, <xref href="impala_show.xml#show_databases"/>, <xref href="impala_drop_table.xml#drop_table"/>
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
127
docs/topics/impala_drop_function.xml
Normal file
127
docs/topics/impala_drop_function.xml
Normal file
@@ -0,0 +1,127 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept rev="1.2" id="drop_function">
|
||||
|
||||
<title>DROP FUNCTION Statement</title>
|
||||
<titlealts audience="PDF"><navtitle>DROP FUNCTION</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="DDL"/>
|
||||
<data name="Category" value="Impala Functions"/>
|
||||
<data name="Category" value="UDFs"/>
|
||||
<data name="Category" value="Schemas"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<indexterm audience="Cloudera">DROP FUNCTION statement</indexterm>
|
||||
Removes a user-defined function (UDF), so that it is not available for execution during Impala
|
||||
<codeph>SELECT</codeph> or <codeph>INSERT</codeph> operations.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
||||
|
||||
<p>
|
||||
To drop C++ UDFs and UDAs:
|
||||
</p>
|
||||
|
||||
<codeblock>DROP [AGGREGATE] FUNCTION [IF EXISTS] [<varname>db_name</varname>.]<varname>function_name</varname>(<varname>type</varname>[, <varname>type</varname>...])</codeblock>
|
||||
|
||||
<note rev="2.5.0 IMPALA-2843 CDH-39148">
|
||||
<p rev="2.5.0 IMPALA-2843 CDH-39148">
|
||||
The preceding syntax, which includes the function signature, also applies to Java UDFs that were created
|
||||
using the corresponding <codeph>CREATE FUNCTION</codeph> syntax that includes the argument and return types.
|
||||
After upgrading to <keyword keyref="impala25_full"/> or higher, consider re-creating all Java UDFs with the
|
||||
<codeph>CREATE FUNCTION</codeph> syntax that does not include the function signature. Java UDFs created this
|
||||
way are now persisted in the metastore database and do not need to be re-created after an Impala restart.
|
||||
</p>
|
||||
</note>
|
||||
|
||||
<p rev="2.5.0 IMPALA-2843 CDH-39148">
|
||||
To drop Java UDFs (created using the <codeph>CREATE FUNCTION</codeph> syntax with no function signature):
|
||||
</p>
|
||||
|
||||
<codeblock rev="2.5.0">DROP FUNCTION [IF EXISTS] [<varname>db_name</varname>.]<varname>function_name</varname></codeblock>
|
||||
|
||||
<!--
|
||||
Examples:
|
||||
CREATE FUNCTION IF NOT EXISTS foo location '/path/to/jar' SYMBOL='TestUdf';
|
||||
CREATE FUNCTION bar location '/path/to/jar' SYMBOL='TestUdf2';
|
||||
DROP FUNCTION foo;
|
||||
DROP FUNCTION IF EXISTS bar;
|
||||
-->
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/ddl_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
||||
|
||||
<p>
|
||||
Because the same function name could be overloaded with different argument signatures, you specify the
|
||||
argument types to identify the exact function to drop.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/udf_persistence_restriction"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/permissions_blurb"/>
|
||||
<p rev="CDH-19187">
|
||||
The user ID that the <cmdname>impalad</cmdname> daemon runs under,
|
||||
typically the <codeph>impala</codeph> user, does not need any
|
||||
particular HDFS permissions to perform this statement.
|
||||
All read and write operations are on the metastore database,
|
||||
not HDFS files and directories.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
<p rev="2.5.0 IMPALA-2843 CDH-39148">
|
||||
The following example shows how to drop Java functions created with the signatureless
|
||||
<codeph>CREATE FUNCTION</codeph> syntax in <keyword keyref="impala25_full"/> and higher.
|
||||
Issuing <codeph>DROP FUNCTION <varname>function_name</varname></codeph> removes all the
|
||||
overloaded functions under that name.
|
||||
(See <xref href="impala_create_function.xml#create_function"/> for a longer example
|
||||
showing how to set up such functions in the first place.)
|
||||
</p>
|
||||
<codeblock rev="2.5.0 IMPALA-2843 CDH-39148">
|
||||
create function my_func location '/user/impala/udfs/udf-examples-cdh570.jar'
|
||||
symbol='com.cloudera.impala.TestUdf';
|
||||
|
||||
show functions;
|
||||
+-------------+---------------------------------------+-------------+---------------+
|
||||
| return type | signature | binary type | is persistent |
|
||||
+-------------+---------------------------------------+-------------+---------------+
|
||||
| BIGINT | my_func(BIGINT) | JAVA | true |
|
||||
| BOOLEAN | my_func(BOOLEAN) | JAVA | true |
|
||||
| BOOLEAN | my_func(BOOLEAN, BOOLEAN) | JAVA | true |
|
||||
...
|
||||
| BIGINT | testudf(BIGINT) | JAVA | true |
|
||||
| BOOLEAN | testudf(BOOLEAN) | JAVA | true |
|
||||
| BOOLEAN | testudf(BOOLEAN, BOOLEAN) | JAVA | true |
|
||||
...
|
||||
|
||||
drop function my_func;
|
||||
show functions;
|
||||
+-------------+---------------------------------------+-------------+---------------+
|
||||
| return type | signature | binary type | is persistent |
|
||||
+-------------+---------------------------------------+-------------+---------------+
|
||||
| BIGINT | testudf(BIGINT) | JAVA | true |
|
||||
| BOOLEAN | testudf(BOOLEAN) | JAVA | true |
|
||||
| BOOLEAN | testudf(BOOLEAN, BOOLEAN) | JAVA | true |
|
||||
...
|
||||
</codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
|
||||
<p>
|
||||
<xref href="impala_udf.xml#udfs"/>, <xref href="impala_create_function.xml#create_function"/>
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
71
docs/topics/impala_drop_role.xml
Normal file
71
docs/topics/impala_drop_role.xml
Normal file
@@ -0,0 +1,71 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept rev="1.4.0" id="drop_role">
|
||||
|
||||
<title>DROP ROLE Statement (<keyword keyref="impala20"/> or higher only)</title>
|
||||
<titlealts audience="PDF"><navtitle>DROP ROLE</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="DDL"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="Sentry"/>
|
||||
<data name="Category" value="Security"/>
|
||||
<data name="Category" value="Roles"/>
|
||||
<data name="Category" value="Administrators"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
<!-- Consider whether to go deeper into categories like Security for the Sentry-related statements. -->
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<indexterm audience="Cloudera">DROP ROLE statement</indexterm>
|
||||
<!-- Copied from Sentry docs. Turn into conref. I did some rewording for clarity. -->
|
||||
The <codeph>DROP ROLE</codeph> statement removes a role from the metastore database. Once dropped, the role
|
||||
is revoked for all users to whom it was previously assigned, and all privileges granted to that role are
|
||||
revoked. Queries that are already executing are not affected. Impala verifies the role information
|
||||
approximately every 60 seconds, so the effects of <codeph>DROP ROLE</codeph> might not take effect for new
|
||||
Impala queries for a brief period.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
||||
|
||||
<codeblock>DROP ROLE <varname>role_name</varname>
|
||||
</codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/privileges_blurb"/>
|
||||
|
||||
<p>
|
||||
Only administrative users (initially, a predefined set of users specified in the Sentry service configuration
|
||||
file) can use this statement.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/compatibility_blurb"/>
|
||||
|
||||
<p>
|
||||
Impala makes use of any roles and privileges specified by the <codeph>GRANT</codeph> and
|
||||
<codeph>REVOKE</codeph> statements in Hive, and Hive makes use of any roles and privileges specified by the
|
||||
<codeph>GRANT</codeph> and <codeph>REVOKE</codeph> statements in Impala. The Impala <codeph>GRANT</codeph>
|
||||
and <codeph>REVOKE</codeph> statements for privileges do not require the <codeph>ROLE</codeph> keyword to be
|
||||
repeated before each role name, unlike the equivalent Hive statements.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
|
||||
<p>
|
||||
<xref href="impala_authorization.xml#authorization"/>, <xref href="impala_grant.xml#grant"/>
|
||||
<xref href="impala_revoke.xml#revoke"/>, <xref href="impala_create_role.xml#create_role"/>,
|
||||
<xref href="impala_show.xml#show"/>
|
||||
</p>
|
||||
|
||||
<!-- To do: nail down the new SHOW syntax, e.g. SHOW ROLES, SHOW CURRENT ROLES, SHOW GROUPS. -->
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/permissions_blurb_no"/>
|
||||
|
||||
</conbody>
|
||||
</concept>
|
||||
279
docs/topics/impala_drop_stats.xml
Normal file
279
docs/topics/impala_drop_stats.xml
Normal file
@@ -0,0 +1,279 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept rev="2.1.0" id="drop_stats">
|
||||
|
||||
<title>DROP STATS Statement</title>
|
||||
<titlealts audience="PDF"><navtitle>DROP STATS</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="DDL"/>
|
||||
<data name="Category" value="ETL"/>
|
||||
<data name="Category" value="Ingest"/>
|
||||
<data name="Category" value="Tables"/>
|
||||
<data name="Category" value="Performance"/>
|
||||
<data name="Category" value="Scalability"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p rev="2.1.0">
|
||||
<indexterm audience="Cloudera">DROP STATS statement</indexterm>
|
||||
Removes the specified statistics from a table or partition. The statistics were originally created by the
|
||||
<codeph>COMPUTE STATS</codeph> or <codeph>COMPUTE INCREMENTAL STATS</codeph> statement.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
||||
|
||||
<codeblock rev="2.1.0">DROP STATS [<varname>database_name</varname>.]<varname>table_name</varname>
|
||||
DROP INCREMENTAL STATS [<varname>database_name</varname>.]<varname>table_name</varname> PARTITION (<varname>partition_spec</varname>)
|
||||
|
||||
<varname>partition_spec</varname> ::= <varname>partition_col</varname>=<varname>constant_value</varname>
|
||||
</codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/incremental_partition_spec"/>
|
||||
|
||||
<p>
|
||||
<codeph>DROP STATS</codeph> removes all statistics from the table, whether created by <codeph>COMPUTE
|
||||
STATS</codeph> or <codeph>COMPUTE INCREMENTAL STATS</codeph>.
|
||||
</p>
|
||||
|
||||
<p rev="2.1.0">
|
||||
<codeph>DROP INCREMENTAL STATS</codeph> only affects incremental statistics for a single partition, specified
|
||||
through the <codeph>PARTITION</codeph> clause. The incremental stats are marked as outdated, so that they are
|
||||
recomputed by the next <codeph>COMPUTE INCREMENTAL STATS</codeph> statement.
|
||||
</p>
|
||||
|
||||
<!-- To do: what release was this added in? -->
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
||||
|
||||
<p>
|
||||
You typically use this statement when the statistics for a table or a partition have become stale due to data
|
||||
files being added to or removed from the associated HDFS data directories, whether by manual HDFS operations
|
||||
or <codeph>INSERT</codeph>, <codeph>INSERT OVERWRITE</codeph>, or <codeph>LOAD DATA</codeph> statements, or
|
||||
adding or dropping partitions.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
When a table or partition has no associated statistics, Impala treats it as essentially zero-sized when
|
||||
constructing the execution plan for a query. In particular, the statistics influence the order in which
|
||||
tables are joined in a join query. To ensure proper query planning and good query performance and
|
||||
scalability, make sure to run <codeph>COMPUTE STATS</codeph> or <codeph>COMPUTE INCREMENTAL STATS</codeph> on
|
||||
the table or partition after removing any stale statistics.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Dropping the statistics is not required for an unpartitioned table or a partitioned table covered by the
|
||||
original type of statistics. A subsequent <codeph>COMPUTE STATS</codeph> statement replaces any existing
|
||||
statistics with new ones, for all partitions, regardless of whether the old ones were outdated. Therefore,
|
||||
this statement was rarely used before the introduction of incremental statistics.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Dropping the statistics is required for a partitioned table containing incremental statistics, to make a
|
||||
subsequent <codeph>COMPUTE INCREMENTAL STATS</codeph> statement rescan an existing partition. See
|
||||
<xref href="impala_perf_stats.xml#perf_stats"/> for information about incremental statistics, a new feature
|
||||
available in Impala 2.1.0 and higher.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/ddl_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/permissions_blurb"/>
|
||||
<p rev="CDH-19187">
|
||||
The user ID that the <cmdname>impalad</cmdname> daemon runs under,
|
||||
typically the <codeph>impala</codeph> user, does not need any
|
||||
particular HDFS permissions to perform this statement.
|
||||
All read and write operations are on the metastore database,
|
||||
not HDFS files and directories.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
|
||||
<p>
|
||||
The following example shows a partitioned table that has associated statistics produced by the
|
||||
<codeph>COMPUTE INCREMENTAL STATS</codeph> statement, and how the situation evolves as statistics are dropped
|
||||
from specific partitions, then the entire table.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Initially, all table and column statistics are filled in.
|
||||
</p>
|
||||
|
||||
<!-- Note: chopped off any excess characters at position 87 and after,
|
||||
to avoid weird wrapping in PDF.
|
||||
Applies to any subsequent examples with output from SHOW ... STATS too. -->
|
||||
|
||||
<codeblock>show table stats item_partitioned;
|
||||
+-------------+-------+--------+----------+--------------+---------+-----------------
|
||||
| i_category | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats
|
||||
+-------------+-------+--------+----------+--------------+---------+-----------------
|
||||
| Books | 1733 | 1 | 223.74KB | NOT CACHED | PARQUET | true
|
||||
| Children | 1786 | 1 | 230.05KB | NOT CACHED | PARQUET | true
|
||||
| Electronics | 1812 | 1 | 232.67KB | NOT CACHED | PARQUET | true
|
||||
| Home | 1807 | 1 | 232.56KB | NOT CACHED | PARQUET | true
|
||||
| Jewelry | 1740 | 1 | 223.72KB | NOT CACHED | PARQUET | true
|
||||
| Men | 1811 | 1 | 231.25KB | NOT CACHED | PARQUET | true
|
||||
| Music | 1860 | 1 | 237.90KB | NOT CACHED | PARQUET | true
|
||||
| Shoes | 1835 | 1 | 234.90KB | NOT CACHED | PARQUET | true
|
||||
| Sports | 1783 | 1 | 227.97KB | NOT CACHED | PARQUET | true
|
||||
| Women | 1790 | 1 | 226.27KB | NOT CACHED | PARQUET | true
|
||||
| Total | 17957 | 10 | 2.25MB | 0B | |
|
||||
+-------------+-------+--------+----------+--------------+---------+-----------------
|
||||
show column stats item_partitioned;
|
||||
+------------------+-----------+------------------+--------+----------+--------------
|
||||
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size
|
||||
+------------------+-----------+------------------+--------+----------+--------------
|
||||
| i_item_sk | INT | 19443 | -1 | 4 | 4
|
||||
| i_item_id | STRING | 9025 | -1 | 16 | 16
|
||||
| i_rec_start_date | TIMESTAMP | 4 | -1 | 16 | 16
|
||||
| i_rec_end_date | TIMESTAMP | 3 | -1 | 16 | 16
|
||||
| i_item_desc | STRING | 13330 | -1 | 200 | 100.302803039
|
||||
| i_current_price | FLOAT | 2807 | -1 | 4 | 4
|
||||
| i_wholesale_cost | FLOAT | 2105 | -1 | 4 | 4
|
||||
| i_brand_id | INT | 965 | -1 | 4 | 4
|
||||
| i_brand | STRING | 725 | -1 | 22 | 16.1776008605
|
||||
| i_class_id | INT | 16 | -1 | 4 | 4
|
||||
| i_class | STRING | 101 | -1 | 15 | 7.76749992370
|
||||
| i_category_id | INT | 10 | -1 | 4 | 4
|
||||
| i_manufact_id | INT | 1857 | -1 | 4 | 4
|
||||
| i_manufact | STRING | 1028 | -1 | 15 | 11.3295001983
|
||||
| i_size | STRING | 8 | -1 | 11 | 4.33459997177
|
||||
| i_formulation | STRING | 12884 | -1 | 20 | 19.9799995422
|
||||
| i_color | STRING | 92 | -1 | 10 | 5.38089990615
|
||||
| i_units | STRING | 22 | -1 | 7 | 4.18690013885
|
||||
| i_container | STRING | 2 | -1 | 7 | 6.99259996414
|
||||
| i_manager_id | INT | 105 | -1 | 4 | 4
|
||||
| i_product_name | STRING | 19094 | -1 | 25 | 18.0233001708
|
||||
| i_category | STRING | 10 | 0 | -1 | -1
|
||||
+------------------+-----------+------------------+--------+----------+--------------
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
To remove statistics for particular partitions, use the <codeph>DROP INCREMENTAL STATS</codeph> statement.
|
||||
After removing statistics for two partitions, the table-level statistics reflect that change in the
|
||||
<codeph>#Rows</codeph> and <codeph>Incremental stats</codeph> fields. The counts, maximums, and averages of
|
||||
the column-level statistics are unaffected.
|
||||
</p>
|
||||
|
||||
<note>
|
||||
(It is possible that the row count might be preserved in future after a <codeph>DROP INCREMENTAL
|
||||
STATS</codeph> statement. Check the resolution of the issue
|
||||
<xref href="https://issues.cloudera.org/browse/IMPALA-1615" scope="external" format="html">IMPALA-1615</xref>.)
|
||||
</note>
|
||||
|
||||
<codeblock>drop incremental stats item_partitioned partition (i_category='Sports');
|
||||
drop incremental stats item_partitioned partition (i_category='Electronics');
|
||||
|
||||
show table stats item_partitioned
|
||||
+-------------+-------+--------+----------+--------------+---------+------------------
|
||||
| i_category | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats
|
||||
+-------------+-------+--------+----------+--------------+---------+-----------------
|
||||
| Books | 1733 | 1 | 223.74KB | NOT CACHED | PARQUET | true
|
||||
| Children | 1786 | 1 | 230.05KB | NOT CACHED | PARQUET | true
|
||||
| Electronics | -1 | 1 | 232.67KB | NOT CACHED | PARQUET | false
|
||||
| Home | 1807 | 1 | 232.56KB | NOT CACHED | PARQUET | true
|
||||
| Jewelry | 1740 | 1 | 223.72KB | NOT CACHED | PARQUET | true
|
||||
| Men | 1811 | 1 | 231.25KB | NOT CACHED | PARQUET | true
|
||||
| Music | 1860 | 1 | 237.90KB | NOT CACHED | PARQUET | true
|
||||
| Shoes | 1835 | 1 | 234.90KB | NOT CACHED | PARQUET | true
|
||||
| Sports | -1 | 1 | 227.97KB | NOT CACHED | PARQUET | false
|
||||
| Women | 1790 | 1 | 226.27KB | NOT CACHED | PARQUET | true
|
||||
| Total | 17957 | 10 | 2.25MB | 0B | |
|
||||
+-------------+-------+--------+----------+--------------+---------+-----------------
|
||||
show column stats item_partitioned
|
||||
+------------------+-----------+------------------+--------+----------+--------------
|
||||
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size
|
||||
+------------------+-----------+------------------+--------+----------+--------------
|
||||
| i_item_sk | INT | 19443 | -1 | 4 | 4
|
||||
| i_item_id | STRING | 9025 | -1 | 16 | 16
|
||||
| i_rec_start_date | TIMESTAMP | 4 | -1 | 16 | 16
|
||||
| i_rec_end_date | TIMESTAMP | 3 | -1 | 16 | 16
|
||||
| i_item_desc | STRING | 13330 | -1 | 200 | 100.302803039
|
||||
| i_current_price | FLOAT | 2807 | -1 | 4 | 4
|
||||
| i_wholesale_cost | FLOAT | 2105 | -1 | 4 | 4
|
||||
| i_brand_id | INT | 965 | -1 | 4 | 4
|
||||
| i_brand | STRING | 725 | -1 | 22 | 16.1776008605
|
||||
| i_class_id | INT | 16 | -1 | 4 | 4
|
||||
| i_class | STRING | 101 | -1 | 15 | 7.76749992370
|
||||
| i_category_id | INT | 10 | -1 | 4 | 4
|
||||
| i_manufact_id | INT | 1857 | -1 | 4 | 4
|
||||
| i_manufact | STRING | 1028 | -1 | 15 | 11.3295001983
|
||||
| i_size | STRING | 8 | -1 | 11 | 4.33459997177
|
||||
| i_formulation | STRING | 12884 | -1 | 20 | 19.9799995422
|
||||
| i_color | STRING | 92 | -1 | 10 | 5.38089990615
|
||||
| i_units | STRING | 22 | -1 | 7 | 4.18690013885
|
||||
| i_container | STRING | 2 | -1 | 7 | 6.99259996414
|
||||
| i_manager_id | INT | 105 | -1 | 4 | 4
|
||||
| i_product_name | STRING | 19094 | -1 | 25 | 18.0233001708
|
||||
| i_category | STRING | 10 | 0 | -1 | -1
|
||||
+------------------+-----------+------------------+--------+----------+--------------
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
To remove all statistics from the table, whether produced by <codeph>COMPUTE STATS</codeph> or
|
||||
<codeph>COMPUTE INCREMENTAL STATS</codeph>, use the <codeph>DROP STATS</codeph> statement without the
|
||||
<codeph>INCREMENTAL</codeph> clause). Now, both table-level and column-level statistics are reset.
|
||||
</p>
|
||||
|
||||
<codeblock>drop stats item_partitioned;
|
||||
|
||||
show table stats item_partitioned
|
||||
+-------------+-------+--------+----------+--------------+---------+------------------
|
||||
| i_category | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats
|
||||
+-------------+-------+--------+----------+--------------+---------+------------------
|
||||
| Books | -1 | 1 | 223.74KB | NOT CACHED | PARQUET | false
|
||||
| Children | -1 | 1 | 230.05KB | NOT CACHED | PARQUET | false
|
||||
| Electronics | -1 | 1 | 232.67KB | NOT CACHED | PARQUET | false
|
||||
| Home | -1 | 1 | 232.56KB | NOT CACHED | PARQUET | false
|
||||
| Jewelry | -1 | 1 | 223.72KB | NOT CACHED | PARQUET | false
|
||||
| Men | -1 | 1 | 231.25KB | NOT CACHED | PARQUET | false
|
||||
| Music | -1 | 1 | 237.90KB | NOT CACHED | PARQUET | false
|
||||
| Shoes | -1 | 1 | 234.90KB | NOT CACHED | PARQUET | false
|
||||
| Sports | -1 | 1 | 227.97KB | NOT CACHED | PARQUET | false
|
||||
| Women | -1 | 1 | 226.27KB | NOT CACHED | PARQUET | false
|
||||
| Total | -1 | 10 | 2.25MB | 0B | |
|
||||
+-------------+-------+--------+----------+--------------+---------+------------------
|
||||
show column stats item_partitioned
|
||||
+------------------+-----------+------------------+--------+----------+----------+
|
||||
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
|
||||
+------------------+-----------+------------------+--------+----------+----------+
|
||||
| i_item_sk | INT | -1 | -1 | 4 | 4 |
|
||||
| i_item_id | STRING | -1 | -1 | -1 | -1 |
|
||||
| i_rec_start_date | TIMESTAMP | -1 | -1 | 16 | 16 |
|
||||
| i_rec_end_date | TIMESTAMP | -1 | -1 | 16 | 16 |
|
||||
| i_item_desc | STRING | -1 | -1 | -1 | -1 |
|
||||
| i_current_price | FLOAT | -1 | -1 | 4 | 4 |
|
||||
| i_wholesale_cost | FLOAT | -1 | -1 | 4 | 4 |
|
||||
| i_brand_id | INT | -1 | -1 | 4 | 4 |
|
||||
| i_brand | STRING | -1 | -1 | -1 | -1 |
|
||||
| i_class_id | INT | -1 | -1 | 4 | 4 |
|
||||
| i_class | STRING | -1 | -1 | -1 | -1 |
|
||||
| i_category_id | INT | -1 | -1 | 4 | 4 |
|
||||
| i_manufact_id | INT | -1 | -1 | 4 | 4 |
|
||||
| i_manufact | STRING | -1 | -1 | -1 | -1 |
|
||||
| i_size | STRING | -1 | -1 | -1 | -1 |
|
||||
| i_formulation | STRING | -1 | -1 | -1 | -1 |
|
||||
| i_color | STRING | -1 | -1 | -1 | -1 |
|
||||
| i_units | STRING | -1 | -1 | -1 | -1 |
|
||||
| i_container | STRING | -1 | -1 | -1 | -1 |
|
||||
| i_manager_id | INT | -1 | -1 | 4 | 4 |
|
||||
| i_product_name | STRING | -1 | -1 | -1 | -1 |
|
||||
| i_category | STRING | 10 | 0 | -1 | -1 |
|
||||
+------------------+-----------+------------------+--------+----------+----------+
|
||||
</codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
|
||||
<p>
|
||||
<xref href="impala_compute_stats.xml#compute_stats"/>, <xref href="impala_show.xml#show_table_stats"/>,
|
||||
<xref href="impala_show.xml#show_column_stats"/>, <xref href="impala_perf_stats.xml#perf_stats"/>
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
150
docs/topics/impala_drop_table.xml
Normal file
150
docs/topics/impala_drop_table.xml
Normal file
@@ -0,0 +1,150 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="drop_table">
|
||||
|
||||
<title>DROP TABLE Statement</title>
|
||||
<titlealts audience="PDF"><navtitle>DROP TABLE</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="DDL"/>
|
||||
<data name="Category" value="Tables"/>
|
||||
<data name="Category" value="Schemas"/>
|
||||
<data name="Category" value="S3"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<indexterm audience="Cloudera">DROP TABLE statement</indexterm>
|
||||
Removes an Impala table. Also removes the underlying HDFS data files for internal tables, although not for
|
||||
external tables.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
||||
|
||||
<codeblock>DROP TABLE [IF EXISTS] [<varname>db_name</varname>.]<varname>table_name</varname> <ph rev="2.3.0">[PURGE]</ph></codeblock>
|
||||
|
||||
<p>
|
||||
<b>IF EXISTS clause:</b>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The optional <codeph>IF EXISTS</codeph> clause makes the statement succeed whether or not the table exists.
|
||||
If the table does exist, it is dropped; if it does not exist, the statement has no effect. This capability is
|
||||
useful in standardized setup scripts that remove existing schema objects and create new ones. By using some
|
||||
combination of <codeph>IF EXISTS</codeph> for the <codeph>DROP</codeph> statements and <codeph>IF NOT
|
||||
EXISTS</codeph> clauses for the <codeph>CREATE</codeph> statements, the script can run successfully the first
|
||||
time you run it (when the objects do not exist yet) and subsequent times (when some or all of the objects do
|
||||
already exist).
|
||||
</p>
|
||||
|
||||
<p rev="2.3.0">
|
||||
<b>PURGE clause:</b>
|
||||
</p>
|
||||
|
||||
<p rev="2.3.0"> The optional <codeph>PURGE</codeph> keyword, available in
|
||||
<keyword keyref="impala23_full"/> and higher, causes Impala to remove the associated
|
||||
HDFS data files immediately, rather than going through the HDFS trashcan
|
||||
mechanism. Use this keyword when dropping a table if it is crucial to
|
||||
remove the data as quickly as possible to free up space, or if there is a
|
||||
problem with the trashcan, such as the trash cannot being configured or
|
||||
being in a different HDFS encryption zone than the data files. </p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/ddl_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
||||
|
||||
<p>
|
||||
By default, Impala removes the associated HDFS directory and data files for the table. If you issue a
|
||||
<codeph>DROP TABLE</codeph> and the data files are not deleted, it might be for the following reasons:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
If the table was created with the
|
||||
<codeph><xref href="impala_tables.xml#external_tables">EXTERNAL</xref></codeph> clause, Impala leaves all
|
||||
files and directories untouched. Use external tables when the data is under the control of other Hadoop
|
||||
components, and Impala is only used to query the data files from their original locations.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Impala might leave the data files behind unintentionally, if there is no HDFS location available to hold
|
||||
the HDFS trashcan for the <codeph>impala</codeph> user. See
|
||||
<xref href="impala_prereqs.xml#prereqs_account"/> for the procedure to set up the required HDFS home
|
||||
directory.
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p>
|
||||
Make sure that you are in the correct database before dropping a table, either by issuing a
|
||||
<codeph>USE</codeph> statement first or by using a fully qualified name
|
||||
<codeph><varname>db_name</varname>.<varname>table_name</varname></codeph>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
If you intend to issue a <codeph>DROP DATABASE</codeph> statement, first issue <codeph>DROP TABLE</codeph>
|
||||
statements to remove all the tables in that database.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
|
||||
<codeblock>create database temporary;
|
||||
use temporary;
|
||||
create table unimportant (x int);
|
||||
create table trivial (s string);
|
||||
-- Drop a table in the current database.
|
||||
drop table unimportant;
|
||||
-- Switch to a different database.
|
||||
use default;
|
||||
-- To drop a table in a different database...
|
||||
drop table trivial;
|
||||
<i>ERROR: AnalysisException: Table does not exist: default.trivial</i>
|
||||
-- ...use a fully qualified name.
|
||||
drop table temporary.trivial;</codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/disk_space_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/s3_blurb"/>
|
||||
<p rev="2.6.0 CDH-39913 IMPALA-1878">
|
||||
The <codeph>DROP TABLE</codeph> statement can remove data files from S3
|
||||
if the associated S3 table is an internal table.
|
||||
In <keyword keyref="impala26_full"/> and higher, as part of improved support for writing
|
||||
to S3, Impala also removes the associated folder when dropping an internal table
|
||||
that resides on S3.
|
||||
See <xref href="impala_s3.xml#s3"/> for details about working with S3 tables.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/s3_drop_table_purge"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/s3_ddl"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/permissions_blurb"/>
|
||||
<p rev="CDH-19187">
|
||||
For an internal table, the user ID that the <cmdname>impalad</cmdname> daemon runs under,
|
||||
typically the <codeph>impala</codeph> user, must have write
|
||||
permission for all the files and directories that make up the table.
|
||||
</p>
|
||||
<p>
|
||||
For an external table, dropping the table only involves changes to metadata in the metastore database.
|
||||
Because Impala does not remove any HDFS files or directories when external tables are dropped,
|
||||
no particular permissions are needed for the associated HDFS files or directories.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
|
||||
<p>
|
||||
<xref href="impala_tables.xml#tables"/>,
|
||||
<xref href="impala_alter_table.xml#alter_table"/>, <xref href="impala_create_table.xml#create_table"/>,
|
||||
<xref href="impala_partitioning.xml#partitioning"/>, <xref href="impala_tables.xml#internal_tables"/>,
|
||||
<xref href="impala_tables.xml#external_tables"/>
|
||||
</p>
|
||||
|
||||
</conbody>
|
||||
</concept>
|
||||
49
docs/topics/impala_drop_view.xml
Normal file
49
docs/topics/impala_drop_view.xml
Normal file
@@ -0,0 +1,49 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept rev="1.1" id="drop_view">
|
||||
|
||||
<title>DROP VIEW Statement</title>
|
||||
<titlealts audience="PDF"><navtitle>DROP VIEW</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="DDL"/>
|
||||
<data name="Category" value="Schemas"/>
|
||||
<data name="Category" value="Tables"/>
|
||||
<data name="Category" value="Views"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<indexterm audience="Cloudera">DROP VIEW statement</indexterm>
|
||||
Removes the specified view, which was originally created by the <codeph>CREATE VIEW</codeph> statement.
|
||||
Because a view is purely a logical construct (an alias for a query) with no physical data behind it,
|
||||
<codeph>DROP VIEW</codeph> only involves changes to metadata in the metastore database, not any data files in
|
||||
HDFS.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
||||
|
||||
<codeblock>DROP VIEW [IF EXISTS] [<varname>db_name</varname>.]<varname>view_name</varname></codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/ddl_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/permissions_blurb_no"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/create_drop_view_examples"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
|
||||
<p>
|
||||
<xref href="impala_views.xml#views"/>, <xref href="impala_create_view.xml#create_view"/>,
|
||||
<xref href="impala_alter_view.xml#alter_view"/>
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
1378
docs/topics/impala_errata.xml
Normal file
1378
docs/topics/impala_errata.xml
Normal file
File diff suppressed because it is too large
Load Diff
96
docs/topics/impala_exec_single_node_rows_threshold.xml
Normal file
96
docs/topics/impala_exec_single_node_rows_threshold.xml
Normal file
@@ -0,0 +1,96 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept rev="2.0.0" id="exec_single_node_rows_threshold">
|
||||
|
||||
<title>EXEC_SINGLE_NODE_ROWS_THRESHOLD Query Option (<keyword keyref="impala21"/> or higher only)</title>
|
||||
<titlealts audience="PDF"><navtitle>EXEC_SINGLE_NODE_ROWS_THRESHOLD</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Impala Query Options"/>
|
||||
<data name="Category" value="Scalability"/>
|
||||
<data name="Category" value="Performance"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p rev="2.0.0">
|
||||
<indexterm audience="Cloudera">EXEC_SINGLE_NODE_ROWS_THRESHOLD query option</indexterm>
|
||||
This setting controls the cutoff point (in terms of number of rows scanned) below which Impala treats a query
|
||||
as a <q>small</q> query, turning off optimizations such as parallel execution and native code generation. The
|
||||
overhead for these optimizations is applicable for queries involving substantial amounts of data, but it
|
||||
makes sense to skip them for queries involving tiny amounts of data. Reducing the overhead for small queries
|
||||
allows Impala to complete them more quickly, keeping YARN resources, admission control slots, and so on
|
||||
available for data-intensive queries.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
||||
|
||||
<codeblock>SET EXEC_SINGLE_NODE_ROWS_THRESHOLD=<varname>number_of_rows</varname></codeblock>
|
||||
|
||||
<p>
|
||||
<b>Type:</b> numeric
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>Default:</b> 100
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>Usage notes:</b> Typically, you increase the default value to make this optimization apply to more queries.
|
||||
If incorrect or corrupted table and column statistics cause Impala to apply this optimization
|
||||
incorrectly to queries that actually involve substantial work, you might see the queries being slower as a
|
||||
result of remote reads. In that case, recompute statistics with the <codeph>COMPUTE STATS</codeph>
|
||||
or <codeph>COMPUTE INCREMENTAL STATS</codeph> statement. If there is a problem collecting accurate
|
||||
statistics, you can turn this feature off by setting the value to -1.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/internals_blurb"/>
|
||||
|
||||
<p>
|
||||
This setting applies to query fragments where the amount of data to scan can be accurately determined, either
|
||||
through table and column statistics, or by the presence of a <codeph>LIMIT</codeph> clause. If Impala cannot
|
||||
accurately estimate the size of the input data, this setting does not apply.
|
||||
</p>
|
||||
|
||||
<p rev="2.3.0">
|
||||
In <keyword keyref="impala23_full"/> and higher, where Impala supports the complex data types <codeph>STRUCT</codeph>,
|
||||
<codeph>ARRAY</codeph>, and <codeph>MAP</codeph>, if a query refers to any column of those types,
|
||||
the small-query optimization is turned off for that query regardless of the
|
||||
<codeph>EXEC_SINGLE_NODE_ROWS_THRESHOLD</codeph> setting.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
For a query that is determined to be <q>small</q>, all work is performed on the coordinator node. This might
|
||||
result in some I/O being performed by remote reads. The savings from not distributing the query work and not
|
||||
generating native code are expected to outweigh any overhead from the remote reads.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/added_in_210"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
|
||||
<p>
|
||||
A common use case is to query just a few rows from a table to inspect typical data values. In this example,
|
||||
Impala does not parallelize the query or perform native code generation because the result set is guaranteed
|
||||
to be smaller than the threshold value from this query option:
|
||||
</p>
|
||||
|
||||
<codeblock>SET EXEC_SINGLE_NODE_ROWS_THRESHOLD=500;
|
||||
SELECT * FROM enormous_table LIMIT 300;
|
||||
</codeblock>
|
||||
|
||||
<!-- Don't have any other places that tie into this particular optimization technique yet.
|
||||
Potentially: conceptual topics about code generation, distributed queries
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
<p>
|
||||
</p>
|
||||
-->
|
||||
|
||||
</conbody>
|
||||
|
||||
</concept>
|
||||
228
docs/topics/impala_explain.xml
Normal file
228
docs/topics/impala_explain.xml
Normal file
@@ -0,0 +1,228 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="explain">
|
||||
|
||||
<title>EXPLAIN Statement</title>
|
||||
<titlealts audience="PDF"><navtitle>EXPLAIN</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="SQL"/>
|
||||
<data name="Category" value="Querying"/>
|
||||
<data name="Category" value="Reports"/>
|
||||
<data name="Category" value="Planning"/>
|
||||
<data name="Category" value="Performance"/>
|
||||
<data name="Category" value="Troubleshooting"/>
|
||||
<data name="Category" value="Administrators"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<indexterm audience="Cloudera">EXPLAIN statement</indexterm>
|
||||
Returns the execution plan for a statement, showing the low-level mechanisms that Impala will use to read the
|
||||
data, divide the work among nodes in the cluster, and transmit intermediate and final results across the
|
||||
network. Use <codeph>explain</codeph> followed by a complete <codeph>SELECT</codeph> query. For example:
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
||||
|
||||
<codeblock>EXPLAIN { <varname>select_query</varname> | <varname>ctas_stmt</varname> | <varname>insert_stmt</varname> }
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
The <varname>select_query</varname> is a <codeph>SELECT</codeph> statement, optionally prefixed by a
|
||||
<codeph>WITH</codeph> clause. See <xref href="impala_select.xml#select"/> for details.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The <varname>insert_stmt</varname> is an <codeph>INSERT</codeph> statement that inserts into or overwrites an
|
||||
existing table. It can use either the <codeph>INSERT ... SELECT</codeph> or <codeph>INSERT ...
|
||||
VALUES</codeph> syntax. See <xref href="impala_insert.xml#insert"/> for details.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The <varname>ctas_stmt</varname> is a <codeph>CREATE TABLE</codeph> statement using the <codeph>AS
|
||||
SELECT</codeph> clause, typically abbreviated as a <q>CTAS</q> operation. See
|
||||
<xref href="impala_create_table.xml#create_table"/> for details.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
||||
|
||||
<p>
|
||||
You can interpret the output to judge whether the query is performing efficiently, and adjust the query
|
||||
and/or the schema if not. For example, you might change the tests in the <codeph>WHERE</codeph> clause, add
|
||||
hints to make join operations more efficient, introduce subqueries, change the order of tables in a join, add
|
||||
or change partitioning for a table, collect column statistics and/or table statistics in Hive, or any other
|
||||
performance tuning steps.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The <codeph>EXPLAIN</codeph> output reminds you if table or column statistics are missing from any table
|
||||
involved in the query. These statistics are important for optimizing queries involving large tables or
|
||||
multi-table joins. See <xref href="impala_compute_stats.xml#compute_stats"/> for how to gather statistics,
|
||||
and <xref href="impala_perf_stats.xml#perf_stats"/> for how to use this information for query tuning.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/explain_interpret"/>
|
||||
|
||||
<p>
|
||||
If you come from a traditional database background and are not familiar with data warehousing, keep in mind
|
||||
that Impala is optimized for full table scans across very large tables. The structure and distribution of
|
||||
this data is typically not suitable for the kind of indexing and single-row lookups that are common in OLTP
|
||||
environments. Seeing a query scan entirely through a large table is common, not necessarily an indication of
|
||||
an inefficient query. Of course, if you can reduce the volume of scanned data by orders of magnitude, for
|
||||
example by using a query that affects only certain partitions within a partitioned table, then you might be
|
||||
able to optimize a query so that it executes in seconds rather than minutes.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
For more information and examples to help you interpret <codeph>EXPLAIN</codeph> output, see
|
||||
<xref href="impala_explain_plan.xml#perf_explain"/>.
|
||||
</p>
|
||||
|
||||
<p rev="1.2">
|
||||
<b>Extended EXPLAIN output:</b>
|
||||
</p>
|
||||
|
||||
<p rev="1.2">
|
||||
For performance tuning of complex queries, and capacity planning (such as using the admission control and
|
||||
resource management features), you can enable more detailed and informative output for the
|
||||
<codeph>EXPLAIN</codeph> statement. In the <cmdname>impala-shell</cmdname> interpreter, issue the command
|
||||
<codeph>SET EXPLAIN_LEVEL=<varname>level</varname></codeph>, where <varname>level</varname> is an integer
|
||||
from 0 to 3 or corresponding mnemonic values <codeph>minimal</codeph>, <codeph>standard</codeph>,
|
||||
<codeph>extended</codeph>, or <codeph>verbose</codeph>.
|
||||
</p>
|
||||
|
||||
<p rev="1.2">
|
||||
When extended <codeph>EXPLAIN</codeph> output is enabled, <codeph>EXPLAIN</codeph> statements print
|
||||
information about estimated memory requirements, minimum number of virtual cores, and so on.
|
||||
<!--
|
||||
that you can use to fine-tune the resource management options explained in <xref href="impala_resource_management.xml#rm_options"/>.
|
||||
(The estimated memory requirements are intentionally on the high side, to allow a margin for error,
|
||||
to avoid cancelling a query unnecessarily if you set the <codeph>MEM_LIMIT</codeph> option to the estimated memory figure.)
|
||||
-->
|
||||
</p>
|
||||
|
||||
<p>
|
||||
See <xref href="impala_explain_level.xml#explain_level"/> for details and examples.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
|
||||
<p>
|
||||
This example shows how the standard <codeph>EXPLAIN</codeph> output moves from the lowest (physical) level to
|
||||
the higher (logical) levels. The query begins by scanning a certain amount of data; each node performs an
|
||||
aggregation operation (evaluating <codeph>COUNT(*)</codeph>) on some subset of data that is local to that
|
||||
node; the intermediate results are transmitted back to the coordinator node (labelled here as the
|
||||
<codeph>EXCHANGE</codeph> node); lastly, the intermediate results are summed to display the final result.
|
||||
</p>
|
||||
|
||||
<codeblock id="explain_plan_simple">[impalad-host:21000] > explain select count(*) from customer_address;
|
||||
+----------------------------------------------------------+
|
||||
| Explain String |
|
||||
+----------------------------------------------------------+
|
||||
| Estimated Per-Host Requirements: Memory=42.00MB VCores=1 |
|
||||
| |
|
||||
| 03:AGGREGATE [MERGE FINALIZE] |
|
||||
| | output: sum(count(*)) |
|
||||
| | |
|
||||
| 02:EXCHANGE [PARTITION=UNPARTITIONED] |
|
||||
| | |
|
||||
| 01:AGGREGATE |
|
||||
| | output: count(*) |
|
||||
| | |
|
||||
| 00:SCAN HDFS [default.customer_address] |
|
||||
| partitions=1/1 size=5.25MB |
|
||||
+----------------------------------------------------------+
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
These examples show how the extended <codeph>EXPLAIN</codeph> output becomes more accurate and informative as
|
||||
statistics are gathered by the <codeph>COMPUTE STATS</codeph> statement. Initially, much of the information
|
||||
about data size and distribution is marked <q>unavailable</q>. Impala can determine the raw data size, but
|
||||
not the number of rows or number of distinct values for each column without additional analysis. The
|
||||
<codeph>COMPUTE STATS</codeph> statement performs this analysis, so a subsequent <codeph>EXPLAIN</codeph>
|
||||
statement has additional information to use in deciding how to optimize the distributed query.
|
||||
</p>
|
||||
|
||||
<!-- To do:
|
||||
Re-run these examples with more substantial tables populated with data.
|
||||
-->
|
||||
|
||||
<codeblock rev="1.2">[localhost:21000] > set explain_level=extended;
|
||||
EXPLAIN_LEVEL set to extended
|
||||
[localhost:21000] > explain select x from t1;
|
||||
[localhost:21000] > explain select x from t1;
|
||||
+----------------------------------------------------------+
|
||||
| Explain String |
|
||||
+----------------------------------------------------------+
|
||||
| Estimated Per-Host Requirements: Memory=32.00MB VCores=1 |
|
||||
| |
|
||||
| 01:EXCHANGE [PARTITION=UNPARTITIONED] |
|
||||
| | hosts=1 per-host-mem=unavailable |
|
||||
<b>| | tuple-ids=0 row-size=4B cardinality=unavailable |</b>
|
||||
| | |
|
||||
| 00:SCAN HDFS [default.t2, PARTITION=RANDOM] |
|
||||
| partitions=1/1 size=36B |
|
||||
<b>| table stats: unavailable |</b>
|
||||
<b>| column stats: unavailable |</b>
|
||||
| hosts=1 per-host-mem=32.00MB |
|
||||
<b>| tuple-ids=0 row-size=4B cardinality=unavailable |</b>
|
||||
+----------------------------------------------------------+
|
||||
</codeblock>
|
||||
|
||||
<codeblock rev="1.2">[localhost:21000] > compute stats t1;
|
||||
+-----------------------------------------+
|
||||
| summary |
|
||||
+-----------------------------------------+
|
||||
| Updated 1 partition(s) and 1 column(s). |
|
||||
+-----------------------------------------+
|
||||
[localhost:21000] > explain select x from t1;
|
||||
+----------------------------------------------------------+
|
||||
| Explain String |
|
||||
+----------------------------------------------------------+
|
||||
| Estimated Per-Host Requirements: Memory=64.00MB VCores=1 |
|
||||
| |
|
||||
| 01:EXCHANGE [PARTITION=UNPARTITIONED] |
|
||||
| | hosts=1 per-host-mem=unavailable |
|
||||
| | tuple-ids=0 row-size=4B cardinality=0 |
|
||||
| | |
|
||||
| 00:SCAN HDFS [default.t1, PARTITION=RANDOM] |
|
||||
| partitions=1/1 size=36B |
|
||||
<b>| table stats: 0 rows total |</b>
|
||||
<b>| column stats: all |</b>
|
||||
| hosts=1 per-host-mem=64.00MB |
|
||||
<b>| tuple-ids=0 row-size=4B cardinality=0 |</b>
|
||||
+----------------------------------------------------------+
|
||||
</codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/security_blurb"/>
|
||||
<p conref="../shared/impala_common.xml#common/redaction_yes"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/permissions_blurb"/>
|
||||
<p rev="CDH-19187">
|
||||
<!-- Doublecheck these details. Does EXPLAIN really need any permissions? -->
|
||||
The user ID that the <cmdname>impalad</cmdname> daemon runs under,
|
||||
typically the <codeph>impala</codeph> user, must have read
|
||||
and execute permissions for all applicable directories in all source tables
|
||||
for the query that is being explained.
|
||||
(A <codeph>SELECT</codeph> operation could read files from multiple different HDFS directories
|
||||
if the source table is partitioned.)
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
<p>
|
||||
<xref href="impala_select.xml#select"/>,
|
||||
<xref href="impala_insert.xml#insert"/>,
|
||||
<xref href="impala_create_table.xml#create_table"/>,
|
||||
<xref href="impala_explain_plan.xml#explain_plan"/>
|
||||
</p>
|
||||
|
||||
</conbody>
|
||||
</concept>
|
||||
350
docs/topics/impala_explain_level.xml
Normal file
350
docs/topics/impala_explain_level.xml
Normal file
@@ -0,0 +1,350 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept rev="1.2" id="explain_level">
|
||||
|
||||
<title>EXPLAIN_LEVEL Query Option</title>
|
||||
<titlealts audience="PDF"><navtitle>EXPLAIN_LEVEL</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Impala Query Options"/>
|
||||
<data name="Category" value="Troubleshooting"/>
|
||||
<data name="Category" value="Querying"/>
|
||||
<data name="Category" value="Performance"/>
|
||||
<data name="Category" value="Reports"/>
|
||||
<data name="Category" value="Administrators"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<indexterm audience="Cloudera">EXPLAIN_LEVEL query option</indexterm>
|
||||
Controls the amount of detail provided in the output of the <codeph>EXPLAIN</codeph> statement. The basic
|
||||
output can help you identify high-level performance issues such as scanning a higher volume of data or more
|
||||
partitions than you expect. The higher levels of detail show how intermediate results flow between nodes and
|
||||
how different SQL operations such as <codeph>ORDER BY</codeph>, <codeph>GROUP BY</codeph>, joins, and
|
||||
<codeph>WHERE</codeph> clauses are implemented within a distributed query.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>Type:</b> <codeph>STRING</codeph> or <codeph>INT</codeph>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>Default:</b> <codeph>1</codeph>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>Arguments:</b>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The allowed range of numeric values for this option is 0 to 3:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
<codeph>0</codeph> or <codeph>MINIMAL</codeph>: A barebones list, one line per operation. Primarily useful
|
||||
for checking the join order in very long queries where the regular <codeph>EXPLAIN</codeph> output is too
|
||||
long to read easily.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<codeph>1</codeph> or <codeph>STANDARD</codeph>: The default level of detail, showing the logical way that
|
||||
work is split up for the distributed query.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<codeph>2</codeph> or <codeph>EXTENDED</codeph>: Includes additional detail about how the query planner
|
||||
uses statistics in its decision-making process, to understand how a query could be tuned by gathering
|
||||
statistics, using query hints, adding or removing predicates, and so on.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<codeph>3</codeph> or <codeph>VERBOSE</codeph>: The maximum level of detail, showing how work is split up
|
||||
within each node into <q>query fragments</q> that are connected in a pipeline. This extra detail is
|
||||
primarily useful for low-level performance testing and tuning within Impala itself, rather than for
|
||||
rewriting the SQL code at the user level.
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<note>
|
||||
Prior to Impala 1.3, the allowed argument range for <codeph>EXPLAIN_LEVEL</codeph> was 0 to 1: level 0 had
|
||||
the mnemonic <codeph>NORMAL</codeph>, and level 1 was <codeph>VERBOSE</codeph>. In Impala 1.3 and higher,
|
||||
<codeph>NORMAL</codeph> is not a valid mnemonic value, and <codeph>VERBOSE</codeph> still applies to the
|
||||
highest level of detail but now corresponds to level 3. You might need to adjust the values if you have any
|
||||
older <codeph>impala-shell</codeph> script files that set the <codeph>EXPLAIN_LEVEL</codeph> query option.
|
||||
</note>
|
||||
|
||||
<p>
|
||||
Changing the value of this option controls the amount of detail in the output of the <codeph>EXPLAIN</codeph>
|
||||
statement. The extended information from level 2 or 3 is especially useful during performance tuning, when
|
||||
you need to confirm whether the work for the query is distributed the way you expect, particularly for the
|
||||
most resource-intensive operations such as join queries against large tables, queries against tables with
|
||||
large numbers of partitions, and insert operations for Parquet tables. The extended information also helps to
|
||||
check estimated resource usage when you use the admission control or resource management features explained
|
||||
in <xref href="impala_resource_management.xml#resource_management"/>. See
|
||||
<xref href="impala_explain.xml#explain"/> for the syntax of the <codeph>EXPLAIN</codeph> statement, and
|
||||
<xref href="impala_explain_plan.xml#perf_explain"/> for details about how to use the extended information.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
||||
|
||||
<p>
|
||||
As always, read the <codeph>EXPLAIN</codeph> output from bottom to top. The lowest lines represent the
|
||||
initial work of the query (scanning data files), the lines in the middle represent calculations done on each
|
||||
node and how intermediate results are transmitted from one node to another, and the topmost lines represent
|
||||
the final results being sent back to the coordinator node.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The numbers in the left column are generated internally during the initial planning phase and do not
|
||||
represent the actual order of operations, so it is not significant if they appear out of order in the
|
||||
<codeph>EXPLAIN</codeph> output.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
At all <codeph>EXPLAIN</codeph> levels, the plan contains a warning if any tables in the query are missing
|
||||
statistics. Use the <codeph>COMPUTE STATS</codeph> statement to gather statistics for each table and suppress
|
||||
this warning. See <xref href="impala_perf_stats.xml#perf_stats"/> for details about how the statistics help
|
||||
query performance.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The <codeph>PROFILE</codeph> command in <cmdname>impala-shell</cmdname> always starts with an explain plan
|
||||
showing full detail, the same as with <codeph>EXPLAIN_LEVEL=3</codeph>. <ph rev="1.4.0">After the explain
|
||||
plan comes the executive summary, the same output as produced by the <codeph>SUMMARY</codeph> command in
|
||||
<cmdname>impala-shell</cmdname>.</ph>
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
|
||||
<p>
|
||||
These examples use a trivial, empty table to illustrate how the essential aspects of query planning are shown
|
||||
in <codeph>EXPLAIN</codeph> output:
|
||||
</p>
|
||||
|
||||
<codeblock>[localhost:21000] > create table t1 (x int, s string);
|
||||
[localhost:21000] > set explain_level=1;
|
||||
[localhost:21000] > explain select count(*) from t1;
|
||||
+------------------------------------------------------------------------+
|
||||
| Explain String |
|
||||
+------------------------------------------------------------------------+
|
||||
| Estimated Per-Host Requirements: Memory=10.00MB VCores=1 |
|
||||
| WARNING: The following tables are missing relevant table and/or column |
|
||||
| statistics. |
|
||||
| explain_plan.t1 |
|
||||
| |
|
||||
| 03:AGGREGATE [MERGE FINALIZE] |
|
||||
| | output: sum(count(*)) |
|
||||
| | |
|
||||
| 02:EXCHANGE [PARTITION=UNPARTITIONED] |
|
||||
| | |
|
||||
| 01:AGGREGATE |
|
||||
| | output: count(*) |
|
||||
| | |
|
||||
| 00:SCAN HDFS [explain_plan.t1] |
|
||||
| partitions=1/1 size=0B |
|
||||
+------------------------------------------------------------------------+
|
||||
[localhost:21000] > explain select * from t1;
|
||||
+------------------------------------------------------------------------+
|
||||
| Explain String |
|
||||
+------------------------------------------------------------------------+
|
||||
| Estimated Per-Host Requirements: Memory=-9223372036854775808B VCores=0 |
|
||||
| WARNING: The following tables are missing relevant table and/or column |
|
||||
| statistics. |
|
||||
| explain_plan.t1 |
|
||||
| |
|
||||
| 01:EXCHANGE [PARTITION=UNPARTITIONED] |
|
||||
| | |
|
||||
| 00:SCAN HDFS [explain_plan.t1] |
|
||||
| partitions=1/1 size=0B |
|
||||
+------------------------------------------------------------------------+
|
||||
[localhost:21000] > set explain_level=2;
|
||||
[localhost:21000] > explain select * from t1;
|
||||
+------------------------------------------------------------------------+
|
||||
| Explain String |
|
||||
+------------------------------------------------------------------------+
|
||||
| Estimated Per-Host Requirements: Memory=-9223372036854775808B VCores=0 |
|
||||
| WARNING: The following tables are missing relevant table and/or column |
|
||||
| statistics. |
|
||||
| explain_plan.t1 |
|
||||
| |
|
||||
| 01:EXCHANGE [PARTITION=UNPARTITIONED] |
|
||||
| | hosts=0 per-host-mem=unavailable |
|
||||
| | tuple-ids=0 row-size=19B cardinality=unavailable |
|
||||
| | |
|
||||
| 00:SCAN HDFS [explain_plan.t1, PARTITION=RANDOM] |
|
||||
| partitions=1/1 size=0B |
|
||||
| table stats: unavailable |
|
||||
| column stats: unavailable |
|
||||
| hosts=0 per-host-mem=0B |
|
||||
| tuple-ids=0 row-size=19B cardinality=unavailable |
|
||||
+------------------------------------------------------------------------+
|
||||
[localhost:21000] > set explain_level=3;
|
||||
[localhost:21000] > explain select * from t1;
|
||||
+------------------------------------------------------------------------+
|
||||
| Explain String |
|
||||
+------------------------------------------------------------------------+
|
||||
| Estimated Per-Host Requirements: Memory=-9223372036854775808B VCores=0 |
|
||||
<b>| WARNING: The following tables are missing relevant table and/or column |</b>
|
||||
<b>| statistics. |</b>
|
||||
<b>| explain_plan.t1 |</b>
|
||||
| |
|
||||
| F01:PLAN FRAGMENT [PARTITION=UNPARTITIONED] |
|
||||
| 01:EXCHANGE [PARTITION=UNPARTITIONED] |
|
||||
| hosts=0 per-host-mem=unavailable |
|
||||
| tuple-ids=0 row-size=19B cardinality=unavailable |
|
||||
| |
|
||||
| F00:PLAN FRAGMENT [PARTITION=RANDOM] |
|
||||
| DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=01, PARTITION=UNPARTITIONED] |
|
||||
| 00:SCAN HDFS [explain_plan.t1, PARTITION=RANDOM] |
|
||||
| partitions=1/1 size=0B |
|
||||
<b>| table stats: unavailable |</b>
|
||||
<b>| column stats: unavailable |</b>
|
||||
| hosts=0 per-host-mem=0B |
|
||||
| tuple-ids=0 row-size=19B cardinality=unavailable |
|
||||
+------------------------------------------------------------------------+
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
As the warning message demonstrates, most of the information needed for Impala to do efficient query
|
||||
planning, and for you to understand the performance characteristics of the query, requires running the
|
||||
<codeph>COMPUTE STATS</codeph> statement for the table:
|
||||
</p>
|
||||
|
||||
<codeblock>[localhost:21000] > compute stats t1;
|
||||
+-----------------------------------------+
|
||||
| summary |
|
||||
+-----------------------------------------+
|
||||
| Updated 1 partition(s) and 2 column(s). |
|
||||
+-----------------------------------------+
|
||||
[localhost:21000] > explain select * from t1;
|
||||
+------------------------------------------------------------------------+
|
||||
| Explain String |
|
||||
+------------------------------------------------------------------------+
|
||||
| Estimated Per-Host Requirements: Memory=-9223372036854775808B VCores=0 |
|
||||
| |
|
||||
| F01:PLAN FRAGMENT [PARTITION=UNPARTITIONED] |
|
||||
| 01:EXCHANGE [PARTITION=UNPARTITIONED] |
|
||||
| hosts=0 per-host-mem=unavailable |
|
||||
| tuple-ids=0 row-size=20B cardinality=0 |
|
||||
| |
|
||||
| F00:PLAN FRAGMENT [PARTITION=RANDOM] |
|
||||
| DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=01, PARTITION=UNPARTITIONED] |
|
||||
| 00:SCAN HDFS [explain_plan.t1, PARTITION=RANDOM] |
|
||||
| partitions=1/1 size=0B |
|
||||
<b>| table stats: 0 rows total |</b>
|
||||
<b>| column stats: all |</b>
|
||||
| hosts=0 per-host-mem=0B |
|
||||
| tuple-ids=0 row-size=20B cardinality=0 |
|
||||
+------------------------------------------------------------------------+
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
Joins and other complicated, multi-part queries are the ones where you most commonly need to examine the
|
||||
<codeph>EXPLAIN</codeph> output and customize the amount of detail in the output. This example shows the
|
||||
default <codeph>EXPLAIN</codeph> output for a three-way join query, then the equivalent output with a
|
||||
<codeph>[SHUFFLE]</codeph> hint to change the join mechanism between the first two tables from a broadcast
|
||||
join to a shuffle join.
|
||||
</p>
|
||||
|
||||
<codeblock>[localhost:21000] > set explain_level=1;
|
||||
[localhost:21000] > explain select one.*, two.*, three.* from t1 one, t1 two, t1 three where one.x = two.x and two.x = three.x;
|
||||
+---------------------------------------------------------+
|
||||
| Explain String |
|
||||
+---------------------------------------------------------+
|
||||
| Estimated Per-Host Requirements: Memory=4.00GB VCores=3 |
|
||||
| |
|
||||
| 07:EXCHANGE [PARTITION=UNPARTITIONED] |
|
||||
| | |
|
||||
<b>| 04:HASH JOIN [INNER JOIN, BROADCAST] |</b>
|
||||
| | hash predicates: two.x = three.x |
|
||||
| | |
|
||||
<b>| |--06:EXCHANGE [BROADCAST] |</b>
|
||||
| | | |
|
||||
| | 02:SCAN HDFS [explain_plan.t1 three] |
|
||||
| | partitions=1/1 size=0B |
|
||||
| | |
|
||||
<b>| 03:HASH JOIN [INNER JOIN, BROADCAST] |</b>
|
||||
| | hash predicates: one.x = two.x |
|
||||
| | |
|
||||
<b>| |--05:EXCHANGE [BROADCAST] |</b>
|
||||
| | | |
|
||||
| | 01:SCAN HDFS [explain_plan.t1 two] |
|
||||
| | partitions=1/1 size=0B |
|
||||
| | |
|
||||
| 00:SCAN HDFS [explain_plan.t1 one] |
|
||||
| partitions=1/1 size=0B |
|
||||
+---------------------------------------------------------+
|
||||
[localhost:21000] > explain select one.*, two.*, three.*
|
||||
> from t1 one join [shuffle] t1 two join t1 three
|
||||
> where one.x = two.x and two.x = three.x;
|
||||
+---------------------------------------------------------+
|
||||
| Explain String |
|
||||
+---------------------------------------------------------+
|
||||
| Estimated Per-Host Requirements: Memory=4.00GB VCores=3 |
|
||||
| |
|
||||
| 08:EXCHANGE [PARTITION=UNPARTITIONED] |
|
||||
| | |
|
||||
<b>| 04:HASH JOIN [INNER JOIN, BROADCAST] |</b>
|
||||
| | hash predicates: two.x = three.x |
|
||||
| | |
|
||||
<b>| |--07:EXCHANGE [BROADCAST] |</b>
|
||||
| | | |
|
||||
| | 02:SCAN HDFS [explain_plan.t1 three] |
|
||||
| | partitions=1/1 size=0B |
|
||||
| | |
|
||||
<b>| 03:HASH JOIN [INNER JOIN, PARTITIONED] |</b>
|
||||
| | hash predicates: one.x = two.x |
|
||||
| | |
|
||||
<b>| |--06:EXCHANGE [PARTITION=HASH(two.x)] |</b>
|
||||
| | | |
|
||||
| | 01:SCAN HDFS [explain_plan.t1 two] |
|
||||
| | partitions=1/1 size=0B |
|
||||
| | |
|
||||
<b>| 05:EXCHANGE [PARTITION=HASH(one.x)] |</b>
|
||||
| | |
|
||||
| 00:SCAN HDFS [explain_plan.t1 one] |
|
||||
| partitions=1/1 size=0B |
|
||||
+---------------------------------------------------------+
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
For a join involving many different tables, the default <codeph>EXPLAIN</codeph> output might stretch over
|
||||
several pages, and the only details you care about might be the join order and the mechanism (broadcast or
|
||||
shuffle) for joining each pair of tables. In that case, you might set <codeph>EXPLAIN_LEVEL</codeph> to its
|
||||
lowest value of 0, to focus on just the join order and join mechanism for each stage. The following example
|
||||
shows how the rows from the first and second joined tables are hashed and divided among the nodes of the
|
||||
cluster for further filtering; then the entire contents of the third table are broadcast to all nodes for the
|
||||
final stage of join processing.
|
||||
</p>
|
||||
|
||||
<codeblock>[localhost:21000] > set explain_level=0;
|
||||
[localhost:21000] > explain select one.*, two.*, three.*
|
||||
> from t1 one join [shuffle] t1 two join t1 three
|
||||
> where one.x = two.x and two.x = three.x;
|
||||
+---------------------------------------------------------+
|
||||
| Explain String |
|
||||
+---------------------------------------------------------+
|
||||
| Estimated Per-Host Requirements: Memory=4.00GB VCores=3 |
|
||||
| |
|
||||
| 08:EXCHANGE [PARTITION=UNPARTITIONED] |
|
||||
<b>| 04:HASH JOIN [INNER JOIN, BROADCAST] |</b>
|
||||
<b>| |--07:EXCHANGE [BROADCAST] |</b>
|
||||
| | 02:SCAN HDFS [explain_plan.t1 three] |
|
||||
<b>| 03:HASH JOIN [INNER JOIN, PARTITIONED] |</b>
|
||||
<b>| |--06:EXCHANGE [PARTITION=HASH(two.x)] |</b>
|
||||
| | 01:SCAN HDFS [explain_plan.t1 two] |
|
||||
<b>| 05:EXCHANGE [PARTITION=HASH(one.x)] |</b>
|
||||
| 00:SCAN HDFS [explain_plan.t1 one] |
|
||||
+---------------------------------------------------------+
|
||||
</codeblock>
|
||||
|
||||
<!-- Consider adding a related info section to collect the xrefs earlier on this page. -->
|
||||
|
||||
</conbody>
|
||||
</concept>
|
||||
568
docs/topics/impala_explain_plan.xml
Normal file
568
docs/topics/impala_explain_plan.xml
Normal file
@@ -0,0 +1,568 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="explain_plan">
|
||||
|
||||
<title>Understanding Impala Query Performance - EXPLAIN Plans and Query Profiles</title>
|
||||
<titlealts audience="PDF"><navtitle>EXPLAIN Plans and Query Profiles</navtitle></titlealts>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Performance"/>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Querying"/>
|
||||
<data name="Category" value="Troubleshooting"/>
|
||||
<data name="Category" value="Reports"/>
|
||||
<data name="Category" value="Concepts"/>
|
||||
<data name="Category" value="Administrators"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
To understand the high-level performance considerations for Impala queries, read the output of the
|
||||
<codeph>EXPLAIN</codeph> statement for the query. You can get the <codeph>EXPLAIN</codeph> plan without
|
||||
actually running the query itself.
|
||||
</p>
|
||||
|
||||
<p rev="1.4.0">
|
||||
For an overview of the physical performance characteristics for a query, issue the <codeph>SUMMARY</codeph>
|
||||
statement in <cmdname>impala-shell</cmdname> immediately after executing a query. This condensed information
|
||||
shows which phases of execution took the most time, and how the estimates for memory usage and number of rows
|
||||
at each phase compare to the actual values.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
To understand the detailed performance characteristics for a query, issue the <codeph>PROFILE</codeph>
|
||||
statement in <cmdname>impala-shell</cmdname> immediately after executing a query. This low-level information
|
||||
includes physical details about memory, CPU, I/O, and network usage, and thus is only available after the
|
||||
query is actually run.
|
||||
</p>
|
||||
|
||||
<p outputclass="toc inpage"/>
|
||||
|
||||
<p>
|
||||
Also, see <xref href="impala_hbase.xml#hbase_performance"/>
|
||||
and <xref href="impala_s3.xml#s3_performance"/>
|
||||
for examples of interpreting
|
||||
<codeph>EXPLAIN</codeph> plans for queries against HBase tables
|
||||
<ph rev="2.2.0">and data stored in the Amazon Simple Storage System (S3)</ph>.
|
||||
</p>
|
||||
</conbody>
|
||||
|
||||
<concept id="perf_explain">
|
||||
|
||||
<title>Using the EXPLAIN Plan for Performance Tuning</title>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
The <codeph><xref href="impala_explain.xml#explain">EXPLAIN</xref></codeph> statement gives you an outline
|
||||
of the logical steps that a query will perform, such as how the work will be distributed among the nodes
|
||||
and how intermediate results will be combined to produce the final result set. You can see these details
|
||||
before actually running the query. You can use this information to check that the query will not operate in
|
||||
some very unexpected or inefficient way.
|
||||
</p>
|
||||
|
||||
<!-- Turn into a conref in ciiu_langref too. Relocate to common.xml. -->
|
||||
|
||||
<codeblock conref="impala_explain.xml#explain/explain_plan_simple"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/explain_interpret"/>
|
||||
|
||||
<p>
|
||||
The <codeph>EXPLAIN</codeph> plan is also printed at the beginning of the query profile report described in
|
||||
<xref href="#perf_profile"/>, for convenience in examining both the logical and physical aspects of the
|
||||
query side-by-side.
|
||||
</p>
|
||||
|
||||
<p rev="1.2">
|
||||
The amount of detail displayed in the <codeph>EXPLAIN</codeph> output is controlled by the
|
||||
<xref href="impala_explain_level.xml#explain_level">EXPLAIN_LEVEL</xref> query option. You typically
|
||||
increase this setting from <codeph>normal</codeph> to <codeph>verbose</codeph> (or from <codeph>0</codeph>
|
||||
to <codeph>1</codeph>) when doublechecking the presence of table and column statistics during performance
|
||||
tuning, or when estimating query resource usage in conjunction with the resource management features in CDH
|
||||
5.
|
||||
</p>
|
||||
|
||||
<!-- To do:
|
||||
This is a good place to have a few examples.
|
||||
-->
|
||||
</conbody>
|
||||
</concept>
|
||||
|
||||
<concept id="perf_summary">
|
||||
|
||||
<title>Using the SUMMARY Report for Performance Tuning</title>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
The <codeph><xref href="impala_shell_commands.xml#shell_commands">SUMMARY</xref></codeph> command within
|
||||
the <cmdname>impala-shell</cmdname> interpreter gives you an easy-to-digest overview of the timings for the
|
||||
different phases of execution for a query. Like the <codeph>EXPLAIN</codeph> plan, it is easy to see
|
||||
potential performance bottlenecks. Like the <codeph>PROFILE</codeph> output, it is available after the
|
||||
query is run and so displays actual timing numbers.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The <codeph>SUMMARY</codeph> report is also printed at the beginning of the query profile report described
|
||||
in <xref href="#perf_profile"/>, for convenience in examining high-level and low-level aspects of the query
|
||||
side-by-side.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
For example, here is a query involving an aggregate function, on a single-node VM. The different stages of
|
||||
the query and their timings are shown (rolled up for all nodes), along with estimated and actual values
|
||||
used in planning the query. In this case, the <codeph>AVG()</codeph> function is computed for a subset of
|
||||
data on each node (stage 01) and then the aggregated results from all nodes are combined at the end (stage
|
||||
03). You can see which stages took the most time, and whether any estimates were substantially different
|
||||
than the actual data distribution. (When examining the time values, be sure to consider the suffixes such
|
||||
as <codeph>us</codeph> for microseconds and <codeph>ms</codeph> for milliseconds, rather than just looking
|
||||
for the largest numbers.)
|
||||
</p>
|
||||
|
||||
<codeblock>[localhost:21000] > select avg(ss_sales_price) from store_sales where ss_coupon_amt = 0;
|
||||
+---------------------+
|
||||
| avg(ss_sales_price) |
|
||||
+---------------------+
|
||||
| 37.80770926328327 |
|
||||
+---------------------+
|
||||
[localhost:21000] > summary;
|
||||
+--------------+--------+----------+----------+-------+------------+----------+---------------+-----------------+
|
||||
| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail |
|
||||
+--------------+--------+----------+----------+-------+------------+----------+---------------+-----------------+
|
||||
| 03:AGGREGATE | 1 | 1.03ms | 1.03ms | 1 | 1 | 48.00 KB | -1 B | MERGE FINALIZE |
|
||||
| 02:EXCHANGE | 1 | 0ns | 0ns | 1 | 1 | 0 B | -1 B | UNPARTITIONED |
|
||||
| 01:AGGREGATE | 1 | 30.79ms | 30.79ms | 1 | 1 | 80.00 KB | 10.00 MB | |
|
||||
| 00:SCAN HDFS | 1 | 5.45s | 5.45s | 2.21M | -1 | 64.05 MB | 432.00 MB | tpc.store_sales |
|
||||
+--------------+--------+----------+----------+-------+------------+----------+---------------+-----------------+
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
Notice how the longest initial phase of the query is measured in seconds (s), while later phases working on
|
||||
smaller intermediate results are measured in milliseconds (ms) or even nanoseconds (ns).
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Here is an example from a more complicated query, as it would appear in the <codeph>PROFILE</codeph>
|
||||
output:
|
||||
</p>
|
||||
|
||||
<!-- This example taken from: https://github.com/cloudera/Impala/commit/af85d3b518089b8840ddea4356947e40d1aca9bd -->
|
||||
|
||||
<codeblock>Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail
|
||||
------------------------------------------------------------------------------------------------------------------------
|
||||
09:MERGING-EXCHANGE 1 79.738us 79.738us 5 5 0 -1.00 B UNPARTITIONED
|
||||
05:TOP-N 3 84.693us 88.810us 5 5 12.00 KB 120.00 B
|
||||
04:AGGREGATE 3 5.263ms 6.432ms 5 5 44.00 KB 10.00 MB MERGE FINALIZE
|
||||
08:AGGREGATE 3 16.659ms 27.444ms 52.52K 600.12K 3.20 MB 15.11 MB MERGE
|
||||
07:EXCHANGE 3 2.644ms 5.1ms 52.52K 600.12K 0 0 HASH(o_orderpriority)
|
||||
03:AGGREGATE 3 342.913ms 966.291ms 52.52K 600.12K 10.80 MB 15.11 MB
|
||||
02:HASH JOIN 3 2s165ms 2s171ms 144.87K 600.12K 13.63 MB 941.01 KB INNER JOIN, BROADCAST
|
||||
|--06:EXCHANGE 3 8.296ms 8.692ms 57.22K 15.00K 0 0 BROADCAST
|
||||
| 01:SCAN HDFS 2 1s412ms 1s978ms 57.22K 15.00K 24.21 MB 176.00 MB tpch.orders o
|
||||
00:SCAN HDFS 3 8s032ms 8s558ms 3.79M 600.12K 32.29 MB 264.00 MB tpch.lineitem l
|
||||
</codeblock>
|
||||
</conbody>
|
||||
</concept>
|
||||
|
||||
<concept id="perf_profile">
|
||||
|
||||
<title>Using the Query Profile for Performance Tuning</title>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
The <codeph>PROFILE</codeph> statement, available in the <cmdname>impala-shell</cmdname> interpreter,
|
||||
produces a detailed low-level report showing how the most recent query was executed. Unlike the
|
||||
<codeph>EXPLAIN</codeph> plan described in <xref href="#perf_explain"/>, this information is only available
|
||||
after the query has finished. It shows physical details such as the number of bytes read, maximum memory
|
||||
usage, and so on for each node. You can use this information to determine if the query is I/O-bound or
|
||||
CPU-bound, whether some network condition is imposing a bottleneck, whether a slowdown is affecting some
|
||||
nodes but not others, and to check that recommended configuration settings such as short-circuit local
|
||||
reads are in effect.
|
||||
</p>
|
||||
|
||||
<p rev="CDH-29157">
|
||||
By default, time values in the profile output reflect the wall-clock time taken by an operation.
|
||||
For values denoting system time or user time, the measurement unit is reflected in the metric
|
||||
name, such as <codeph>ScannerThreadsSysTime</codeph> or <codeph>ScannerThreadsUserTime</codeph>.
|
||||
For example, a multi-threaded I/O operation might show a small figure for wall-clock time,
|
||||
while the corresponding system time is larger, representing the sum of the CPU time taken by each thread.
|
||||
Or a wall-clock time figure might be larger because it counts time spent waiting, while
|
||||
the corresponding system and user time figures only measure the time while the operation
|
||||
is actively using CPU cycles.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The <xref href="impala_explain_plan.xml#perf_explain"><codeph>EXPLAIN</codeph> plan</xref> is also printed
|
||||
at the beginning of the query profile report, for convenience in examining both the logical and physical
|
||||
aspects of the query side-by-side. The
|
||||
<xref href="impala_explain_level.xml#explain_level">EXPLAIN_LEVEL</xref> query option also controls the
|
||||
verbosity of the <codeph>EXPLAIN</codeph> output printed by the <codeph>PROFILE</codeph> command.
|
||||
</p>
|
||||
|
||||
<!-- To do:
|
||||
This is a good place to have a few more examples.
|
||||
-->
|
||||
|
||||
<p>
|
||||
Here is an example of a query profile, from a relatively straightforward query on a single-node
|
||||
pseudo-distributed cluster to keep the output relatively brief.
|
||||
</p>
|
||||
|
||||
<codeblock>[localhost:21000] > profile;
|
||||
Query Runtime Profile:
|
||||
Query (id=6540a03d4bee0691:4963d6269b210ebd):
|
||||
Summary:
|
||||
Session ID: ea4a197f1c7bf858:c74e66f72e3a33ba
|
||||
Session Type: BEESWAX
|
||||
Start Time: 2013-12-02 17:10:30.263067000
|
||||
End Time: 2013-12-02 17:10:50.932044000
|
||||
Query Type: QUERY
|
||||
Query State: FINISHED
|
||||
Query Status: OK
|
||||
Impala Version: impalad version 1.2.1 RELEASE (build edb5af1bcad63d410bc5d47cc203df3a880e9324)
|
||||
User: cloudera
|
||||
Network Address: 127.0.0.1:49161
|
||||
Default Db: stats_testing
|
||||
Sql Statement: select t1.s, t2.s from t1 join t2 on (t1.id = t2.parent)
|
||||
Plan:
|
||||
----------------
|
||||
Estimated Per-Host Requirements: Memory=2.09GB VCores=2
|
||||
|
||||
PLAN FRAGMENT 0
|
||||
PARTITION: UNPARTITIONED
|
||||
|
||||
4:EXCHANGE
|
||||
cardinality: unavailable
|
||||
per-host memory: unavailable
|
||||
tuple ids: 0 1
|
||||
|
||||
PLAN FRAGMENT 1
|
||||
PARTITION: RANDOM
|
||||
|
||||
STREAM DATA SINK
|
||||
EXCHANGE ID: 4
|
||||
UNPARTITIONED
|
||||
|
||||
2:HASH JOIN
|
||||
| join op: INNER JOIN (BROADCAST)
|
||||
| hash predicates:
|
||||
| t1.id = t2.parent
|
||||
| cardinality: unavailable
|
||||
| per-host memory: 2.00GB
|
||||
| tuple ids: 0 1
|
||||
|
|
||||
|----3:EXCHANGE
|
||||
| cardinality: unavailable
|
||||
| per-host memory: 0B
|
||||
| tuple ids: 1
|
||||
|
|
||||
0:SCAN HDFS
|
||||
table=stats_testing.t1 #partitions=1/1 size=33B
|
||||
table stats: unavailable
|
||||
column stats: unavailable
|
||||
cardinality: unavailable
|
||||
per-host memory: 32.00MB
|
||||
tuple ids: 0
|
||||
|
||||
PLAN FRAGMENT 2
|
||||
PARTITION: RANDOM
|
||||
|
||||
STREAM DATA SINK
|
||||
EXCHANGE ID: 3
|
||||
UNPARTITIONED
|
||||
|
||||
1:SCAN HDFS
|
||||
table=stats_testing.t2 #partitions=1/1 size=960.00KB
|
||||
table stats: unavailable
|
||||
column stats: unavailable
|
||||
cardinality: unavailable
|
||||
per-host memory: 96.00MB
|
||||
tuple ids: 1
|
||||
----------------
|
||||
Query Timeline: 20s670ms
|
||||
- Start execution: 2.559ms (2.559ms)
|
||||
- Planning finished: 23.587ms (21.27ms)
|
||||
- Rows available: 666.199ms (642.612ms)
|
||||
- First row fetched: 668.919ms (2.719ms)
|
||||
- Unregister query: 20s668ms (20s000ms)
|
||||
ImpalaServer:
|
||||
- ClientFetchWaitTimer: 19s637ms
|
||||
- RowMaterializationTimer: 167.121ms
|
||||
Execution Profile 6540a03d4bee0691:4963d6269b210ebd:(Active: 837.815ms, % non-child: 0.00%)
|
||||
Per Node Peak Memory Usage: impala-1.example.com:22000(7.42 MB)
|
||||
- FinalizationTimer: 0ns
|
||||
Coordinator Fragment:(Active: 195.198ms, % non-child: 0.00%)
|
||||
MemoryUsage(500.0ms): 16.00 KB, 7.42 MB, 7.33 MB, 7.10 MB, 6.94 MB, 6.71 MB, 6.56 MB, 6.40 MB, 6.17 MB, 6.02 MB, 5.79 MB, 5.63 MB, 5.48 MB, 5.25 MB, 5.09 MB, 4.86 MB, 4.71 MB, 4.47 MB, 4.32 MB, 4.09 MB, 3.93 MB, 3.78 MB, 3.55 MB, 3.39 MB, 3.16 MB, 3.01 MB, 2.78 MB, 2.62 MB, 2.39 MB, 2.24 MB, 2.08 MB, 1.85 MB, 1.70 MB, 1.54 MB, 1.31 MB, 1.16 MB, 948.00 KB, 790.00 KB, 553.00 KB, 395.00 KB, 237.00 KB
|
||||
ThreadUsage(500.0ms): 1
|
||||
- AverageThreadTokens: 1.00
|
||||
- PeakMemoryUsage: 7.42 MB
|
||||
- PrepareTime: 36.144us
|
||||
- RowsProduced: 98.30K (98304)
|
||||
- TotalCpuTime: 20s449ms
|
||||
- TotalNetworkWaitTime: 191.630ms
|
||||
- TotalStorageWaitTime: 0ns
|
||||
CodeGen:(Active: 150.679ms, % non-child: 77.19%)
|
||||
- CodegenTime: 0ns
|
||||
- CompileTime: 139.503ms
|
||||
- LoadTime: 10.7ms
|
||||
- ModuleFileSize: 95.27 KB
|
||||
EXCHANGE_NODE (id=4):(Active: 194.858ms, % non-child: 99.83%)
|
||||
- BytesReceived: 2.33 MB
|
||||
- ConvertRowBatchTime: 2.732ms
|
||||
- DataArrivalWaitTime: 191.118ms
|
||||
- DeserializeRowBatchTimer: 14.943ms
|
||||
- FirstBatchArrivalWaitTime: 191.117ms
|
||||
- PeakMemoryUsage: 7.41 MB
|
||||
- RowsReturned: 98.30K (98304)
|
||||
- RowsReturnedRate: 504.49 K/sec
|
||||
- SendersBlockedTimer: 0ns
|
||||
- SendersBlockedTotalTimer(*): 0ns
|
||||
Averaged Fragment 1:(Active: 442.360ms, % non-child: 0.00%)
|
||||
split sizes: min: 33.00 B, max: 33.00 B, avg: 33.00 B, stddev: 0.00
|
||||
completion times: min:443.720ms max:443.720ms mean: 443.720ms stddev:0ns
|
||||
execution rates: min:74.00 B/sec max:74.00 B/sec mean:74.00 B/sec stddev:0.00 /sec
|
||||
num instances: 1
|
||||
- AverageThreadTokens: 1.00
|
||||
- PeakMemoryUsage: 6.06 MB
|
||||
- PrepareTime: 7.291ms
|
||||
- RowsProduced: 98.30K (98304)
|
||||
- TotalCpuTime: 784.259ms
|
||||
- TotalNetworkWaitTime: 388.818ms
|
||||
- TotalStorageWaitTime: 3.934ms
|
||||
CodeGen:(Active: 312.862ms, % non-child: 70.73%)
|
||||
- CodegenTime: 2.669ms
|
||||
- CompileTime: 302.467ms
|
||||
- LoadTime: 9.231ms
|
||||
- ModuleFileSize: 95.27 KB
|
||||
DataStreamSender (dst_id=4):(Active: 80.63ms, % non-child: 18.10%)
|
||||
- BytesSent: 2.33 MB
|
||||
- NetworkThroughput(*): 35.89 MB/sec
|
||||
- OverallThroughput: 29.06 MB/sec
|
||||
- PeakMemoryUsage: 5.33 KB
|
||||
- SerializeBatchTime: 26.487ms
|
||||
- ThriftTransmitTime(*): 64.814ms
|
||||
- UncompressedRowBatchSize: 6.66 MB
|
||||
HASH_JOIN_NODE (id=2):(Active: 362.25ms, % non-child: 3.92%)
|
||||
- BuildBuckets: 1.02K (1024)
|
||||
- BuildRows: 98.30K (98304)
|
||||
- BuildTime: 12.622ms
|
||||
- LoadFactor: 0.00
|
||||
- PeakMemoryUsage: 6.02 MB
|
||||
- ProbeRows: 3
|
||||
- ProbeTime: 3.579ms
|
||||
- RowsReturned: 98.30K (98304)
|
||||
- RowsReturnedRate: 271.54 K/sec
|
||||
EXCHANGE_NODE (id=3):(Active: 344.680ms, % non-child: 77.92%)
|
||||
- BytesReceived: 1.15 MB
|
||||
- ConvertRowBatchTime: 2.792ms
|
||||
- DataArrivalWaitTime: 339.936ms
|
||||
- DeserializeRowBatchTimer: 9.910ms
|
||||
- FirstBatchArrivalWaitTime: 199.474ms
|
||||
- PeakMemoryUsage: 156.00 KB
|
||||
- RowsReturned: 98.30K (98304)
|
||||
- RowsReturnedRate: 285.20 K/sec
|
||||
- SendersBlockedTimer: 0ns
|
||||
- SendersBlockedTotalTimer(*): 0ns
|
||||
HDFS_SCAN_NODE (id=0):(Active: 13.616us, % non-child: 0.00%)
|
||||
- AverageHdfsReadThreadConcurrency: 0.00
|
||||
- AverageScannerThreadConcurrency: 0.00
|
||||
- BytesRead: 33.00 B
|
||||
- BytesReadLocal: 33.00 B
|
||||
- BytesReadShortCircuit: 33.00 B
|
||||
- NumDisksAccessed: 1
|
||||
- NumScannerThreadsStarted: 1
|
||||
- PeakMemoryUsage: 46.00 KB
|
||||
- PerReadThreadRawHdfsThroughput: 287.52 KB/sec
|
||||
- RowsRead: 3
|
||||
- RowsReturned: 3
|
||||
- RowsReturnedRate: 220.33 K/sec
|
||||
- ScanRangesComplete: 1
|
||||
- ScannerThreadsInvoluntaryContextSwitches: 26
|
||||
- ScannerThreadsTotalWallClockTime: 55.199ms
|
||||
- DelimiterParseTime: 2.463us
|
||||
- MaterializeTupleTime(*): 1.226us
|
||||
- ScannerThreadsSysTime: 0ns
|
||||
- ScannerThreadsUserTime: 42.993ms
|
||||
- ScannerThreadsVoluntaryContextSwitches: 1
|
||||
- TotalRawHdfsReadTime(*): 112.86us
|
||||
- TotalReadThroughput: 0.00 /sec
|
||||
Averaged Fragment 2:(Active: 190.120ms, % non-child: 0.00%)
|
||||
split sizes: min: 960.00 KB, max: 960.00 KB, avg: 960.00 KB, stddev: 0.00
|
||||
completion times: min:191.736ms max:191.736ms mean: 191.736ms stddev:0ns
|
||||
execution rates: min:4.89 MB/sec max:4.89 MB/sec mean:4.89 MB/sec stddev:0.00 /sec
|
||||
num instances: 1
|
||||
- AverageThreadTokens: 0.00
|
||||
- PeakMemoryUsage: 906.33 KB
|
||||
- PrepareTime: 3.67ms
|
||||
- RowsProduced: 98.30K (98304)
|
||||
- TotalCpuTime: 403.351ms
|
||||
- TotalNetworkWaitTime: 34.999ms
|
||||
- TotalStorageWaitTime: 108.675ms
|
||||
CodeGen:(Active: 162.57ms, % non-child: 85.24%)
|
||||
- CodegenTime: 3.133ms
|
||||
- CompileTime: 148.316ms
|
||||
- LoadTime: 12.317ms
|
||||
- ModuleFileSize: 95.27 KB
|
||||
DataStreamSender (dst_id=3):(Active: 70.620ms, % non-child: 37.14%)
|
||||
- BytesSent: 1.15 MB
|
||||
- NetworkThroughput(*): 23.30 MB/sec
|
||||
- OverallThroughput: 16.23 MB/sec
|
||||
- PeakMemoryUsage: 5.33 KB
|
||||
- SerializeBatchTime: 22.69ms
|
||||
- ThriftTransmitTime(*): 49.178ms
|
||||
- UncompressedRowBatchSize: 3.28 MB
|
||||
HDFS_SCAN_NODE (id=1):(Active: 118.839ms, % non-child: 62.51%)
|
||||
- AverageHdfsReadThreadConcurrency: 0.00
|
||||
- AverageScannerThreadConcurrency: 0.00
|
||||
- BytesRead: 960.00 KB
|
||||
- BytesReadLocal: 960.00 KB
|
||||
- BytesReadShortCircuit: 960.00 KB
|
||||
- NumDisksAccessed: 1
|
||||
- NumScannerThreadsStarted: 1
|
||||
- PeakMemoryUsage: 869.00 KB
|
||||
- PerReadThreadRawHdfsThroughput: 130.21 MB/sec
|
||||
- RowsRead: 98.30K (98304)
|
||||
- RowsReturned: 98.30K (98304)
|
||||
- RowsReturnedRate: 827.20 K/sec
|
||||
- ScanRangesComplete: 15
|
||||
- ScannerThreadsInvoluntaryContextSwitches: 34
|
||||
- ScannerThreadsTotalWallClockTime: 189.774ms
|
||||
- DelimiterParseTime: 15.703ms
|
||||
- MaterializeTupleTime(*): 3.419ms
|
||||
- ScannerThreadsSysTime: 1.999ms
|
||||
- ScannerThreadsUserTime: 44.993ms
|
||||
- ScannerThreadsVoluntaryContextSwitches: 118
|
||||
- TotalRawHdfsReadTime(*): 7.199ms
|
||||
- TotalReadThroughput: 0.00 /sec
|
||||
Fragment 1:
|
||||
Instance 6540a03d4bee0691:4963d6269b210ebf (host=impala-1.example.com:22000):(Active: 442.360ms, % non-child: 0.00%)
|
||||
Hdfs split stats (<volume id>:<# splits>/<split lengths>): 0:1/33.00 B
|
||||
MemoryUsage(500.0ms): 69.33 KB
|
||||
ThreadUsage(500.0ms): 1
|
||||
- AverageThreadTokens: 1.00
|
||||
- PeakMemoryUsage: 6.06 MB
|
||||
- PrepareTime: 7.291ms
|
||||
- RowsProduced: 98.30K (98304)
|
||||
- TotalCpuTime: 784.259ms
|
||||
- TotalNetworkWaitTime: 388.818ms
|
||||
- TotalStorageWaitTime: 3.934ms
|
||||
CodeGen:(Active: 312.862ms, % non-child: 70.73%)
|
||||
- CodegenTime: 2.669ms
|
||||
- CompileTime: 302.467ms
|
||||
- LoadTime: 9.231ms
|
||||
- ModuleFileSize: 95.27 KB
|
||||
DataStreamSender (dst_id=4):(Active: 80.63ms, % non-child: 18.10%)
|
||||
- BytesSent: 2.33 MB
|
||||
- NetworkThroughput(*): 35.89 MB/sec
|
||||
- OverallThroughput: 29.06 MB/sec
|
||||
- PeakMemoryUsage: 5.33 KB
|
||||
- SerializeBatchTime: 26.487ms
|
||||
- ThriftTransmitTime(*): 64.814ms
|
||||
- UncompressedRowBatchSize: 6.66 MB
|
||||
HASH_JOIN_NODE (id=2):(Active: 362.25ms, % non-child: 3.92%)
|
||||
ExecOption: Build Side Codegen Enabled, Probe Side Codegen Enabled, Hash Table Built Asynchronously
|
||||
- BuildBuckets: 1.02K (1024)
|
||||
- BuildRows: 98.30K (98304)
|
||||
- BuildTime: 12.622ms
|
||||
- LoadFactor: 0.00
|
||||
- PeakMemoryUsage: 6.02 MB
|
||||
- ProbeRows: 3
|
||||
- ProbeTime: 3.579ms
|
||||
- RowsReturned: 98.30K (98304)
|
||||
- RowsReturnedRate: 271.54 K/sec
|
||||
EXCHANGE_NODE (id=3):(Active: 344.680ms, % non-child: 77.92%)
|
||||
- BytesReceived: 1.15 MB
|
||||
- ConvertRowBatchTime: 2.792ms
|
||||
- DataArrivalWaitTime: 339.936ms
|
||||
- DeserializeRowBatchTimer: 9.910ms
|
||||
- FirstBatchArrivalWaitTime: 199.474ms
|
||||
- PeakMemoryUsage: 156.00 KB
|
||||
- RowsReturned: 98.30K (98304)
|
||||
- RowsReturnedRate: 285.20 K/sec
|
||||
- SendersBlockedTimer: 0ns
|
||||
- SendersBlockedTotalTimer(*): 0ns
|
||||
HDFS_SCAN_NODE (id=0):(Active: 13.616us, % non-child: 0.00%)
|
||||
Hdfs split stats (<volume id>:<# splits>/<split lengths>): 0:1/33.00 B
|
||||
Hdfs Read Thread Concurrency Bucket: 0:0% 1:0%
|
||||
File Formats: TEXT/NONE:1
|
||||
ExecOption: Codegen enabled: 1 out of 1
|
||||
- AverageHdfsReadThreadConcurrency: 0.00
|
||||
- AverageScannerThreadConcurrency: 0.00
|
||||
- BytesRead: 33.00 B
|
||||
- BytesReadLocal: 33.00 B
|
||||
- BytesReadShortCircuit: 33.00 B
|
||||
- NumDisksAccessed: 1
|
||||
- NumScannerThreadsStarted: 1
|
||||
- PeakMemoryUsage: 46.00 KB
|
||||
- PerReadThreadRawHdfsThroughput: 287.52 KB/sec
|
||||
- RowsRead: 3
|
||||
- RowsReturned: 3
|
||||
- RowsReturnedRate: 220.33 K/sec
|
||||
- ScanRangesComplete: 1
|
||||
- ScannerThreadsInvoluntaryContextSwitches: 26
|
||||
- ScannerThreadsTotalWallClockTime: 55.199ms
|
||||
- DelimiterParseTime: 2.463us
|
||||
- MaterializeTupleTime(*): 1.226us
|
||||
- ScannerThreadsSysTime: 0ns
|
||||
- ScannerThreadsUserTime: 42.993ms
|
||||
- ScannerThreadsVoluntaryContextSwitches: 1
|
||||
- TotalRawHdfsReadTime(*): 112.86us
|
||||
- TotalReadThroughput: 0.00 /sec
|
||||
Fragment 2:
|
||||
Instance 6540a03d4bee0691:4963d6269b210ec0 (host=impala-1.example.com:22000):(Active: 190.120ms, % non-child: 0.00%)
|
||||
Hdfs split stats (<volume id>:<# splits>/<split lengths>): 0:15/960.00 KB
|
||||
- AverageThreadTokens: 0.00
|
||||
- PeakMemoryUsage: 906.33 KB
|
||||
- PrepareTime: 3.67ms
|
||||
- RowsProduced: 98.30K (98304)
|
||||
- TotalCpuTime: 403.351ms
|
||||
- TotalNetworkWaitTime: 34.999ms
|
||||
- TotalStorageWaitTime: 108.675ms
|
||||
CodeGen:(Active: 162.57ms, % non-child: 85.24%)
|
||||
- CodegenTime: 3.133ms
|
||||
- CompileTime: 148.316ms
|
||||
- LoadTime: 12.317ms
|
||||
- ModuleFileSize: 95.27 KB
|
||||
DataStreamSender (dst_id=3):(Active: 70.620ms, % non-child: 37.14%)
|
||||
- BytesSent: 1.15 MB
|
||||
- NetworkThroughput(*): 23.30 MB/sec
|
||||
- OverallThroughput: 16.23 MB/sec
|
||||
- PeakMemoryUsage: 5.33 KB
|
||||
- SerializeBatchTime: 22.69ms
|
||||
- ThriftTransmitTime(*): 49.178ms
|
||||
- UncompressedRowBatchSize: 3.28 MB
|
||||
HDFS_SCAN_NODE (id=1):(Active: 118.839ms, % non-child: 62.51%)
|
||||
Hdfs split stats (<volume id>:<# splits>/<split lengths>): 0:15/960.00 KB
|
||||
Hdfs Read Thread Concurrency Bucket: 0:0% 1:0%
|
||||
File Formats: TEXT/NONE:15
|
||||
ExecOption: Codegen enabled: 15 out of 15
|
||||
- AverageHdfsReadThreadConcurrency: 0.00
|
||||
- AverageScannerThreadConcurrency: 0.00
|
||||
- BytesRead: 960.00 KB
|
||||
- BytesReadLocal: 960.00 KB
|
||||
- BytesReadShortCircuit: 960.00 KB
|
||||
- NumDisksAccessed: 1
|
||||
- NumScannerThreadsStarted: 1
|
||||
- PeakMemoryUsage: 869.00 KB
|
||||
- PerReadThreadRawHdfsThroughput: 130.21 MB/sec
|
||||
- RowsRead: 98.30K (98304)
|
||||
- RowsReturned: 98.30K (98304)
|
||||
- RowsReturnedRate: 827.20 K/sec
|
||||
- ScanRangesComplete: 15
|
||||
- ScannerThreadsInvoluntaryContextSwitches: 34
|
||||
- ScannerThreadsTotalWallClockTime: 189.774ms
|
||||
- DelimiterParseTime: 15.703ms
|
||||
- MaterializeTupleTime(*): 3.419ms
|
||||
- ScannerThreadsSysTime: 1.999ms
|
||||
- ScannerThreadsUserTime: 44.993ms
|
||||
- ScannerThreadsVoluntaryContextSwitches: 118
|
||||
- TotalRawHdfsReadTime(*): 7.199ms
|
||||
- TotalReadThroughput: 0.00 /sec</codeblock>
|
||||
</conbody>
|
||||
</concept>
|
||||
</concept>
|
||||
1877
docs/topics/impala_faq.xml
Normal file
1877
docs/topics/impala_faq.xml
Normal file
File diff suppressed because it is too large
Load Diff
24
docs/topics/impala_faq_base.xml
Normal file
24
docs/topics/impala_faq_base.xml
Normal file
@@ -0,0 +1,24 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="obsolete_faq">
|
||||
|
||||
<title>Impala Frequently Asked Questions</title>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="FAQs"/>
|
||||
<data name="Category" value="Planning"/>
|
||||
<data name="Category" value="Getting Started"/>
|
||||
<data name="Category" value="Administrators"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<draft-comment translate="no">
|
||||
Obsolete. Content all moved into impala_faq.xml.
|
||||
</draft-comment>
|
||||
</conbody>
|
||||
</concept>
|
||||
21
docs/topics/impala_features.xml
Normal file
21
docs/topics/impala_features.xml
Normal file
@@ -0,0 +1,21 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||||
<concept id="features">
|
||||
|
||||
<title>Primary Impala Features</title>
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
<data name="Category" value="Concepts"/>
|
||||
<data name="Category" value="Getting Started"/>
|
||||
<data name="Category" value="Administrators"/>
|
||||
<data name="Category" value="Developers"/>
|
||||
<data name="Category" value="Data Analysts"/>
|
||||
</metadata>
|
||||
</prolog>
|
||||
|
||||
<conbody>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/feature_list"/>
|
||||
</conbody>
|
||||
</concept>
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user