IMPALA-3398: Add docs to main Impala branch.

These are refugees from doc_prototype. They can be rendered with the
DITA Open Toolkit version 2.3.3 by:

/tmp/dita-ot-2.3.3/bin/dita \
  -i impala.ditamap \
  -f html5 \
  -o $(mktemp -d) \
  -filter impala_html.ditaval

Change-Id: I8861e99adc446f659a04463ca78c79200669484f
Reviewed-on: http://gerrit.cloudera.org:8080/5014
Reviewed-by: John Russell <jrussell@cloudera.com>
Tested-by: John Russell <jrussell@cloudera.com>
This commit is contained in:
Jim Apple
2016-11-08 16:49:18 -08:00
parent 46f5ad48e3
commit 3be0f122a5
264 changed files with 89657 additions and 0 deletions

View File

@@ -0,0 +1,10 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE map PUBLIC "-//OASIS//DTD DITA Map//EN" "map.dtd">
<map audience="standalone">
<title>Cloudera Impala Release Notes</title>
<topicref href="topics/impala_relnotes.xml" audience="HTML standalone"/>
<topicref href="topics/impala_new_features.xml"/>
<topicref href="topics/impala_incompatible_changes.xml"/>
<topicref href="topics/impala_known_issues.xml"/>
<topicref href="topics/impala_fixed_issues.xml"/>
</map>

View File

@@ -0,0 +1,73 @@
#Generating HTML or a PDF of Apache Impala (Incubating) Documentation
##Prerequisites:
Make sure that you have a recent version of a Java JDK installed and that your JAVA\_HOME environment variable is set. This procedure has been tested with JDK 1.8.0. See [Setting JAVA\_HOME](#settingjavahome) at the end of these instructions.
* Open a terminal window and run the following commands to get the Impala documentation source files from Git:
<pre><code>git clone https://git-wip-us.apache.org/repos/asf/incubator-impala.git/docs
cd \<local\_directory\>
git checkout doc\_prototype</code></pre>
Where <code>doc\_prototype</code> is the branch where Impala documentation source files are uploaded.
* Download the DITA Open Toolkit version 2.3.3 from the DITA Open Toolkit web site:
[https://github.com/dita-ot/dita-ot/releases/download/2.3.3/dita-ot-2.3.3.zip] (https://github.com/dita-ot/dita-ot/releases/download/2.3.3/dita-ot-2.3.3.zip)
**Note:** A DITA-OT 2.3.3 User Guide is included in the toolkit. Look for <code>userguide.pdf</code> in the <code>doc</code> directory of the toolkit after you extract it. For example, if you extract the toolkit package to the <code>/Users/\<_username_\>/DITA-OT</code> directory on Mac OS, you will find the <code>userguide.pdf</code> at the following location:
<code>/Users/\<_username_\>/DITA-OT/doc/userguide.pdf</code>
##To generate HTML or PDF:
1. In the directory where you cloned the Impala documentation files, you will find the following important configuration files in the <code>docs</code> subdirectory. These files are used to convert the XML source you downloaded from the Apache site to PDF and HTML:
* <code>impala.ditamap</code>: Tells the DITA Open Toolkit what topics to include in the Impala User/Administration Guide. This guide also includes the Impala SQL Reference.
* <code>impala\_sqlref.ditamap</code>: Tells the DITA Open Toolkit what topics to include in the Impala SQL Reference.
* <code>impala\_html.ditaval</code>: Further defines what topics to include in the Impala HTML output.
* <code>impala\_pdf.ditaval</code>: Further defines what topics to include in the Impala PDF output.
2. Extract the contents of the DITA-OT package into a directory where you want to generate the HTML or the PDF.
3. Open a terminal window and navigate to the directory where you extracted the DITA-OT package.
4. Run one of the following commands, depending on what you want to generate:
* **To generate HTML output of the Impala User and Administration Guide, which includes the Impala SQL Reference, run the following command:**
<code>./bin/dita -input \<path\_to\_impala.ditamap\> -format html5 -output \<path\_to\_build\_output\_directory\> -filter \<path\_to\_impala\_html.ditaval\></code>
* **To generate PDF output of the Impala User and Administration Guide, which includes the Impala SQL Reference, run the following command:**
<code>./bin/dita -input \<path\_to\_impala.ditamap\> -format pdf -output \<path\_to\_build\_output\_directory\> -filter \<path\_to\_impala\_pdf.ditaval\></code>
* **To generate HTML output of the Impala SQL Reference, run the following command:**
<code>./bin/dita -input \<path\_to\_impala\_sqlref.ditamap\> -format html5 -output \<path\_to\_build\_output\_directory\> -filter \<path\_to\_impala\_html.ditaval\></code>
* **To generate PDF output of the Impala SQL Reference, run the following command:**
<code>./bin/dita -input \<path\_to\_impala\_sqlref.ditamap\> -format pdf -output \<path\_to\_build\_output\_directory\> -filter \<path\_to\_impala\_pdf.ditaval\></code>
**Note:** For a description of all command-line options, see the _DITA Open Toolkit User Guide_ in the <code>doc</code> directory of your downloaded DITA Open Toolkit.
5. Go to the output directory that you specified in Step 3 to view the HTML or PDF that you generated. If you generated HTML, open the <code>index.html</code> file with a browser to view the output.
<a name="settingjavahome" />
#Setting JAVA\_HOME
</a>
Set your JAVA\_HOME environment variable to tell your computer where to find the Java executable file. For example, to set your JAVA\_HOME environment on Mac OS X when you the the 1.8.0\_101 version of the Java Development Kit (JDK) installed and you are using the Bash version 3.2 shell, perform the following steps:
1. Edit your <code>/Users/\<username\>/.bash\_profile</code> file and add the following lines to the end of the file:
<pre><code>#Set JAVA_HOME
JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home
export JAVA_HOME;</code></pre>
Where <code>jdk1.8.0\_101.jdk</code> is the version of JDK that you have installed. For example, if you have installed <code>jdk1.8.0\_102.jdk</code>, you would use that value instead.
2. Test to make sure you have set your JAVA\_HOME correctly:
* Open a terminal window and type: <code>$JAVA\_HOME/bin/java -version</code>
* Press return. If you see something like the following:
<pre><code>java version "1.5.0_16"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_16-b06-284)
Java HotSpot (TM) Client VM (build 1.5.0\_16-133, mixed mode, sharing)</code></pre>
Then you've successfully set your JAVA\_HOME environment variable to the binary stored in <code>/Library/Java/JavaVirtualMachines/jdk1.8.0\_101.jdk/Contents/Home</code>.

Binary file not shown.

After

Width:  |  Height:  |  Size: 45 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 82 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 44 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 41 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 34 KiB

1172
docs/impala.ditamap Normal file

File diff suppressed because it is too large Load Diff

21
docs/impala_html.ditaval Normal file
View File

@@ -0,0 +1,21 @@
<?xml version="1.0" encoding="UTF-8"?><val>
<!-- Exclude Cloudera-only content. This is typically material that's permanently hidden,
e.g. obsolete or abandoned. Use pre-release for material being actively worked on
that's not ready for prime time. -->
<prop att="audience" val="Cloudera" action="exclude"/>
<!-- These two are backward: things marked HTML are excluded from the HTML and
things marked PDF are excluded from the PDF. -->
<prop att="audience" val="HTML" action="exclude"/>
<prop att="audience" val="PDF" action="include"/>
<!-- standalone = upstream Impala docs, not part of any larger library
integrated = any xrefs, topicrefs, or other residue from original downstream docs
that don't resolve properly in the upstream context -->
<prop att="audience" val="integrated" action="exclude"/>
<prop att="audience" val="standalone" action="include"/>
<!-- John added this so he can work on Impala_Next in master without fear that
it will show up too early in released docs -->
<prop att="audience" val="impala_next" action="exclude"/>
<!-- This DITAVAL specifically EXCLUDES things marked pre-release -->
<!-- It is safe to use for generating public artifacts. -->
<prop att="audience" val="pre-release" action="exclude"/>
</val>

21
docs/impala_pdf.ditaval Normal file
View File

@@ -0,0 +1,21 @@
<?xml version="1.0" encoding="UTF-8"?><val>
<!-- Exclude Cloudera-only content. This is typically material that's permanently hidden,
e.g. obsolete or abandoned. Use pre-release for material being actively worked on
that's not ready for prime time. -->
<prop att="audience" val="Cloudera" action="exclude"/>
<!-- These two are backward: things marked HTML are excluded from the HTML and
things marked PDF are excluded from the PDF. -->
<prop att="audience" val="PDF" action="exclude"/>
<prop att="audience" val="HTML" action="include"/>
<!-- standalone = upstream Impala docs, not part of any larger library
integrated = any xrefs, topicrefs, or other residue from original downstream docs
that don't resolve properly in the upstream context -->
<prop att="audience" val="integrated" action="exclude"/>
<prop att="audience" val="standalone" action="include"/>
<!-- John added this so he can work on Impala_Next in master without fear that
it will show up too early in released docs -->
<prop att="audience" val="impala_next" action="exclude"/>
<!-- This DITAVAL specifically EXCLUDES things marked pre-release -->
<!-- It is safe to use for generating public artifacts. -->
<prop att="audience" val="pre-release" action="exclude"/>
</val>

146
docs/impala_sqlref.ditamap Normal file
View File

@@ -0,0 +1,146 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE map PUBLIC "-//OASIS//DTD DITA Map//EN" "map.dtd">
<map id="impala_sqlref">
<title>Impala SQL Reference</title>
<topicmeta>
<prodinfo conref="shared/ImpalaVariables.xml#impala_vars/prodinfo_for_html">
<prodname/>
<vrmlist>
<vrm version="version_dlq_gry_sm"/>
</vrmlist>
</prodinfo>
</topicmeta>
<topicref href="topics/impala_langref.xml"/>
<topicref href="topics/impala_comments.xml"/>
<topicref href="topics/impala_datatypes.xml">
<topicref href="topics/impala_array.xml"/>
<topicref href="topics/impala_bigint.xml"/>
<topicref href="topics/impala_boolean.xml"/>
<topicref href="topics/impala_char.xml"/>
<topicref href="topics/impala_decimal.xml"/>
<topicref href="topics/impala_double.xml"/>
<topicref href="topics/impala_float.xml"/>
<topicref href="topics/impala_int.xml"/>
<topicref href="topics/impala_map.xml"/>
<topicref href="topics/impala_real.xml"/>
<topicref href="topics/impala_smallint.xml"/>
<topicref href="topics/impala_string.xml"/>
<topicref href="topics/impala_struct.xml"/>
<topicref href="topics/impala_timestamp.xml"/>
<topicref href="topics/impala_tinyint.xml"/>
<topicref href="topics/impala_varchar.xml"/>
<topicref href="topics/impala_complex_types.xml"/>
</topicref>
<topicref href="topics/impala_literals.xml"/>
<topicref href="topics/impala_operators.xml"/>
<topicref href="topics/impala_schema_objects.xml">
<topicref href="topics/impala_aliases.xml"/>
<topicref href="topics/impala_databases.xml"/>
<topicref href="topics/impala_functions_overview.xml"/>
<topicref href="topics/impala_identifiers.xml"/>
<topicref href="topics/impala_tables.xml"/>
<topicref href="topics/impala_views.xml"/>
</topicref>
<topicref href="topics/impala_langref_sql.xml">
<topicref href="topics/impala_ddl.xml"/>
<topicref href="topics/impala_dml.xml"/>
<topicref href="topics/impala_alter_table.xml"/>
<topicref href="topics/impala_alter_view.xml"/>
<topicref href="topics/impala_compute_stats.xml"/>
<topicref href="topics/impala_create_database.xml"/>
<topicref href="topics/impala_create_function.xml"/>
<topicref href="topics/impala_create_role.xml"/>
<topicref href="topics/impala_create_table.xml"/>
<topicref href="topics/impala_create_view.xml"/>
<topicref audience="impala_next" href="topics/impala_delete.xml"/>
<topicref href="topics/impala_describe.xml"/>
<topicref href="topics/impala_drop_database.xml"/>
<topicref href="topics/impala_drop_function.xml"/>
<topicref href="topics/impala_drop_role.xml"/>
<topicref href="topics/impala_drop_stats.xml"/>
<topicref href="topics/impala_drop_table.xml"/>
<topicref href="topics/impala_drop_view.xml"/>
<topicref href="topics/impala_explain.xml"/>
<topicref href="topics/impala_grant.xml"/>
<topicref href="topics/impala_insert.xml"/>
<topicref href="topics/impala_invalidate_metadata.xml"/>
<topicref href="topics/impala_load_data.xml"/>
<topicref href="topics/impala_refresh.xml"/>
<topicref href="topics/impala_revoke.xml"/>
<topicref href="topics/impala_select.xml">
<topicref href="topics/impala_joins.xml"/>
<topicref href="topics/impala_order_by.xml"/>
<topicref href="topics/impala_group_by.xml"/>
<topicref href="topics/impala_having.xml"/>
<topicref href="topics/impala_limit.xml"/>
<topicref href="topics/impala_offset.xml"/>
<topicref href="topics/impala_union.xml"/>
<topicref href="topics/impala_subqueries.xml"/>
<topicref href="topics/impala_with.xml"/>
<topicref href="topics/impala_distinct.xml"/>
<topicref href="topics/impala_hints.xml"/>
</topicref>
<topicref href="topics/impala_set.xml"/>
<topicref href="topics/impala_query_options.xml">
<topicref href="topics/impala_abort_on_default_limit_exceeded.xml"/>
<topicref href="topics/impala_abort_on_error.xml"/>
<topicref href="topics/impala_allow_unsupported_formats.xml"/>
<topicref href="topics/impala_appx_count_distinct.xml"/>
<topicref href="topics/impala_batch_size.xml"/>
<topicref href="topics/impala_compression_codec.xml"/>
<topicref href="topics/impala_debug_action.xml"/>
<topicref href="topics/impala_default_order_by_limit.xml"/>
<topicref href="topics/impala_disable_codegen.xml"/>
<topicref href="topics/impala_disable_unsafe_spills.xml"/>
<topicref href="topics/impala_exec_single_node_rows_threshold.xml"/>
<topicref href="topics/impala_explain_level.xml"/>
<topicref href="topics/impala_hbase_cache_blocks.xml"/>
<topicref href="topics/impala_hbase_caching.xml"/>
<topicref href="topics/impala_live_progress.xml"/>
<topicref href="topics/impala_live_summary.xml"/>
<topicref href="topics/impala_max_errors.xml"/>
<topicref href="topics/impala_max_io_buffers.xml"/>
<topicref href="topics/impala_max_scan_range_length.xml"/>
<topicref href="topics/impala_mem_limit.xml"/>
<topicref href="topics/impala_num_nodes.xml"/>
<topicref href="topics/impala_num_scanner_threads.xml"/>
<topicref href="topics/impala_parquet_compression_codec.xml"/>
<topicref href="topics/impala_parquet_file_size.xml"/>
<topicref href="topics/impala_query_timeout_s.xml"/>
<topicref href="topics/impala_request_pool.xml"/>
<topicref href="topics/impala_reservation_request_timeout.xml"/>
<topicref href="topics/impala_support_start_over.xml"/>
<topicref href="topics/impala_sync_ddl.xml"/>
<topicref href="topics/impala_v_cpu_cores.xml"/>
</topicref>
<topicref href="topics/impala_show.xml"/>
<topicref href="topics/impala_truncate_table.xml"/>
<topicref audience="impala_next" href="topics/impala_update.xml"/>
<topicref href="topics/impala_use.xml"/>
</topicref>
<topicref href="topics/impala_functions.xml">
<topicref href="topics/impala_math_functions.xml"/>
<topicref href="topics/impala_bit_functions.xml"/>
<topicref href="topics/impala_conversion_functions.xml"/>
<topicref href="topics/impala_datetime_functions.xml"/>
<topicref href="topics/impala_conditional_functions.xml"/>
<topicref href="topics/impala_string_functions.xml"/>
<topicref href="topics/impala_misc_functions.xml"/>
<topicref href="topics/impala_aggregate_functions.xml">
<topicref href="topics/impala_appx_median.xml"/>
<topicref href="topics/impala_avg.xml"/>
<topicref href="topics/impala_count.xml"/>
<topicref href="topics/impala_group_concat.xml"/>
<topicref href="topics/impala_max.xml"/>
<topicref href="topics/impala_min.xml"/>
<topicref href="topics/impala_ndv.xml"/>
<topicref href="topics/impala_stddev.xml"/>
<topicref href="topics/impala_sum.xml"/>
<topicref href="topics/impala_variance.xml"/>
</topicref>
<topicref href="topics/impala_analytic_functions.xml"/>
<topicref href="topics/impala_udf.xml"/>
</topicref>
<topicref href="topics/impala_langref_unsupported.xml"/>
<topicref href="topics/impala_porting.xml"/>
</map>

View File

@@ -0,0 +1,52 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="impala_vars">
<title>Cloudera Impala Variables</title>
<prolog id="prolog_slg_nmv_km">
<metadata id="metadata_ecq_qmv_km">
<prodinfo audience="PDF" id="prodinfo_for_html">
<prodname>Impala</prodname>
<vrmlist>
<vrm version="Impala 2.7.x / CDH 5.9.x"/>
</vrmlist>
</prodinfo>
<prodinfo audience="HTML" id="prodinfo_for_pdf">
<prodname></prodname>
<vrmlist>
<vrm version="Impala 2.7.x / CDH 5.9.x"/>
</vrmlist>
</prodinfo>
</metadata>
</prolog>
<conbody>
<p>Substitution variables for denoting features available in release X or higher.
The upstream docs can refer to the Impala release number.
The docs included with a distro can refer to the distro release number by
editing the values here.
<ul>
<li><ph id="impala27">CDH 5.9</ph></li>
<li><ph id="impala26">CDH 5.8</ph></li>
<li><ph id="impala25">CDH 5.7</ph></li>
<li><ph id="impala24">CDH 5.6</ph></li>
<li><ph id="impala23">CDH 5.5</ph></li>
<li><ph id="impala22">CDH 5.4</ph></li>
<li><ph id="impala21">CDH 5.3</ph></li>
<li><ph id="impala20">CDH 5.2</ph></li>
<li><ph id="impala14">CDH 5.1</ph></li>
<li><ph id="impala13">CDH 5.0</ph></li>
</ul>
</p>
<p>Release Version Variable - <ph id="ReleaseVersion">Impala 2.7.x / CDH 5.9.x</ph></p>
<p>Banner for examples showing shell version -<ph id="ShellBanner">(Shell
build version: Impala Shell v2.7.x (<varname>hash</varname>) built on
<varname>date</varname>)</ph></p>
<p>Banner for examples showing impalad version -<ph id="ImpaladBanner">Server version: impalad version 2.7.x (build
x.y.z)</ph></p>
<data name="version-message" id="version-message">
<foreign>
<lines xml:space="preserve">This is the documentation for <data name="version"/>.
Documentation for other versions is available at <xref href="http://www.cloudera.com/content/support/en/documentation.html" scope="external" format="html">Cloudera Documentation</xref>.</lines>
</foreign>
</data>
</conbody>
</concept>

File diff suppressed because it is too large Load Diff

77
docs/topics/impala.xml Normal file
View File

@@ -0,0 +1,77 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="about_impala">
<title>Apache Impala (incubating) - Interactive SQL</title>
<titlealts audience="PDF"><navtitle>Impala Guide</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Components"/>
<data name="Category" value="Data Analysts"/>
<data name="Category" value="Developers"/>
<data name="Category" value="SQL"/>
</metadata>
</prolog>
<conbody>
<p conref="../shared/impala_common.xml#common/impala_mission_statement"/>
<p conref="../shared/impala_common.xml#common/impala_hive_compatibility"/>
<p conref="../shared/impala_common.xml#common/impala_advantages"/>
<p outputclass="toc"/>
<p audience="integrated">
<b>Related information throughout the CDH 5 library:</b>
</p>
<p audience="integrated">
In CDH 5, the Impala documentation for Release Notes, Installation, Upgrading, and Security has been
integrated alongside the corresponding information for other Hadoop components:
</p>
<!-- Same list is in impala.xml and Impala FAQs. Conref in both places. -->
<ul>
<li>
<xref href="impala_new_features.xml#new_features">New features</xref>
</li>
<li>
<xref href="impala_known_issues.xml#known_issues">Known and fixed issues</xref>
</li>
<li>
<xref href="impala_incompatible_changes.xml#incompatible_changes">Incompatible changes</xref>
</li>
<li>
<xref href="impala_install.xml#install">Installing Impala</xref>
</li>
<li>
<xref href="impala_upgrading.xml#upgrading">Upgrading Impala</xref>
</li>
<li>
<xref href="impala_config.xml#config">Configuring Impala</xref>
</li>
<li>
<xref href="impala_processes.xml#processes">Starting Impala</xref>
</li>
<li>
<xref href="impala_security.xml#security">Security for Impala</xref>
</li>
<li>
<xref href="http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH-Version-and-Packaging-Information/CDH-Version-and-Packaging-Information.html" scope="external" format="html">CDH
Version and Packaging Information</xref>
</li>
</ul>
</conbody>
</concept>

View File

@@ -0,0 +1,23 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept rev="obwl" id="abort_on_default_limit_exceeded">
<title>ABORT_ON_DEFAULT_LIMIT_EXCEEDED Query Option</title>
<titlealts audience="PDF"><navtitle>ABORT_ON_DEFAULT_LIMIT_EXCEEDED</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Impala Query Options"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p conref="../shared/impala_common.xml#common/obwl_query_options"/>
<p conref="../shared/impala_common.xml#common/type_boolean"/>
<p conref="../shared/impala_common.xml#common/default_false_0"/>
</conbody>
</concept>

View File

@@ -0,0 +1,44 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="abort_on_error">
<title>ABORT_ON_ERROR Query Option</title>
<titlealts audience="PDF"><navtitle>ABORT_ON_ERROR</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Impala Query Options"/>
<data name="Category" value="Troubleshooting"/>
<data name="Category" value="Querying"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="Cloudera">ABORT_ON_ERROR query option</indexterm>
When this option is enabled, Impala cancels a query immediately when any of the nodes encounters an error,
rather than continuing and possibly returning incomplete results. This option is disabled by default, to help
gather maximum diagnostic information when an error occurs, for example, whether the same problem occurred on
all nodes or only a single node. Currently, the errors that Impala can skip over involve data corruption,
such as a column that contains a string value when expected to contain an integer value.
</p>
<p>
To control how much logging Impala does for non-fatal errors when <codeph>ABORT_ON_ERROR</codeph> is turned
off, use the <codeph>MAX_ERRORS</codeph> option.
</p>
<p conref="../shared/impala_common.xml#common/type_boolean"/>
<p conref="../shared/impala_common.xml#common/default_false_0"/>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_max_errors.xml#max_errors"/>,
<xref href="impala_logging.xml#logging"/>
</p>
</conbody>
</concept>

View File

@@ -0,0 +1,60 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="admin">
<title>Impala Administration</title>
<titlealts audience="PDF"><navtitle>Administration</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Administrators"/>
<!-- Although there is a reasonable amount of info on the page, it could be better to use wiki-style embedding instead of linking hither and thither. -->
<data name="Category" value="Stub Pages"/>
</metadata>
</prolog>
<conbody>
<p>
As an administrator, you monitor Impala's use of resources and take action when necessary to keep Impala
running smoothly and avoid conflicts with other Hadoop components running on the same cluster. When you
detect that an issue has happened or could happen in the future, you reconfigure Impala or other components
such as HDFS or even the hardware of the cluster itself to resolve or avoid problems.
</p>
<p outputclass="toc"/>
<p>
<b>Related tasks:</b>
</p>
<p>
As an administrator, you can expect to perform installation, upgrade, and configuration tasks for Impala on
all machines in a cluster. See <xref href="impala_install.xml#install"/>,
<xref href="impala_upgrading.xml#upgrading"/>, and <xref href="impala_config.xml#config"/> for details.
</p>
<p>
For security tasks typically performed by administrators, see <xref href="impala_security.xml#security"/>.
</p>
<p>
Administrators also decide how to allocate cluster resources so that all Hadoop components can run smoothly
together. For Impala, this task primarily involves:
<ul>
<li>
Deciding how many Impala queries can run concurrently and with how much memory, through the admission
control feature. See <xref href="impala_admission.xml#admission_control"/> for details.
</li>
<li>
Dividing cluster resources such as memory between Impala and other components, using YARN for overall
resource management, and Llama to mediate resource requests from Impala to YARN. See
<xref href="impala_resource_management.xml#resource_management"/> for details.
</li>
</ul>
</p>
<!-- <p conref="../shared/impala_common.xml#common/impala_mr"/> -->
</conbody>
</concept>

View File

@@ -0,0 +1,947 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept rev="1.3.0" id="admission_control">
<title>Admission Control and Query Queuing</title>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Querying"/>
<data name="Category" value="Admission Control"/>
<data name="Category" value="Resource Management"/>
</metadata>
</prolog>
<conbody>
<p id="admission_control_intro">
Admission control is an Impala feature that imposes limits on concurrent SQL queries, to avoid resource usage
spikes and out-of-memory conditions on busy CDH clusters.
It is a form of <q>throttling</q>.
New queries are accepted and executed until
certain conditions are met, such as too many queries or too much
total memory used across the cluster.
When one of these thresholds is reached,
incoming queries wait to begin execution. These queries are
queued and are admitted (that is, begin executing) when the resources become available.
</p>
<p>
In addition to the threshold values for currently executing queries,
you can place limits on the maximum number of queries that are
queued (waiting) and a limit on the amount of time they might wait
before returning with an error. These queue settings let you ensure that queries do
not wait indefinitely, so that you can detect and correct <q>starvation</q> scenarios.
</p>
<p>
Enable this feature if your cluster is
underutilized at some times and overutilized at others. Overutilization is indicated by performance
bottlenecks and queries being cancelled due to out-of-memory conditions, when those same queries are
successful and perform well during times with less concurrent load. Admission control works as a safeguard to
avoid out-of-memory conditions during heavy concurrent usage.
</p>
<note conref="../shared/impala_common.xml#common/impala_llama_obsolete"/>
<p outputclass="toc inpage"/>
</conbody>
<concept id="admission_intro">
<title>Overview of Impala Admission Control</title>
<prolog>
<metadata>
<data name="Category" value="Concepts"/>
</metadata>
</prolog>
<conbody>
<p>
On a busy CDH cluster, you might find there is an optimal number of Impala queries that run concurrently.
For example, when the I/O capacity is fully utilized by I/O-intensive queries,
you might not find any throughput benefit in running more concurrent queries.
By allowing some queries to run at full speed while others wait, rather than having
all queries contend for resources and run slowly, admission control can result in higher overall throughput.
</p>
<p>
For another example, consider a memory-bound workload such as many large joins or aggregation queries.
Each such query could briefly use many gigabytes of memory to process intermediate results.
Because Impala by default cancels queries that exceed the specified memory limit,
running multiple large-scale queries at once might require
re-running some queries that are cancelled. In this case, admission control improves the
reliability and stability of the overall workload by only allowing as many concurrent queries
as the overall memory of the cluster can accomodate.
</p>
<p>
The admission control feature lets you set an upper limit on the number of concurrent Impala
queries and on the memory used by those queries. Any additional queries are queued until the earlier ones
finish, rather than being cancelled or running slowly and causing contention. As other queries finish, the
queued queries are allowed to proceed.
</p>
<p rev="2.5.0">
In <keyword keyref="impala25_full"/> and higher, you can specify these limits and thresholds for each
pool rather than globally. That way, you can balance the resource usage and throughput
between steady well-defined workloads, rare resource-intensive queries, and ad hoc
exploratory queries.
</p>
<p>
For details on the internal workings of admission control, see
<xref href="impala_admission.xml#admission_architecture"/>.
</p>
</conbody>
</concept>
<concept id="admission_concurrency">
<title>Concurrent Queries and Admission Control</title>
<conbody>
<p>
One way to limit resource usage through admission control is to set an upper limit
on the number of concurrent queries. This is the initial technique you might use
when you do not have extensive information about memory usage for your workload.
This setting can be specified separately for each dynamic resource pool.
</p>
<p>
You can combine this setting with the memory-based approach described in
<xref href="impala_admission.xml#admission_memory"/>. If either the maximum number of
or the expected memory usage of the concurrent queries is exceeded, subsequent queries
are queued until the concurrent workload falls below the threshold again.
</p>
<p>
See
<xref audience="integrated" href="cm_mc_resource_pools.xml#concept_xkk_l1d_wr"/><xref audience="standalone" href="http://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_resource_pools.html" scope="external" format="html"/>
for information about all these dynamic resource
pool settings, how to use them together, and how to divide different parts of your workload among
different pools.
</p>
</conbody>
</concept>
<concept id="admission_memory">
<title>Memory Limits and Admission Control</title>
<conbody>
<p>
Each dynamic resource pool can have an upper limit on the cluster-wide memory used by queries executing in that pool.
This is the technique to use once you have a stable workload with well-understood memory requirements.
</p>
<p>
Always specify the <uicontrol>Default Query Memory Limit</uicontrol> for the expected maximum amount of RAM
that a query might require on each host, which is equivalent to setting the <codeph>MEM_LIMIT</codeph>
query option for every query run in that pool. That value affects the execution of each query, preventing it
from overallocating memory on each host, and potentially activating the spill-to-disk mechanism or cancelling
the query when necessary.
</p>
<p>
Optionally, specify the <uicontrol>Max Memory</uicontrol> setting, a cluster-wide limit that determines
how many queries can be safely run concurrently, based on the upper memory limit per host multiplied by the
number of Impala nodes in the cluster.
</p>
<p conref="../shared/impala_common.xml#common/admission_control_mem_limit_interaction"/>
<note conref="../shared/impala_common.xml#common/max_memory_default_limit_caveat"/>
<p>
You can combine the memory-based settings with the upper limit on concurrent queries described in
<xref href="impala_admission.xml#admission_concurrency"/>. If either the maximum number of
or the expected memory usage of the concurrent queries is exceeded, subsequent queries
are queued until the concurrent workload falls below the threshold again.
</p>
<p>
See
<xref audience="integrated" href="cm_mc_resource_pools.xml#concept_xkk_l1d_wr"/><xref audience="standalone" href="http://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_resource_pools.html" scope="external" format="html"/>
for information about all these dynamic resource
pool settings, how to use them together, and how to divide different parts of your workload among
different pools.
</p>
</conbody>
</concept>
<concept id="admission_yarn">
<title>How Impala Admission Control Relates to Other Resource Management Tools</title>
<prolog>
<metadata>
<data name="Category" value="Concepts"/>
</metadata>
</prolog>
<conbody>
<p>
The admission control feature is similar in some ways to the Cloudera Manager
static partitioning feature, as well as the YARN resource management framework. These features
can be used separately or together. This section describes some similarities and differences, to help you
decide which combination of resource management features to use for Impala.
</p>
<p>
Admission control is a lightweight, decentralized system that is suitable for workloads consisting
primarily of Impala queries and other SQL statements. It sets <q>soft</q> limits that smooth out Impala
memory usage during times of heavy load, rather than taking an all-or-nothing approach that cancels jobs
that are too resource-intensive.
</p>
<p>
Because the admission control system does not interact with other Hadoop workloads such as MapReduce jobs, you
might use YARN with static service pools on CDH 5 clusters where resources are shared between
Impala and other Hadoop components. This configuration is recommended when using Impala in a
<term>multitenant</term> cluster. Devote a percentage of cluster resources to Impala, and allocate another
percentage for MapReduce and other batch-style workloads. Let admission control handle the concurrency and
memory usage for the Impala work within the cluster, and let YARN manage the work for other components within the
cluster. In this scenario, Impala's resources are not managed by YARN.
</p>
<p>
The Impala admission control feature uses the same configuration mechanism as the YARN resource manager to map users to
pools and authenticate them.
</p>
<p rev="DOCS-648">
Although the Impala admission control feature uses a <codeph>fair-scheduler.xml</codeph> configuration file
behind the scenes, this file does not depend on which scheduler is used for YARN. You still use this file,
and Cloudera Manager can generate it for you, even when YARN is using the capacity scheduler.
</p>
</conbody>
</concept>
<concept id="admission_architecture">
<title>How Impala Schedules and Enforces Limits on Concurrent Queries</title>
<prolog>
<metadata>
<data name="Category" value="Concepts"/>
<data name="Category" value="Scheduling"/>
</metadata>
</prolog>
<conbody>
<p>
The admission control system is decentralized, embedded in each Impala daemon and communicating through the
statestore mechanism. Although the limits you set for memory usage and number of concurrent queries apply
cluster-wide, each Impala daemon makes its own decisions about whether to allow each query to run
immediately or to queue it for a less-busy time. These decisions are fast, meaning the admission control
mechanism is low-overhead, but might be imprecise during times of heavy load across many coordinators. There could be times when the
more queries were queued (in aggregate across the cluster) than the specified limit, or when number of admitted queries
exceeds the expected number. Thus, you typically err on the
high side for the size of the queue, because there is not a big penalty for having a large number of queued
queries; and you typically err on the low side for configuring memory resources, to leave some headroom in case more
queries are admitted than expected, without running out of memory and being cancelled as a result.
</p>
<!-- Commenting out as redundant.
<p>
The limit on the number of concurrent queries is a <q>soft</q> one, To achieve high throughput, Impala
makes quick decisions at the host level about which queued queries to dispatch. Therefore, Impala might
slightly exceed the limits from time to time.
</p>
-->
<p>
To avoid a large backlog of queued requests, you can set an upper limit on the size of the queue for
queries that are queued. When the number of queued queries exceeds this limit, further queries are
cancelled rather than being queued. You can also configure a timeout period per pool, after which queued queries are
cancelled, to avoid indefinite waits. If a cluster reaches this state where queries are cancelled due to
too many concurrent requests or long waits for query execution to begin, that is a signal for an
administrator to take action, either by provisioning more resources, scheduling work on the cluster to
smooth out the load, or by doing <xref href="impala_performance.xml#performance">Impala performance
tuning</xref> to enable higher throughput.
</p>
</conbody>
</concept>
<concept id="admission_jdbc_odbc">
<title>How Admission Control works with Impala Clients (JDBC, ODBC, HiveServer2)</title>
<prolog>
<metadata>
<data name="Category" value="JDBC"/>
<data name="Category" value="ODBC"/>
<data name="Category" value="HiveServer2"/>
<data name="Category" value="Concepts"/>
</metadata>
</prolog>
<conbody>
<p>
Most aspects of admission control work transparently with client interfaces such as JDBC and ODBC:
</p>
<ul>
<li>
If a SQL statement is put into a queue rather than running immediately, the API call blocks until the
statement is dequeued and begins execution. At that point, the client program can request to fetch
results, which might also block until results become available.
</li>
<li>
If a SQL statement is cancelled because it has been queued for too long or because it exceeded the memory
limit during execution, the error is returned to the client program with a descriptive error message.
</li>
</ul>
<p rev="CDH-27667">
In Impala 2.0 and higher, you can submit
a SQL <codeph>SET</codeph> statement from the client application
to change the <codeph>REQUEST_POOL</codeph> query option.
This option lets you submit queries to different resource pools,
as described in <xref href="impala_request_pool.xml#request_pool"/>.
<!-- Commenting out as starting to be too old to mention.
Prior to Impala 2.0, that option was only settable
for a session through the <cmdname>impala-shell</cmdname> <codeph>SET</codeph> command, or cluster-wide through an
<cmdname>impalad</cmdname> startup option.
-->
</p>
<p>
At any time, the set of queued queries could include queries submitted through multiple different Impala
daemon hosts. All the queries submitted through a particular host will be executed in order, so a
<codeph>CREATE TABLE</codeph> followed by an <codeph>INSERT</codeph> on the same table would succeed.
Queries submitted through different hosts are not guaranteed to be executed in the order they were
received. Therefore, if you are using load-balancing or other round-robin scheduling where different
statements are submitted through different hosts, set up all table structures ahead of time so that the
statements controlled by the queuing system are primarily queries, where order is not significant. Or, if a
sequence of statements needs to happen in strict order (such as an <codeph>INSERT</codeph> followed by a
<codeph>SELECT</codeph>), submit all those statements through a single session, while connected to the same
Impala daemon host.
</p>
<p>
Admission control has the following limitations or special behavior when used with JDBC or ODBC
applications:
</p>
<ul>
<li>
The other resource-related query options,
<codeph>RESERVATION_REQUEST_TIMEOUT</codeph> and <codeph>V_CPU_CORES</codeph>, are no longer used. Those query options only
applied to using Impala with Llama, which is no longer supported.
</li>
</ul>
</conbody>
</concept>
<concept id="admission_schema_config">
<title>SQL and Schema Considerations for Admission Control</title>
<conbody>
<p>
When queries complete quickly and are tuned for optimal memory usage, there is less chance of
performance or capacity problems during times of heavy load. Before setting up admission control,
tune your Impala queries to ensure that the query plans are efficient and the memory estimates
are accurate. Understanding the nature of your workload, and which queries are the most
resource-intensive, helps you to plan how to divide the queries into different pools and
decide what limits to define for each pool.
</p>
<p>
For large tables, especially those involved in join queries, keep their statistics up to date
after loading substantial amounts of new data or adding new partitions.
Use the <codeph>COMPUTE STATS</codeph> statement for unpartitioned tables, and
<codeph>COMPUTE INCREMENTAL STATS</codeph> for partitioned tables.
</p>
<p>
When you use dynamic resource pools with a <uicontrol>Max Memory</uicontrol> setting enabled,
you typically override the memory estimates that Impala makes based on the statistics from the
<codeph>COMPUTE STATS</codeph> statement.
You either set the <codeph>MEM_LIMIT</codeph> query option within a particular session to
set an upper memory limit for queries within that session, or a default <codeph>MEM_LIMIT</codeph>
setting for all queries processed by the <cmdname>impalad</cmdname> instance, or
a default <codeph>MEM_LIMIT</codeph> setting for all queries assigned to a particular
dynamic resource pool. By designating a consistent memory limit for a set of similar queries
that use the same resource pool, you avoid unnecessary query queuing or out-of-memory conditions
that can arise during high-concurrency workloads when memory estimates for some queries are inaccurate.
</p>
<p>
Follow other steps from <xref href="impala_performance.xml#performance"/> to tune your queries.
</p>
</conbody>
</concept>
<concept id="admission_config">
<title>Configuring Admission Control</title>
<prolog>
<metadata>
<data name="Category" value="Configuring"/>
</metadata>
</prolog>
<conbody>
<p>
The configuration options for admission control range from the simple (a single resource pool with a single
set of options) to the complex (multiple resource pools with different options, each pool handling queries
for a different set of users and groups). <ph rev="upstream">Cloudera</ph> recommends configuring the settings through the Cloudera Manager user
interface.
<!--
, or on a system without Cloudera Manager by editing configuration files or through startup
options to the <cmdname>impalad</cmdname> daemon.
-->
</p>
<!-- To do: reconcile the similar notes in impala_admission.xml and admin_impala_admission_control.xml
and make into a conref in both places. -->
<note type="important">
Although the following options are still present in the Cloudera Manager interface under the
<uicontrol>Admission Control</uicontrol> configuration settings dialog,
<ph rev="upstream">Cloudera</ph> recommends you not use them in <keyword keyref="impala25_full"/> and higher.
These settings only apply if you enable admission control but leave dynamic resource pools disabled.
In <keyword keyref="impala25_full"/> and higher, prefer to set up dynamic resource pools and
customize the settings for each pool, as described in
<ph audience="integrated"><xref href="cm_mc_resource_pools.xml#concept_xkk_l1d_wr/section_p15_mhn_2v"/> and <xref href="cm_mc_resource_pools.xml#concept_xkk_l1d_wr/section_gph_tnk_lm"/></ph>
<xref audience="standalone" href="http://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_resource_pools.html" scope="external" format="html"/>.
</note>
<section id="admission_flags">
<title>Impala Service Flags for Admission Control (Advanced)</title>
<p>
The following Impala configuration options let you adjust the settings of the admission control feature. When supplying the
options on the <cmdname>impalad</cmdname> command line, prepend the option name with <codeph>--</codeph>.
</p>
<dl id="admission_control_option_list">
<dlentry id="queue_wait_timeout_ms">
<dt>
<codeph>queue_wait_timeout_ms</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">--queue_wait_timeout_ms</indexterm>
<b>Purpose:</b> Maximum amount of time (in milliseconds) that a
request waits to be admitted before timing out.
<p>
<b>Type:</b> <codeph>int64</codeph>
</p>
<p>
<b>Default:</b> <codeph>60000</codeph>
</p>
</dd>
</dlentry>
<dlentry id="default_pool_max_requests">
<dt>
<codeph>default_pool_max_requests</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">--default_pool_max_requests</indexterm>
<b>Purpose:</b> Maximum number of concurrent outstanding requests
allowed to run before incoming requests are queued. Because this
limit applies cluster-wide, but each Impala node makes independent
decisions to run queries immediately or queue them, it is a soft
limit; the overall number of concurrent queries might be slightly
higher during times of heavy load. A negative value indicates no
limit. Ignored if <codeph>fair_scheduler_config_path</codeph> and
<codeph>llama_site_path</codeph> are set. <p>
<b>Type:</b>
<codeph>int64</codeph>
</p>
<p>
<b>Default:</b>
<ph rev="2.5.0">-1, meaning unlimited (prior to <keyword keyref="impala25_full"/> the default was 200)</ph>
</p>
</dd>
</dlentry>
<dlentry id="default_pool_max_queued">
<dt>
<codeph>default_pool_max_queued</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">--default_pool_max_queued</indexterm>
<b>Purpose:</b> Maximum number of requests allowed to be queued
before rejecting requests. Because this limit applies
cluster-wide, but each Impala node makes independent decisions to
run queries immediately or queue them, it is a soft limit; the
overall number of queued queries might be slightly higher during
times of heavy load. A negative value or 0 indicates requests are
always rejected once the maximum concurrent requests are
executing. Ignored if <codeph>fair_scheduler_config_path</codeph>
and <codeph>llama_site_path</codeph> are set. <p>
<b>Type:</b>
<codeph>int64</codeph>
</p>
<p>
<b>Default:</b>
<ph rev="2.5.0">unlimited</ph>
</p>
</dd>
</dlentry>
<dlentry id="default_pool_mem_limit">
<dt>
<codeph>default_pool_mem_limit</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">--default_pool_mem_limit</indexterm>
<b>Purpose:</b> Maximum amount of memory (across the entire
cluster) that all outstanding requests in this pool can use before
new requests to this pool are queued. Specified in bytes,
megabytes, or gigabytes by a number followed by the suffix
<codeph>b</codeph> (optional), <codeph>m</codeph>, or
<codeph>g</codeph>, either uppercase or lowercase. You can
specify floating-point values for megabytes and gigabytes, to
represent fractional numbers such as <codeph>1.5</codeph>. You can
also specify it as a percentage of the physical memory by
specifying the suffix <codeph>%</codeph>. 0 or no setting
indicates no limit. Defaults to bytes if no unit is given. Because
this limit applies cluster-wide, but each Impala node makes
independent decisions to run queries immediately or queue them, it
is a soft limit; the overall memory used by concurrent queries
might be slightly higher during times of heavy load. Ignored if
<codeph>fair_scheduler_config_path</codeph> and
<codeph>llama_site_path</codeph> are set. <note
conref="../shared/impala_common.xml#common/admission_compute_stats" />
<p conref="../shared/impala_common.xml#common/type_string" />
<p>
<b>Default:</b>
<codeph>""</codeph> (empty string, meaning unlimited) </p>
</dd>
</dlentry>
<!-- Possibly from here on down, command-line controls not applicable to CM. -->
<dlentry id="disable_admission_control">
<dt>
<codeph>disable_admission_control</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">--disable_admission_control</indexterm>
<b>Purpose:</b> Turns off the admission control feature entirely,
regardless of other configuration option settings.
<p>
<b>Type:</b> Boolean </p>
<p>
<b>Default:</b>
<codeph>false</codeph>
</p>
</dd>
</dlentry>
<dlentry id="disable_pool_max_requests">
<dt>
<codeph>disable_pool_max_requests</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">--disable_pool_max_requests</indexterm>
<b>Purpose:</b> Disables all per-pool limits on the maximum number
of running requests. <p>
<b>Type:</b> Boolean </p>
<p>
<b>Default:</b>
<codeph>false</codeph>
</p>
</dd>
</dlentry>
<dlentry id="disable_pool_mem_limits">
<dt>
<codeph>disable_pool_mem_limits</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">--disable_pool_mem_limits</indexterm>
<b>Purpose:</b> Disables all per-pool mem limits. <p>
<b>Type:</b> Boolean </p>
<p>
<b>Default:</b>
<codeph>false</codeph>
</p>
</dd>
</dlentry>
<dlentry id="fair_scheduler_allocation_path">
<dt>
<codeph>fair_scheduler_allocation_path</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">--fair_scheduler_allocation_path</indexterm>
<b>Purpose:</b> Path to the fair scheduler allocation file
(<codeph>fair-scheduler.xml</codeph>). <p
conref="../shared/impala_common.xml#common/type_string" />
<p>
<b>Default:</b>
<codeph>""</codeph> (empty string) </p>
<p>
<b>Usage notes:</b> Admission control only uses a small subset
of the settings that can go in this file, as described below.
For details about all the Fair Scheduler configuration settings,
see the <xref
href="http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html#Configuration"
scope="external" format="html">Apache wiki</xref>. </p>
</dd>
</dlentry>
<dlentry id="llama_site_path">
<dt>
<codeph>llama_site_path</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">--llama_site_path</indexterm>
<b>Purpose:</b> Path to the configuration file used by admission control
(<codeph>llama-site.xml</codeph>). If set,
<codeph>fair_scheduler_allocation_path</codeph> must also be set.
<p conref="../shared/impala_common.xml#common/type_string" />
<p>
<b>Default:</b> <codeph>""</codeph> (empty string) </p>
<p>
<b>Usage notes:</b> Admission control only uses a few
of the settings that can go in this file, as described below.
</p>
</dd>
</dlentry>
</dl>
</section>
</conbody>
<concept id="admission_config_cm">
<!-- TK: Maybe all this stuff overlaps with admin_impala_admission_control and can be delegated there. -->
<title>Configuring Admission Control Using Cloudera Manager</title>
<prolog>
<metadata>
<data name="Category" value="Cloudera Manager"/>
</metadata>
</prolog>
<conbody>
<p>
In Cloudera Manager, you can configure pools to manage queued Impala queries, and the options for the
limit on number of concurrent queries and how to handle queries that exceed the limit. For details, see
<xref href="http://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_managing_resources.html" scope="external" format="html">Managing Resources with Cloudera Manager</xref>.
</p>
<p audience="Cloudera"><!-- Hiding link because that subtopic is now hidden. -->
See <xref href="#admission_examples"/> for a sample setup for admission control under
Cloudera Manager.
</p>
</conbody>
</concept>
<concept id="admission_config_noncm">
<title>Configuring Admission Control Using the Command Line</title>
<conbody>
<p>
If you do not use Cloudera Manager, you use a combination of startup options for the Impala daemon, and
optionally editing or manually constructing the configuration files
<filepath>fair-scheduler.xml</filepath> and <filepath>llama-site.xml</filepath>.
</p>
<p>
For a straightforward configuration using a single resource pool named <codeph>default</codeph>, you can
specify configuration options on the command line and skip the <filepath>fair-scheduler.xml</filepath>
and <filepath>llama-site.xml</filepath> configuration files.
</p>
<p>
For an advanced configuration with multiple resource pools using different settings, set up the
<filepath>fair-scheduler.xml</filepath> and <filepath>llama-site.xml</filepath> configuration files
manually. Provide the paths to each one using the <cmdname>impalad</cmdname> command-line options,
<codeph>--fair_scheduler_allocation_path</codeph> and <codeph>--llama_site_path</codeph> respectively.
</p>
<p>
The Impala admission control feature only uses the Fair Scheduler configuration settings to determine how
to map users and groups to different resource pools. For example, you might set up different resource
pools with separate memory limits, and maximum number of concurrent and queued queries, for different
categories of users within your organization. For details about all the Fair Scheduler configuration
settings, see the
<xref href="http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html#Configuration" scope="external" format="html">Apache
wiki</xref>.
</p>
<p>
The Impala admission control feature only uses a small subset of possible settings from the
<filepath>llama-site.xml</filepath> configuration file:
</p>
<codeblock>llama.am.throttling.maximum.placed.reservations.<varname>queue_name</varname>
llama.am.throttling.maximum.queued.reservations.<varname>queue_name</varname>
<ph rev="2.5.0 IMPALA-2538">impala.admission-control.pool-default-query-options.<varname>queue_name</varname>
impala.admission-control.pool-queue-timeout-ms.<varname>queue_name</varname></ph>
</codeblock>
<p rev="2.5.0 IMPALA-2538">
The <codeph>impala.admission-control.pool-queue-timeout-ms</codeph>
setting specifies the timeout value for this pool, in milliseconds.
The<codeph>impala.admission-control.pool-default-query-options</codeph>
settings designates the default query options for all queries that run
in this pool. Its argument value is a comma-delimited string of
'key=value' pairs, for example,<codeph>'key1=val1,key2=val2'</codeph>.
For example, this is where you might set a default memory limit
for all queries in the pool, using an argument such as <codeph>MEM_LIMIT=5G</codeph>.
</p>
<p rev="2.5.0 IMPALA-2538">
The <codeph>impala.admission-control.*</codeph> configuration settings are available in
<keyword keyref="impala25_full"/> and higher.
</p>
<p audience="Cloudera"><!-- Hiding link because that subtopic is now hidden. -->
See <xref href="#admission_examples/section_etq_qgb_rq"/> for sample configuration files
for admission control using multiple resource pools, without Cloudera Manager.
</p>
</conbody>
</concept>
<concept id="admission_examples">
<!-- Pruning the CM examples and screenshots because in Impala 2.5 the defaults match up much better with our recommendations. -->
<title>Examples of Admission Control Configurations</title>
<conbody>
<section id="section_fqn_qgb_rq">
<title>Example Admission Control Configurations Using Cloudera Manager</title>
<p>
For full instructions about configuring dynamic resource pools through Cloudera Manager, see
<xref audience="integrated" href="cm_mc_resource_pools.xml#xd_583c10bfdbd326ba--43d5fd93-1410993f8c2--7ff2"/><xref audience="standalone" href="http://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_resource_pools.html" scope="external" format="html"/>.
</p>
</section>
<section id="section_etq_qgb_rq">
<title>Example Admission Control Configurations Using Configuration Files</title>
<p>
For clusters not managed by Cloudera Manager, here are sample <filepath>fair-scheduler.xml</filepath>
and <filepath>llama-site.xml</filepath> files that define resource pools <codeph>root.default</codeph>,
<codeph>root.development</codeph>, and <codeph>root.production</codeph>.
These sample files are stripped down: in a real deployment they
might contain other settings for use with various aspects of the YARN component. The
settings shown here are the significant ones for the Impala admission control feature.
</p>
<p>
<b>fair-scheduler.xml:</b>
</p>
<p>
Although Impala does not use the <codeph>vcores</codeph> value, you must still specify it to satisfy
YARN requirements for the file contents.
</p>
<p>
Each <codeph>&lt;aclSubmitApps&gt;</codeph> tag (other than the one for <codeph>root</codeph>) contains
a comma-separated list of users, then a space, then a comma-separated list of groups; these are the
users and groups allowed to submit Impala statements to the corresponding resource pool.
</p>
<p>
If you leave the <codeph>&lt;aclSubmitApps&gt;</codeph> element empty for a pool, nobody can submit
directly to that pool; child pools can specify their own <codeph>&lt;aclSubmitApps&gt;</codeph> values
to authorize users and groups to submit to those pools.
</p>
<codeblock><![CDATA[<allocations>
<queue name="root">
<aclSubmitApps> </aclSubmitApps>
<queue name="default">
<maxResources>50000 mb, 0 vcores</maxResources>
<aclSubmitApps>*</aclSubmitApps>
</queue>
<queue name="development">
<maxResources>200000 mb, 0 vcores</maxResources>
<aclSubmitApps>user1,user2 dev,ops,admin</aclSubmitApps>
</queue>
<queue name="production">
<maxResources>1000000 mb, 0 vcores</maxResources>
<aclSubmitApps> ops,admin</aclSubmitApps>
</queue>
</queue>
<queuePlacementPolicy>
<rule name="specified" create="false"/>
<rule name="default" />
</queuePlacementPolicy>
</allocations>
]]>
</codeblock>
<p>
<b>llama-site.xml:</b>
</p>
<codeblock rev="2.5.0 IMPALA-2538"><![CDATA[
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>llama.am.throttling.maximum.placed.reservations.root.default</name>
<value>10</value>
</property>
<property>
<name>llama.am.throttling.maximum.queued.reservations.root.default</name>
<value>50</value>
</property>
<property>
<name>impala.admission-control.pool-default-query-options.root.default</name>
<value>mem_limit=128m,query_timeout_s=20,max_io_buffers=10</value>
</property>
<property>
<name>impala.admission-control.pool-queue-timeout-ms.root.default</name>
<value>30000</value>
</property>
<property>
<name>llama.am.throttling.maximum.placed.reservations.root.development</name>
<value>50</value>
</property>
<property>
<name>llama.am.throttling.maximum.queued.reservations.root.development</name>
<value>100</value>
</property>
<property>
<name>impala.admission-control.pool-default-query-options.root.development</name>
<value>mem_limit=256m,query_timeout_s=30,max_io_buffers=10</value>
</property>
<property>
<name>impala.admission-control.pool-queue-timeout-ms.root.development</name>
<value>15000</value>
</property>
<property>
<name>llama.am.throttling.maximum.placed.reservations.root.production</name>
<value>100</value>
</property>
<property>
<name>llama.am.throttling.maximum.queued.reservations.root.production</name>
<value>200</value>
</property>
<!--
Default query options for the 'root.production' pool.
THIS IS A NEW PARAMETER in CDH 5.7 / Impala 2.5.
Note that the MEM_LIMIT query option still shows up in here even though it is a
separate box in the UI. We do that because it is the most important query option
that people will need (everything else is somewhat advanced).
MEM_LIMIT takes a per-node memory limit which is specified using one of the following:
- '<int>[bB]?' -> bytes (default if no unit given)
- '<float>[mM(bB)]' -> megabytes
- '<float>[gG(bB)]' -> in gigabytes
E.g. 'MEM_LIMIT=12345' (no unit) means 12345 bytes, and you can append m or g
to specify megabytes or gigabytes, though that is not required.
-->
<property>
<name>impala.admission-control.pool-default-query-options.root.production</name>
<value>mem_limit=386m,query_timeout_s=30,max_io_buffers=10</value>
</property>
<!--
Default queue timeout (ms) for the pool 'root.production'.
If this isnt set, the process-wide flag is used.
THIS IS A NEW PARAMETER in CDH 5.7 / Impala 2.5.
-->
<property>
<name>impala.admission-control.pool-queue-timeout-ms.root.production</name>
<value>30000</value>
</property>
</configuration>
]]>
</codeblock>
</section>
</conbody>
</concept>
</concept>
<!-- End Config -->
<concept id="admission_guidelines">
<title>Guidelines for Using Admission Control</title>
<prolog>
<metadata>
<data name="Category" value="Planning"/>
<data name="Category" value="Guidelines"/>
<data name="Category" value="Best Practices"/>
</metadata>
</prolog>
<conbody>
<p>
To see how admission control works for particular queries, examine the profile output for the query. This
information is available through the <codeph>PROFILE</codeph> statement in <cmdname>impala-shell</cmdname>
immediately after running a query in the shell, on the <uicontrol>queries</uicontrol> page of the Impala
debug web UI, or in the Impala log file (basic information at log level 1, more detailed information at log
level 2). The profile output contains details about the admission decision, such as whether the query was
queued or not and which resource pool it was assigned to. It also includes the estimated and actual memory
usage for the query, so you can fine-tune the configuration for the memory limits of the resource pools.
</p>
<p>
Where practical, use Cloudera Manager to configure the admission control parameters. The Cloudera Manager
GUI is much simpler than editing the configuration files directly.
</p>
<p>
Remember that the limits imposed by admission control are <q>soft</q> limits.
The decentralized nature of this mechanism means that each Impala node makes its own decisions about whether
to allow queries to run immediately or to queue them. These decisions rely on information passed back and forth
between nodes by the statestore service. If a sudden surge in requests causes more queries than anticipated to run
concurrently, then throughput could decrease due to queries spilling to disk or contending for resources;
or queries could be cancelled if they exceed the <codeph>MEM_LIMIT</codeph> setting while running.
</p>
<!--
<p>
If you have trouble getting a query to run because its estimated memory usage is too high, you can override
the estimate by setting the <codeph>MEM_LIMIT</codeph> query option in <cmdname>impala-shell</cmdname>,
then issuing the query through the shell in the same session. The <codeph>MEM_LIMIT</codeph> value is
treated as the estimated amount of memory, overriding the estimate that Impala would generate based on
table and column statistics. This value is used only for making admission control decisions, and is not
pre-allocated by the query.
</p>
-->
<p>
In <cmdname>impala-shell</cmdname>, you can also specify which resource pool to direct queries to by
setting the <codeph>REQUEST_POOL</codeph> query option.
</p>
<p>
The statements affected by the admission control feature are primarily queries, but also include statements
that write data such as <codeph>INSERT</codeph> and <codeph>CREATE TABLE AS SELECT</codeph>. Most write
operations in Impala are not resource-intensive, but inserting into a Parquet table can require substantial
memory due to buffering intermediate data before writing out each Parquet data block. See
<xref href="impala_parquet.xml#parquet_etl"/> for instructions about inserting data efficiently into
Parquet tables.
</p>
<p>
Although admission control does not scrutinize memory usage for other kinds of DDL statements, if a query
is queued due to a limit on concurrent queries or memory usage, subsequent statements in the same session
are also queued so that they are processed in the correct order:
</p>
<codeblock>-- This query could be queued to avoid out-of-memory at times of heavy load.
select * from huge_table join enormous_table using (id);
-- If so, this subsequent statement in the same session is also queued
-- until the previous statement completes.
drop table huge_table;
</codeblock>
<p>
If you set up different resource pools for different users and groups, consider reusing any classifications
you developed for use with Sentry security. See <xref href="impala_authorization.xml#authorization"/> for details.
</p>
<p>
For details about all the Fair Scheduler configuration settings, see
<xref href="https://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-yarn/hadoop-yarn-site/FairScheduler.html#Configuration" scope="external" format="html">Fair
Scheduler Configuration</xref>, in particular the tags such as <codeph>&lt;queue&gt;</codeph> and
<codeph>&lt;aclSubmitApps&gt;</codeph> to map users and groups to particular resource pools (queues).
</p>
<!-- Wait a sec. We say admission control doesn't use RESERVATION_REQUEST_TIMEOUT at all.
What's the real story here? Matt did refer to some timeout option that was
available through the shell but not the DB-centric APIs.
<p>
Because you cannot override query options such as
<codeph>RESERVATION_REQUEST_TIMEOUT</codeph>
in a JDBC or ODBC application, consider configuring timeout periods
on the application side to cancel queries that take
too long due to being queued during times of high load.
</p>
-->
</conbody>
</concept>
</concept>
<!-- Admission control -->

View File

@@ -0,0 +1,33 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="aggregate_functions">
<title>Impala Aggregate Functions</title>
<titlealts audience="PDF"><navtitle>Aggregate Functions</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Impala Functions"/>
<data name="Category" value="Aggregate Functions"/>
<data name="Category" value="SQL"/>
<data name="Category" value="Data Analysts"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Querying"/>
</metadata>
</prolog>
<conbody>
<p conref="../shared/impala_common.xml#common/aggr1"/>
<codeblock conref="../shared/impala_common.xml#common/aggr2"/>
<p conref="../shared/impala_common.xml#common/aggr3"/>
<p>
<indexterm audience="Cloudera">aggregate functions</indexterm>
</p>
<p outputclass="toc"/>
</conbody>
</concept>

View File

@@ -0,0 +1,87 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="aliases">
<title>Overview of Impala Aliases</title>
<titlealts audience="PDF"><navtitle>Aliases</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="SQL"/>
<data name="Category" value="Data Analysts"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Querying"/>
<data name="Category" value="Tables"/>
<data name="Category" value="Schemas"/>
</metadata>
</prolog>
<conbody>
<p>
When you write the names of tables, columns, or column expressions in a query, you can assign an alias at the
same time. Then you can specify the alias rather than the original name when making other references to the
table or column in the same statement. You typically specify aliases that are shorter, easier to remember, or
both than the original names. The aliases are printed in the query header, making them useful for
self-documenting output.
</p>
<p>
To set up an alias, add the <codeph>AS <varname>alias</varname></codeph> clause immediately after any table,
column, or expression name in the <codeph>SELECT</codeph> list or <codeph>FROM</codeph> list of a query. The
<codeph>AS</codeph> keyword is optional; you can also specify the alias immediately after the original name.
</p>
<codeblock>-- Make the column headers of the result set easier to understand.
SELECT c1 AS name, c2 AS address, c3 AS phone FROM table_with_terse_columns;
SELECT SUM(ss_xyz_dollars_net) AS total_sales FROM table_with_cryptic_columns;
-- The alias can be a quoted string for extra readability.
SELECT c1 AS "Employee ID", c2 AS "Date of hire" FROM t1;
-- The AS keyword is optional.
SELECT c1 "Employee ID", c2 "Date of hire" FROM t1;
-- The table aliases assigned in the FROM clause can be used both earlier
-- in the query (the SELECT list) and later (the WHERE clause).
SELECT one.name, two.address, three.phone
FROM census one, building_directory two, phonebook three
WHERE one.id = two.id and two.id = three.id;
-- The aliases c1 and c2 let the query handle columns with the same names from 2 joined tables.
-- The aliases t1 and t2 let the query abbreviate references to long or cryptically named tables.
SELECT t1.column_n AS c1, t2.column_n AS c2 FROM long_name_table AS t1, very_long_name_table2 AS t2
WHERE c1 = c2;
SELECT t1.column_n c1, t2.column_n c2 FROM table1 t1, table2 t2
WHERE c1 = c2;
</codeblock>
<p>
To use an alias name that matches one of the Impala reserved keywords (listed in
<xref href="impala_reserved_words.xml#reserved_words"/>), surround the identifier with either single or
double quotation marks, or <codeph>``</codeph> characters (backticks).
</p>
<p>
<ph conref="../shared/impala_common.xml#common/aliases_vs_identifiers"/>
</p>
<p conref="../shared/impala_common.xml#common/complex_types_blurb"/>
<p rev="2.3.0">
Queries involving the complex types (<codeph>ARRAY</codeph>,
<codeph>STRUCT</codeph>, and <codeph>MAP</codeph>), typically make
extensive use of table aliases. These queries involve join clauses
where the complex type column is treated as a joined table.
To construct two-part or three-part qualified names for the
complex column elements in the <codeph>FROM</codeph> list,
sometimes it is syntactically required to construct a table
alias for the complex column where it is referenced in the join clause.
See <xref href="impala_complex_types.xml#complex_types"/> for details and examples.
</p>
<p>
<b>Alternatives:</b>
</p>
<p conref="../shared/impala_common.xml#common/views_vs_identifiers"/>
</conbody>
</concept>

View File

@@ -0,0 +1,31 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="allow_unsupported_formats">
<title>ALLOW_UNSUPPORTED_FORMATS Query Option</title>
<titlealts audience="PDF"><navtitle>ALLOW_UNSUPPORTED_FORMATS</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Impala Query Options"/>
<data name="Category" value="Deprecated Features"/>
</metadata>
</prolog>
<conbody>
<!--
The original brief explanation with not enough detail comes from the comments at:
http://github.sf.cloudera.com/CDH/Impala/raw/master/common/thrift/ImpalaService.thrift
Removing that wording from here after discussions with dev team. Just recording the URL for posterity.
-->
<p>
An obsolete query option from early work on support for file formats. Do not use. Might be removed in the
future.
</p>
<p conref="../shared/impala_common.xml#common/type_boolean"/>
<p conref="../shared/impala_common.xml#common/default_false_0"/>
</conbody>
</concept>

View File

@@ -0,0 +1,21 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept audience="Cloudera" rev="1.x" id="alter_function">
<title>ALTER FUNCTION Statement</title>
<titlealts audience="PDF"><navtitle>ALTER FUNCTION</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="SQL"/>
<data name="Category" value="DDL"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p/>
</conbody>
</concept>

View File

@@ -0,0 +1,806 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="alter_table">
<title>ALTER TABLE Statement</title>
<titlealts audience="PDF"><navtitle>ALTER TABLE</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Impala Data Types"/>
<data name="Category" value="SQL"/>
<data name="Category" value="DDL"/>
<data name="Category" value="HDFS Caching"/>
<data name="Category" value="Tables"/>
<data name="Category" value="Schemas"/>
<data name="Category" value="S3"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="Cloudera">ALTER TABLE statement</indexterm>
The <codeph>ALTER TABLE</codeph> statement changes the structure or properties of an existing Impala table.
</p>
<p>
In Impala, this is primarily a logical operation that updates the table metadata in the metastore database that Impala
shares with Hive. Most <codeph>ALTER TABLE</codeph> operations do not actually rewrite, move, and so on the actual data
files. (The <codeph>RENAME TO</codeph> clause is the one exception; it can cause HDFS files to be moved to different paths.)
When you do an <codeph>ALTER TABLE</codeph> operation, you typically need to perform corresponding physical filesystem operations,
such as rewriting the data files to include extra fields, or converting them to a different file format.
</p>
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
<codeblock>ALTER TABLE [<varname>old_db_name</varname>.]<varname>old_table_name</varname> RENAME TO [<varname>new_db_name</varname>.]<varname>new_table_name</varname>
ALTER TABLE <varname>name</varname> ADD COLUMNS (<varname>col_spec</varname>[, <varname>col_spec</varname> ...])
ALTER TABLE <varname>name</varname> DROP [COLUMN] <varname>column_name</varname>
ALTER TABLE <varname>name</varname> CHANGE <varname>column_name</varname> <varname>new_name</varname> <varname>new_type</varname>
ALTER TABLE <varname>name</varname> REPLACE COLUMNS (<varname>col_spec</varname>[, <varname>col_spec</varname> ...])
ALTER TABLE <varname>name</varname> { ADD [IF NOT EXISTS] | DROP [IF EXISTS] } PARTITION (<varname>partition_spec</varname>) <ph rev="2.3.0">[PURGE]</ph>
<ph rev="2.3.0 IMPALA-1568 CDH-36799">ALTER TABLE <varname>name</varname> RECOVER PARTITIONS</ph>
ALTER TABLE <varname>name</varname> [PARTITION (<varname>partition_spec</varname>)]
SET { FILEFORMAT <varname>file_format</varname>
| LOCATION '<varname>hdfs_path_of_directory</varname>'
| TBLPROPERTIES (<varname>table_properties</varname>)
| SERDEPROPERTIES (<varname>serde_properties</varname>) }
<ph rev="2.6.0 IMPALA-3369">ALTER TABLE <varname>name</varname> <varname>colname</varname>
('<varname>statsKey</varname>'='<varname>val</varname>, ...)
statsKey ::= numDVs | numNulls | avgSize | maxSize</ph>
<ph rev="1.4.0">ALTER TABLE <varname>name</varname> [PARTITION (<varname>partition_spec</varname>)] SET { CACHED IN '<varname>pool_name</varname>' <ph rev="2.2.0">[WITH REPLICATION = <varname>integer</varname>]</ph> | UNCACHED }</ph>
<varname>new_name</varname> ::= [<varname>new_database</varname>.]<varname>new_table_name</varname>
<varname>col_spec</varname> ::= <varname>col_name</varname> <varname>type_name</varname>
<varname>partition_spec</varname> ::= <varname>partition_col</varname>=<varname>constant_value</varname>
<varname>table_properties</varname> ::= '<varname>name</varname>'='<varname>value</varname>'[, '<varname>name</varname>'='<varname>value</varname>' ...]
<varname>serde_properties</varname> ::= '<varname>name</varname>'='<varname>value</varname>'[, '<varname>name</varname>'='<varname>value</varname>' ...]
<varname>file_format</varname> ::= { PARQUET | TEXTFILE | RCFILE | SEQUENCEFILE | AVRO }
</codeblock>
<p conref="../shared/impala_common.xml#common/ddl_blurb"/>
<p conref="../shared/impala_common.xml#common/complex_types_blurb"/>
<p rev="2.3.0">
In <keyword keyref="impala23_full"/> and higher, the <codeph>ALTER TABLE</codeph> statement can
change the metadata for tables containing complex types (<codeph>ARRAY</codeph>,
<codeph>STRUCT</codeph>, and <codeph>MAP</codeph>).
For example, you can use an <codeph>ADD COLUMNS</codeph>, <codeph>DROP COLUMN</codeph>, or <codeph>CHANGE</codeph>
clause to modify the table layout for complex type columns.
Although Impala queries only work for complex type columns in Parquet tables, the complex type support in the
<codeph>ALTER TABLE</codeph> statement applies to all file formats.
For example, you can use Impala to update metadata for a staging table in a non-Parquet file format where the
data is populated by Hive. Or you can use <codeph>ALTER TABLE SET FILEFORMAT</codeph> to change the format
of an existing table to Parquet so that Impala can query it. Remember that changing the file format for a table does
not convert the data files within the table; you must prepare any Parquet data files containing complex types
outside Impala, and bring them into the table using <codeph>LOAD DATA</codeph> or updating the table's
<codeph>LOCATION</codeph> property.
See <xref href="impala_complex_types.xml#complex_types"/> for details about using complex types.
</p>
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
<p>
Whenever you specify partitions in an <codeph>ALTER TABLE</codeph> statement, through the <codeph>PARTITION
(<varname>partition_spec</varname>)</codeph> clause, you must include all the partitioning columns in the
specification.
</p>
<p>
Most of the <codeph>ALTER TABLE</codeph> operations work the same for internal tables (managed by Impala) as
for external tables (with data files located in arbitrary locations). The exception is renaming a table; for
an external table, the underlying data directory is not renamed or moved.
</p>
<p conref="../shared/impala_common.xml#common/s3_blurb"/>
<p rev="2.6.0 CDH-39913 IMPALA-1878">
You can specify an <codeph>s3a://</codeph> prefix on the <codeph>LOCATION</codeph> attribute of a table or partition
to make Impala query data from the Amazon S3 filesystem. In <keyword keyref="impala26_full"/> and higher, Impala automatically
handles creating or removing the associated folders when you issue <codeph>ALTER TABLE</codeph> statements
with the <codeph>ADD PARTITION</codeph> or <codeph>DROP PARTITION</codeph> clauses.
</p>
<p conref="../shared/impala_common.xml#common/s3_ddl"/>
<p rev="1.4.0">
<b>HDFS caching (CACHED IN clause):</b>
</p>
<p rev="1.4.0">
If you specify the <codeph>CACHED IN</codeph> clause, any existing or future data files in the table
directory or the partition subdirectories are designated to be loaded into memory with the HDFS caching
mechanism. See <xref href="impala_perf_hdfs_caching.xml#hdfs_caching"/> for details about using the HDFS
caching feature.
</p>
<p conref="../shared/impala_common.xml#common/impala_cache_replication_factor"/>
<p conref="../shared/impala_common.xml#common/sync_ddl_blurb"/>
<p>
The following sections show examples of the use cases for various <codeph>ALTER TABLE</codeph> clauses.
</p>
<p>
<b>To rename a table (RENAME TO clause):</b>
</p>
<!-- Beefing up the syntax in its original location up to, don't need to repeat it here.
<codeblock>ALTER TABLE <varname>old_name</varname> RENAME TO <varname>new_name</varname>;</codeblock>
-->
<p>
The <codeph>RENAME TO</codeph> clause lets you change the name of an existing table, and optionally which
database it is located in.
</p>
<p>
For internal tables, this operation physically renames the directory within HDFS that contains the data files;
the original directory name no longer exists. By qualifying the table names with database names, you can use
this technique to move an internal table (and its associated data directory) from one database to another.
For example:
</p>
<codeblock>create database d1;
create database d2;
create database d3;
use d1;
create table mobile (x int);
use d2;
-- Move table from another database to the current one.
alter table d1.mobile rename to mobile;
use d1;
-- Move table from one database to another.
alter table d2.mobile rename to d3.mobile;</codeblock>
<p>
For external tables,
</p>
<p>
<b>To change the physical location where Impala looks for data files associated with a table or
partition:</b>
</p>
<codeblock>ALTER TABLE <varname>table_name</varname> [PARTITION (<varname>partition_spec</varname>)] SET LOCATION '<varname>hdfs_path_of_directory</varname>';</codeblock>
<p>
The path you specify is the full HDFS path where the data files reside, or will be created. Impala does not
create any additional subdirectory named after the table. Impala does not move any data files to this new
location or change any data files that might already exist in that directory.
</p>
<p>
To set the location for a single partition, include the <codeph>PARTITION</codeph> clause. Specify all the
same partitioning columns for the table, with a constant value for each, to precisely identify the single
partition affected by the statement:
</p>
<codeblock>create table p1 (s string) partitioned by (month int, day int);
-- Each ADD PARTITION clause creates a subdirectory in HDFS.
alter table p1 add partition (month=1, day=1);
alter table p1 add partition (month=1, day=2);
alter table p1 add partition (month=2, day=1);
alter table p1 add partition (month=2, day=2);
-- Redirect queries, INSERT, and LOAD DATA for one partition
-- to a specific different directory.
alter table p1 partition (month=1, day=1) set location '/usr/external_data/new_years_day';
</codeblock>
<note conref="../shared/impala_common.xml#common/add_partition_set_location"/>
<p rev="2.3.0 IMPALA-1568 CDH-36799">
<b>To automatically detect new partition directories added through Hive or HDFS operations:</b>
</p>
<p rev="2.3.0 IMPALA-1568 CDH-36799">
In <keyword keyref="impala23_full"/> and higher, the <codeph>RECOVER PARTITIONS</codeph> clause scans
a partitioned table to detect if any new partition directories were added outside of Impala,
such as by Hive <codeph>ALTER TABLE</codeph> statements or by <cmdname>hdfs dfs</cmdname>
or <cmdname>hadoop fs</cmdname> commands. The <codeph>RECOVER PARTITIONS</codeph> clause
automatically recognizes any data files present in these new directories, the same as
the <codeph>REFRESH</codeph> statement does.
</p>
<p rev="2.3.0 IMPALA-1568 CDH-36799">
For example, here is a sequence of examples showing how you might create a partitioned table in Impala,
create new partitions through Hive, copy data files into the new partitions with the <cmdname>hdfs</cmdname>
command, and have Impala recognize the new partitions and new data:
</p>
<p rev="2.3.0 IMPALA-1568 CDH-36799">
In Impala, create the table, and a single partition for demonstration purposes:
</p>
<codeblock rev="2.3.0 IMPALA-1568 CDH-36799">
<![CDATA[
create database recover_partitions;
use recover_partitions;
create table t1 (s string) partitioned by (yy int, mm int);
insert into t1 partition (yy = 2016, mm = 1) values ('Partition exists');
show files in t1;
+---------------------------------------------------------------------+------+--------------+
| Path | Size | Partition |
+---------------------------------------------------------------------+------+--------------+
| /user/hive/warehouse/recover_partitions.db/t1/yy=2016/mm=1/data.txt | 17B | yy=2016/mm=1 |
+---------------------------------------------------------------------+------+--------------+
quit;
]]>
</codeblock>
<p rev="2.3.0 IMPALA-1568 CDH-36799">
In Hive, create some new partitions. In a real use case, you might create the
partitions and populate them with data as the final stages of an ETL pipeline.
</p>
<codeblock rev="2.3.0 IMPALA-1568 CDH-36799">
<![CDATA[
hive> use recover_partitions;
OK
hive> alter table t1 add partition (yy = 2016, mm = 2);
OK
hive> alter table t1 add partition (yy = 2016, mm = 3);
OK
hive> quit;
]]>
</codeblock>
<p rev="2.3.0 IMPALA-1568 CDH-36799">
For demonstration purposes, manually copy data (a single row) into these
new partitions, using manual HDFS operations:
</p>
<codeblock rev="2.3.0 IMPALA-1568 CDH-36799">
<![CDATA[
$ hdfs dfs -ls /user/hive/warehouse/recover_partitions.db/t1/yy=2016/
Found 3 items
drwxr-xr-x - impala hive 0 2016-05-09 16:06 /user/hive/warehouse/recover_partitions.db/t1/yy=2016/mm=1
drwxr-xr-x - jrussell hive 0 2016-05-09 16:14 /user/hive/warehouse/recover_partitions.db/t1/yy=2016/mm=2
drwxr-xr-x - jrussell hive 0 2016-05-09 16:13 /user/hive/warehouse/recover_partitions.db/t1/yy=2016/mm=3
$ hdfs dfs -cp /user/hive/warehouse/recover_partitions.db/t1/yy=2016/mm=1/data.txt \
/user/hive/warehouse/recover_partitions.db/t1/yy=2016/mm=2/data.txt
$ hdfs dfs -cp /user/hive/warehouse/recover_partitions.db/t1/yy=2016/mm=1/data.txt \
/user/hive/warehouse/recover_partitions.db/t1/yy=2016/mm=3/data.txt
]]>
</codeblock>
<codeblock rev="2.3.0 IMPALA-1568 CDH-36799">
<![CDATA[
hive> select * from t1;
OK
Partition exists 2016 1
Partition exists 2016 2
Partition exists 2016 3
hive> quit;
]]>
</codeblock>
<p rev="2.3.0 IMPALA-1568 CDH-36799">
In Impala, initially the partitions and data are not visible.
Running <codeph>ALTER TABLE</codeph> with the <codeph>RECOVER PARTITIONS</codeph>
clause scans the table data directory to find any new partition directories, and
the data files inside them:
</p>
<codeblock rev="2.3.0 IMPALA-1568 CDH-36799">
<![CDATA[
select * from t1;
+------------------+------+----+
| s | yy | mm |
+------------------+------+----+
| Partition exists | 2016 | 1 |
+------------------+------+----+
alter table t1 recover partitions;
select * from t1;
+------------------+------+----+
| s | yy | mm |
+------------------+------+----+
| Partition exists | 2016 | 1 |
| Partition exists | 2016 | 3 |
| Partition exists | 2016 | 2 |
+------------------+------+----+
]]>
</codeblock>
<p rev="1.2">
<b>To change the key-value pairs of the TBLPROPERTIES and SERDEPROPERTIES fields:</b>
</p>
<codeblock>ALTER TABLE <varname>table_name</varname> SET TBLPROPERTIES ('<varname>key1</varname>'='<varname>value1</varname>', '<varname>key2</varname>'='<varname>value2</varname>'[, ...]);
ALTER TABLE <varname>table_name</varname> SET SERDEPROPERTIES ('<varname>key1</varname>'='<varname>value1</varname>', '<varname>key2</varname>'='<varname>value2</varname>'[, ...]);</codeblock>
<p>
The <codeph>TBLPROPERTIES</codeph> clause is primarily a way to associate arbitrary user-specified data items
with a particular table.
</p>
<p>
The <codeph>SERDEPROPERTIES</codeph> clause sets up metadata defining how tables are read or written, needed
in some cases by Hive but not used extensively by Impala. You would use this clause primarily to change the
delimiter in an existing text table or partition, by setting the <codeph>'serialization.format'</codeph> and
<codeph>'field.delim'</codeph> property values to the new delimiter character:
</p>
<codeblock>-- This table begins life as pipe-separated text format.
create table change_to_csv (s1 string, s2 string) row format delimited fields terminated by '|';
-- Then we change it to a CSV table.
alter table change_to_csv set SERDEPROPERTIES ('serialization.format'=',', 'field.delim'=',');
insert overwrite change_to_csv values ('stop','go'), ('yes','no');
!hdfs dfs -cat 'hdfs://<varname>hostname</varname>:8020/<varname>data_directory</varname>/<varname>dbname</varname>.db/change_to_csv/<varname>data_file</varname>';
stop,go
yes,no</codeblock>
<p>
Use the <codeph>DESCRIBE FORMATTED</codeph> statement to see the current values of these properties for an
existing table. See <xref href="impala_create_table.xml#create_table"/> for more details about these clauses.
See <xref href="impala_perf_stats.xml#perf_table_stats_manual"/> for an example of using table properties to
fine-tune the performance-related table statistics.
</p>
<p>
<b>To manually set or update table or column statistics:</b>
</p>
<p>
Although for most tables the <codeph>COMPUTE STATS</codeph> or <codeph>COMPUTE INCREMENTAL STATS</codeph>
statement is all you need to keep table and column statistics up to date for a table,
sometimes for a very large table or one that is updated frequently, the length of time to recompute
all the statistics might make it impractical to run those statements as often as needed.
As a workaround, you can use the <codeph>ALTER TABLE</codeph> statement to set table statistics
at the level of the entire table or a single partition, or column statistics at the level of
the entire table.
</p>
<p>
You can set the <codeph>numrows</codeph> value for table statistics by changing the
<codeph>TBLPROPERTIES</codeph> setting for a table or partition.
For example:
<codeblock conref="../shared/impala_common.xml#common/set_numrows_example"/>
<codeblock conref="../shared/impala_common.xml#common/set_numrows_partitioned_example"/>
See <xref href="impala_perf_stats.xml#perf_table_stats_manual"/> for details.
</p>
<p rev="2.6.0 IMPALA-3369">
In <keyword keyref="impala26_full"/> and higher, you can use the <codeph>SET COLUMN STATS</codeph> clause
to set a specific stats value for a particular column.
</p>
<p conref="../shared/impala_common.xml#common/set_column_stats_example"/>
<p>
<b>To reorganize columns for a table:</b>
</p>
<codeblock>ALTER TABLE <varname>table_name</varname> ADD COLUMNS (<varname>column_defs</varname>);
ALTER TABLE <varname>table_name</varname> REPLACE COLUMNS (<varname>column_defs</varname>);
ALTER TABLE <varname>table_name</varname> CHANGE <varname>column_name</varname> <varname>new_name</varname> <varname>new_type</varname>;
ALTER TABLE <varname>table_name</varname> DROP <varname>column_name</varname>;</codeblock>
<p>
The <varname>column_spec</varname> is the same as in the <codeph>CREATE TABLE</codeph> statement: the column
name, then its data type, then an optional comment. You can add multiple columns at a time. The parentheses
are required whether you add a single column or multiple columns. When you replace columns, all the original
column definitions are discarded. You might use this technique if you receive a new set of data files with
different data types or columns in a different order. (The data files are retained, so if the new columns are
incompatible with the old ones, use <codeph>INSERT OVERWRITE</codeph> or <codeph>LOAD DATA OVERWRITE</codeph>
to replace all the data before issuing any further queries.)
</p>
<p rev="CDH-37178">
For example, here is how you might add columns to an existing table.
The first <codeph>ALTER TABLE</codeph> adds two new columns, and the second
<codeph>ALTER TABLE</codeph> adds one new column.
A single Impala query reads both the old and new data files, containing different numbers of columns.
For any columns not present in a particular data file, all the column values are
considered to be <codeph>NULL</codeph>.
</p>
<codeblock rev="CDH-37178">
create table t1 (x int);
insert into t1 values (1), (2);
alter table t1 add columns (s string, t timestamp);
insert into t1 values (3, 'three', now());
alter table t1 add columns (b boolean);
insert into t1 values (4, 'four', now(), true);
select * from t1 order by x;
+---+-------+-------------------------------+------+
| x | s | t | b |
+---+-------+-------------------------------+------+
| 1 | NULL | NULL | NULL |
| 2 | NULL | NULL | NULL |
| 3 | three | 2016-05-11 11:19:45.054457000 | NULL |
| 4 | four | 2016-05-11 11:20:20.260733000 | true |
+---+-------+-------------------------------+------+
</codeblock>
<p>
You might use the <codeph>CHANGE</codeph> clause to rename a single column, or to treat an existing column as
a different type than before, such as to switch between treating a column as <codeph>STRING</codeph> and
<codeph>TIMESTAMP</codeph>, or between <codeph>INT</codeph> and <codeph>BIGINT</codeph>. You can only drop a
single column at a time; to drop multiple columns, issue multiple <codeph>ALTER TABLE</codeph> statements, or
define the new set of columns with a single <codeph>ALTER TABLE ... REPLACE COLUMNS</codeph> statement.
</p>
<p rev="CDH-37178">
The following examples show some safe operations to drop or change columns. Dropping the final column
in a table lets Impala ignore the data causing any disruption to existing data files. Changing the type
of a column works if existing data values can be safely converted to the new type. The type conversion
rules depend on the file format of the underlying table. For example, in a text table, the same value
can be interpreted as a <codeph>STRING</codeph> or a numeric value, while in a binary format such as
Parquet, the rules are stricter and type conversions only work between certain sizes of integers.
</p>
<codeblock rev="CDH-37178">
create table optional_columns (x int, y int, z int, a1 int, a2 int);
insert into optional_columns values (1,2,3,0,0), (2,3,4,100,100);
-- When the last column in the table is dropped, Impala ignores the
-- values that are no longer needed. (Dropping A1 but leaving A2
-- would cause problems, as we will see in a subsequent example.)
alter table optional_columns drop column a2;
alter table optional_columns drop column a1;
select * from optional_columns;
+---+---+---+
| x | y | z |
+---+---+---+
| 1 | 2 | 3 |
| 2 | 3 | 4 |
+---+---+---+
</codeblock>
<codeblock rev="CDH-37178">
create table int_to_string (s string, x int);
insert into int_to_string values ('one', 1), ('two', 2);
-- What was an INT column will now be interpreted as STRING.
-- This technique works for text tables but not other file formats.
-- The second X represents the new name of the column, which we keep the same.
alter table int_to_string change x x string;
-- Once the type is changed, we can insert non-integer values into the X column
-- and treat that column as a string, for example by uppercasing or concatenating.
insert into int_to_string values ('three', 'trois');
select s, upper(x) from int_to_string;
+-------+----------+
| s | upper(x) |
+-------+----------+
| one | 1 |
| two | 2 |
| three | TROIS |
+-------+----------+
</codeblock>
<p rev="CDH-37178">
Remember that Impala does not actually do any conversion for the underlying data files as a result of
<codeph>ALTER TABLE</codeph> statements. If you use <codeph>ALTER TABLE</codeph> to create a table
layout that does not agree with the contents of the underlying files, you must replace the files
yourself, such as using <codeph>LOAD DATA</codeph> to load a new set of data files, or
<codeph>INSERT OVERWRITE</codeph> to copy from another table and replace the original data.
</p>
<p rev="CDH-37178">
The following example shows what happens if you delete the middle column from a Parquet table containing three columns.
The underlying data files still contain three columns of data. Because the columns are interpreted based on their positions in
the data file instead of the specific column names, a <codeph>SELECT *</codeph> query now reads the first and second
columns from the data file, potentially leading to unexpected results or conversion errors.
For this reason, if you expect to someday drop a column, declare it as the last column in the table, where its data
can be ignored by queries after the column is dropped. Or, re-run your ETL process and create new data files
if you drop or change the type of a column in a way that causes problems with existing data files.
</p>
<codeblock rev="CDH-37178">
-- Parquet table showing how dropping a column can produce unexpected results.
create table p1 (s1 string, s2 string, s3 string) stored as parquet;
insert into p1 values ('one', 'un', 'uno'), ('two', 'deux', 'dos'),
('three', 'trois', 'tres');
select * from p1;
+-------+-------+------+
| s1 | s2 | s3 |
+-------+-------+------+
| one | un | uno |
| two | deux | dos |
| three | trois | tres |
+-------+-------+------+
alter table p1 drop column s2;
-- The S3 column contains unexpected results.
-- Because S2 and S3 have compatible types, the query reads
-- values from the dropped S2, because the existing data files
-- still contain those values as the second column.
select * from p1;
+-------+-------+
| s1 | s3 |
+-------+-------+
| one | un |
| two | deux |
| three | trois |
+-------+-------+
</codeblock>
<codeblock rev="CDH-37178">
-- Parquet table showing how dropping a column can produce conversion errors.
create table p2 (s1 string, x int, s3 string) stored as parquet;
insert into p2 values ('one', 1, 'uno'), ('two', 2, 'dos'), ('three', 3, 'tres');
select * from p2;
+-------+---+------+
| s1 | x | s3 |
+-------+---+------+
| one | 1 | uno |
| two | 2 | dos |
| three | 3 | tres |
+-------+---+------+
alter table p2 drop column x;
select * from p2;
WARNINGS:
File '<varname>hdfs_filename</varname>' has an incompatible Parquet schema for column 'add_columns.p2.s3'.
Column type: STRING, Parquet schema:
optional int32 x [i:1 d:1 r:0]
File '<varname>hdfs_filename</varname>' has an incompatible Parquet schema for column 'add_columns.p2.s3'.
Column type: STRING, Parquet schema:
optional int32 x [i:1 d:1 r:0]
</codeblock>
<p rev="IMPALA-3092">
In <keyword keyref="impala26_full"/> and higher, if an Avro table is created without column definitions in the
<codeph>CREATE TABLE</codeph> statement, and columns are later
added through <codeph>ALTER TABLE</codeph>, the resulting
table is now queryable. Missing values from the newly added
columns now default to <codeph>NULL</codeph>.
</p>
<p>
<b>To change the file format that Impala expects data to be in, for a table or partition:</b>
</p>
<p>
Use an <codeph>ALTER TABLE ... SET FILEFORMAT</codeph> clause. You can include an optional <codeph>PARTITION
(<varname>col1</varname>=<varname>val1</varname>, <varname>col2</varname>=<varname>val2</varname>,
...</codeph> clause so that the file format is changed for a specific partition rather than the entire table.
</p>
<p>
Because this operation only changes the table metadata, you must do any conversion of existing data using
regular Hadoop techniques outside of Impala. Any new data created by the Impala <codeph>INSERT</codeph>
statement will be in the new format. You cannot specify the delimiter for Text files; the data files must be
comma-delimited.
<!-- Although Impala can read Avro tables
created through Hive, you cannot specify the Avro file format in an Impala
<codeph>ALTER TABLE</codeph> statement. -->
</p>
<p>
To set the file format for a single partition, include the <codeph>PARTITION</codeph> clause. Specify all the
same partitioning columns for the table, with a constant value for each, to precisely identify the single
partition affected by the statement:
</p>
<codeblock>create table p1 (s string) partitioned by (month int, day int);
-- Each ADD PARTITION clause creates a subdirectory in HDFS.
alter table p1 add partition (month=1, day=1);
alter table p1 add partition (month=1, day=2);
alter table p1 add partition (month=2, day=1);
alter table p1 add partition (month=2, day=2);
-- Queries and INSERT statements will read and write files
-- in this format for this specific partition.
alter table p1 partition (month=2, day=2) set fileformat parquet;
</codeblock>
<p>
<b>To add or drop partitions for a table</b>, the table must already be partitioned (that is, created with a
<codeph>PARTITIONED BY</codeph> clause). The partition is a physical directory in HDFS, with a name that
encodes a particular column value (the <b>partition key</b>). The Impala <codeph>INSERT</codeph> statement
already creates the partition if necessary, so the <codeph>ALTER TABLE ... ADD PARTITION</codeph> is
primarily useful for importing data by moving or copying existing data files into the HDFS directory
corresponding to a partition. (You can use the <codeph>LOAD DATA</codeph> statement to move files into the
partition directory, or <codeph>ALTER TABLE ... PARTITION (...) SET LOCATION</codeph> to point a partition at
a directory that already contains data files.
</p>
<p>
The <codeph>DROP PARTITION</codeph> clause is used to remove the HDFS directory and associated data files for
a particular set of partition key values; for example, if you always analyze the last 3 months worth of data,
at the beginning of each month you might drop the oldest partition that is no longer needed. Removing
partitions reduces the amount of metadata associated with the table and the complexity of calculating the
optimal query plan, which can simplify and speed up queries on partitioned tables, particularly join queries.
Here is an example showing the <codeph>ADD PARTITION</codeph> and <codeph>DROP PARTITION</codeph> clauses.
</p>
<p>
To avoid errors while adding or dropping partitions whose existence is not certain,
add the optional <codeph>IF [NOT] EXISTS</codeph> clause between the <codeph>ADD</codeph> or
<codeph>DROP</codeph> keyword and the <codeph>PARTITION</codeph> keyword. That is, the entire
clause becomes <codeph>ADD IF NOT EXISTS PARTITION</codeph> or <codeph>DROP IF EXISTS PARTITION</codeph>.
The following example shows how partitions can be created automatically through <codeph>INSERT</codeph>
statements, or manually through <codeph>ALTER TABLE</codeph> statements. The <codeph>IF [NOT] EXISTS</codeph>
clauses let the <codeph>ALTER TABLE</codeph> statements succeed even if a new requested partition already
exists, or a partition to be dropped does not exist.
</p>
<p>
Inserting 2 year values creates 2 partitions:
</p>
<codeblock>
create table partition_t (s string) partitioned by (y int);
insert into partition_t (s,y) values ('two thousand',2000), ('nineteen ninety',1990);
show partitions partition_t;
+-------+-------+--------+------+--------------+-------------------+--------+-------------------+
| y | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats |
+-------+-------+--------+------+--------------+-------------------+--------+-------+
| 1990 | -1 | 1 | 16B | NOT CACHED | NOT CACHED | TEXT | false |
| 2000 | -1 | 1 | 13B | NOT CACHED | NOT CACHED | TEXT | false |
| Total | -1 | 2 | 29B | 0B | | | |
+-------+-------+--------+------+--------------+-------------------+--------+-------+
</codeblock>
<p>
Without the <codeph>IF NOT EXISTS</codeph> clause, an attempt to add a new partition might fail:
</p>
<codeblock>
alter table partition_t add partition (y=2000);
ERROR: AnalysisException: Partition spec already exists: (y=2000).
</codeblock>
<p>
The <codeph>IF NOT EXISTS</codeph> clause makes the statement succeed whether or not there was already a
partition with the specified key value:
</p>
<codeblock>
alter table partition_t add if not exists partition (y=2000);
alter table partition_t add if not exists partition (y=2010);
show partitions partition_t;
+-------+-------+--------+------+--------------+-------------------+--------+-------------------+
| y | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats |
+-------+-------+--------+------+--------------+-------------------+--------+-------+
| 1990 | -1 | 1 | 16B | NOT CACHED | NOT CACHED | TEXT | false |
| 2000 | -1 | 1 | 13B | NOT CACHED | NOT CACHED | TEXT | false |
| 2010 | -1 | 0 | 0B | NOT CACHED | NOT CACHED | TEXT | false |
| Total | -1 | 2 | 29B | 0B | | | |
+-------+-------+--------+------+--------------+-------------------+--------+-------+
</codeblock>
<p>
Likewise, the <codeph>IF EXISTS</codeph> clause lets <codeph>DROP PARTITION</codeph> succeed whether or not the partition is already
in the table:
</p>
<codeblock>
alter table partition_t drop if exists partition (y=2000);
alter table partition_t drop if exists partition (y=1950);
show partitions partition_t;
+-------+-------+--------+------+--------------+-------------------+--------+-------------------+
| y | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats |
+-------+-------+--------+------+--------------+-------------------+--------+-------+
| 1990 | -1 | 1 | 16B | NOT CACHED | NOT CACHED | TEXT | false |
| 2010 | -1 | 0 | 0B | NOT CACHED | NOT CACHED | TEXT | false |
| Total | -1 | 1 | 16B | 0B | | | |
+-------+-------+--------+------+--------------+-------------------+--------+-------+
</codeblock>
<p rev="2.3.0"> The optional <codeph>PURGE</codeph> keyword, available in
<keyword keyref="impala23_full"/> and higher, is used with the <codeph>DROP
PARTITION</codeph> clause to remove associated HDFS data files
immediately rather than going through the HDFS trashcan mechanism. Use
this keyword when dropping a partition if it is crucial to remove the data
as quickly as possible to free up space, or if there is a problem with the
trashcan, such as the trash cannot being configured or being in a
different HDFS encryption zone than the data files. </p>
<!--
To do: Make example more general by partitioning by year/month/day.
Then could show inserting into fixed year, variable month and day;
dropping particular year/month/day partition.
-->
<codeblock>-- Create an empty table and define the partitioning scheme.
create table part_t (x int) partitioned by (month int);
-- Create an empty partition into which you could copy data files from some other source.
alter table part_t add partition (month=1);
-- After changing the underlying data, issue a REFRESH statement to make the data visible in Impala.
refresh part_t;
-- Later, do the same for the next month.
alter table part_t add partition (month=2);
-- Now you no longer need the older data.
alter table part_t drop partition (month=1);
-- If the table was partitioned by month and year, you would issue a statement like:
-- alter table part_t drop partition (year=2003,month=1);
-- which would require 12 ALTER TABLE statements to remove a year's worth of data.
-- If the data files for subsequent months were in a different file format,
-- you could set a different file format for the new partition as you create it.
alter table part_t add partition (month=3) set fileformat=parquet;
</codeblock>
<p>
The value specified for a partition key can be an arbitrary constant expression, without any references to
columns. For example:
</p>
<codeblock>alter table time_data add partition (month=concat('Decem','ber'));
alter table sales_data add partition (zipcode = cast(9021 * 10 as string));</codeblock>
<note>
<p>
An alternative way to reorganize a table and its associated data files is to use <codeph>CREATE
TABLE</codeph> to create a variation of the original table, then use <codeph>INSERT</codeph> to copy the
transformed or reordered data to the new table. The advantage of <codeph>ALTER TABLE</codeph> is that it
avoids making a duplicate copy of the data files, allowing you to reorganize huge volumes of data in a
space-efficient way using familiar Hadoop techniques.
</p>
</note>
<p>
<b>To switch a table between internal and external:</b>
</p>
<p conref="../shared/impala_common.xml#common/switch_internal_external_table"/>
<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
<p conref="../shared/impala_common.xml#common/permissions_blurb"/>
<p rev="CDH-19187">
Most <codeph>ALTER TABLE</codeph> clauses do not actually
read or write any HDFS files, and so do not depend on
specific HDFS permissions. For example, the <codeph>SET FILEFORMAT</codeph>
clause does not actually check the file format existing data files or
convert them to the new format, and the <codeph>SET LOCATION</codeph> clause
does not require any special permissions on the new location.
(Any permission-related failures would come later, when you
actually query or insert into the table.)
</p>
<!-- Haven't rigorously tested all the assertions in the following paragraph. -->
<!-- Most testing so far has been around RENAME TO clause. -->
<p>
In general, <codeph>ALTER TABLE</codeph> clauses that do touch
HDFS files and directories require the same HDFS permissions
as corresponding <codeph>CREATE</codeph>, <codeph>INSERT</codeph>,
or <codeph>SELECT</codeph> statements.
The permissions allow
the user ID that the <cmdname>impalad</cmdname> daemon runs under,
typically the <codeph>impala</codeph> user, to read or write
files or directories, or (in the case of the execute bit) descend into a directory.
The <codeph>RENAME TO</codeph> clause requires read, write, and execute permission in the
source and destination database directories and in the table data directory,
and read and write permission for the data files within the table.
The <codeph>ADD PARTITION</codeph> and <codeph>DROP PARTITION</codeph> clauses
require write and execute permissions for the associated partition directory.
</p>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_tables.xml#tables"/>,
<xref href="impala_create_table.xml#create_table"/>, <xref href="impala_drop_table.xml#drop_table"/>,
<xref href="impala_partitioning.xml#partitioning"/>, <xref href="impala_tables.xml#internal_tables"/>,
<xref href="impala_tables.xml#external_tables"/>
</p>
</conbody>
</concept>

View File

@@ -0,0 +1,86 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept rev="1.1" id="alter_view">
<title>ALTER VIEW Statement</title>
<titlealts audience="PDF"><navtitle>ALTER VIEW</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="SQL"/>
<data name="Category" value="DDL"/>
<data name="Category" value="Tables"/>
<data name="Category" value="Schemas"/>
<data name="Category" value="Views"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="Cloudera">ALTER VIEW statement</indexterm>
Changes the characteristics of a view. The syntax has two forms:
</p>
<ul>
<li>
The <codeph>AS</codeph> clause associates the view with a different query.
</li>
<li>
The <codeph>RENAME TO</codeph> clause changes the name of the view, moves the view to
a different database, or both.
</li>
</ul>
<p>
Because a view is purely a logical construct (an alias for a query) with no physical data behind it,
<codeph>ALTER VIEW</codeph> only involves changes to metadata in the metastore database, not any data files
in HDFS.
</p>
<!-- View _permissions_ don't rely on underlying table. -->
<!-- Could use views to grant access only to certain columns. -->
<!-- Treated like a table for authorization. -->
<!-- ALTER VIEW that queries another view - possibly a runtime error. -->
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
<codeblock>ALTER VIEW [<varname>database_name</varname>.]<varname>view_name</varname> AS <varname>select_statement</varname>
ALTER VIEW [<varname>database_name</varname>.]<varname>view_name</varname> RENAME TO [<varname>database_name</varname>.]<varname>view_name</varname></codeblock>
<p conref="../shared/impala_common.xml#common/ddl_blurb"/>
<p conref="../shared/impala_common.xml#common/sync_ddl_blurb"/>
<p conref="../shared/impala_common.xml#common/security_blurb"/>
<p conref="../shared/impala_common.xml#common/redaction_yes"/>
<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
<p conref="../shared/impala_common.xml#common/permissions_blurb_no"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<codeblock>create table t1 (x int, y int, s string);
create table t2 like t1;
create view v1 as select * from t1;
alter view v1 as select * from t2;
alter view v1 as select x, upper(s) s from t2;</codeblock>
<!-- Repeat the same blurb + example to see the definition of a view, as in CREATE VIEW. -->
<p conref="../shared/impala_common.xml#common/describe_formatted_view"/>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_views.xml#views"/>, <xref href="impala_create_view.xml#create_view"/>,
<xref href="impala_drop_view.xml#drop_view"/>
</p>
</conbody>
</concept>

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,81 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept rev="2.0.0" id="appx_count_distinct">
<title>APPX_COUNT_DISTINCT Query Option (<keyword keyref="impala20"/> or higher only)</title>
<titlealts audience="PDF"><navtitle>APPX_COUNT_DISTINCT</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Impala Query Options"/>
<data name="Category" value="Aggregate Functions"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p rev="2.0.0">
<indexterm audience="Cloudera">APPX_COUNT_DISTINCT query option</indexterm>
Allows multiple <codeph>COUNT(DISTINCT)</codeph> operations within a single query, by internally rewriting
each <codeph>COUNT(DISTINCT)</codeph> to use the <codeph>NDV()</codeph> function. The resulting count is
approximate rather than precise.
</p>
<p conref="../shared/impala_common.xml#common/type_boolean"/>
<p conref="../shared/impala_common.xml#common/default_false_0"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p>
The following examples show how the <codeph>APPX_COUNT_DISTINCT</codeph> lets you work around the restriction
where a query can only evaluate <codeph>COUNT(DISTINCT <varname>col_name</varname>)</codeph> for a single
column. By default, you can count the distinct values of one column or another, but not both in a single
query:
</p>
<codeblock>[localhost:21000] &gt; select count(distinct x) from int_t;
+-------------------+
| count(distinct x) |
+-------------------+
| 10 |
+-------------------+
[localhost:21000] &gt; select count(distinct property) from int_t;
+--------------------------+
| count(distinct property) |
+--------------------------+
| 7 |
+--------------------------+
[localhost:21000] &gt; select count(distinct x), count(distinct property) from int_t;
ERROR: AnalysisException: all DISTINCT aggregate functions need to have the same set of parameters
as count(DISTINCT x); deviating function: count(DISTINCT property)
</codeblock>
<p>
When you enable the <codeph>APPX_COUNT_DISTINCT</codeph> query option, now the query with multiple
<codeph>COUNT(DISTINCT)</codeph> works. The reason this behavior requires a query option is that each
<codeph>COUNT(DISTINCT)</codeph> is rewritten internally to use the <codeph>NDV()</codeph> function instead,
which provides an approximate result rather than a precise count.
</p>
<codeblock>[localhost:21000] &gt; set APPX_COUNT_DISTINCT=true;
[localhost:21000] &gt; select count(distinct x), count(distinct property) from int_t;
+-------------------+--------------------------+
| count(distinct x) | count(distinct property) |
+-------------------+--------------------------+
| 10 | 7 |
+-------------------+--------------------------+
</codeblock>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_count.xml#count"/>,
<xref href="impala_distinct.xml#distinct"/>,
<xref href="impala_ndv.xml#ndv"/>
</p>
</conbody>
</concept>

View File

@@ -0,0 +1,124 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept rev="1.2.1" id="appx_median">
<title>APPX_MEDIAN Function</title>
<titlealts audience="PDF"><navtitle>APPX_MEDIAN</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="SQL"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
<data name="Category" value="Impala Functions"/>
<data name="Category" value="Aggregate Functions"/>
<data name="Category" value="Querying"/>
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="Cloudera">appx_median() function</indexterm>
An aggregate function that returns a value that is approximately the median (midpoint) of values in the set
of input values.
</p>
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
<codeblock>APPX_MEDIAN([DISTINCT | ALL] <varname>expression</varname>)
</codeblock>
<p>
This function works with any input type, because the only requirement is that the type supports less-than and
greater-than comparison operators.
</p>
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
<p>
Because the return value represents the estimated midpoint, it might not reflect the precise midpoint value,
especially if the cardinality of the input values is very high. If the cardinality is low (up to
approximately 20,000), the result is more accurate because the sampling considers all or almost all of the
different values.
</p>
<p conref="../shared/impala_common.xml#common/return_type_same_except_string"/>
<p>
The return value is always the same as one of the input values, not an <q>in-between</q> value produced by
averaging.
</p>
<!-- <p conref="../shared/impala_common.xml#common/restrictions_sliding_window"/> -->
<p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
<p conref="../shared/impala_common.xml#common/analytic_not_allowed_caveat"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p>
The following example uses a table of a million random floating-point numbers ranging up to approximately
50,000. The average is approximately 25,000. Because of the random distribution, we would expect the median
to be close to this same number. Computing the precise median is a more intensive operation than computing
the average, because it requires keeping track of every distinct value and how many times each occurs. The
<codeph>APPX_MEDIAN()</codeph> function uses a sampling algorithm to return an approximate result, which in
this case is close to the expected value. To make sure that the value is not substantially out of range due
to a skewed distribution, subsequent queries confirm that there are approximately 500,000 values higher than
the <codeph>APPX_MEDIAN()</codeph> value, and approximately 500,000 values lower than the
<codeph>APPX_MEDIAN()</codeph> value.
</p>
<codeblock>[localhost:21000] &gt; select min(x), max(x), avg(x) from million_numbers;
+-------------------+-------------------+-------------------+
| min(x) | max(x) | avg(x) |
+-------------------+-------------------+-------------------+
| 4.725693727250069 | 49994.56852674231 | 24945.38563793553 |
+-------------------+-------------------+-------------------+
[localhost:21000] &gt; select appx_median(x) from million_numbers;
+----------------+
| appx_median(x) |
+----------------+
| 24721.6 |
+----------------+
[localhost:21000] &gt; select count(x) as higher from million_numbers where x &gt; (select appx_median(x) from million_numbers);
+--------+
| higher |
+--------+
| 502013 |
+--------+
[localhost:21000] &gt; select count(x) as lower from million_numbers where x &lt; (select appx_median(x) from million_numbers);
+--------+
| lower |
+--------+
| 497987 |
+--------+
</codeblock>
<p>
The following example computes the approximate median using a subset of the values from the table, and then
confirms that the result is a reasonable estimate for the midpoint.
</p>
<codeblock>[localhost:21000] &gt; select appx_median(x) from million_numbers where x between 1000 and 5000;
+-------------------+
| appx_median(x) |
+-------------------+
| 3013.107787358159 |
+-------------------+
[localhost:21000] &gt; select count(x) as higher from million_numbers where x between 1000 and 5000 and x &gt; 3013.107787358159;
+--------+
| higher |
+--------+
| 37692 |
+--------+
[localhost:21000] &gt; select count(x) as lower from million_numbers where x between 1000 and 5000 and x &lt; 3013.107787358159;
+-------+
| lower |
+-------+
| 37089 |
+-------+
</codeblock>
</conbody>
</concept>

View File

@@ -0,0 +1,269 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="array">
<title>ARRAY Complex Type (<keyword keyref="impala23"/> or higher only)</title>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Impala Data Types"/>
<data name="Category" value="SQL"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p>
A complex data type that can represent an arbitrary number of ordered elements.
The elements can be scalars or another complex type (<codeph>ARRAY</codeph>,
<codeph>STRUCT</codeph>, or <codeph>MAP</codeph>).
</p>
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
<!-- To do: make sure there is sufficient syntax info under the SELECT statement to understand how to query all the complex types. -->
<codeblock><varname>column_name</varname> ARRAY &lt; <varname>type</varname> &gt;
type ::= <varname>primitive_type</varname> | <varname>complex_type</varname>
</codeblock>
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
<p conref="../shared/impala_common.xml#common/complex_types_combo"/>
<p>
The elements of the array have no names. You refer to the value of the array item using the
<codeph>ITEM</codeph> pseudocolumn, or its position in the array with the <codeph>POS</codeph>
pseudocolumn. See <xref href="impala_complex_types.xml#item"/> for information about
these pseudocolumns.
</p>
<!-- Array is a frequently used idiom; don't recommend MAP right up front, since that is more rarely used. STRUCT has all different considerations.
<p>
If it would be logical to have a fixed number of elements and give each one a name, consider using a
<codeph>MAP</codeph> (when all elements are of the same type) or a <codeph>STRUCT</codeph> (if different
elements have different types) instead of an <codeph>ARRAY</codeph>.
</p>
-->
<p>
Each row can have a different number of elements (including none) in the array for that row.
</p>
<!-- Since you don't use numeric indexes, this assertion and advice doesn't make sense.
<p>
If you attempt to refer to a non-existent array element, the result is <codeph>NULL</codeph>. Therefore,
when using operations such as addition or string concatenation involving array elements, you might use
conditional functions to substitute default values such as 0 or <codeph>""</codeph> in the place of missing
array elements.
</p>
-->
<p>
When an array contains items of scalar types, you can use aggregation functions on the array elements without using join notation. For
example, you can find the <codeph>COUNT()</codeph>, <codeph>AVG()</codeph>, <codeph>SUM()</codeph>, and so on of numeric array
elements, or the <codeph>MAX()</codeph> and <codeph>MIN()</codeph> of any scalar array elements by referring to
<codeph><varname>table_name</varname>.<varname>array_column</varname></codeph> in the <codeph>FROM</codeph> clause of the query. When
you need to cross-reference values from the array with scalar values from the same row, such as by including a <codeph>GROUP
BY</codeph> clause to produce a separate aggregated result for each row, then the join clause is required.
</p>
<p>
A common usage pattern with complex types is to have an array as the top-level type for the column:
an array of structs, an array of maps, or an array of arrays.
For example, you can model a denormalized table by creating a column that is an <codeph>ARRAY</codeph>
of <codeph>STRUCT</codeph> elements; each item in the array represents a row from a table that would
normally be used in a join query. This kind of data structure lets you essentially denormalize tables by
associating multiple rows from one table with the matching row in another table.
</p>
<p>
You typically do not create more than one top-level <codeph>ARRAY</codeph> column, because if there is
some relationship between the elements of multiple arrays, it is convenient to model the data as
an array of another complex type element (either <codeph>STRUCT</codeph> or <codeph>MAP</codeph>).
</p>
<p conref="../shared/impala_common.xml#common/complex_types_describe"/>
<p conref="../shared/impala_common.xml#common/added_in_230"/>
<p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
<ul conref="../shared/impala_common.xml#common/complex_types_restrictions">
<li/>
</ul>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<note conref="../shared/impala_common.xml#common/complex_type_schema_pointer"/>
<p>
The following example shows how to construct a table with various kinds of <codeph>ARRAY</codeph> columns,
both at the top level and nested within other complex types.
Whenever the <codeph>ARRAY</codeph> consists of a scalar value, such as in the <codeph>PETS</codeph>
column or the <codeph>CHILDREN</codeph> field, you can see that future expansion is limited.
For example, you could not easily evolve the schema to record the kind of pet or the child's birthday alongside the name.
Therefore, it is more common to use an <codeph>ARRAY</codeph> whose elements are of <codeph>STRUCT</codeph> type,
to associate multiple fields with each array element.
</p>
<note>
Practice the <codeph>CREATE TABLE</codeph> and query notation for complex type columns
using empty tables, until you can visualize a complex data structure and construct corresponding SQL statements reliably.
</note>
<!-- To do: verify and flesh out this example. -->
<codeblock><![CDATA[CREATE TABLE array_demo
(
id BIGINT,
name STRING,
-- An ARRAY of scalar type as a top-level column.
pets ARRAY <STRING>,
-- An ARRAY with elements of complex type (STRUCT).
places_lived ARRAY < STRUCT <
place: STRING,
start_year: INT
>>,
-- An ARRAY as a field (CHILDREN) within a STRUCT.
-- (The STRUCT is inside another ARRAY, because it is rare
-- for a STRUCT to be a top-level column.)
marriages ARRAY < STRUCT <
spouse: STRING,
children: ARRAY <STRING>
>>,
-- An ARRAY as the value part of a MAP.
-- The first MAP field (the key) would be a value such as
-- 'Parent' or 'Grandparent', and the corresponding array would
-- represent 2 parents, 4 grandparents, and so on.
ancestors MAP < STRING, ARRAY <STRING> >
)
STORED AS PARQUET;
]]>
</codeblock>
<p>
The following example shows how to examine the structure of a table containing one or more <codeph>ARRAY</codeph> columns by using the
<codeph>DESCRIBE</codeph> statement. You can visualize each <codeph>ARRAY</codeph> as its own two-column table, with columns
<codeph>ITEM</codeph> and <codeph>POS</codeph>.
</p>
<!-- To do: extend the examples to include MARRIAGES and ANCESTORS columns, or get rid of those columns. -->
<codeblock><![CDATA[DESCRIBE array_demo;
+--------------+---------------------------+
| name | type |
+--------------+---------------------------+
| id | bigint |
| name | string |
| pets | array<string> |
| marriages | array<struct< |
| | spouse:string, |
| | children:array<string> |
| | >> |
| places_lived | array<struct< |
| | place:string, |
| | start_year:int |
| | >> |
| ancestors | map<string,array<string>> |
+--------------+---------------------------+
DESCRIBE array_demo.pets;
+------+--------+
| name | type |
+------+--------+
| item | string |
| pos | bigint |
+------+--------+
DESCRIBE array_demo.marriages;
+------+--------------------------+
| name | type |
+------+--------------------------+
| item | struct< |
| | spouse:string, |
| | children:array<string> |
| | > |
| pos | bigint |
+------+--------------------------+
DESCRIBE array_demo.places_lived;
+------+------------------+
| name | type |
+------+------------------+
| item | struct< |
| | place:string, |
| | start_year:int |
| | > |
| pos | bigint |
+------+------------------+
DESCRIBE array_demo.ancestors;
+-------+---------------+
| name | type |
+-------+---------------+
| key | string |
| value | array<string> |
+-------+---------------+
]]>
</codeblock>
<p>
The following example shows queries involving <codeph>ARRAY</codeph> columns containing elements of scalar or complex types. You
<q>unpack</q> each <codeph>ARRAY</codeph> column by referring to it in a join query, as if it were a separate table with
<codeph>ITEM</codeph> and <codeph>POS</codeph> columns. If the array element is a scalar type, you refer to its value using the
<codeph>ITEM</codeph> pseudocolumn. If the array element is a <codeph>STRUCT</codeph>, you refer to the <codeph>STRUCT</codeph> fields
using dot notation and the field names. If the array element is another <codeph>ARRAY</codeph> or a <codeph>MAP</codeph>, you use
another level of join to unpack the nested collection elements.
</p>
<!-- To do: have some sample output to show for these queries. -->
<codeblock><![CDATA[-- Array of scalar values.
-- Each array element represents a single string, plus we know its position in the array.
SELECT id, name, pets.pos, pets.item FROM array_demo, array_demo.pets;
-- Array of structs.
-- Now each array element has named fields, possibly of different types.
-- You can consider an ARRAY of STRUCT to represent a table inside another table.
SELECT id, name, places_lived.pos, places_lived.item.place, places_lived.item.start_year
FROM array_demo, array_demo.places_lived;
-- The .ITEM name is optional for array elements that are structs.
-- The following query is equivalent to the previous one, with .ITEM
-- removed from the column references.
SELECT id, name, places_lived.pos, places_lived.place, places_lived.start_year
FROM array_demo, array_demo.places_lived;
-- To filter specific items from the array, do comparisons against the .POS or .ITEM
-- pseudocolumns, or names of struct fields, in the WHERE clause.
SELECT id, name, pets.item FROM array_demo, array_demo.pets
WHERE pets.pos in (0, 1, 3);
SELECT id, name, pets.item FROM array_demo, array_demo.pets
WHERE pets.item LIKE 'Mr. %';
SELECT id, name, places_lived.pos, places_lived.place, places_lived.start_year
FROM array_demo, array_demo.places_lived
WHERE places_lived.place like '%California%';
]]>
</codeblock>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_complex_types.xml#complex_types"/>,
<!-- <xref href="impala_array.xml#array"/>, -->
<xref href="impala_struct.xml#struct"/>, <xref href="impala_map.xml#map"/>
</p>
</conbody>
</concept>

View File

@@ -0,0 +1,260 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="auditing">
<title>Auditing Impala Operations</title>
<titlealts audience="PDF"><navtitle>Auditing</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Auditing"/>
<data name="Category" value="Governance"/>
<data name="Category" value="Navigator"/>
<data name="Category" value="Security"/>
<data name="Category" value="Administrators"/>
</metadata>
</prolog>
<conbody>
<p>
To monitor how Impala data is being used within your organization, ensure that your Impala authorization and
authentication policies are effective, and detect attempts at intrusion or unauthorized access to Impala
data, you can use the auditing feature in Impala 1.2.1 and higher:
</p>
<ul>
<li>
Enable auditing by including the option <codeph>-audit_event_log_dir=<varname>directory_path</varname></codeph>
in your <cmdname>impalad</cmdname> startup options for a cluster not managed by Cloudera Manager, or
<xref audience="integrated" href="cn_iu_audit_log.xml#xd_583c10bfdbd326ba--6eed2fb8-14349d04bee--7d6f/section_v25_lmy_bn">configuring Impala Daemon logging in Cloudera Manager</xref><xref audience="standalone" href="http://www.cloudera.com/documentation/enterprise/latest/topics/cn_iu_service_audit.html" scope="external" format="html">configuring Impala Daemon logging in Cloudera Manager</xref>.
The log directory must be a local directory on the
server, not an HDFS directory.
</li>
<li>
Decide how many queries will be represented in each log files. By default, Impala starts a new log file
every 5000 queries. To specify a different number, <ph
audience="standalone"
>include
the option <codeph>-max_audit_event_log_file_size=<varname>number_of_queries</varname></codeph> in the
<cmdname>impalad</cmdname> startup
options</ph><xref
href="cn_iu_audit_log.xml#xd_583c10bfdbd326ba--6eed2fb8-14349d04bee--7d6f/section_v25_lmy_bn"
audience="integrated"
>configure
Impala Daemon logging in Cloudera Manager</xref>.
</li>
<li> Configure Cloudera Navigator to collect and consolidate the audit
logs from all the hosts in the cluster. </li>
<li>
Use Cloudera Navigator or Cloudera Manager to filter, visualize, and produce reports based on the audit
data. (The Impala auditing feature works with Cloudera Manager 4.7 to 5.1 and Cloudera Navigator 2.1 and
higher.) Check the audit data to ensure that all activity is authorized and detect attempts at
unauthorized access.
</li>
</ul>
<p outputclass="toc inpage"/>
</conbody>
<concept id="auditing_performance">
<title>Durability and Performance Considerations for Impala Auditing</title>
<prolog>
<metadata>
<data name="Category" value="Performance"/>
</metadata>
</prolog>
<conbody>
<p>
The auditing feature only imposes performance overhead while auditing is enabled.
</p>
<p>
Because any Impala host can process a query, enable auditing on all hosts where the
<ph audience="standalone"><cmdname>impalad</cmdname> daemon</ph>
<ph audience="integrated">Impala Daemon role</ph> runs. Each host stores its own log
files, in a directory in the local filesystem. The log data is periodically flushed to disk (through an
<codeph>fsync()</codeph> system call) to avoid loss of audit data in case of a crash.
</p>
<p> The runtime overhead of auditing applies to whichever host serves as the coordinator for the query, that is, the host you connect to when you issue the query. This might be the same host for all queries, or different applications or users might connect to and issue queries through different hosts. </p>
<p> To avoid excessive I/O overhead on busy coordinator hosts, Impala syncs the audit log data (using the <codeph>fsync()</codeph> system call) periodically rather than after every query. Currently, the <codeph>fsync()</codeph> calls are issued at a fixed interval, every 5 seconds. </p>
<p>
By default, Impala avoids losing any audit log data in the case of an error during a logging operation
(such as a disk full error), by immediately shutting down
<cmdname audience="standalone">impalad</cmdname><ph audience="integrated">the Impala
Daemon role</ph> on the host where the auditing problem occurred.
<ph audience="standalone">You can override this setting by specifying the option
<codeph>-abort_on_failed_audit_event=false</codeph> in the <cmdname>impalad</cmdname> startup options.</ph>
</p>
</conbody>
</concept>
<concept id="auditing_format">
<title>Format of the Audit Log Files</title>
<prolog>
<metadata>
<data name="Category" value="Logs"/>
</metadata>
</prolog>
<conbody>
<p> The audit log files represent the query information in JSON format, one query per line. Typically, rather than looking at the log files themselves, you use the Cloudera Navigator product to consolidate the log data from all Impala hosts and filter and visualize the results in useful ways. (If you do examine the raw log data, you might run the files through a JSON pretty-printer first.) </p>
<p>
All the information about schema objects accessed by the query is encoded in a single nested record on the
same line. For example, the audit log for an <codeph>INSERT ... SELECT</codeph> statement records that a
select operation occurs on the source table and an insert operation occurs on the destination table. The
audit log for a query against a view records the base table accessed by the view, or multiple base tables
in the case of a view that includes a join query. Every Impala operation that corresponds to a SQL
statement is recorded in the audit logs, whether the operation succeeds or fails. Impala records more
information for a successful operation than for a failed one, because an unauthorized query is stopped
immediately, before all the query planning is completed.
</p>
<!-- Opportunity to conref at the phrase level here... the content of this paragraph is the same as part
of a list bullet earlier on. -->
<p>
The information logged for each query includes:
</p>
<ul>
<li>
Client session state:
<ul>
<li>
Session ID
</li>
<li>
User name
</li>
<li>
Network address of the client connection
</li>
</ul>
</li>
<li>
SQL statement details:
<ul>
<li>
Query ID
</li>
<li>
Statement Type - DML, DDL, and so on
</li>
<li>
SQL statement text
</li>
<li>
Execution start time, in local time
</li>
<li>
Execution Status - Details on any errors that were encountered
</li>
<li>
Target Catalog Objects:
<ul>
<li>
Object Type - Table, View, or Database
</li>
<li>
Fully qualified object name
</li>
<li>
Privilege - How the object is being used (<codeph>SELECT</codeph>, <codeph>INSERT</codeph>,
<codeph>CREATE</codeph>, and so on)
</li>
</ul>
</li>
</ul>
</li>
</ul>
<!-- Delegating actual examples to the Cloudera Navigator doc for the moment.
<p>
Here is an excerpt from a sample audit log file:
</p>
<codeblock></codeblock>
-->
</conbody>
</concept>
<concept id="auditing_exceptions">
<title>Which Operations Are Audited</title>
<conbody>
<p>
The kinds of SQL queries represented in the audit log are:
</p>
<ul>
<li>
Queries that are prevented due to lack of authorization.
</li>
<li>
Queries that Impala can analyze and parse to determine that they are authorized. The audit data is
recorded immediately after Impala finishes its analysis, before the query is actually executed.
</li>
</ul>
<p>
The audit log does not contain entries for queries that could not be parsed and analyzed. For example, a
query that fails due to a syntax error is not recorded in the audit log. The audit log also does not
contain queries that fail due to a reference to a table that does not exist, if you would be authorized to
access the table if it did exist.
</p>
<p>
Certain statements in the <cmdname>impala-shell</cmdname> interpreter, such as <codeph>CONNECT</codeph>,
<codeph rev="1.4.0">SUMMARY</codeph>, <codeph>PROFILE</codeph>, <codeph>SET</codeph>, and
<codeph>QUIT</codeph>, do not correspond to actual SQL queries, and these statements are not reflected in
the audit log.
</p>
</conbody>
</concept>
<concept id="auditing_reviewing">
<title>Reviewing the Audit Logs</title>
<prolog>
<metadata>
<data name="Category" value="Logs"/>
</metadata>
</prolog>
<conbody>
<p>
You typically do not review the audit logs in raw form. The Cloudera Manager Agent periodically transfers
the log information into a back-end database where it can be examined in consolidated form. See
<ph audience="standalone">the <xref href="http://www.cloudera.com/content/cloudera-content/cloudera-docs/Navigator/latest/Cloudera-Navigator-Installation-and-User-Guide/Cloudera-Navigator-Installation-and-User-Guide.html"
scope="external" format="html">Cloudera Navigator documentation</xref> for details</ph>
<xref href="cn_iu_audits.xml#cn_topic_7" audience="integrated" />.
</p>
</conbody>
</concept>
</concept>

View File

@@ -0,0 +1,39 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="authentication">
<title>Impala Authentication</title>
<prolog>
<metadata>
<data name="Category" value="Security"/>
<data name="Category" value="Impala"/>
<data name="Category" value="Authentication"/>
<data name="Category" value="Administrators"/>
</metadata>
</prolog>
<conbody>
<p>
Authentication is the mechanism to ensure that only specified hosts and users can connect to Impala. It also
verifies that when clients connect to Impala, they are connected to a legitimate server. This feature
prevents spoofing such as <term>impersonation</term> (setting up a phony client system with the same account
and group names as a legitimate user) and <term>man-in-the-middle attacks</term> (intercepting application
requests before they reach Impala and eavesdropping on sensitive information in the requests or the results).
</p>
<p>
Impala supports authentication using either Kerberos or LDAP.
</p>
<note conref="../shared/impala_common.xml#common/authentication_vs_authorization"/>
<p outputclass="toc"/>
<p>
Once you are finished setting up authentication, move on to authorization, which involves specifying what
databases, tables, HDFS directories, and so on can be accessed by particular users when they connect through
Impala. See <xref href="impala_authorization.xml#authorization"/> for details.
</p>
</conbody>
</concept>

File diff suppressed because it is too large Load Diff

225
docs/topics/impala_avg.xml Normal file
View File

@@ -0,0 +1,225 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="avg">
<title>AVG Function</title>
<titlealts audience="PDF"><navtitle>AVG</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="SQL"/>
<data name="Category" value="Impala Functions"/>
<data name="Category" value="Analytic Functions"/>
<data name="Category" value="Aggregate Functions"/>
<data name="Category" value="Querying"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="Cloudera">avg() function</indexterm>
An aggregate function that returns the average value from a set of numbers or <codeph>TIMESTAMP</codeph> values.
Its single argument can be numeric column, or the numeric result of a function or expression applied to the
column value. Rows with a <codeph>NULL</codeph> value for the specified column are ignored. If the table is empty,
or all the values supplied to <codeph>AVG</codeph> are <codeph>NULL</codeph>, <codeph>AVG</codeph> returns
<codeph>NULL</codeph>.
</p>
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
<codeblock>AVG([DISTINCT | ALL] <varname>expression</varname>) [OVER (<varname>analytic_clause</varname>)]
</codeblock>
<p>
When the query contains a <codeph>GROUP BY</codeph> clause, returns one value for each combination of
grouping values.
</p>
<p>
<b>Return type:</b> <codeph>DOUBLE</codeph> for numeric values; <codeph>TIMESTAMP</codeph> for
<codeph>TIMESTAMP</codeph> values
</p>
<p conref="../shared/impala_common.xml#common/complex_types_blurb"/>
<p conref="../shared/impala_common.xml#common/complex_types_aggregation_explanation"/>
<p conref="../shared/impala_common.xml#common/complex_types_aggregation_example"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<codeblock>-- Average all the non-NULL values in a column.
insert overwrite avg_t values (2),(4),(6),(null),(null);
-- The average of the above values is 4: (2+4+6) / 3. The 2 NULL values are ignored.
select avg(x) from avg_t;
-- Average only certain values from the column.
select avg(x) from t1 where month = 'January' and year = '2013';
-- Apply a calculation to the value of the column before averaging.
select avg(x/3) from t1;
-- Apply a function to the value of the column before averaging.
-- Here we are substituting a value of 0 for all NULLs in the column,
-- so that those rows do factor into the return value.
select avg(isnull(x,0)) from t1;
-- Apply some number-returning function to a string column and average the results.
-- If column s contains any NULLs, length(s) also returns NULL and those rows are ignored.
select avg(length(s)) from t1;
-- Can also be used in combination with DISTINCT and/or GROUP BY.
-- Return more than one result.
select month, year, avg(page_visits) from web_stats group by month, year;
-- Filter the input to eliminate duplicates before performing the calculation.
select avg(distinct x) from t1;
-- Filter the output after performing the calculation.
select avg(x) from t1 group by y having avg(x) between 1 and 20;
</codeblock>
<p rev="2.0.0">
The following examples show how to use <codeph>AVG()</codeph> in an analytic context. They use a table
containing integers from 1 to 10. Notice how the <codeph>AVG()</codeph> is reported for each input value, as
opposed to the <codeph>GROUP BY</codeph> clause which condenses the result set.
<codeblock>select x, property, avg(x) over (partition by property) as avg from int_t where property in ('odd','even');
+----+----------+-----+
| x | property | avg |
+----+----------+-----+
| 2 | even | 6 |
| 4 | even | 6 |
| 6 | even | 6 |
| 8 | even | 6 |
| 10 | even | 6 |
| 1 | odd | 5 |
| 3 | odd | 5 |
| 5 | odd | 5 |
| 7 | odd | 5 |
| 9 | odd | 5 |
+----+----------+-----+
</codeblock>
Adding an <codeph>ORDER BY</codeph> clause lets you experiment with results that are cumulative or apply to a moving
set of rows (the <q>window</q>). The following examples use <codeph>AVG()</codeph> in an analytic context
(that is, with an <codeph>OVER()</codeph> clause) to produce a running average of all the even values,
then a running average of all the odd values. The basic <codeph>ORDER BY x</codeph> clause implicitly
activates a window clause of <codeph>RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph>,
which is effectively the same as <codeph>ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph>,
therefore all of these examples produce the same results:
<codeblock>select x, property,
avg(x) over (partition by property <b>order by x</b>) as 'cumulative average'
from int_t where property in ('odd','even');
+----+----------+--------------------+
| x | property | cumulative average |
+----+----------+--------------------+
| 2 | even | 2 |
| 4 | even | 3 |
| 6 | even | 4 |
| 8 | even | 5 |
| 10 | even | 6 |
| 1 | odd | 1 |
| 3 | odd | 2 |
| 5 | odd | 3 |
| 7 | odd | 4 |
| 9 | odd | 5 |
+----+----------+--------------------+
select x, property,
avg(x) over
(
partition by property
<b>order by x</b>
<b>range between unbounded preceding and current row</b>
) as 'cumulative average'
from int_t where property in ('odd','even');
+----+----------+--------------------+
| x | property | cumulative average |
+----+----------+--------------------+
| 2 | even | 2 |
| 4 | even | 3 |
| 6 | even | 4 |
| 8 | even | 5 |
| 10 | even | 6 |
| 1 | odd | 1 |
| 3 | odd | 2 |
| 5 | odd | 3 |
| 7 | odd | 4 |
| 9 | odd | 5 |
+----+----------+--------------------+
select x, property,
avg(x) over
(
partition by property
<b>order by x</b>
<b>rows between unbounded preceding and current row</b>
) as 'cumulative average'
from int_t where property in ('odd','even');
+----+----------+--------------------+
| x | property | cumulative average |
+----+----------+--------------------+
| 2 | even | 2 |
| 4 | even | 3 |
| 6 | even | 4 |
| 8 | even | 5 |
| 10 | even | 6 |
| 1 | odd | 1 |
| 3 | odd | 2 |
| 5 | odd | 3 |
| 7 | odd | 4 |
| 9 | odd | 5 |
+----+----------+--------------------+
</codeblock>
The following examples show how to construct a moving window, with a running average taking into account 1 row before
and 1 row after the current row, within the same partition (all the even values or all the odd values).
Because of a restriction in the Impala <codeph>RANGE</codeph> syntax, this type of
moving window is possible with the <codeph>ROWS BETWEEN</codeph> clause but not the <codeph>RANGE BETWEEN</codeph>
clause:
<codeblock>select x, property,
avg(x) over
(
partition by property
<b>order by x</b>
<b>rows between 1 preceding and 1 following</b>
) as 'moving average'
from int_t where property in ('odd','even');
+----+----------+----------------+
| x | property | moving average |
+----+----------+----------------+
| 2 | even | 3 |
| 4 | even | 4 |
| 6 | even | 6 |
| 8 | even | 8 |
| 10 | even | 9 |
| 1 | odd | 2 |
| 3 | odd | 3 |
| 5 | odd | 5 |
| 7 | odd | 7 |
| 9 | odd | 8 |
+----+----------+----------------+
-- Doesn't work because of syntax restriction on RANGE clause.
select x, property,
avg(x) over
(
partition by property
<b>order by x</b>
<b>range between 1 preceding and 1 following</b>
) as 'moving average'
from int_t where property in ('odd','even');
ERROR: AnalysisException: RANGE is only supported with both the lower and upper bounds UNBOUNDED or one UNBOUNDED and the other CURRENT ROW.
</codeblock>
</p>
<p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
<!-- This conref appears under SUM(), AVG(), FLOAT, and DOUBLE topics. -->
<p conref="../shared/impala_common.xml#common/sum_double"/>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_analytic_functions.xml#analytic_functions"/>, <xref href="impala_max.xml#max"/>,
<xref href="impala_min.xml#min"/>
</p>
</conbody>
</concept>

556
docs/topics/impala_avro.xml Normal file
View File

@@ -0,0 +1,556 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="avro">
<title>Using the Avro File Format with Impala Tables</title>
<titlealts audience="PDF"><navtitle>Avro Data Files</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="File Formats"/>
<data name="Category" value="Avro"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p rev="1.4.0">
<indexterm audience="Cloudera">Avro support in Impala</indexterm>
Impala supports using tables whose data files use the Avro file format. Impala can query Avro
tables, and in Impala 1.4.0 and higher can create them, but currently cannot insert data into them. For
insert operations, use Hive, then switch back to Impala to run queries.
</p>
<table>
<title>Avro Format Support in Impala</title>
<tgroup cols="5">
<colspec colname="1" colwidth="10*"/>
<colspec colname="2" colwidth="10*"/>
<colspec colname="3" colwidth="20*"/>
<colspec colname="4" colwidth="30*"/>
<colspec colname="5" colwidth="30*"/>
<thead>
<row>
<entry>
File Type
</entry>
<entry>
Format
</entry>
<entry>
Compression Codecs
</entry>
<entry>
Impala Can CREATE?
</entry>
<entry>
Impala Can INSERT?
</entry>
</row>
</thead>
<tbody>
<row conref="impala_file_formats.xml#file_formats/avro_support">
<entry/>
</row>
</tbody>
</tgroup>
</table>
<p outputclass="toc inpage"/>
</conbody>
<concept id="avro_create_table">
<title>Creating Avro Tables</title>
<conbody>
<p>
To create a new table using the Avro file format, issue the <codeph>CREATE TABLE</codeph> statement through
Impala with the <codeph>STORED AS AVRO</codeph> clause, or through Hive. If you create the table through
Impala, you must include column definitions that match the fields specified in the Avro schema. With Hive,
you can omit the columns and just specify the Avro schema.
</p>
<p rev="2.3.0">
In <keyword keyref="impala23_full"/> and higher, the <codeph>CREATE TABLE</codeph> for Avro tables can include
SQL-style column definitions rather than specifying Avro notation through the <codeph>TBLPROPERTIES</codeph>
clause. Impala issues warning messages if there are any mismatches between the types specified in the
SQL column definitions and the underlying types; for example, any <codeph>TINYINT</codeph> or
<codeph>SMALLINT</codeph> columns are treated as <codeph>INT</codeph> in the underlying Avro files,
and therefore are displayed as <codeph>INT</codeph> in any <codeph>DESCRIBE</codeph> or
<codeph>SHOW CREATE TABLE</codeph> output.
</p>
<note>
<p conref="../shared/impala_common.xml#common/avro_no_timestamp"/>
</note>
<!--
To do: Expand these examples to show switching between impala-shell and Hive, loading some data, and then
doing DESCRIBE and querying the table.
-->
<p>
The following examples demonstrate creating an Avro table in Impala, using either an inline column
specification or one taken from a JSON file stored in HDFS:
</p>
<codeblock><![CDATA[
[localhost:21000] > CREATE TABLE avro_only_sql_columns
> (
> id INT,
> bool_col BOOLEAN,
> tinyint_col TINYINT, /* Gets promoted to INT */
> smallint_col SMALLINT, /* Gets promoted to INT */
> int_col INT,
> bigint_col BIGINT,
> float_col FLOAT,
> double_col DOUBLE,
> date_string_col STRING,
> string_col STRING
> )
> STORED AS AVRO;
[localhost:21000] > CREATE TABLE impala_avro_table
> (bool_col BOOLEAN, int_col INT, long_col BIGINT, float_col FLOAT, double_col DOUBLE, string_col STRING, nullable_int INT)
> STORED AS AVRO
> TBLPROPERTIES ('avro.schema.literal'='{
> "name": "my_record",
> "type": "record",
> "fields": [
> {"name":"bool_col", "type":"boolean"},
> {"name":"int_col", "type":"int"},
> {"name":"long_col", "type":"long"},
> {"name":"float_col", "type":"float"},
> {"name":"double_col", "type":"double"},
> {"name":"string_col", "type":"string"},
> {"name": "nullable_int", "type": ["null", "int"]}]}');
[localhost:21000] > CREATE TABLE avro_examples_of_all_types (
> id INT,
> bool_col BOOLEAN,
> tinyint_col TINYINT,
> smallint_col SMALLINT,
> int_col INT,
> bigint_col BIGINT,
> float_col FLOAT,
> double_col DOUBLE,
> date_string_col STRING,
> string_col STRING
> )
> STORED AS AVRO
> TBLPROPERTIES ('avro.schema.url'='hdfs://localhost:8020/avro_schemas/alltypes.json');
]]>
</codeblock>
<p>
The following example demonstrates creating an Avro table in Hive:
</p>
<codeblock><![CDATA[
hive> CREATE TABLE hive_avro_table
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
> STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
> TBLPROPERTIES ('avro.schema.literal'='{
> "name": "my_record",
> "type": "record",
> "fields": [
> {"name":"bool_col", "type":"boolean"},
> {"name":"int_col", "type":"int"},
> {"name":"long_col", "type":"long"},
> {"name":"float_col", "type":"float"},
> {"name":"double_col", "type":"double"},
> {"name":"string_col", "type":"string"},
> {"name": "nullable_int", "type": ["null", "int"]}]}');
]]>
</codeblock>
<p>
Each field of the record becomes a column of the table. Note that any other information, such as the record
name, is ignored.
</p>
<!-- Have not got a working example of this syntax yet from Lenni.
<p>
The schema can be specified either through the <codeph>TBLPROPERTIES</codeph> clause or the
<codeph>WITH SERDEPROPERTIES</codeph> clause.
For best compatibility with future versions of Hive, use the <codeph>WITH SERDEPROPERTIES</codeph> clause
for this information.
</p>
-->
<note>
For nullable Avro columns, make sure to put the <codeph>"null"</codeph> entry before the actual type name.
In Impala, all columns are nullable; Impala currently does not have a <codeph>NOT NULL</codeph> clause. Any
non-nullable property is only enforced on the Avro side.
</note>
<p>
Most column types map directly from Avro to Impala under the same names. These are the exceptions and
special cases to consider:
</p>
<ul>
<li>
The <codeph>DECIMAL</codeph> type is defined in Avro as a <codeph>BYTE</codeph> type with the
<codeph>logicalType</codeph> property set to <codeph>"decimal"</codeph> and a specified precision and
scale. Use <codeph>DECIMAL</codeph> in Avro tables only under CDH 5. The infrastructure and components
under CDH 4 do not have reliable <codeph>DECIMAL</codeph> support.
</li>
<li>
The Avro <codeph>long</codeph> type maps to <codeph>BIGINT</codeph> in Impala.
</li>
</ul>
<p>
If you create the table through Hive, switch back to <cmdname>impala-shell</cmdname> and issue an
<codeph>INVALIDATE METADATA <varname>table_name</varname></codeph> statement. Then you can run queries for
that table through <cmdname>impala-shell</cmdname>.
</p>
<p rev="2.3.0">
In rare instances, a mismatch could occur between the Avro schema and the column definitions in the
metastore database. In <keyword keyref="impala23_full"/> and higher, Impala checks for such inconsistencies during
a <codeph>CREATE TABLE</codeph> statement and each time it loads the metadata for a table (for example,
after <codeph>INVALIDATE METADATA</codeph>). Impala uses the following rules to determine how to treat
mismatching columns, a process known as <term>schema reconciliation</term>:
<ul>
<li>
If there is a mismatch in the number of columns, Impala uses the column
definitions from the Avro schema.
</li>
<li>
If there is a mismatch in column name or type, Impala uses the column definition from the Avro schema.
Because a <codeph>CHAR</codeph> or <codeph>VARCHAR</codeph> column in Impala maps to an Avro <codeph>STRING</codeph>,
this case is not considered a mismatch and the column is preserved as <codeph>CHAR</codeph> or <codeph>VARCHAR</codeph>
in the reconciled schema. <ph rev="2.7.0 IMPALA-3687 CDH-43731">Prior to <keyword keyref="impala27_full"/> the column
name and comment for such <codeph>CHAR</codeph> and <codeph>VARCHAR</codeph> columns was also taken from the SQL column definition.
In <keyword keyref="impala27_full"/> and higher, the column name and comment from the Avro schema file take precedence for such columns,
and only the <codeph>CHAR</codeph> or <codeph>VARCHAR</codeph> type is preserved from the SQL column definition.</ph>
</li>
<li>
An Impala <codeph>TIMESTAMP</codeph> column definition maps to an Avro <codeph>STRING</codeph> and is presented as a <codeph>STRING</codeph>
in the reconciled schema, because Avro has no binary <codeph>TIMESTAMP</codeph> representation.
As a result, no Avro table can have a <codeph>TIMESTAMP</codeph> column; this restriction is the same as
in earlier CDH and Impala releases.
</li>
</ul>
</p>
<p conref="../shared/impala_common.xml#common/complex_types_unsupported_filetype"/>
</conbody>
</concept>
<concept id="avro_map_table">
<title>Using a Hive-Created Avro Table in Impala</title>
<conbody>
<p>
If you have an Avro table created through Hive, you can use it in Impala as long as it contains only
Impala-compatible data types. It cannot contain:
<ul>
<li>
Complex types: <codeph>array</codeph>, <codeph>map</codeph>, <codeph>record</codeph>,
<codeph>struct</codeph>, <codeph>union</codeph> other than
<codeph>[<varname>supported_type</varname>,null]</codeph> or
<codeph>[null,<varname>supported_type</varname>]</codeph>
</li>
<li>
The Avro-specific types <codeph>enum</codeph>, <codeph>bytes</codeph>, and <codeph>fixed</codeph>
</li>
<li>
Any scalar type other than those listed in <xref href="impala_datatypes.xml#datatypes"/>
</li>
</ul>
Because Impala and Hive share the same metastore database, Impala can directly access the table definitions
and data for tables that were created in Hive.
</p>
<p>
If you create an Avro table in Hive, issue an <codeph>INVALIDATE METADATA</codeph> the next time you
connect to Impala through <cmdname>impala-shell</cmdname>. This is a one-time operation to make Impala
aware of the new table. You can issue the statement while connected to any Impala node, and the catalog
service broadcasts the change to all other Impala nodes.
</p>
<p>
If you load new data into an Avro table through Hive, either through a Hive <codeph>LOAD DATA</codeph> or
<codeph>INSERT</codeph> statement, or by manually copying or moving files into the data directory for the
table, issue a <codeph>REFRESH <varname>table_name</varname></codeph> statement the next time you connect
to Impala through <cmdname>impala-shell</cmdname>. You can issue the statement while connected to any
Impala node, and the catalog service broadcasts the change to all other Impala nodes. If you issue the
<codeph>LOAD DATA</codeph> statement through Impala, you do not need a <codeph>REFRESH</codeph> afterward.
</p>
<p>
Impala only supports fields of type <codeph>boolean</codeph>, <codeph>int</codeph>, <codeph>long</codeph>,
<codeph>float</codeph>, <codeph>double</codeph>, and <codeph>string</codeph>, or unions of these types with
null; for example, <codeph>["string", "null"]</codeph>. Unions with <codeph>null</codeph> essentially
create a nullable type.
</p>
</conbody>
</concept>
<concept id="avro_json">
<title>Specifying the Avro Schema through JSON</title>
<conbody>
<p>
While you can embed a schema directly in your <codeph>CREATE TABLE</codeph> statement, as shown above,
column width restrictions in the Hive metastore limit the length of schema you can specify. If you
encounter problems with long schema literals, try storing your schema as a <codeph>JSON</codeph> file in
HDFS instead. Specify your schema in HDFS using table properties similar to the following:
</p>
<codeblock>tblproperties ('avro.schema.url'='hdfs//your-name-node:port/path/to/schema.json');</codeblock>
</conbody>
</concept>
<concept id="avro_load_data">
<title>Loading Data into an Avro Table</title>
<prolog>
<metadata>
<data name="Category" value="ETL"/>
<data name="Category" value="Ingest"/>
</metadata>
</prolog>
<conbody>
<p rev="DOCS-1523">
Currently, Impala cannot write Avro data files. Therefore, an Avro table cannot be used as the destination
of an Impala <codeph>INSERT</codeph> statement or <codeph>CREATE TABLE AS SELECT</codeph>.
</p>
<p>
To copy data from another table, issue any <codeph>INSERT</codeph> statements through Hive. For information
about loading data into Avro tables through Hive, see
<xref href="https://cwiki.apache.org/confluence/display/Hive/AvroSerDe" scope="external" format="html">Avro
page on the Hive wiki</xref>.
</p>
<p>
If you already have data files in Avro format, you can also issue <codeph>LOAD DATA</codeph> in either
Impala or Hive. Impala can move existing Avro data files into an Avro table, it just cannot create new
Avro data files.
</p>
</conbody>
</concept>
<concept id="avro_compression">
<title>Enabling Compression for Avro Tables</title>
<prolog>
<metadata>
<data name="Category" value="Compression"/>
<data name="Category" value="Snappy"/>
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="Cloudera">compression</indexterm>
To enable compression for Avro tables, specify settings in the Hive shell to enable compression and to
specify a codec, then issue a <codeph>CREATE TABLE</codeph> statement as in the preceding examples. Impala
supports the <codeph>snappy</codeph> and <codeph>deflate</codeph> codecs for Avro tables.
</p>
<p>
For example:
</p>
<codeblock>hive&gt; set hive.exec.compress.output=true;
hive&gt; set avro.output.codec=snappy;</codeblock>
</conbody>
</concept>
<concept rev="1.1" id="avro_schema_evolution">
<title>How Impala Handles Avro Schema Evolution</title>
<prolog>
<metadata>
<data name="Category" value="Concepts"/>
</metadata>
</prolog>
<conbody>
<p>
Starting in Impala 1.1, Impala can deal with Avro data files that employ <term>schema evolution</term>,
where different data files within the same table use slightly different type definitions. (You would
perform the schema evolution operation by issuing an <codeph>ALTER TABLE</codeph> statement in the Hive
shell.) The old and new types for any changed columns must be compatible, for example a column might start
as an <codeph>int</codeph> and later change to a <codeph>bigint</codeph> or <codeph>float</codeph>.
</p>
<p>
As with any other tables where the definitions are changed or data is added outside of the current
<cmdname>impalad</cmdname> node, ensure that Impala loads the latest metadata for the table if the Avro
schema is modified through Hive. Issue a <codeph>REFRESH <varname>table_name</varname></codeph> or
<codeph>INVALIDATE METADATA <varname>table_name</varname></codeph> statement. <codeph>REFRESH</codeph>
reloads the metadata immediately, <codeph>INVALIDATE METADATA</codeph> reloads the metadata the next time
the table is accessed.
</p>
<p>
When Avro data files or columns are not consulted during a query, Impala does not check for consistency.
Thus, if you issue <codeph>SELECT c1, c2 FROM t1</codeph>, Impala does not return any error if the column
<codeph>c3</codeph> changed in an incompatible way. If a query retrieves data from some partitions but not
others, Impala does not check the data files for the unused partitions.
</p>
<p>
In the Hive DDL statements, you can specify an <codeph>avro.schema.literal</codeph> table property (if the
schema definition is short) or an <codeph>avro.schema.url</codeph> property (if the schema definition is
long, or to allow convenient editing for the definition).
</p>
<p>
For example, running the following SQL code in the Hive shell creates a table using the Avro file format
and puts some sample data into it:
</p>
<codeblock>CREATE TABLE avro_table (a string, b string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES (
'avro.schema.literal'='{
"type": "record",
"name": "my_record",
"fields": [
{"name": "a", "type": "int"},
{"name": "b", "type": "string"}
]}');
INSERT OVERWRITE TABLE avro_table SELECT 1, "avro" FROM functional.alltypes LIMIT 1;
</codeblock>
<p>
Once the Avro table is created and contains data, you can query it through the
<cmdname>impala-shell</cmdname> command:
</p>
<codeblock>[localhost:21000] &gt; select * from avro_table;
+---+------+
| a | b |
+---+------+
| 1 | avro |
+---+------+
</codeblock>
<p>
Now in the Hive shell, you change the type of a column and add a new column with a default value:
</p>
<codeblock>-- Promote column "a" from INT to FLOAT (no need to update Avro schema)
ALTER TABLE avro_table CHANGE A A FLOAT;
-- Add column "c" with default
ALTER TABLE avro_table ADD COLUMNS (c int);
ALTER TABLE avro_table SET TBLPROPERTIES (
'avro.schema.literal'='{
"type": "record",
"name": "my_record",
"fields": [
{"name": "a", "type": "int"},
{"name": "b", "type": "string"},
{"name": "c", "type": "int", "default": 10}
]}');
</codeblock>
<p>
Once again in <cmdname>impala-shell</cmdname>, you can query the Avro table based on its latest schema
definition. Because the table metadata was changed outside of Impala, you issue a <codeph>REFRESH</codeph>
statement first so that Impala has up-to-date metadata for the table.
</p>
<codeblock>[localhost:21000] &gt; refresh avro_table;
[localhost:21000] &gt; select * from avro_table;
+---+------+----+
| a | b | c |
+---+------+----+
| 1 | avro | 10 |
+---+------+----+
</codeblock>
</conbody>
</concept>
<concept id="avro_data_types">
<title>Data Type Considerations for Avro Tables</title>
<conbody>
<p>
The Avro format defines a set of data types whose names differ from the names of the corresponding Impala
data types. If you are preparing Avro files using other Hadoop components such as Pig or MapReduce, you
might need to work with the type names defined by Avro. The following figure lists the Avro-defined types
and the equivalent types in Impala.
</p>
<codeblock><![CDATA[Primitive Types (Avro -> Impala)
--------------------------------
STRING -> STRING
STRING -> CHAR
STRING -> VARCHAR
INT -> INT
BOOLEAN -> BOOLEAN
LONG -> BIGINT
FLOAT -> FLOAT
DOUBLE -> DOUBLE
Logical Types
-------------
BYTES + logicalType = "decimal" -> DECIMAL
Avro Types with No Impala Equivalent
------------------------------------
RECORD, MAP, ARRAY, UNION, ENUM, FIXED, NULL
Impala Types with No Avro Equivalent
------------------------------------
TIMESTAMP
]]>
</codeblock>
<p conref="../shared/impala_common.xml#common/avro_2gb_strings"/>
</conbody>
</concept>
<concept id="avro_performance">
<title>Query Performance for Impala Avro Tables</title>
<conbody>
<p>
In general, expect query performance with Avro tables to be
faster than with tables using text data, but slower than with
Parquet tables. See <xref href="impala_parquet.xml#parquet"/>
for information about using the Parquet file format for
high-performance analytic queries.
</p>
<p conref="../shared/impala_common.xml#common/s3_block_splitting"/>
</conbody>
</concept>
</concept>

View File

@@ -0,0 +1,38 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="batch_size">
<title>BATCH_SIZE Query Option</title>
<titlealts audience="PDF"><navtitle>BATCH_SIZE</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Impala Query Options"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
<data name="Category" value="Querying"/>
<data name="Category" value="Performance"/>
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="Cloudera">BATCH_SIZE query option</indexterm>
Number of rows evaluated at a time by SQL operators. Unspecified or a size of 0 uses a predefined default
size. Using a large number improves responsiveness, especially for scan operations, at the cost of a higher memory footprint.
</p>
<p>
This option is primarily for testing during Impala development, or for use under the direction of <keyword keyref="support_org"/>.
</p>
<p>
<b>Type:</b> numeric
</p>
<p>
<b>Default:</b> 0 (meaning the predefined default of 1024)
</p>
</conbody>
</concept>

View File

@@ -0,0 +1,102 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="bigint">
<title>BIGINT Data Type</title>
<titlealts audience="PDF"><navtitle>BIGINT</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Impala Data Types"/>
<data name="Category" value="SQL"/>
<data name="Category" value="Data Analysts"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Schemas"/>
</metadata>
</prolog>
<conbody>
<p>
An 8-byte integer data type used in <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph>
statements.
</p>
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
<p>
In the column definition of a <codeph>CREATE TABLE</codeph> statement:
</p>
<codeblock><varname>column_name</varname> BIGINT</codeblock>
<p>
<b>Range:</b> -9223372036854775808 .. 9223372036854775807. There is no <codeph>UNSIGNED</codeph> subtype.
</p>
<p>
<b>Conversions:</b> Impala automatically converts to a floating-point type (<codeph>FLOAT</codeph> or
<codeph>DOUBLE</codeph>) automatically. Use <codeph>CAST()</codeph> to convert to <codeph>TINYINT</codeph>,
<codeph>SMALLINT</codeph>, <codeph>INT</codeph>, <codeph>STRING</codeph>, or <codeph>TIMESTAMP</codeph>.
<ph conref="../shared/impala_common.xml#common/cast_int_to_timestamp"/>
</p>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<codeblock>CREATE TABLE t1 (x BIGINT);
SELECT CAST(1000 AS BIGINT);
</codeblock>
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
<p>
<codeph>BIGINT</codeph> is a convenient type to use for column declarations because you can use any kind of
integer values in <codeph>INSERT</codeph> statements and they are promoted to <codeph>BIGINT</codeph> where
necessary. However, <codeph>BIGINT</codeph> also requires the most bytes of any integer type on disk and in
memory, meaning your queries are not as efficient and scalable as possible if you overuse this type.
Therefore, prefer to use the smallest integer type with sufficient range to hold all input values, and
<codeph>CAST()</codeph> when necessary to the appropriate type.
</p>
<p>
For a convenient and automated way to check the bounds of the <codeph>BIGINT</codeph> type, call the
functions <codeph>MIN_BIGINT()</codeph> and <codeph>MAX_BIGINT()</codeph>.
</p>
<p>
If an integer value is too large to be represented as a <codeph>BIGINT</codeph>, use a
<codeph>DECIMAL</codeph> instead with sufficient digits of precision.
</p>
<p conref="../shared/impala_common.xml#common/null_bad_numeric_cast"/>
<p conref="../shared/impala_common.xml#common/partitioning_good"/>
<p conref="../shared/impala_common.xml#common/hbase_ok"/>
<!-- <p conref="../shared/impala_common.xml#common/parquet_blurb"/> -->
<p conref="../shared/impala_common.xml#common/text_bulky"/>
<!-- <p conref="../shared/impala_common.xml#common/compatibility_blurb"/> -->
<p conref="../shared/impala_common.xml#common/internals_8_bytes"/>
<p conref="../shared/impala_common.xml#common/added_forever"/>
<p conref="../shared/impala_common.xml#common/column_stats_constant"/>
<p conref="../shared/impala_common.xml#common/sqoop_blurb"/>
<p conref="../shared/impala_common.xml#common/sqoop_timestamp_caveat"/>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_literals.xml#numeric_literals"/>, <xref href="impala_tinyint.xml#tinyint"/>,
<xref href="impala_smallint.xml#smallint"/>, <xref href="impala_int.xml#int"/>,
<xref href="impala_bigint.xml#bigint"/>, <xref href="impala_decimal.xml#decimal"/>,
<xref href="impala_math_functions.xml#math_functions"/>
</p>
</conbody>
</concept>

View File

@@ -0,0 +1,794 @@
<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="bit_functions" rev="2.3.0">
<title>Impala Bit Functions</title>
<titlealts audience="PDF"><navtitle>Bit Functions</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Impala Functions"/>
<data name="Category" value="SQL"/>
<data name="Category" value="Data Analysts"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Querying"/>
</metadata>
</prolog>
<conbody>
<p rev="2.3.0">
Bit manipulation functions perform bitwise operations involved in scientific processing or computer science algorithms.
For example, these functions include setting, clearing, or testing bits within an integer value, or changing the
positions of bits with or without wraparound.
</p>
<p>
If a function takes two integer arguments that are required to be of the same type, the smaller argument is promoted
to the type of the larger one if required. For example, <codeph>BITAND(1,4096)</codeph> treats both arguments as
<codeph>SMALLINT</codeph>, because 1 can be represented as a <codeph>TINYINT</codeph> but 4096 requires a <codeph>SMALLINT</codeph>.
</p>
<p>
Remember that all Impala integer values are signed. Therefore, when dealing with binary values where the most significant
bit is 1, the specified or returned values might be negative when represented in base 10.
</p>
<p>
Whenever any argument is <codeph>NULL</codeph>, either the input value, bit position, or number of shift or rotate positions,
the return value from any of these functions is also <codeph>NULL</codeph>
</p>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
The bit functions operate on all the integral data types: <xref href="impala_int.xml#int"/>,
<xref href="impala_bigint.xml#bigint"/>, <xref href="impala_smallint.xml#smallint"/>, and
<xref href="impala_tinyint.xml#tinyint"/>.
</p>
<p>
<b>Function reference:</b>
</p>
<p>
Impala supports the following bit functions:
</p>
<!--
bitand
bitnot
bitor
bitxor
countset
getbit
rotateleft
rotateright
setbit
shiftleft
shiftright
-->
<dl>
<dlentry id="bitand">
<dt>
<codeph>bitand(integer_type a, same_type b)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">bitand() function</indexterm>
<b>Purpose:</b> Returns an integer value representing the bits that are set to 1 in both of the arguments.
If the arguments are of different sizes, the smaller is promoted to the type of the larger.
<p>
<b>Usage notes:</b> The <codeph>bitand()</codeph> function is equivalent to the <codeph>&amp;</codeph> binary operator.
</p>
<p conref="../shared/impala_common.xml#common/return_type_same"/>
<p conref="../shared/impala_common.xml#common/added_in_230"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p>
The following examples show the results of ANDing integer values.
255 contains all 1 bits in its lowermost 7 bits.
32767 contains all 1 bits in its lowermost 15 bits.
<!--
Negative numbers have a 1 in the sign bit and the value is the
<xref href="https://en.wikipedia.org/wiki/Two%27s_complement" scope="external" format="html">two's complement</xref>
of the positive equivalent.
-->
You can use the <codeph>bin()</codeph> function to check the binary representation of any
integer value, although the result is always represented as a 64-bit value.
If necessary, the smaller argument is promoted to the
type of the larger one.
</p>
<codeblock>select bitand(255, 32767); /* 0000000011111111 &amp; 0111111111111111 */
+--------------------+
| bitand(255, 32767) |
+--------------------+
| 255 |
+--------------------+
select bitand(32767, 1); /* 0111111111111111 &amp; 0000000000000001 */
+------------------+
| bitand(32767, 1) |
+------------------+
| 1 |
+------------------+
select bitand(32, 16); /* 00010000 &amp; 00001000 */
+----------------+
| bitand(32, 16) |
+----------------+
| 0 |
+----------------+
select bitand(12,5); /* 00001100 &amp; 00000101 */
+---------------+
| bitand(12, 5) |
+---------------+
| 4 |
+---------------+
select bitand(-1,15); /* 11111111 &amp; 00001111 */
+----------------+
| bitand(-1, 15) |
+----------------+
| 15 |
+----------------+
</codeblock>
</dd>
</dlentry>
<dlentry id="bitnot">
<dt>
<codeph>bitnot(integer_type a)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">bitnot() function</indexterm>
<b>Purpose:</b> Inverts all the bits of the input argument.
<p>
<b>Usage notes:</b> The <codeph>bitnot()</codeph> function is equivalent to the <codeph>~</codeph> unary operator.
</p>
<p conref="../shared/impala_common.xml#common/return_type_same"/>
<p conref="../shared/impala_common.xml#common/added_in_230"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p>
These examples illustrate what happens when you flip all the bits of an integer value.
The sign always changes. The decimal representation is one different between the positive and
negative values.
<!--
because negative values are represented as the
<xref href="https://en.wikipedia.org/wiki/Two%27s_complement" scope="external" format="html">two's complement</xref>
of the corresponding positive value.
-->
</p>
<codeblock>select bitnot(127); /* 01111111 -> 10000000 */
+-------------+
| bitnot(127) |
+-------------+
| -128 |
+-------------+
select bitnot(16); /* 00010000 -> 11101111 */
+------------+
| bitnot(16) |
+------------+
| -17 |
+------------+
select bitnot(0); /* 00000000 -> 11111111 */
+-----------+
| bitnot(0) |
+-----------+
| -1 |
+-----------+
select bitnot(-128); /* 10000000 -> 01111111 */
+--------------+
| bitnot(-128) |
+--------------+
| 127 |
+--------------+
</codeblock>
</dd>
</dlentry>
<dlentry id="bitor">
<dt>
<codeph>bitor(integer_type a, same_type b)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">bitor() function</indexterm>
<b>Purpose:</b> Returns an integer value representing the bits that are set to 1 in either of the arguments.
If the arguments are of different sizes, the smaller is promoted to the type of the larger.
<p>
<b>Usage notes:</b> The <codeph>bitor()</codeph> function is equivalent to the <codeph>|</codeph> binary operator.
</p>
<p conref="../shared/impala_common.xml#common/return_type_same"/>
<p conref="../shared/impala_common.xml#common/added_in_230"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p>
The following examples show the results of ORing integer values.
</p>
<codeblock>select bitor(1,4); /* 00000001 | 00000100 */
+-------------+
| bitor(1, 4) |
+-------------+
| 5 |
+-------------+
select bitor(16,48); /* 00001000 | 00011000 */
+---------------+
| bitor(16, 48) |
+---------------+
| 48 |
+---------------+
select bitor(0,7); /* 00000000 | 00000111 */
+-------------+
| bitor(0, 7) |
+-------------+
| 7 |
+-------------+
</codeblock>
</dd>
</dlentry>
<dlentry id="bitxor">
<dt>
<codeph>bitxor(integer_type a, same_type b)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">bitxor() function</indexterm>
<b>Purpose:</b> Returns an integer value representing the bits that are set to 1 in one but not both of the arguments.
If the arguments are of different sizes, the smaller is promoted to the type of the larger.
<p>
<b>Usage notes:</b> The <codeph>bitxor()</codeph> function is equivalent to the <codeph>^</codeph> binary operator.
</p>
<p conref="../shared/impala_common.xml#common/return_type_same"/>
<p conref="../shared/impala_common.xml#common/added_in_230"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p>
The following examples show the results of XORing integer values.
XORing a non-zero value with zero returns the non-zero value.
XORing two identical values returns zero, because all the 1 bits from the first argument are also 1 bits in the second argument.
XORing different non-zero values turns off some bits and leaves others turned on, based on whether the same bit is set in both arguments.
</p>
<codeblock>select bitxor(0,15); /* 00000000 ^ 00001111 */
+---------------+
| bitxor(0, 15) |
+---------------+
| 15 |
+---------------+
select bitxor(7,7); /* 00000111 ^ 00000111 */
+--------------+
| bitxor(7, 7) |
+--------------+
| 0 |
+--------------+
select bitxor(8,4); /* 00001000 ^ 00000100 */
+--------------+
| bitxor(8, 4) |
+--------------+
| 12 |
+--------------+
select bitxor(3,7); /* 00000011 ^ 00000111 */
+--------------+
| bitxor(3, 7) |
+--------------+
| 4 |
+--------------+
</codeblock>
</dd>
</dlentry>
<dlentry id="countset">
<dt>
<codeph>countset(integer_type a [, int zero_or_one])</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">countset() function</indexterm>
<b>Purpose:</b> By default, returns the number of 1 bits in the specified integer value.
If the optional second argument is set to zero, it returns the number of 0 bits instead.
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
<p>
In discussions of information theory, this operation is referred to as the
<q><xref href="https://en.wikipedia.org/wiki/Hamming_weight" scope="external" format="html">population count</xref></q>
or <q>popcount</q>.
</p>
<p conref="../shared/impala_common.xml#common/return_type_same"/>
<p conref="../shared/impala_common.xml#common/added_in_230"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p>
The following examples show how to count the number of 1 bits in an integer value.
</p>
<codeblock>select countset(1); /* 00000001 */
+-------------+
| countset(1) |
+-------------+
| 1 |
+-------------+
select countset(3); /* 00000011 */
+-------------+
| countset(3) |
+-------------+
| 2 |
+-------------+
select countset(16); /* 00010000 */
+--------------+
| countset(16) |
+--------------+
| 1 |
+--------------+
select countset(17); /* 00010001 */
+--------------+
| countset(17) |
+--------------+
| 2 |
+--------------+
select countset(7,1); /* 00000111 = 3 1 bits; the function counts 1 bits by default */
+----------------+
| countset(7, 1) |
+----------------+
| 3 |
+----------------+
select countset(7,0); /* 00000111 = 5 0 bits; third argument can only be 0 or 1 */
+----------------+
| countset(7, 0) |
+----------------+
| 5 |
+----------------+
</codeblock>
</dd>
</dlentry>
<dlentry id="getbit">
<dt>
<codeph>getbit(integer_type a, int position)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">getbit() function</indexterm>
<b>Purpose:</b> Returns a 0 or 1 representing the bit at a
specified position. The positions are numbered right to left, starting at zero.
The position argument cannot be negative.
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
<p>
When you use a literal input value, it is treated as an 8-bit, 16-bit,
and so on value, the smallest type that is appropriate.
The type of the input value limits the range of the positions.
Cast the input value to the appropriate type if you need to
ensure it is treated as a 64-bit, 32-bit, and so on value.
</p>
<p conref="../shared/impala_common.xml#common/return_type_same"/>
<p conref="../shared/impala_common.xml#common/added_in_230"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p>
The following examples show how to test a specific bit within an integer value.
</p>
<codeblock>select getbit(1,0); /* 00000001 */
+--------------+
| getbit(1, 0) |
+--------------+
| 1 |
+--------------+
select getbit(16,1) /* 00010000 */
+---------------+
| getbit(16, 1) |
+---------------+
| 0 |
+---------------+
select getbit(16,4) /* 00010000 */
+---------------+
| getbit(16, 4) |
+---------------+
| 1 |
+---------------+
select getbit(16,5) /* 00010000 */
+---------------+
| getbit(16, 5) |
+---------------+
| 0 |
+---------------+
select getbit(-1,3); /* 11111111 */
+---------------+
| getbit(-1, 3) |
+---------------+
| 1 |
+---------------+
select getbit(-1,25); /* 11111111 */
ERROR: Invalid bit position: 25
select getbit(cast(-1 as int),25); /* 11111111111111111111111111111111 */
+-----------------------------+
| getbit(cast(-1 as int), 25) |
+-----------------------------+
| 1 |
+-----------------------------+
</codeblock>
</dd>
</dlentry>
<dlentry id="rotateleft">
<dt>
<codeph>rotateleft(integer_type a, int positions)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">rotateleft() function</indexterm>
<b>Purpose:</b> Rotates an integer value left by a specified number of bits.
As the most significant bit is taken out of the original value,
if it is a 1 bit, it is <q>rotated</q> back to the least significant bit.
Therefore, the final value has the same number of 1 bits as the original value,
just in different positions.
In computer science terms, this operation is a
<q><xref href="https://en.wikipedia.org/wiki/Circular_shift" scope="external" format="html">circular shift</xref></q>.
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
<p>
Specifying a second argument of zero leaves the original value unchanged.
Rotating a -1 value by any number of positions still returns -1,
because the original value has all 1 bits and all the 1 bits are
preserved during rotation.
Similarly, rotating a 0 value by any number of positions still returns 0.
Rotating a value by the same number of bits as in the value returns the same value.
Because this is a circular operation, the number of positions is not limited
to the number of bits in the input value.
For example, rotating an 8-bit value by 1, 9, 17, and so on positions returns an
identical result in each case.
</p>
<p conref="../shared/impala_common.xml#common/return_type_same"/>
<p conref="../shared/impala_common.xml#common/added_in_230"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<codeblock>select rotateleft(1,4); /* 00000001 -> 00010000 */
+------------------+
| rotateleft(1, 4) |
+------------------+
| 16 |
+------------------+
select rotateleft(-1,155); /* 11111111 -> 11111111 */
+---------------------+
| rotateleft(-1, 155) |
+---------------------+
| -1 |
+---------------------+
select rotateleft(-128,1); /* 10000000 -> 00000001 */
+---------------------+
| rotateleft(-128, 1) |
+---------------------+
| 1 |
+---------------------+
select rotateleft(-127,3); /* 10000001 -> 00001100 */
+---------------------+
| rotateleft(-127, 3) |
+---------------------+
| 12 |
+---------------------+
</codeblock>
</dd>
</dlentry>
<dlentry id="rotateright">
<dt>
<codeph>rotateright(integer_type a, int positions)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">rotateright() function</indexterm>
<b>Purpose:</b> Rotates an integer value right by a specified number of bits.
As the least significant bit is taken out of the original value,
if it is a 1 bit, it is <q>rotated</q> back to the most significant bit.
Therefore, the final value has the same number of 1 bits as the original value,
just in different positions.
In computer science terms, this operation is a
<q><xref href="https://en.wikipedia.org/wiki/Circular_shift" scope="external" format="html">circular shift</xref></q>.
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
<p>
Specifying a second argument of zero leaves the original value unchanged.
Rotating a -1 value by any number of positions still returns -1,
because the original value has all 1 bits and all the 1 bits are
preserved during rotation.
Similarly, rotating a 0 value by any number of positions still returns 0.
Rotating a value by the same number of bits as in the value returns the same value.
Because this is a circular operation, the number of positions is not limited
to the number of bits in the input value.
For example, rotating an 8-bit value by 1, 9, 17, and so on positions returns an
identical result in each case.
</p>
<p conref="../shared/impala_common.xml#common/return_type_same"/>
<p conref="../shared/impala_common.xml#common/added_in_230"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<codeblock>select rotateright(16,4); /* 00010000 -> 00000001 */
+--------------------+
| rotateright(16, 4) |
+--------------------+
| 1 |
+--------------------+
select rotateright(-1,155); /* 11111111 -> 11111111 */
+----------------------+
| rotateright(-1, 155) |
+----------------------+
| -1 |
+----------------------+
select rotateright(-128,1); /* 10000000 -> 01000000 */
+----------------------+
| rotateright(-128, 1) |
+----------------------+
| 64 |
+----------------------+
select rotateright(-127,3); /* 10000001 -> 00110000 */
+----------------------+
| rotateright(-127, 3) |
+----------------------+
| 48 |
+----------------------+
</codeblock>
</dd>
</dlentry>
<dlentry id="setbit">
<dt>
<codeph>setbit(integer_type a, int position [, int zero_or_one])</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">setbit() function</indexterm>
<b>Purpose:</b> By default, changes a bit at a specified position to a 1, if it is not already.
If the optional third argument is set to zero, the specified bit is set to 0 instead.
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
If the bit at the specified position was already 1 (by default)
or 0 (with a third argument of zero), the return value is
the same as the first argument.
The positions are numbered right to left, starting at zero.
(Therefore, the return value could be different from the first argument
even if the position argument is zero.)
The position argument cannot be negative.
<p>
When you use a literal input value, it is treated as an 8-bit, 16-bit,
and so on value, the smallest type that is appropriate.
The type of the input value limits the range of the positions.
Cast the input value to the appropriate type if you need to
ensure it is treated as a 64-bit, 32-bit, and so on value.
</p>
<p conref="../shared/impala_common.xml#common/return_type_same"/>
<p conref="../shared/impala_common.xml#common/added_in_230"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<codeblock>select setbit(0,0); /* 00000000 -> 00000001 */
+--------------+
| setbit(0, 0) |
+--------------+
| 1 |
+--------------+
select setbit(0,3); /* 00000000 -> 00001000 */
+--------------+
| setbit(0, 3) |
+--------------+
| 8 |
+--------------+
select setbit(7,3); /* 00000111 -> 00001111 */
+--------------+
| setbit(7, 3) |
+--------------+
| 15 |
+--------------+
select setbit(15,3); /* 00001111 -> 00001111 */
+---------------+
| setbit(15, 3) |
+---------------+
| 15 |
+---------------+
select setbit(0,32); /* By default, 0 is a TINYINT with only 8 bits. */
ERROR: Invalid bit position: 32
select setbit(cast(0 as bigint),32); /* For BIGINT, the position can be 0..63. */
+-------------------------------+
| setbit(cast(0 as bigint), 32) |
+-------------------------------+
| 4294967296 |
+-------------------------------+
select setbit(7,3,1); /* 00000111 -> 00001111; setting to 1 is the default */
+-----------------+
| setbit(7, 3, 1) |
+-----------------+
| 15 |
+-----------------+
select setbit(7,2,0); /* 00000111 -> 00000011; third argument of 0 clears instead of sets */
+-----------------+
| setbit(7, 2, 0) |
+-----------------+
| 3 |
+-----------------+
</codeblock>
</dd>
</dlentry>
<dlentry id="shiftleft">
<dt>
<codeph>shiftleft(integer_type a, int positions)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">shiftleft() function</indexterm>
<b>Purpose:</b> Shifts an integer value left by a specified number of bits.
As the most significant bit is taken out of the original value,
it is discarded and the least significant bit becomes 0.
In computer science terms, this operation is a <q><xref href="https://en.wikipedia.org/wiki/Logical_shift" scope="external" format="html">logical shift</xref></q>.
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
<p>
The final value has either the same number of 1 bits as the original value, or fewer.
Shifting an 8-bit value by 8 positions, a 16-bit value by 16 positions, and so on produces
a result of zero.
</p>
<p>
Specifying a second argument of zero leaves the original value unchanged.
Shifting any value by 0 returns the original value.
Shifting any value by 1 is the same as multiplying it by 2,
as long as the value is small enough; larger values eventually
become negative when shifted, as the sign bit is set.
Starting with the value 1 and shifting it left by N positions gives
the same result as 2 to the Nth power, or <codeph>pow(2,<varname>N</varname>)</codeph>.
</p>
<p conref="../shared/impala_common.xml#common/return_type_same"/>
<p conref="../shared/impala_common.xml#common/added_in_230"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<codeblock>select shiftleft(1,0); /* 00000001 -> 00000001 */
+-----------------+
| shiftleft(1, 0) |
+-----------------+
| 1 |
+-----------------+
select shiftleft(1,3); /* 00000001 -> 00001000 */
+-----------------+
| shiftleft(1, 3) |
+-----------------+
| 8 |
+-----------------+
select shiftleft(8,2); /* 00001000 -> 00100000 */
+-----------------+
| shiftleft(8, 2) |
+-----------------+
| 32 |
+-----------------+
select shiftleft(127,1); /* 01111111 -> 11111110 */
+-------------------+
| shiftleft(127, 1) |
+-------------------+
| -2 |
+-------------------+
select shiftleft(127,5); /* 01111111 -> 11100000 */
+-------------------+
| shiftleft(127, 5) |
+-------------------+
| -32 |
+-------------------+
select shiftleft(-1,4); /* 11111111 -> 11110000 */
+------------------+
| shiftleft(-1, 4) |
+------------------+
| -16 |
+------------------+
</codeblock>
</dd>
</dlentry>
<dlentry id="shiftright">
<dt>
<codeph>shiftright(integer_type a, int positions)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">shiftright() function</indexterm>
<b>Purpose:</b> Shifts an integer value right by a specified number of bits.
As the least significant bit is taken out of the original value,
it is discarded and the most significant bit becomes 0.
In computer science terms, this operation is a <q><xref href="https://en.wikipedia.org/wiki/Logical_shift" scope="external" format="html">logical shift</xref></q>.
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
<p>
Therefore, the final value has either the same number of 1 bits as the original value, or fewer.
Shifting an 8-bit value by 8 positions, a 16-bit value by 16 positions, and so on produces
a result of zero.
</p>
<p>
Specifying a second argument of zero leaves the original value unchanged.
Shifting any value by 0 returns the original value.
Shifting any positive value right by 1 is the same as dividing it by 2.
Negative values become positive when shifted right.
</p>
<p conref="../shared/impala_common.xml#common/return_type_same"/>
<p conref="../shared/impala_common.xml#common/added_in_230"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<codeblock>select shiftright(16,0); /* 00010000 -> 00000000 */
+-------------------+
| shiftright(16, 0) |
+-------------------+
| 16 |
+-------------------+
select shiftright(16,4); /* 00010000 -> 00000000 */
+-------------------+
| shiftright(16, 4) |
+-------------------+
| 1 |
+-------------------+
select shiftright(16,5); /* 00010000 -> 00000000 */
+-------------------+
| shiftright(16, 5) |
+-------------------+
| 0 |
+-------------------+
select shiftright(-1,1); /* 11111111 -> 01111111 */
+-------------------+
| shiftright(-1, 1) |
+-------------------+
| 127 |
+-------------------+
select shiftright(-1,5); /* 11111111 -> 00000111 */
+-------------------+
| shiftright(-1, 5) |
+-------------------+
| 7 |
+-------------------+
</codeblock>
</dd>
</dlentry>
</dl>
</conbody>
</concept>

View File

@@ -0,0 +1,154 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="boolean">
<title>BOOLEAN Data Type</title>
<titlealts audience="PDF"><navtitle>BOOLEAN</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Impala Data Types"/>
<data name="Category" value="SQL"/>
<data name="Category" value="Data Analysts"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Schemas"/>
</metadata>
</prolog>
<conbody>
<p>
A data type used in <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph> statements, representing a
single true/false choice.
</p>
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
<p>
In the column definition of a <codeph>CREATE TABLE</codeph> statement:
</p>
<codeblock><varname>column_name</varname> BOOLEAN</codeblock>
<p>
<b>Range:</b> <codeph>TRUE</codeph> or <codeph>FALSE</codeph>. Do not use quotation marks around the
<codeph>TRUE</codeph> and <codeph>FALSE</codeph> literal values. You can write the literal values in
uppercase, lowercase, or mixed case. The values queried from a table are always returned in lowercase,
<codeph>true</codeph> or <codeph>false</codeph>.
</p>
<p>
<b>Conversions:</b> Impala does not automatically convert any other type to <codeph>BOOLEAN</codeph>. All
conversions must use an explicit call to the <codeph>CAST()</codeph> function.
</p>
<p>
You can use <codeph>CAST()</codeph> to convert
<!--
<codeph>TINYINT</codeph>, <codeph>SMALLINT</codeph>,
<codeph>INT</codeph>, <codeph>BIGINT</codeph>, <codeph>FLOAT</codeph>, or <codeph>DOUBLE</codeph>
-->
any integer or floating-point type to
<codeph>BOOLEAN</codeph>: a value of 0 represents <codeph>false</codeph>, and any non-zero value is converted
to <codeph>true</codeph>.
</p>
<codeblock>SELECT CAST(42 AS BOOLEAN) AS nonzero_int, CAST(99.44 AS BOOLEAN) AS nonzero_decimal,
CAST(000 AS BOOLEAN) AS zero_int, CAST(0.0 AS BOOLEAN) AS zero_decimal;
+-------------+-----------------+----------+--------------+
| nonzero_int | nonzero_decimal | zero_int | zero_decimal |
+-------------+-----------------+----------+--------------+
| true | true | false | false |
+-------------+-----------------+----------+--------------+
</codeblock>
<p>
When you cast the opposite way, from <codeph>BOOLEAN</codeph> to a numeric type,
the result becomes either 1 or 0:
</p>
<codeblock>SELECT CAST(true AS INT) AS true_int, CAST(true AS DOUBLE) AS true_double,
CAST(false AS INT) AS false_int, CAST(false AS DOUBLE) AS false_double;
+----------+-------------+-----------+--------------+
| true_int | true_double | false_int | false_double |
+----------+-------------+-----------+--------------+
| 1 | 1 | 0 | 0 |
+----------+-------------+-----------+--------------+
</codeblock>
<p rev="1.4.0">
<!-- BOOLEAN-to-DECIMAL casting requested in IMPALA-991. As of Sept. 2014, designated "won't fix". -->
You can cast <codeph>DECIMAL</codeph> values to <codeph>BOOLEAN</codeph>, with the same treatment of zero and
non-zero values as the other numeric types. You cannot cast a <codeph>BOOLEAN</codeph> to a
<codeph>DECIMAL</codeph>.
</p>
<p>
You cannot cast a <codeph>STRING</codeph> value to <codeph>BOOLEAN</codeph>, although you can cast a
<codeph>BOOLEAN</codeph> value to <codeph>STRING</codeph>, returning <codeph>'1'</codeph> for
<codeph>true</codeph> values and <codeph>'0'</codeph> for <codeph>false</codeph> values.
</p>
<p>
Although you can cast a <codeph>TIMESTAMP</codeph> to a <codeph>BOOLEAN</codeph> or a
<codeph>BOOLEAN</codeph> to a <codeph>TIMESTAMP</codeph>, the results are unlikely to be useful. Any non-zero
<codeph>TIMESTAMP</codeph> (that is, any value other than <codeph>1970-01-01 00:00:00</codeph>) becomes
<codeph>TRUE</codeph> when converted to <codeph>BOOLEAN</codeph>, while <codeph>1970-01-01 00:00:00</codeph>
becomes <codeph>FALSE</codeph>. A value of <codeph>FALSE</codeph> becomes <codeph>1970-01-01
00:00:00</codeph> when converted to <codeph>BOOLEAN</codeph>, and <codeph>TRUE</codeph> becomes one second
past this epoch date, that is, <codeph>1970-01-01 00:00:01</codeph>.
</p>
<p conref="../shared/impala_common.xml#common/null_null_arguments"/>
<p conref="../shared/impala_common.xml#common/partitioning_blurb"/>
<p>
Do not use a <codeph>BOOLEAN</codeph> column as a partition key. Although you can create such a table,
subsequent operations produce errors:
</p>
<codeblock>[localhost:21000] &gt; create table truth_table (assertion string) partitioned by (truth boolean);
[localhost:21000] &gt; insert into truth_table values ('Pigs can fly',false);
ERROR: AnalysisException: INSERT into table with BOOLEAN partition column (truth) is not supported: partitioning.truth_table
</codeblock>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<codeblock>SELECT 1 &lt; 2;
SELECT 2 = 5;
SELECT 100 &lt; NULL, 100 &gt; NULL;
CREATE TABLE assertions (claim STRING, really BOOLEAN);
INSERT INTO assertions VALUES
("1 is less than 2", 1 &lt; 2),
("2 is the same as 5", 2 = 5),
("Grass is green", true),
("The moon is made of green cheese", false);
SELECT claim FROM assertions WHERE really = TRUE;
</codeblock>
<p conref="../shared/impala_common.xml#common/hbase_ok"/>
<p conref="../shared/impala_common.xml#common/parquet_ok"/>
<p conref="../shared/impala_common.xml#common/text_bulky"/>
<!-- <p conref="../shared/impala_common.xml#common/compatibility_blurb"/> -->
<!-- <p conref="../shared/impala_common.xml#common/internals_blurb"/> -->
<!-- <p conref="../shared/impala_common.xml#common/added_in_20"/> -->
<p conref="../shared/impala_common.xml#common/column_stats_constant"/>
<!-- <p conref="../shared/impala_common.xml#common/restrictions_blurb"/> -->
<!-- <p conref="../shared/impala_common.xml#common/related_info"/> -->
<p>
<b>Related information:</b> <xref href="impala_literals.xml#boolean_literals"/>,
<xref href="impala_operators.xml#operators"/>,
<xref href="impala_conditional_functions.xml#conditional_functions"/>
</p>
</conbody>
</concept>

View File

@@ -0,0 +1,256 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="breakpad" rev="2.6.0 IMPALA-2686 CDH-40238">
<title>Breakpad Minidumps for Impala (<keyword keyref="impala26"/> or higher only)</title>
<titlealts audience="PDF"><navtitle>Breakpad Minidumps</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Troubleshooting"/>
<data name="Category" value="Support"/>
<data name="Category" value="Administrators"/>
</metadata>
</prolog>
<conbody>
<p rev="2.6.0 IMPALA-2686 CDH-40238">
The <xref href="https://chromium.googlesource.com/breakpad/breakpad/" scope="external" format="html">breakpad</xref>
project is an open-source framework for crash reporting.
In <keyword keyref="impala26_full"/> and higher, Impala can use <codeph>breakpad</codeph> to record stack information and
register values when any of the Impala-related daemons crash due to an error such as <codeph>SIGSEGV</codeph>
or unhandled exceptions.
The dump files are much smaller than traditional core dump files. The dump mechanism itself uses very little
memory, which improves reliability if the crash occurs while the system is low on memory.
</p>
<note type="important">
Because of the internal mechanisms involving Impala memory allocation and Linux
signalling for out-of-memory (OOM) errors, if an Impala-related daemon experiences a
crash due to an OOM condition, it does <i>not</i> generate a minidump for that error.
<p>
</p>
</note>
<p outputclass="toc inpage" audience="PDF"/>
</conbody>
<concept id="breakpad_minidump_enable">
<title>Enabling or Disabling Minidump Generation</title>
<conbody>
<p>
By default, a minidump file is generated when an Impala-related daemon crashes.
To turn off generation of the minidump files, change the
<uicontrol>minidump_path</uicontrol> configuration setting of one or more Impala-related daemons
to the empty string, and restart the corresponding services or daemons.
</p>
<p rev="IMPALA-3677 CDH-43745">
In <keyword keyref="impala27_full"/> and higher,
you can send a <codeph>SIGUSR1</codeph> signal to any Impala-related daemon to write a
Breakpad minidump. For advanced troubleshooting, you can now produce a minidump
without triggering a crash.
</p>
</conbody>
</concept>
<concept id="breakpad_minidump_location" rev="IMPALA-3581">
<title>Specifying the Location for Minidump Files</title>
<conbody>
<p>
By default, all minidump files are written to the following location
on the host where a crash occurs:
<!-- Location stated in IMPALA-3581; overridden by different location from IMPALA-2686?
<filepath><varname>log_directory</varname>/minidumps/<varname>daemon_name</varname></filepath> -->
<ul>
<li>
<p>
Clusters managed by Cloudera Manager: <filepath>/var/log/impala-minidumps/<varname>daemon_name</varname></filepath>
</p>
</li>
<li>
<p>
Clusters not managed by Cloudera Manager:
<filepath><varname>impala_log_dir</varname>/<varname>daemon_name</varname>/minidumps/<varname>daemon_name</varname></filepath>
</p>
</li>
</ul>
The minidump files for <cmdname>impalad</cmdname>, <cmdname>catalogd</cmdname>,
and <cmdname>statestored</cmdname> are each written to a separate directory.
</p>
<p>
To specify a different location, set the
<!-- Again, IMPALA-3581 says one thing and IMPALA-2686 / observation of CM interface says another.
<codeph>log_dir</codeph> -->
<uicontrol>minidump_path</uicontrol>
configuration setting of one or more Impala-related daemons, and restart the corresponding services or daemons.
</p>
<p>
If you specify a relative path for this setting, the value is interpreted relative to
the default <uicontrol>minidump_path</uicontrol> directory.
</p>
</conbody>
</concept>
<concept id="breakpad_minidump_number">
<title>Controlling the Number of Minidump Files</title>
<conbody>
<p>
Like any files used for logging or troubleshooting, consider limiting the number of
minidump files, or removing unneeded ones, depending on the amount of free storage
space on the hosts in the cluster.
</p>
<p>
Because the minidump files are only used for problem resolution, you can remove any such files that
are not needed to debug current issues.
</p>
<p>
To control how many minidump files Impala keeps around at any one time,
set the <uicontrol>max_minidumps</uicontrol> configuration setting for
of one or more Impala-related daemon, and restart the corresponding services or daemons.
The default for this setting is 9. A zero or negative value is interpreted as
<q>unlimited</q>.
</p>
</conbody>
</concept>
<concept id="breakpad_minidump_logging">
<title>Detecting Crash Events</title>
<conbody>
<p>
You can see in the Impala log files or in the Cloudera Manager charts for Impala
when crash events occur that generate minidump files. Because each restart begins
a new log file, the <q>crashed</q> message is always at or near the bottom of the
log file. (There might be another later message if core dumps are also enabled.)
</p>
</conbody>
</concept>
<concept id="breakpad_support_process" rev="CDH-39818">
<title>Using the Minidump Files for Problem Resolution</title>
<conbody>
<p>
Typically, you provide minidump files to <keyword keyref="support_org"/> as part of problem resolution,
in the same way that you might provide a core dump. The <uicontrol>Send Diagnostic Data</uicontrol>
under the <uicontrol>Support</uicontrol> menu in Cloudera Manager guides you through the
process of selecting a time period and volume of diagnostic data, then collects the data
from all hosts and transmits the relevant information for you.
</p>
<fig id="fig_pqw_gvx_pr">
<title>Send Diagnostic Data choice under Support menu</title>
<image href="../images/support_send_diagnostic_data.png" scalefit="yes" placement="break"/>
</fig>
<p>
You might get additional instructions from <keyword keyref="support_org"/> about collecting minidumps to better isolate a specific problem.
Because the information in the minidump files is limited to stack traces and register contents,
the possibility of including sensitive information is much lower than with core dump files.
If any sensitive information is included in the minidump, <keyword keyref="support_org"/> preserves the confidentiality of that information.
</p>
</conbody>
</concept>
<concept id="breakpad_demo">
<title>Demonstration of Breakpad Feature</title>
<conbody>
<p>
The following example uses the command <cmdname>kill -11</cmdname> to
simulate a <codeph>SIGSEGV</codeph> crash for an <cmdname>impalad</cmdname>
process on a single DataNode, then examines the relevant log files and minidump file.
</p>
<p>
First, as root on a worker node, we kill the <cmdname>impalad</cmdname> process with a
<codeph>SIGSEGV</codeph> error. The original process ID was 23114. (Cloudera Manager
restarts the process with a new pid, as shown by the second <cmdname>ps</cmdname> command.)
</p>
<codeblock><![CDATA[
# ps ax | grep impalad
23114 ? Sl 0:18 /opt/cloudera/parcels/<parcel_version>/lib/impala/sbin-retail/impalad --flagfile=/var/run/cloudera-scm-agent/process/114-impala-IMPALAD/impala-conf/impalad_flags
31259 pts/0 S+ 0:00 grep impalad
#
# kill -11 23114
#
# ps ax | grep impalad
31374 ? Rl 0:04 /opt/cloudera/parcels/<parcel_version>/lib/impala/sbin-retail/impalad --flagfile=/var/run/cloudera-scm-agent/process/114-impala-IMPALAD/impala-conf/impalad_flags
31475 pts/0 S+ 0:00 grep impalad
]]>
</codeblock>
<p>
We locate the log directory underneath <filepath>/var/log</filepath>.
There is a <codeph>.INFO</codeph>, <codeph>.WARNING</codeph>, and <codeph>.ERROR</codeph>
log file for the 23114 process ID. The minidump message is written to the
<codeph>.INFO</codeph> file and the <codeph>.ERROR</codeph> file, but not the
<codeph>.WARNING</codeph> file. In this case, a large core file was also produced.
</p>
<codeblock><![CDATA[
# cd /var/log/impalad
# ls -la | grep 23114
-rw------- 1 impala impala 3539079168 Jun 23 15:20 core.23114
-rw-r--r-- 1 impala impala 99057 Jun 23 15:20 hs_err_pid23114.log
-rw-r--r-- 1 impala impala 351 Jun 23 15:20 impalad.worker_node_123.impala.log.ERROR.20160623-140343.23114
-rw-r--r-- 1 impala impala 29101 Jun 23 15:20 impalad.worker_node_123.impala.log.INFO.20160623-140343.23114
-rw-r--r-- 1 impala impala 228 Jun 23 14:03 impalad.worker_node_123.impala.log.WARNING.20160623-140343.23114
]]>
</codeblock>
<p>
The <codeph>.INFO</codeph> log includes the location of the minidump file, followed by
a report of a core dump. With the breakpad minidump feature enabled, now we might
disable core dumps or keep fewer of them around.
</p>
<codeblock><![CDATA[
# cat impalad.worker_node_123.impala.log.INFO.20160623-140343.23114
...
Wrote minidump to /var/log/impala-minidumps/impalad/0980da2d-a905-01e1-25ff883a-04ee027a.dmp
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00000030c0e0b68a, pid=23114, tid=139869541455968
#
# JRE version: Java(TM) SE Runtime Environment (7.0_67-b01) (build 1.7.0_67-b01)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (24.65-b04 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C [libpthread.so.0+0xb68a] pthread_cond_wait+0xca
#
# Core dump written. Default location: /var/log/impalad/core or core.23114
#
# An error report file with more information is saved as:
# /var/log/impalad/hs_err_pid23114.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.sun.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
...
# cat impalad.worker_node_123.impala.log.ERROR.20160623-140343.23114
Log file created at: 2016/06/23 14:03:43
Running on machine:.worker_node_123
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E0623 14:03:43.911002 23114 logging.cc:118] stderr will be logged to this file.
Wrote minidump to /var/log/impala-minidumps/impalad/0980da2d-a905-01e1-25ff883a-04ee027a.dmp
]]>
</codeblock>
<p>
The resulting minidump file is much smaller than the corresponding core file,
making it much easier to supply diagnostic information to <keyword keyref="support_org"/>.
The transmission process for the minidump files is automated through Cloudera Manager.
</p>
<codeblock><![CDATA[
# pwd
/var/log/impalad
# cd ../impala-minidumps/impalad
# ls
0980da2d-a905-01e1-25ff883a-04ee027a.dmp
# du -kh *
2.4M 0980da2d-a905-01e1-25ff883a-04ee027a.dmp
]]>
</codeblock>
</conbody>
</concept>
</concept>

View File

@@ -0,0 +1,25 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="impala_cdh">
<title>How Impala Works with CDH</title>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Concepts"/>
<data name="Category" value="CDH"/>
<data name="Category" value="Administrators"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p conref="../shared/impala_common.xml#common/impala_overview_diagram"/>
<p conref="../shared/impala_common.xml#common/component_list"/>
<p conref="../shared/impala_common.xml#common/query_overview"/>
</conbody>
</concept>

278
docs/topics/impala_char.xml Normal file
View File

@@ -0,0 +1,278 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="char" rev="2.0.0">
<title>CHAR Data Type (<keyword keyref="impala20"/> or higher only)</title>
<titlealts audience="PDF"><navtitle>CHAR</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Impala Data Types"/>
<data name="Category" value="SQL"/>
<data name="Category" value="Data Analysts"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Schemas"/>
</metadata>
</prolog>
<conbody>
<p rev="2.0.0">
<indexterm audience="Cloudera">CHAR data type</indexterm>
A fixed-length character type, padded with trailing spaces if necessary to achieve the specified length. If
values are longer than the specified length, Impala truncates any trailing characters.
</p>
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
<p>
In the column definition of a <codeph>CREATE TABLE</codeph> statement:
</p>
<codeblock><varname>column_name</varname> CHAR(<varname>length</varname>)</codeblock>
<p>
The maximum length you can specify is 255.
</p>
<p>
<b>Semantics of trailing spaces:</b>
</p>
<ul>
<li>
When you store a <codeph>CHAR</codeph> value shorter than the specified length in a table, queries return
the value padded with trailing spaces if necessary; the resulting value has the same length as specified in
the column definition.
</li>
<li>
If you store a <codeph>CHAR</codeph> value containing trailing spaces in a table, those trailing spaces are
not stored in the data file. When the value is retrieved by a query, the result could have a different
number of trailing spaces. That is, the value includes however many spaces are needed to pad it to the
specified length of the column.
</li>
<li>
If you compare two <codeph>CHAR</codeph> values that differ only in the number of trailing spaces, those
values are considered identical.
</li>
</ul>
<p conref="../shared/impala_common.xml#common/partitioning_bad"/>
<p conref="../shared/impala_common.xml#common/hbase_no"/>
<p conref="../shared/impala_common.xml#common/parquet_blurb"/>
<ul>
<li>
This type can be read from and written to Parquet files.
</li>
<li>
There is no requirement for a particular level of Parquet.
</li>
<li>
Parquet files generated by Impala and containing this type can be freely interchanged with other components
such as Hive and MapReduce.
</li>
<li>
Any trailing spaces, whether implicitly or explicitly specified, are not written to the Parquet data files.
</li>
<li>
Parquet data files might contain values that are longer than allowed by the
<codeph>CHAR(<varname>n</varname>)</codeph> length limit. Impala ignores any extra trailing characters when
it processes those values during a query.
</li>
</ul>
<p conref="../shared/impala_common.xml#common/text_blurb"/>
<p>
Text data files might contain values that are longer than allowed for a particular
<codeph>CHAR(<varname>n</varname>)</codeph> column. Any extra trailing characters are ignored when Impala
processes those values during a query. Text data files can also contain values that are shorter than the
defined length limit, and Impala pads them with trailing spaces up to the specified length. Any text data
files produced by Impala <codeph>INSERT</codeph> statements do not include any trailing blanks for
<codeph>CHAR</codeph> columns.
</p>
<p><b>Avro considerations:</b></p>
<p conref="../shared/impala_common.xml#common/avro_2gb_strings"/>
<p conref="../shared/impala_common.xml#common/compatibility_blurb"/>
<p>
This type is available using Impala 2.0 or higher under CDH 4, or with Impala on CDH 5.2 or higher. There are
no compatibility issues with other components when exchanging data files or running Impala on CDH 4.
</p>
<p>
Some other database systems make the length specification optional. For Impala, the length is required.
</p>
<!--
<p>
The Impala maximum length is larger than for the <codeph>CHAR</codeph> data type in Hive.
If a Hive query encounters a <codeph>CHAR</codeph> value longer than 255 during processing,
it silently treats the value as length 255.
</p>
-->
<p conref="../shared/impala_common.xml#common/internals_max_bytes"/>
<p conref="../shared/impala_common.xml#common/added_in_20"/>
<p conref="../shared/impala_common.xml#common/column_stats_constant"/>
<!-- Seems like a logical design decision but don't think it's currently implemented like this.
<p>
Because both the maximum and average length are always known and always the same for
any given <codeph>CHAR(<varname>n</varname>)</codeph> column, those fields are always filled
in for <codeph>SHOW COLUMN STATS</codeph> output, even before you run
<codeph>COMPUTE STATS</codeph> on the table.
</p>
-->
<p conref="../shared/impala_common.xml#common/udf_blurb_no"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p>
These examples show how trailing spaces are not considered significant when comparing or processing
<codeph>CHAR</codeph> values. <codeph>CAST()</codeph> truncates any longer string to fit within the defined
length. If a <codeph>CHAR</codeph> value is shorter than the specified length, it is padded on the right with
spaces until it matches the specified length. Therefore, <codeph>LENGTH()</codeph> represents the length
including any trailing spaces, and <codeph>CONCAT()</codeph> also treats the column value as if it has
trailing spaces.
</p>
<codeblock>select cast('x' as char(4)) = cast('x ' as char(4)) as "unpadded equal to padded";
+--------------------------+
| unpadded equal to padded |
+--------------------------+
| true |
+--------------------------+
create table char_length(c char(3));
insert into char_length values (cast('1' as char(3))), (cast('12' as char(3))), (cast('123' as char(3))), (cast('123456' as char(3)));
select concat("[",c,"]") as c, length(c) from char_length;
+-------+-----------+
| c | length(c) |
+-------+-----------+
| [1 ] | 3 |
| [12 ] | 3 |
| [123] | 3 |
| [123] | 3 |
+-------+-----------+
</codeblock>
<p>
This example shows a case where data values are known to have a specific length, where <codeph>CHAR</codeph>
is a logical data type to use.
<!--
Because all the <codeph>CHAR</codeph> values have a constant predictable length,
Impala can efficiently analyze how best to use these values in join queries,
aggregation queries, and other contexts where column length is significant.
-->
</p>
<codeblock>create table addresses
(id bigint,
street_name string,
state_abbreviation char(2),
country_abbreviation char(2));
</codeblock>
<p>
The following example shows how values written by Impala do not physically include the trailing spaces. It
creates a table using text format, with <codeph>CHAR</codeph> values much shorter than the declared length,
and then prints the resulting data file to show that the delimited values are not separated by spaces. The
same behavior applies to binary-format Parquet data files.
</p>
<codeblock>create table char_in_text (a char(20), b char(30), c char(40))
row format delimited fields terminated by ',';
insert into char_in_text values (cast('foo' as char(20)), cast('bar' as char(30)), cast('baz' as char(40))), (cast('hello' as char(20)), cast('goodbye' as char(30)), cast('aloha' as char(40)));
-- Running this Linux command inside impala-shell using the ! shortcut.
!hdfs dfs -cat 'hdfs://127.0.0.1:8020/user/hive/warehouse/impala_doc_testing.db/char_in_text/*.*';
foo,bar,baz
hello,goodbye,aloha
</codeblock>
<p>
The following example further illustrates the treatment of spaces. It replaces the contents of the previous
table with some values including leading spaces, trailing spaces, or both. Any leading spaces are preserved
within the data file, but trailing spaces are discarded. Then when the values are retrieved by a query, the
leading spaces are retrieved verbatim while any necessary trailing spaces are supplied by Impala.
</p>
<codeblock>insert overwrite char_in_text values (cast('trailing ' as char(20)), cast(' leading and trailing ' as char(30)), cast(' leading' as char(40)));
!hdfs dfs -cat 'hdfs://127.0.0.1:8020/user/hive/warehouse/impala_doc_testing.db/char_in_text/*.*';
trailing, leading and trailing, leading
select concat('[',a,']') as a, concat('[',b,']') as b, concat('[',c,']') as c from char_in_text;
+------------------------+----------------------------------+--------------------------------------------+
| a | b | c |
+------------------------+----------------------------------+--------------------------------------------+
| [trailing ] | [ leading and trailing ] | [ leading ] |
+------------------------+----------------------------------+--------------------------------------------+
</codeblock>
<p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
<p>
Because the blank-padding behavior requires allocating the maximum length for each value in memory, for
scalability reasons avoid declaring <codeph>CHAR</codeph> columns that are much longer than typical values in
that column.
</p>
<p conref="../shared/impala_common.xml#common/blobs_are_strings"/>
<p>
When an expression compares a <codeph>CHAR</codeph> with a <codeph>STRING</codeph> or
<codeph>VARCHAR</codeph>, the <codeph>CHAR</codeph> value is implicitly converted to <codeph>STRING</codeph>
first, with trailing spaces preserved.
</p>
<codeblock>select cast("foo " as char(5)) = 'foo' as "char equal to string";
+----------------------+
| char equal to string |
+----------------------+
| false |
+----------------------+
</codeblock>
<p>
This behavior differs from other popular database systems. To get the expected result of
<codeph>TRUE</codeph>, cast the expressions on both sides to <codeph>CHAR</codeph> values of the appropriate
length:
</p>
<codeblock>select cast("foo " as char(5)) = cast('foo' as char(3)) as "char equal to string";
+----------------------+
| char equal to string |
+----------------------+
| true |
+----------------------+
</codeblock>
<p>
This behavior is subject to change in future releases.
</p>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_string.xml#string"/>, <xref href="impala_varchar.xml#varchar"/>,
<xref href="impala_literals.xml#string_literals"/>,
<xref href="impala_string_functions.xml#string_functions"/>
</p>
</conbody>
</concept>

View File

@@ -0,0 +1,353 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="cluster_sizing">
<title>Cluster Sizing Guidelines for Impala</title>
<titlealts audience="PDF"><navtitle>Cluster Sizing</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Clusters"/>
<data name="Category" value="Planning"/>
<data name="Category" value="Sizing"/>
<data name="Category" value="Deploying"/>
<!-- Hoist by my own petard. Memory is an important theme of this topic but that's in a <section> title. -->
<data name="Category" value="Sectionated Pages"/>
<data name="Category" value="Memory"/>
<data name="Category" value="Scalability"/>
<data name="Category" value="Proof of Concept"/>
<data name="Category" value="Requirements"/>
<data name="Category" value="Guidelines"/>
<data name="Category" value="Best Practices"/>
<data name="Category" value="Administrators"/>
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="Cloudera">cluster sizing</indexterm>
This document provides a very rough guideline to estimate the size of a cluster needed for a specific
customer application. You can use this information when planning how much and what type of hardware to
acquire for a new cluster, or when adding Impala workloads to an existing cluster.
</p>
<note>
Before making purchase or deployment decisions, consult your Cloudera representative to verify the
conclusions about hardware requirements based on your data volume and workload.
</note>
<!-- <p outputclass="toc inpage"/> -->
<p>
Always use hosts with identical specifications and capacities for all the nodes in the cluster. Currently,
Impala divides the work evenly between cluster nodes, regardless of their exact hardware configuration.
Because work can be distributed in different ways for different queries, if some hosts are overloaded
compared to others in terms of CPU, memory, I/O, or network, you might experience inconsistent performance
and overall slowness
</p>
<p>
For analytic workloads with star/snowflake schemas, and using consistent hardware for all nodes (64 GB RAM,
12 2 TB hard drives, 2x E5-2630L 12 cores total, 10 GB network), the following table estimates the number of
DataNodes needed in the cluster based on data size and the number of concurrent queries, for workloads
similar to TPC-DS benchmark queries:
</p>
<table>
<title>Cluster size estimation based on the number of concurrent queries and data size with a 20 second average query response time</title>
<tgroup cols="6">
<colspec colnum="1" colname="col1"/>
<colspec colnum="2" colname="col2"/>
<colspec colnum="3" colname="col3"/>
<colspec colnum="4" colname="col4"/>
<colspec colnum="5" colname="col5"/>
<colspec colnum="6" colname="col6"/>
<thead>
<row>
<entry>
Data Size
</entry>
<entry>
1 query
</entry>
<entry>
10 queries
</entry>
<entry>
100 queries
</entry>
<entry>
1000 queries
</entry>
<entry>
2000 queries
</entry>
</row>
</thead>
<tbody>
<row>
<entry>
<b>250 GB</b>
</entry>
<entry>
2
</entry>
<entry>
2
</entry>
<entry>
5
</entry>
<entry>
35
</entry>
<entry>
70
</entry>
</row>
<row>
<entry>
<b>500 GB</b>
</entry>
<entry>
2
</entry>
<entry>
2
</entry>
<entry>
10
</entry>
<entry>
70
</entry>
<entry>
135
</entry>
</row>
<row>
<entry>
<b>1 TB</b>
</entry>
<entry>
2
</entry>
<entry>
2
</entry>
<entry>
15
</entry>
<entry>
135
</entry>
<entry>
270
</entry>
</row>
<row>
<entry>
<b>15 TB</b>
</entry>
<entry>
2
</entry>
<entry>
20
</entry>
<entry>
200
</entry>
<entry>
N/A
</entry>
<entry>
N/A
</entry>
</row>
<row>
<entry>
<b>30 TB</b>
</entry>
<entry>
4
</entry>
<entry>
40
</entry>
<entry>
400
</entry>
<entry>
N/A
</entry>
<entry>
N/A
</entry>
</row>
<row>
<entry>
<b>60 TB</b>
</entry>
<entry>
8
</entry>
<entry>
80
</entry>
<entry>
800
</entry>
<entry>
N/A
</entry>
<entry>
N/A
</entry>
</row>
</tbody>
</tgroup>
</table>
<section id="sizing_factors">
<title>Factors Affecting Scalability</title>
<p>
A typical analytic workload (TPC-DS style queries) using recommended hardware is usually CPU-bound. Each
node can process roughly 1.6 GB/sec. Both CPU-bound and disk-bound workloads can scale almost linearly with
cluster size. However, for some workloads, the scalability might be bounded by the network, or even by
memory.
</p>
<p>
If the workload is already network bound (on a 10 GB network), increasing the cluster size wont reduce
the network load; in fact, a larger cluster could increase network traffic because some queries involve
<q>broadcast</q> operations to all DataNodes. Therefore, boosting the cluster size does not improve query
throughput in a network-constrained environment.
</p>
<p>
Lets look at a memory-bound workload. A workload is memory-bound if Impala cannot run any additional
concurrent queries because all memory allocated has already been consumed, but neither CPU, disk, nor
network is saturated yet. This can happen because currently Impala uses only a single core per node to
process join and aggregation queries. For a node with 128 GB of RAM, if a join node takes 50 GB, the system
cannot run more than 2 such queries at the same time.
</p>
<p>
Therefore, at most 2 cores are used. Throughput can still scale almost linearly even for a memory-bound
workload. Its just that the CPU will not be saturated. Per-node throughput will be lower than 1.6
GB/sec. Consider increasing the memory per node.
</p>
<p>
As long as the workload is not network- or memory-bound, we can use the 1.6 GB/second per node as the
throughput estimate.
</p>
</section>
<section id="sizing_details">
<title>A More Precise Approach</title>
<p>
A more precise sizing estimate would require not only queries per minute (QPM), but also an average data
size scanned per query (D). With the proper partitioning strategy, D is usually a fraction of the total
data size. The following equation can be used as a rough guide to estimate the number of nodes (N) needed:
</p>
<codeblock>Eq 1: N &gt; QPM * D / 100 GB
</codeblock>
<p>
Here is an example. Suppose, on average, a query scans 50 GB of data and the average response time is
required to be 15 seconds or less when there are 100 concurrent queries. The QPM is 100/15*60 = 400. We can
estimate the number of node using our equation above.
</p>
<codeblock>N &gt; QPM * D / 100GB
N &gt; 400 * 50GB / 100GB
N &gt; 200
</codeblock>
<p>
Because this figure is a rough estimate, the corresponding number of nodes could be between 100 and 500.
</p>
<p>
Depending on the complexity of the query, the processing rate of query might change. If the query has more
joins, aggregation functions, or CPU-intensive functions such as string processing or complex UDFs, the
process rate will be lower than 1.6 GB/second per node. On the other hand, if the query only does scan and
filtering on numbers, the processing rate can be higher.
</p>
</section>
<section id="sizing_mem_estimate">
<title>Estimating Memory Requirements</title>
<!--
<prolog>
<metadata>
<data name="Category" value="Memory"/>
</metadata>
</prolog>
-->
<p>
Impala can handle joins between multiple large tables. Make sure that statistics are collected for all the
joined tables, using the <codeph><xref href="impala_compute_stats.xml#compute_stats">COMPUTE
STATS</xref></codeph> statement. However, joining big tables does consume more memory. Follow the steps
below to calculate the minimum memory requirement.
</p>
<p>
Suppose you are running the following join:
</p>
<codeblock>select a.*, b.col_1, b.col_2, … b.col_n
from a, b
where a.key = b.key
and b.col_1 in (1,2,4...)
and b.col_4 in (....);
</codeblock>
<p>
And suppose table <codeph>B</codeph> is smaller than table <codeph>A</codeph> (but still a large table).
</p>
<p>
The memory requirement for the query is the right-hand table (<codeph>B</codeph>), after decompression,
filtering (<codeph>b.col_n in ...</codeph>) and after projection (only using certain columns) must be less
than the total memory of the entire cluster.
</p>
<codeblock>Cluster Total Memory Requirement = Size of the smaller table *
selectivity factor from the predicate *
projection factor * compression ratio
</codeblock>
<p>
In this case, assume that table <codeph>B</codeph> is 100 TB in Parquet format with 200 columns. The
predicate on <codeph>B</codeph> (<codeph>b.col_1 in ...and b.col_4 in ...</codeph>) will select only 10% of
the rows from <codeph>B</codeph> and for projection, we are only projecting 5 columns out of 200 columns.
Usually, Snappy compression gives us 3 times compression, so we estimate a 3x compression factor.
</p>
<codeblock>Cluster Total Memory Requirement = Size of the smaller table *
selectivity factor from the predicate *
projection factor * compression ratio
= 100TB * 10% * 5/200 * 3
= 0.75TB
= 750GB
</codeblock>
<p>
So, if you have a 10-node cluster, each node has 128 GB of RAM and you give 80% to Impala, then you have 1
TB of usable memory for Impala, which is more than 750GB. Therefore, your cluster can handle join queries
of this magnitude.
</p>
</section>
</conbody>
</concept>

View File

@@ -0,0 +1,56 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="cm_installation">
<title>Installing Impala with Cloudera Manager</title>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Installing"/>
<data name="Category" value="Cloudera Manager"/>
<data name="Category" value="Administrators"/>
</metadata>
</prolog>
<conbody>
<p>
Before installing Impala through the Cloudera Manager interface, make sure all applicable nodes have the
appropriate hardware configuration and levels of operating system and CDH. See
<xref href="impala_prereqs.xml#prereqs"/> for details.
</p>
<note rev="1.2.0">
<p rev="1.2.0">
To install the latest Impala under CDH 4, upgrade Cloudera Manager to 4.8 or higher. Cloudera Manager 4.8 is
the first release that can manage the Impala catalog service introduced in Impala 1.2. Cloudera Manager 4.8
requires this service to be present, so if you upgrade to Cloudera Manager 4.8, also upgrade Impala to the
most recent version at the same time.
<!-- Not so relevant now for 1.1.1, but maybe someday we'll capture all this history in a compatibility grid.
Upgrade to Cloudera Manager 4.6.2 or higher to enable Cloudera Manager to
handle access control for the Impala web UI, available by default through
port 25000 on each Impala host.
-->
</p>
</note>
<p>
For information on installing Impala in a Cloudera Manager-managed environment, see
<xref audience="integrated" href="cm_ig_install_impala.xml"/><xref audience="standalone" href="http://www.cloudera.com/documentation/enterprise/latest/topics/cm_ig_install_impala.html" scope="external" format="html">Installing Impala</xref>.
</p>
<p>
Managing your Impala installation through Cloudera Manager has a number of advantages. For example, when you
make configuration changes to CDH components using Cloudera Manager, it automatically applies changes to the
copies of configuration files, such as <codeph>hive-site.xml</codeph>, that Impala keeps under
<filepath>/etc/impala/conf</filepath>. It also sets up the Hive Metastore service that is required for
Impala running under CDH 4.1.
</p>
<p>
In some cases, depending on the level of Impala, CDH, and Cloudera Manager, you might need to add particular
component configuration details in some of the free-form option fields on the Impala configuration pages
within Cloudera Manager. <ph conref="../shared/impala_common.xml#common/safety_valve"/>
</p>
</conbody>
</concept>

View File

@@ -0,0 +1,53 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="comments">
<title>Comments</title>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="SQL"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="Cloudera">comments (SQL)</indexterm>
Impala supports the familiar styles of SQL comments:
</p>
<ul>
<li>
All text from a <codeph>--</codeph> sequence to the end of the line is considered a comment and ignored.
This type of comment can occur on a single line by itself, or after all or part of a statement.
</li>
<li>
All text from a <codeph>/*</codeph> sequence to the next <codeph>*/</codeph> sequence is considered a
comment and ignored. This type of comment can stretch over multiple lines. This type of comment can occur
on one or more lines by itself, in the middle of a statement, or before or after a statement.
</li>
</ul>
<p>
For example:
</p>
<codeblock>-- This line is a comment about a table.
create table ...;
/*
This is a multi-line comment about a query.
*/
select ...;
select * from t /* This is an embedded comment about a query. */ where ...;
select * from t -- This is a trailing comment within a multi-line command.
where ...;
</codeblock>
</conbody>
</concept>

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,180 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="intro_components">
<title>Components of the Impala Server</title>
<titlealts audience="PDF"><navtitle>Components</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Concepts"/>
<data name="Category" value="Administrators"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p>
The Impala server is a distributed, massively parallel processing (MPP) database engine. It consists of
different daemon processes that run on specific hosts within your CDH cluster.
</p>
<p outputclass="toc inpage"/>
</conbody>
<concept id="intro_impalad">
<title>The Impala Daemon</title>
<conbody>
<p>
The core Impala component is a daemon process that runs on each DataNode of the cluster, physically represented
by the <codeph>impalad</codeph> process. It reads and writes to data files; accepts queries transmitted
from the <codeph>impala-shell</codeph> command, Hue, JDBC, or ODBC; parallelizes the queries and
distributes work across the cluster; and transmits intermediate query results back to the
central coordinator node.
</p>
<p>
You can submit a query to the Impala daemon running on any DataNode, and that instance of the daemon serves as the
<term>coordinator node</term> for that query. The other nodes transmit partial results back to the
coordinator, which constructs the final result set for a query. When running experiments with functionality
through the <codeph>impala-shell</codeph> command, you might always connect to the same Impala daemon for
convenience. For clusters running production workloads, you might load-balance by
submitting each query to a different Impala daemon in round-robin style, using the JDBC or ODBC interfaces.
</p>
<p>
The Impala daemons are in constant communication with the <term>statestore</term>, to confirm which nodes
are healthy and can accept new work.
</p>
<p rev="1.2">
They also receive broadcast messages from the <cmdname>catalogd</cmdname> daemon (introduced in Impala 1.2)
whenever any Impala node in the cluster creates, alters, or drops any type of object, or when an
<codeph>INSERT</codeph> or <codeph>LOAD DATA</codeph> statement is processed through Impala. This
background communication minimizes the need for <codeph>REFRESH</codeph> or <codeph>INVALIDATE
METADATA</codeph> statements that were needed to coordinate metadata across nodes prior to Impala 1.2.
</p>
<p>
<b>Related information:</b> <xref href="impala_config_options.xml#config_options"/>,
<xref href="impala_processes.xml#processes"/>, <xref href="impala_timeouts.xml#impalad_timeout"/>,
<xref href="impala_ports.xml#ports"/>, <xref href="impala_proxy.xml#proxy"/>
</p>
</conbody>
</concept>
<concept id="intro_statestore">
<title>The Impala Statestore</title>
<conbody>
<p>
The Impala component known as the <term>statestore</term> checks on the health of Impala daemons on all the
DataNodes in a cluster, and continuously relays its findings to each of those daemons. It is physically
represented by a daemon process named <codeph>statestored</codeph>; you only need such a process on one
host in the cluster. If an Impala daemon goes offline due to hardware failure, network error, software issue,
or other reason, the statestore informs all the other Impala daemons so that future queries can avoid making
requests to the unreachable node.
</p>
<p>
Because the statestore's purpose is to help when things go wrong, it is not critical to the normal
operation of an Impala cluster. If the statestore is not running or becomes unreachable, the Impala daemons
continue running and distributing work among themselves as usual; the cluster just becomes less robust if
other Impala daemons fail while the statestore is offline. When the statestore comes back online, it re-establishes
communication with the Impala daemons and resumes its monitoring function.
</p>
<p conref="../shared/impala_common.xml#common/statestored_catalogd_ha_blurb"/>
<p>
<b>Related information:</b>
</p>
<p>
<xref href="impala_scalability.xml#statestore_scalability"/>,
<xref href="impala_config_options.xml#config_options"/>, <xref href="impala_processes.xml#processes"/>,
<xref href="impala_timeouts.xml#statestore_timeout"/>, <xref href="impala_ports.xml#ports"/>
</p>
</conbody>
</concept>
<concept rev="1.2" id="intro_catalogd">
<title>The Impala Catalog Service</title>
<conbody>
<p>
The Impala component known as the <term>catalog service</term> relays the metadata changes from Impala SQL
statements to all the DataNodes in a cluster. It is physically represented by a daemon process named
<codeph>catalogd</codeph>; you only need such a process on one host in the cluster. Because the requests
are passed through the statestore daemon, it makes sense to run the <cmdname>statestored</cmdname> and
<cmdname>catalogd</cmdname> services on the same host.
</p>
<p>
The catalog service avoids the need to issue
<codeph>REFRESH</codeph> and <codeph>INVALIDATE METADATA</codeph> statements when the metadata changes are
performed by statements issued through Impala. When you create a table, load data, and so on through Hive,
you do need to issue <codeph>REFRESH</codeph> or <codeph>INVALIDATE METADATA</codeph> on an Impala node
before executing a query there.
</p>
<p>
This feature touches a number of aspects of Impala:
</p>
<!-- This was formerly a conref, but since the list of links also included a link
to this same topic, materializing the list here and removing that
circular link. (The conref is still used in Incompatible Changes.)
<ul conref="../shared/impala_common.xml#common/catalogd_xrefs">
<li/>
</ul>
-->
<ul id="catalogd_xrefs">
<li>
<p>
See <xref href="impala_install.xml#install"/>, <xref href="impala_upgrading.xml#upgrading"/> and
<xref href="impala_processes.xml#processes"/>, for usage information for the
<cmdname>catalogd</cmdname> daemon.
</p>
</li>
<li>
<p>
The <codeph>REFRESH</codeph> and <codeph>INVALIDATE METADATA</codeph> statements are not needed
when the <codeph>CREATE TABLE</codeph>, <codeph>INSERT</codeph>, or other table-changing or
data-changing operation is performed through Impala. These statements are still needed if such
operations are done through Hive or by manipulating data files directly in HDFS, but in those cases the
statements only need to be issued on one Impala node rather than on all nodes. See
<xref href="impala_refresh.xml#refresh"/> and
<xref href="impala_invalidate_metadata.xml#invalidate_metadata"/> for the latest usage information for
those statements.
</p>
</li>
</ul>
<p conref="../shared/impala_common.xml#common/load_catalog_in_background"/>
<p conref="../shared/impala_common.xml#common/statestored_catalogd_ha_blurb"/>
<note>
<p conref="../shared/impala_common.xml#common/catalog_server_124"/>
</note>
<p>
<b>Related information:</b> <xref href="impala_config_options.xml#config_options"/>,
<xref href="impala_processes.xml#processes"/>, <xref href="impala_ports.xml#ports"/>
</p>
</conbody>
</concept>
</concept>

View File

@@ -0,0 +1,98 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept rev="2.0.0" id="compression_codec">
<title>COMPRESSION_CODEC Query Option (<keyword keyref="impala20"/> or higher only)</title>
<titlealts audience="PDF"><navtitle>COMPRESSION_CODEC</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Impala Query Options"/>
<data name="Category" value="Compression"/>
<data name="Category" value="File Formats"/>
<data name="Category" value="Parquet"/>
<data name="Category" value="Snappy"/>
<data name="Category" value="Gzip"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<!-- The initial part of this paragraph is copied straight from the #parquet_compression topic. -->
<!-- Could turn into a conref. -->
<p rev="2.0.0">
<indexterm audience="Cloudera">COMPRESSION_CODEC query option</indexterm>
When Impala writes Parquet data files using the <codeph>INSERT</codeph> statement, the underlying compression
is controlled by the <codeph>COMPRESSION_CODEC</codeph> query option.
</p>
<note>
Prior to Impala 2.0, this option was named <codeph>PARQUET_COMPRESSION_CODEC</codeph>. In Impala 2.0 and
later, the <codeph>PARQUET_COMPRESSION_CODEC</codeph> name is not recognized. Use the more general name
<codeph>COMPRESSION_CODEC</codeph> for new code.
</note>
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
<codeblock>SET COMPRESSION_CODEC=<varname>codec_name</varname>;</codeblock>
<p>
The allowed values for this query option are <codeph>SNAPPY</codeph> (the default), <codeph>GZIP</codeph>,
and <codeph>NONE</codeph>.
</p>
<note>
A Parquet file created with <codeph>COMPRESSION_CODEC=NONE</codeph> is still typically smaller than the
original data, due to encoding schemes such as run-length encoding and dictionary encoding that are applied
separately from compression.
</note>
<p></p>
<p>
The option value is not case-sensitive.
</p>
<p>
If the option is set to an unrecognized value, all kinds of queries will fail due to the invalid option
setting, not just queries involving Parquet tables. (The value <codeph>BZIP2</codeph> is also recognized, but
is not compatible with Parquet tables.)
</p>
<p>
<b>Type:</b> <codeph>STRING</codeph>
</p>
<p>
<b>Default:</b> <codeph>SNAPPY</codeph>
</p>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<codeblock>set compression_codec=gzip;
insert into parquet_table_highly_compressed select * from t1;
set compression_codec=snappy;
insert into parquet_table_compression_plus_fast_queries select * from t1;
set compression_codec=none;
insert into parquet_table_no_compression select * from t1;
set compression_codec=foo;
select * from t1 limit 5;
ERROR: Invalid compression codec: foo
</codeblock>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
For information about how compressing Parquet data files affects query performance, see
<xref href="impala_parquet.xml#parquet_compression"/>.
</p>
</conbody>
</concept>

View File

@@ -0,0 +1,432 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept rev="1.2.2" id="compute_stats">
<title>COMPUTE STATS Statement</title>
<titlealts audience="PDF"><navtitle>COMPUTE STATS</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Performance"/>
<data name="Category" value="Scalability"/>
<data name="Category" value="ETL"/>
<data name="Category" value="Ingest"/>
<data name="Category" value="SQL"/>
<data name="Category" value="Tables"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="Cloudera">COMPUTE STATS statement</indexterm>
Gathers information about volume and distribution of data in a table and all associated columns and
partitions. The information is stored in the metastore database, and used by Impala to help optimize queries.
For example, if Impala can determine that a table is large or small, or has many or few distinct values it
can organize parallelize the work appropriately for a join query or insert operation. For details about the
kinds of information gathered by this statement, see <xref href="impala_perf_stats.xml#perf_stats"/>.
</p>
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
<codeblock rev="2.1.0">COMPUTE STATS [<varname>db_name</varname>.]<varname>table_name</varname>
COMPUTE INCREMENTAL STATS [<varname>db_name</varname>.]<varname>table_name</varname> [PARTITION (<varname>partition_spec</varname>)]
<varname>partition_spec</varname> ::= <varname>partition_col</varname>=<varname>constant_value</varname>
</codeblock>
<p conref="../shared/impala_common.xml#common/incremental_partition_spec"/>
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
<p>
Originally, Impala relied on users to run the Hive <codeph>ANALYZE TABLE</codeph> statement, but that method
of gathering statistics proved unreliable and difficult to use. The Impala <codeph>COMPUTE STATS</codeph>
statement is built from the ground up to improve the reliability and user-friendliness of this operation.
<codeph>COMPUTE STATS</codeph> does not require any setup steps or special configuration. You only run a
single Impala <codeph>COMPUTE STATS</codeph> statement to gather both table and column statistics, rather
than separate Hive <codeph>ANALYZE TABLE</codeph> statements for each kind of statistics.
</p>
<p rev="2.1.0">
The <codeph>COMPUTE INCREMENTAL STATS</codeph> variation is a shortcut for partitioned tables that works on a
subset of partitions rather than the entire table. The incremental nature makes it suitable for large tables
with many partitions, where a full <codeph>COMPUTE STATS</codeph> operation takes too long to be practical
each time a partition is added or dropped. See <xref href="impala_perf_stats.xml#perf_stats_incremental"/>
for full usage details.
</p>
<p>
<codeph>COMPUTE INCREMENTAL STATS</codeph> only applies to partitioned tables. If you use the
<codeph>INCREMENTAL</codeph> clause for an unpartitioned table, Impala automatically uses the original
<codeph>COMPUTE STATS</codeph> statement. Such tables display <codeph>false</codeph> under the
<codeph>Incremental stats</codeph> column of the <codeph>SHOW TABLE STATS</codeph> output.
</p>
<note>
Because many of the most performance-critical and resource-intensive operations rely on table and column
statistics to construct accurate and efficient plans, <codeph>COMPUTE STATS</codeph> is an important step at
the end of your ETL process. Run <codeph>COMPUTE STATS</codeph> on all tables as your first step during
performance tuning for slow queries, or troubleshooting for out-of-memory conditions:
<ul>
<li>
Accurate statistics help Impala construct an efficient query plan for join queries, improving performance
and reducing memory usage.
</li>
<li>
Accurate statistics help Impala distribute the work effectively for insert operations into Parquet
tables, improving performance and reducing memory usage.
</li>
<li rev="1.3.0">
Accurate statistics help Impala estimate the memory required for each query, which is important when you
use resource management features, such as admission control and the YARN resource management framework.
The statistics help Impala to achieve high concurrency, full utilization of available memory, and avoid
contention with workloads from other Hadoop components.
</li>
</ul>
</note>
<p conref="../shared/impala_common.xml#common/complex_types_blurb"/>
<p rev="2.3.0">
Currently, the statistics created by the <codeph>COMPUTE STATS</codeph> statement do not include
information about complex type columns. The column stats metrics for complex columns are always shown
as -1. For queries involving complex type columns, Impala uses
heuristics to estimate the data distribution within such columns.
</p>
<p conref="../shared/impala_common.xml#common/hbase_blurb"/>
<p>
<codeph>COMPUTE STATS</codeph> works for HBase tables also. The statistics gathered for HBase tables are
somewhat different than for HDFS-backed tables, but that metadata is still used for optimization when HBase
tables are involved in join queries.
</p>
<p conref="../shared/impala_common.xml#common/s3_blurb"/>
<p rev="2.2.0">
<codeph>COMPUTE STATS</codeph> also works for tables where data resides in the Amazon Simple Storage Service (S3).
See <xref href="impala_s3.xml#s3"/> for details.
</p>
<p conref="../shared/impala_common.xml#common/performance_blurb"/>
<p>
The statistics collected by <codeph>COMPUTE STATS</codeph> are used to optimize join queries
<codeph>INSERT</codeph> operations into Parquet tables, and other resource-intensive kinds of SQL statements.
See <xref href="impala_perf_stats.xml#perf_stats"/> for details.
</p>
<p>
For large tables, the <codeph>COMPUTE STATS</codeph> statement itself might take a long time and you
might need to tune its performance. The <codeph>COMPUTE STATS</codeph> statement does not work with the
<codeph>EXPLAIN</codeph> statement, or the <codeph>SUMMARY</codeph> command in <cmdname>impala-shell</cmdname>.
You can use the <codeph>PROFILE</codeph> statement in <cmdname>impala-shell</cmdname> to examine timing information
for the statement as a whole. If a basic <codeph>COMPUTE STATS</codeph> statement takes a long time for a
partitioned table, consider switching to the <codeph>COMPUTE INCREMENTAL STATS</codeph> syntax so that only
newly added partitions are analyzed each time.
</p>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p>
This example shows two tables, <codeph>T1</codeph> and <codeph>T2</codeph>, with a small number distinct
values linked by a parent-child relationship between <codeph>T1.ID</codeph> and <codeph>T2.PARENT</codeph>.
<codeph>T1</codeph> is tiny, while <codeph>T2</codeph> has approximately 100K rows. Initially, the statistics
includes physical measurements such as the number of files, the total size, and size measurements for
fixed-length columns such as with the <codeph>INT</codeph> type. Unknown values are represented by -1. After
running <codeph>COMPUTE STATS</codeph> for each table, much more information is available through the
<codeph>SHOW STATS</codeph> statements. If you were running a join query involving both of these tables, you
would need statistics for both tables to get the most effective optimization for the query.
</p>
<!-- Note: chopped off any excess characters at position 87 and after,
to avoid weird wrapping in PDF.
Applies to any subsequent examples with output from SHOW ... STATS too. -->
<codeblock>[localhost:21000] &gt; show table stats t1;
Query: show table stats t1
+-------+--------+------+--------+
| #Rows | #Files | Size | Format |
+-------+--------+------+--------+
| -1 | 1 | 33B | TEXT |
+-------+--------+------+--------+
Returned 1 row(s) in 0.02s
[localhost:21000] &gt; show table stats t2;
Query: show table stats t2
+-------+--------+----------+--------+
| #Rows | #Files | Size | Format |
+-------+--------+----------+--------+
| -1 | 28 | 960.00KB | TEXT |
+-------+--------+----------+--------+
Returned 1 row(s) in 0.01s
[localhost:21000] &gt; show column stats t1;
Query: show column stats t1
+--------+--------+------------------+--------+----------+----------+
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
+--------+--------+------------------+--------+----------+----------+
| id | INT | -1 | -1 | 4 | 4 |
| s | STRING | -1 | -1 | -1 | -1 |
+--------+--------+------------------+--------+----------+----------+
Returned 2 row(s) in 1.71s
[localhost:21000] &gt; show column stats t2;
Query: show column stats t2
+--------+--------+------------------+--------+----------+----------+
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
+--------+--------+------------------+--------+----------+----------+
| parent | INT | -1 | -1 | 4 | 4 |
| s | STRING | -1 | -1 | -1 | -1 |
+--------+--------+------------------+--------+----------+----------+
Returned 2 row(s) in 0.01s
[localhost:21000] &gt; compute stats t1;
Query: compute stats t1
+-----------------------------------------+
| summary |
+-----------------------------------------+
| Updated 1 partition(s) and 2 column(s). |
+-----------------------------------------+
Returned 1 row(s) in 5.30s
[localhost:21000] &gt; show table stats t1;
Query: show table stats t1
+-------+--------+------+--------+
| #Rows | #Files | Size | Format |
+-------+--------+------+--------+
| 3 | 1 | 33B | TEXT |
+-------+--------+------+--------+
Returned 1 row(s) in 0.01s
[localhost:21000] &gt; show column stats t1;
Query: show column stats t1
+--------+--------+------------------+--------+----------+----------+
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
+--------+--------+------------------+--------+----------+----------+
| id | INT | 3 | -1 | 4 | 4 |
| s | STRING | 3 | -1 | -1 | -1 |
+--------+--------+------------------+--------+----------+----------+
Returned 2 row(s) in 0.02s
[localhost:21000] &gt; compute stats t2;
Query: compute stats t2
+-----------------------------------------+
| summary |
+-----------------------------------------+
| Updated 1 partition(s) and 2 column(s). |
+-----------------------------------------+
Returned 1 row(s) in 5.70s
[localhost:21000] &gt; show table stats t2;
Query: show table stats t2
+-------+--------+----------+--------+
| #Rows | #Files | Size | Format |
+-------+--------+----------+--------+
| 98304 | 1 | 960.00KB | TEXT |
+-------+--------+----------+--------+
Returned 1 row(s) in 0.03s
[localhost:21000] &gt; show column stats t2;
Query: show column stats t2
+--------+--------+------------------+--------+----------+----------+
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
+--------+--------+------------------+--------+----------+----------+
| parent | INT | 3 | -1 | 4 | 4 |
| s | STRING | 6 | -1 | 14 | 9.3 |
+--------+--------+------------------+--------+----------+----------+
Returned 2 row(s) in 0.01s</codeblock>
<p rev="2.1.0">
The following example shows how to use the <codeph>INCREMENTAL</codeph> clause, available in Impala 2.1.0 and
higher. The <codeph>COMPUTE INCREMENTAL STATS</codeph> syntax lets you collect statistics for newly added or
changed partitions, without rescanning the entire table.
</p>
<codeblock>-- Initially the table has no incremental stats, as indicated
-- by -1 under #Rows and false under Incremental stats.
show table stats item_partitioned;
+-------------+-------+--------+----------+--------------+---------+------------------
| i_category | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats
+-------------+-------+--------+----------+--------------+---------+------------------
| Books | -1 | 1 | 223.74KB | NOT CACHED | PARQUET | false
| Children | -1 | 1 | 230.05KB | NOT CACHED | PARQUET | false
| Electronics | -1 | 1 | 232.67KB | NOT CACHED | PARQUET | false
| Home | -1 | 1 | 232.56KB | NOT CACHED | PARQUET | false
| Jewelry | -1 | 1 | 223.72KB | NOT CACHED | PARQUET | false
| Men | -1 | 1 | 231.25KB | NOT CACHED | PARQUET | false
| Music | -1 | 1 | 237.90KB | NOT CACHED | PARQUET | false
| Shoes | -1 | 1 | 234.90KB | NOT CACHED | PARQUET | false
| Sports | -1 | 1 | 227.97KB | NOT CACHED | PARQUET | false
| Women | -1 | 1 | 226.27KB | NOT CACHED | PARQUET | false
| Total | -1 | 10 | 2.25MB | 0B | |
+-------------+-------+--------+----------+--------------+---------+------------------
-- After the first COMPUTE INCREMENTAL STATS,
-- all partitions have stats.
compute incremental stats item_partitioned;
+-------------------------------------------+
| summary |
+-------------------------------------------+
| Updated 10 partition(s) and 21 column(s). |
+-------------------------------------------+
show table stats item_partitioned;
+-------------+-------+--------+----------+--------------+---------+------------------
| i_category | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats
+-------------+-------+--------+----------+--------------+---------+------------------
| Books | 1733 | 1 | 223.74KB | NOT CACHED | PARQUET | true
| Children | 1786 | 1 | 230.05KB | NOT CACHED | PARQUET | true
| Electronics | 1812 | 1 | 232.67KB | NOT CACHED | PARQUET | true
| Home | 1807 | 1 | 232.56KB | NOT CACHED | PARQUET | true
| Jewelry | 1740 | 1 | 223.72KB | NOT CACHED | PARQUET | true
| Men | 1811 | 1 | 231.25KB | NOT CACHED | PARQUET | true
| Music | 1860 | 1 | 237.90KB | NOT CACHED | PARQUET | true
| Shoes | 1835 | 1 | 234.90KB | NOT CACHED | PARQUET | true
| Sports | 1783 | 1 | 227.97KB | NOT CACHED | PARQUET | true
| Women | 1790 | 1 | 226.27KB | NOT CACHED | PARQUET | true
| Total | 17957 | 10 | 2.25MB | 0B | |
+-------------+-------+--------+----------+--------------+---------+------------------
-- Add a new partition...
alter table item_partitioned add partition (i_category='Camping');
-- Add or replace files in HDFS outside of Impala,
-- rendering the stats for a partition obsolete.
!import_data_into_sports_partition.sh
refresh item_partitioned;
drop incremental stats item_partitioned partition (i_category='Sports');
-- Now some partitions have incremental stats
-- and some do not.
show table stats item_partitioned;
+-------------+-------+--------+----------+--------------+---------+------------------
| i_category | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats
+-------------+-------+--------+----------+--------------+---------+------------------
| Books | 1733 | 1 | 223.74KB | NOT CACHED | PARQUET | true
| Camping | -1 | 1 | 408.02KB | NOT CACHED | PARQUET | false
| Children | 1786 | 1 | 230.05KB | NOT CACHED | PARQUET | true
| Electronics | 1812 | 1 | 232.67KB | NOT CACHED | PARQUET | true
| Home | 1807 | 1 | 232.56KB | NOT CACHED | PARQUET | true
| Jewelry | 1740 | 1 | 223.72KB | NOT CACHED | PARQUET | true
| Men | 1811 | 1 | 231.25KB | NOT CACHED | PARQUET | true
| Music | 1860 | 1 | 237.90KB | NOT CACHED | PARQUET | true
| Shoes | 1835 | 1 | 234.90KB | NOT CACHED | PARQUET | true
| Sports | -1 | 1 | 227.97KB | NOT CACHED | PARQUET | false
| Women | 1790 | 1 | 226.27KB | NOT CACHED | PARQUET | true
| Total | 17957 | 11 | 2.65MB | 0B | |
+-------------+-------+--------+----------+--------------+---------+------------------
-- After another COMPUTE INCREMENTAL STATS,
-- all partitions have incremental stats, and only the 2
-- partitions without incremental stats were scanned.
compute incremental stats item_partitioned;
+------------------------------------------+
| summary |
+------------------------------------------+
| Updated 2 partition(s) and 21 column(s). |
+------------------------------------------+
show table stats item_partitioned;
+-------------+-------+--------+----------+--------------+---------+------------------
| i_category | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats
+-------------+-------+--------+----------+--------------+---------+------------------
| Books | 1733 | 1 | 223.74KB | NOT CACHED | PARQUET | true
| Camping | 5328 | 1 | 408.02KB | NOT CACHED | PARQUET | true
| Children | 1786 | 1 | 230.05KB | NOT CACHED | PARQUET | true
| Electronics | 1812 | 1 | 232.67KB | NOT CACHED | PARQUET | true
| Home | 1807 | 1 | 232.56KB | NOT CACHED | PARQUET | true
| Jewelry | 1740 | 1 | 223.72KB | NOT CACHED | PARQUET | true
| Men | 1811 | 1 | 231.25KB | NOT CACHED | PARQUET | true
| Music | 1860 | 1 | 237.90KB | NOT CACHED | PARQUET | true
| Shoes | 1835 | 1 | 234.90KB | NOT CACHED | PARQUET | true
| Sports | 1783 | 1 | 227.97KB | NOT CACHED | PARQUET | true
| Women | 1790 | 1 | 226.27KB | NOT CACHED | PARQUET | true
| Total | 17957 | 11 | 2.65MB | 0B | |
+-------------+-------+--------+----------+--------------+---------+------------------
</codeblock>
<p conref="../shared/impala_common.xml#common/file_format_blurb"/>
<p>
The <codeph>COMPUTE STATS</codeph> statement works with tables created with any of the file formats supported
by Impala. See <xref href="impala_file_formats.xml#file_formats"/> for details about working with the
different file formats. The following considerations apply to <codeph>COMPUTE STATS</codeph> depending on the
file format of the table.
</p>
<p>
The <codeph>COMPUTE STATS</codeph> statement works with text tables with no restrictions. These tables can be
created through either Impala or Hive.
</p>
<p>
The <codeph>COMPUTE STATS</codeph> statement works with Parquet tables. These tables can be created through
either Impala or Hive.
</p>
<p>
The <codeph>COMPUTE STATS</codeph> statement works with Avro tables without restriction in CDH 5.4 / Impala 2.2
and higher. In earlier releases, <codeph>COMPUTE STATS</codeph> worked only for Avro tables created through Hive,
and required the <codeph>CREATE TABLE</codeph> statement to use SQL-style column names and types rather than an
Avro-style schema specification.
</p>
<p>
The <codeph>COMPUTE STATS</codeph> statement works with RCFile tables with no restrictions. These tables can
be created through either Impala or Hive.
</p>
<p>
The <codeph>COMPUTE STATS</codeph> statement works with SequenceFile tables with no restrictions. These
tables can be created through either Impala or Hive.
</p>
<p>
The <codeph>COMPUTE STATS</codeph> statement works with partitioned tables, whether all the partitions use
the same file format, or some partitions are defined through <codeph>ALTER TABLE</codeph> to use different
file formats.
</p>
<p conref="../shared/impala_common.xml#common/ddl_blurb"/>
<p conref="../shared/impala_common.xml#common/cancel_blurb_maybe"/>
<p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
<p conref="../shared/impala_common.xml#common/decimal_no_stats"/>
<note conref="../shared/impala_common.xml#common/compute_stats_nulls"/>
<p conref="../shared/impala_common.xml#common/internals_blurb"/>
<p>
Behind the scenes, the <codeph>COMPUTE STATS</codeph> statement
executes two statements: one to count the rows of each partition
in the table (or the entire table if unpartitioned) through the
<codeph>COUNT(*)</codeph> function,
and another to count the approximate number of distinct values
in each column through the <codeph>NDV()</codeph> function.
You might see these queries in your monitoring and diagnostic displays.
The same factors that affect the performance, scalability, and
execution of other queries (such as parallel execution, memory usage,
admission control, and timeouts) also apply to the queries run by the
<codeph>COMPUTE STATS</codeph> statement.
</p>
<p conref="../shared/impala_common.xml#common/permissions_blurb"/>
<p rev="CDH-19187">
The user ID that the <cmdname>impalad</cmdname> daemon runs under,
typically the <codeph>impala</codeph> user, must have read
permission for all affected files in the source directory:
all files in the case of an unpartitioned table or
a partitioned table in the case of <codeph>COMPUTE STATS</codeph>;
or all the files in partitions without incremental stats in
the case of <codeph>COMPUTE INCREMENTAL STATS</codeph>.
It must also have read and execute permissions for all
relevant directories holding the data files.
(Essentially, <codeph>COMPUTE STATS</codeph> requires the
same permissions as the underlying <codeph>SELECT</codeph> queries it runs
against the table.)
</p>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_drop_stats.xml#drop_stats"/>, <xref href="impala_show.xml#show_table_stats"/>,
<xref href="impala_show.xml#show_column_stats"/>, <xref href="impala_perf_stats.xml#perf_stats"/>
</p>
</conbody>
</concept>

View File

@@ -0,0 +1,296 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="concepts">
<title>Impala Concepts and Architecture</title>
<titlealts audience="PDF"><navtitle>Concepts and Architecture</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Concepts"/>
<data name="Category" value="Data Analysts"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Stub Pages"/>
</metadata>
</prolog>
<conbody>
<draft-comment author="-dita-use-conref-target" audience="integrated"
conref="../shared/cdh_cm_common.xml#id_dgz_rhr_kv/draft-comment-test"/>
<p>
The following sections provide background information to help you become productive using Impala and
its features. Where appropriate, the explanations include context to help understand how aspects of Impala
relate to other technologies you might already be familiar with, such as relational database management
systems and data warehouses, or other Hadoop components such as Hive, HDFS, and HBase.
</p>
<p outputclass="toc"/>
</conbody>
<!-- These other topics are waiting to be filled in. Could become subtopics or top-level topics depending on the depth of coverage in each case. -->
<concept id="intro_data_lifecycle" audience="Cloudera">
<title>Overview of the Data Lifecycle for Impala</title>
<conbody/>
</concept>
<concept id="intro_etl" audience="Cloudera">
<title>Overview of the Extract, Transform, Load (ETL) Process for Impala</title>
<prolog>
<metadata>
<data name="Category" value="ETL"/>
<data name="Category" value="Ingest"/>
<data name="Category" value="Concepts"/>
</metadata>
</prolog>
<conbody/>
</concept>
<concept id="intro_hadoop_data" audience="Cloudera">
<title>How Impala Works with Hadoop Data Files</title>
<conbody/>
</concept>
<concept id="intro_web_ui" audience="Cloudera">
<title>Overview of the Impala Web Interface</title>
<conbody/>
</concept>
<concept id="intro_bi" audience="Cloudera">
<title>Using Impala with Business Intelligence Tools</title>
<conbody/>
</concept>
<concept id="intro_ha" audience="Cloudera">
<title>Overview of Impala Availability and Fault Tolerance</title>
<conbody/>
</concept>
<!-- This is pretty much ready to go. Decide if it should go under "Concepts" or "Performance",
and if it should be split out into a separate file, and then take out the audience= attribute
to make it visible.
-->
<concept id="intro_llvm" audience="Cloudera">
<title>Overview of Impala Runtime Code Generation</title>
<conbody>
<!-- Adapted from the CIDR15 paper written by the Impala team. -->
<p>
Impala uses <term>LLVM</term> (a compiler library and collection of related tools) to perform just-in-time
(JIT) compilation within the running <cmdname>impalad</cmdname> process. This runtime code generation
technique improves query execution times by generating native code optimized for the architecture of each
host in your particular cluster. Performance gains of 5 times or more are typical for representative
workloads.
</p>
<p>
Impala uses runtime code generation to produce query-specific versions of functions that are critical to
performance. In particular, code generation is applied to <term>inner loop</term> functions, that is, those
that are executed many times (for every tuple) in a given query, and thus constitute a large portion of the
total time the query takes to execute. For example, when Impala scans a data file, it calls a function to
parse each record into Impalas in-memory tuple format. For queries scanning large tables, billions of
records could result in billions of function calls. This function must therefore be extremely efficient for
good query performance, and removing even a few instructions from each function call can result in large
query speedups.
</p>
<p>
Overall, JIT compilation has an effect similar to writing custom code to process a query. For example, it
eliminates branches, unrolls loops, propagates constants, offsets and pointers, and inlines functions.
Inlining is especially valuable for functions used internally to evaluate expressions, where the function
call itself is more expensive than the function body (for example, a function that adds two numbers).
Inlining functions also increases instruction-level parallelism, and allows the compiler to make further
optimizations such as subexpression elimination across expressions.
</p>
<p>
Impala generates runtime query code automatically, so you do not need to do anything special to get this
performance benefit. This technique is most effective for complex and long-running queries that process
large numbers of rows. If you need to issue a series of short, small queries, you might turn off this
feature to avoid the overhead of compilation time for each query. In this case, issue the statement
<codeph>SET DISABLE_CODEGEN=true</codeph> to turn off runtime code generation for the duration of the
current session.
</p>
<!--
<p>
Without code generation,
functions tend to be suboptimal
to handle situations that cannot be predicted in advance.
For example,
a record-parsing function that
only handles integer types will be faster at parsing an integer-only file
than a function that handles other data types
such as strings and floating-point numbers.
However, the schemas of the files to
be scanned are unknown at compile time,
and so a general-purpose function must be used, even if at runtime
it is known that more limited functionality is sufficient.
</p>
<p>
A source of large runtime overheads are virtual functions. Virtual function calls incur a large performance
penalty, particularly when the called function is very simple, as the calls cannot be inlined.
If the type of the object instance is known at runtime, we can use code generation to replace the virtual
function call with a call directly to the correct function, which can then be inlined. This is especially
valuable when evaluating expression trees. In Impala (as in many systems), expressions are composed of a
tree of individual operators and functions.
</p>
<p>
Each type of expression that can appear in a query is implemented internally by overriding a virtual function.
Many of these expression functions are quite simple, for example, adding two numbers.
The virtual function call can be more expensive than the function body itself. By resolving the virtual
function calls with code generation and then inlining the resulting function calls, Impala can evaluate expressions
directly with no function call overhead. Inlining functions also increases
instruction-level parallelism, and allows the compiler to make further optimizations such as subexpression
elimination across expressions.
</p>
-->
</conbody>
</concept>
<!-- Same as the previous section: adapted from CIDR paper, ready to externalize after deciding where to go. -->
<concept audience="Cloudera" id="intro_io">
<title>Overview of Impala I/O</title>
<conbody>
<p>
Efficiently retrieving data from HDFS is a challenge for all SQL-on-Hadoop systems. To perform
data scans from both disk and memory at or near hardware speed, Impala uses an HDFS feature called
<term>short-circuit local reads</term> to bypass the DataNode protocol when reading from local disk. Impala
can read at almost disk bandwidth (approximately 100 MB/s per disk) and is typically able to saturate all
available disks. For example, with 12 disks, Impala is typically capable of sustaining I/O at 1.2 GB/sec.
Furthermore, <term>HDFS caching</term> allows Impala to access memory-resident data at memory bus speed,
and saves CPU cycles as there is no need to copy or checksum data blocks within memory.
</p>
<p>
The I/O manager component interfaces with storage devices to read and write data. I/O manager assigns a
fixed number of worker threads per physical disk (currently one thread per rotational disk and eight per
SSD), providing an asynchronous interface to clients (<term>scanner threads</term>).
</p>
</conbody>
</concept>
<!-- Same as the previous section: adapted from CIDR paper, ready to externalize after deciding where to go. -->
<!-- Although good idea to get some answers from Henry first. -->
<concept audience="Cloudera" id="intro_state_distribution">
<title>State distribution</title>
<conbody>
<p>
As a massively parallel database that can run on hundreds of nodes, Impala must coordinate and synchronize
its metadata across the entire cluster. Impala's symmetric-node architecture means that any node can accept
and execute queries, and thus each node needs up-to-date versions of the system catalog and a knowledge of
which hosts the <cmdname>impalad</cmdname> daemons run on. To avoid the overhead of TCP connections and
remote procedure calls to retrieve metadata during query planning, Impala implements a simple
publish-subscribe service called the <term>statestore</term> to push metadata changes to a set of
subscribers (the <cmdname>impalad</cmdname> daemons running on all the DataNodes).
</p>
<p>
The statestore maintains a set of topics, which are arrays of <codeph>(<varname>key</varname>,
<varname>value</varname>, <varname>version</varname>)</codeph> triplets called <term>entries</term> where
<varname>key</varname> and <varname>value</varname> are byte arrays, and <varname>version</varname> is a
64-bit integer. A topic is defined by an application, and so the statestore has no understanding of the
contents of any topic entry. Topics are persistent through the lifetime of the statestore, but are not
persisted across service restarts. Processes that receive updates to any topic are called
<term>subscribers</term>, and express their interest by registering with the statestore at startup and
providing a list of topics. The statestore responds to registration by sending the subscriber an initial
topic update for each registered topic, which consists of all the entries currently in that topic.
</p>
<!-- Henry: OK, but in practice, what is in these topic messages for Impala? -->
<p>
After registration, the statestore periodically sends two kinds of messages to each subscriber. The first
kind of message is a topic update, and consists of all changes to a topic (new entries, modified entries
and deletions) since the last update was successfully sent to the subscriber. Each subscriber maintains a
per-topic most-recent-version identifier which allows the statestore to only send the delta between
updates. In response to a topic update, each subscriber sends a list of changes it intends to make to its
subscribed topics. Those changes are guaranteed to have been applied by the time the next update is
received.
</p>
<p>
The second kind of statestore message is a <term>heartbeat</term>, formerly sometimes called
<term>keepalive</term>. The statestore uses heartbeat messages to maintain the connection to each
subscriber, which would otherwise time out its subscription and attempt to re-register.
</p>
<p>
Prior to Impala 2.0, both kinds of communication were combined in a single kind of message. Because these
messages could be very large in instances with thousands of tables, partitions, data files, and so on,
Impala 2.0 and higher divides the types of messages so that the small heartbeat pings can be transmitted
and acknowledged quickly, increasing the reliability of the statestore mechanism that detects when Impala
nodes become unavailable.
</p>
<p>
If the statestore detects a failed subscriber (for example, by repeated failed heartbeat deliveries), it
stops sending updates to that node.
<!-- Henry: what are examples of these transient topic entries? -->
Some topic entries are marked as transient, meaning that if their owning subscriber fails, they are
removed.
</p>
<p>
Although the asynchronous nature of this mechanism means that metadata updates might take some time to
propagate across the entire cluster, that does not affect the consistency of query planning or results.
Each query is planned and coordinated by a particular node, so as long as the coordinator node is aware of
the existence of the relevant tables, data files, and so on, it can distribute the query work to other
nodes even if those other nodes have not received the latest metadata updates.
<!-- Henry: need another example here of what's in a topic, e.g. is it the list of available tables? -->
<!--
For example, query planning is performed on a single node based on the
catalog metadata topic, and once a full plan has been computed, all information required to execute that
plan is distributed directly to the executing nodes.
There is no requirement that an executing node should
know about the same version of the catalog metadata topic.
-->
</p>
<p>
We have found that the statestore process with default settings scales well to medium sized clusters, and
can serve our largest deployments with some configuration changes.
<!-- Henry: elaborate on the configuration changes. -->
</p>
<p>
<!-- Henry: other examples like load information? How is load information used? -->
The statestore does not persist any metadata to disk: all current metadata is pushed to the statestore by
its subscribers (for example, load information). Therefore, should a statestore restart, its state can be
recovered during the initial subscriber registration phase. Or if the machine that the statestore is
running on fails, a new statestore process can be started elsewhere, and subscribers can fail over to it.
There is no built-in failover mechanism in Impala, instead deployments commonly use a retargetable DNS
entry to force subscribers to automatically move to the new process instance.
<!-- Henry: translate that last sentence into instructions / guidelines. -->
</p>
</conbody>
</concept>
</concept>

View File

@@ -0,0 +1,443 @@
<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="conditional_functions">
<title>Impala Conditional Functions</title>
<titlealts audience="PDF"><navtitle>Conditional Functions</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Impala Functions"/>
<data name="Category" value="SQL"/>
<data name="Category" value="Data Analysts"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Querying"/>
</metadata>
</prolog>
<conbody>
<p>
Impala supports the following conditional functions for testing equality, comparison operators, and nullity:
</p>
<dl>
<dlentry id="case">
<dt>
<codeph>CASE a WHEN b THEN c [WHEN d THEN e]... [ELSE f] END</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">CASE expression</indexterm>
<b>Purpose:</b> Compares an expression to one or more possible values, and returns a corresponding result
when a match is found.
<p conref="../shared/impala_common.xml#common/return_same_type"/>
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
<p>
In this form of the <codeph>CASE</codeph> expression, the initial value <codeph>A</codeph>
being evaluated for each row it typically a column reference, or an expression involving
a column. This form can only compare against a set of specified values, not ranges,
multi-value comparisons such as <codeph>BETWEEN</codeph> or <codeph>IN</codeph>,
regular expressions, or <codeph>NULL</codeph>.
</p>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p>
Although this example is split across multiple lines, you can put any or all parts of a <codeph>CASE</codeph> expression
on a single line, with no punctuation or other separators between the <codeph>WHEN</codeph>,
<codeph>ELSE</codeph>, and <codeph>END</codeph> clauses.
</p>
<codeblock>select case x
when 1 then 'one'
when 2 then 'two'
when 0 then 'zero'
else 'out of range'
end
from t1;
</codeblock>
</dd>
</dlentry>
<dlentry id="case2">
<dt>
<codeph>CASE WHEN a THEN b [WHEN c THEN d]... [ELSE e] END</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">CASE expression</indexterm>
<b>Purpose:</b> Tests whether any of a sequence of expressions is true, and returns a corresponding
result for the first true expression.
<p conref="../shared/impala_common.xml#common/return_same_type"/>
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
<p>
<codeph>CASE</codeph> expressions without an initial test value have more flexibility.
For example, they can test different columns in different <codeph>WHEN</codeph> clauses,
or use comparison operators such as <codeph>BETWEEN</codeph>, <codeph>IN</codeph> and <codeph>IS NULL</codeph>
rather than comparing against discrete values.
</p>
<p>
<codeph>CASE</codeph> expressions are often the foundation of long queries that
summarize and format results for easy-to-read reports. For example, you might
use a <codeph>CASE</codeph> function call to turn values from a numeric column
into category strings corresponding to integer values, or labels such as <q>Small</q>,
<q>Medium</q> and <q>Large</q> based on ranges. Then subsequent parts of the
query might aggregate based on the transformed values, such as how many
values are classified as small, medium, or large. You can also use <codeph>CASE</codeph>
to signal problems with out-of-bounds values, <codeph>NULL</codeph> values,
and so on.
</p>
<p>
By using operators such as <codeph>OR</codeph>, <codeph>IN</codeph>,
<codeph>REGEXP</codeph>, and so on in <codeph>CASE</codeph> expressions,
you can build extensive tests and transformations into a single query.
Therefore, applications that construct SQL statements often rely heavily on <codeph>CASE</codeph>
calls in the generated SQL code.
</p>
<p>
Because this flexible form of the <codeph>CASE</codeph> expressions allows you to perform
many comparisons and call multiple functions when evaluating each row, be careful applying
elaborate <codeph>CASE</codeph> expressions to queries that process large amounts of data.
For example, when practical, evaluate and transform values through <codeph>CASE</codeph>
after applying operations such as aggregations that reduce the size of the result set;
transform numbers to strings after performing joins with the original numeric values.
</p>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p>
Although this example is split across multiple lines, you can put any or all parts of a <codeph>CASE</codeph> expression
on a single line, with no punctuation or other separators between the <codeph>WHEN</codeph>,
<codeph>ELSE</codeph>, and <codeph>END</codeph> clauses.
</p>
<codeblock>select case
when dayname(now()) in ('Saturday','Sunday') then 'result undefined on weekends'
when x > y then 'x greater than y'
when x = y then 'x and y are equal'
when x is null or y is null then 'one of the columns is null'
else null
end
from t1;
</codeblock>
</dd>
</dlentry>
<dlentry id="coalesce">
<dt>
<codeph>coalesce(type v1, type v2, ...)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">coalesce() function</indexterm>
<b>Purpose:</b> Returns the first specified argument that is not <codeph>NULL</codeph>, or
<codeph>NULL</codeph> if all arguments are <codeph>NULL</codeph>.
<p conref="../shared/impala_common.xml#common/return_same_type"/>
</dd>
</dlentry>
<dlentry rev="2.0.0" id="decode">
<dt>
<codeph>decode(type expression, type search1, type result1 [, type search2, type result2 ...] [, type
default] )</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">decode() function</indexterm>
<b>Purpose:</b> Compares an expression to one or more possible values, and returns a corresponding result
when a match is found.
<p conref="../shared/impala_common.xml#common/return_same_type"/>
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
<p>
Can be used as shorthand for a <codeph>CASE</codeph> expression.
</p>
<p>
The original expression and the search expressions must of the same type or convertible types. The
result expression can be a different type, but all result expressions must be of the same type.
</p>
<p>
Returns a successful match If the original expression is <codeph>NULL</codeph> and a search expression
is also <codeph>NULL</codeph>. the
</p>
<p>
Returns <codeph>NULL</codeph> if the final <codeph>default</codeph> value is omitted and none of the
search expressions match the original expression.
</p>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p>
The following example translates numeric day values into descriptive names:
</p>
<codeblock>SELECT event, decode(day_of_week, 1, "Monday", 2, "Tuesday", 3, "Wednesday",
4, "Thursday", 5, "Friday", 6, "Saturday", 7, "Sunday", "Unknown day")
FROM calendar;
</codeblock>
</dd>
</dlentry>
<dlentry id="if">
<dt>
<codeph>if(boolean condition, type ifTrue, type ifFalseOrNull)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">if() function</indexterm>
<b>Purpose:</b> Tests an expression and returns a corresponding result depending on whether the result is
true, false, or <codeph>NULL</codeph>.
<p>
<b>Return type:</b> Same as the <codeph>ifTrue</codeph> argument value
</p>
</dd>
</dlentry>
<dlentry rev="1.3.0" id="ifnull">
<dt>
<codeph>ifnull(type a, type ifNull)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">isnull() function</indexterm>
<b>Purpose:</b> Alias for the <codeph>isnull()</codeph> function, with the same behavior. To simplify
porting SQL with vendor extensions to Impala.
<p conref="../shared/impala_common.xml#common/added_in_130"/>
</dd>
</dlentry>
<dlentry id="isfalse" rev="2.2.0">
<dt>
<codeph>isfalse(<varname>boolean</varname>)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">isfalse() function</indexterm>
<b>Purpose:</b> Tests if a Boolean expression is <codeph>false</codeph> or not.
Returns <codeph>true</codeph> if so.
If the argument is <codeph>NULL</codeph>, returns <codeph>false</codeph>.
Identical to <codeph>isnottrue()</codeph>, except it returns the opposite value for a <codeph>NULL</codeph> argument.
<p conref="../shared/impala_common.xml#common/return_type_boolean"/>
<p conref="../shared/impala_common.xml#common/added_in_220"/>
</dd>
</dlentry>
<dlentry id="isnotfalse" rev="2.2.0">
<dt>
<codeph>isnotfalse(<varname>boolean</varname>)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">isnotfalse() function</indexterm>
<b>Purpose:</b> Tests if a Boolean expression is not <codeph>false</codeph> (that is, either <codeph>true</codeph> or <codeph>NULL</codeph>).
Returns <codeph>true</codeph> if so.
If the argument is <codeph>NULL</codeph>, returns <codeph>true</codeph>.
Identical to <codeph>istrue()</codeph>, except it returns the opposite value for a <codeph>NULL</codeph> argument.
<p conref="../shared/impala_common.xml#common/return_type_boolean"/>
<p conref="../shared/impala_common.xml#common/for_compatibility_only"/>
<p conref="../shared/impala_common.xml#common/added_in_220"/>
</dd>
</dlentry>
<dlentry id="isnottrue" rev="2.2.0">
<dt>
<codeph>isnottrue(<varname>boolean</varname>)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">isnottrue() function</indexterm>
<b>Purpose:</b> Tests if a Boolean expression is not <codeph>true</codeph> (that is, either <codeph>false</codeph> or <codeph>NULL</codeph>).
Returns <codeph>true</codeph> if so.
If the argument is <codeph>NULL</codeph>, returns <codeph>true</codeph>.
Identical to <codeph>isfalse()</codeph>, except it returns the opposite value for a <codeph>NULL</codeph> argument.
<p conref="../shared/impala_common.xml#common/return_type_boolean"/>
<p conref="../shared/impala_common.xml#common/added_in_220"/>
</dd>
</dlentry>
<dlentry id="isnull">
<dt>
<codeph>isnull(type a, type ifNull)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">isnull() function</indexterm>
<b>Purpose:</b> Tests if an expression is <codeph>NULL</codeph>, and returns the expression result value
if not. If the first argument is <codeph>NULL</codeph>, returns the second argument.
<p>
<b>Compatibility notes:</b> Equivalent to the <codeph>nvl()</codeph> function from Oracle Database or
<codeph>ifnull()</codeph> from MySQL. The <codeph>nvl()</codeph> and <codeph>ifnull()</codeph>
functions are also available in Impala.
</p>
<p>
<b>Return type:</b> Same as the first argument value
</p>
</dd>
</dlentry>
<dlentry id="istrue" rev="2.2.0">
<dt>
<codeph>istrue(<varname>boolean</varname>)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">istrue() function</indexterm>
<b>Purpose:</b> Tests if a Boolean expression is <codeph>true</codeph> or not.
Returns <codeph>true</codeph> if so.
If the argument is <codeph>NULL</codeph>, returns <codeph>false</codeph>.
Identical to <codeph>isnotfalse()</codeph>, except it returns the opposite value for a <codeph>NULL</codeph> argument.
<p conref="../shared/impala_common.xml#common/return_type_boolean"/>
<p conref="../shared/impala_common.xml#common/for_compatibility_only"/>
<p conref="../shared/impala_common.xml#common/added_in_220"/>
</dd>
</dlentry>
<dlentry id="nonnullvalue" rev="2.2.0">
<dt>
<codeph>nonnullvalue(<varname>expression</varname>)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">function</indexterm>
<b>Purpose:</b> Tests if an expression (of any type) is <codeph>NULL</codeph> or not.
Returns <codeph>false</codeph> if so.
The converse of <codeph>nullvalue()</codeph>.
<p conref="../shared/impala_common.xml#common/return_type_boolean"/>
<p conref="../shared/impala_common.xml#common/for_compatibility_only"/>
<p conref="../shared/impala_common.xml#common/added_in_220"/>
</dd>
</dlentry>
<dlentry rev="1.3.0" id="nullif">
<dt>
<codeph>nullif(<varname>expr1</varname>,<varname>expr2</varname>)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">nullif() function</indexterm>
<b>Purpose:</b> Returns <codeph>NULL</codeph> if the two specified arguments are equal. If the specified
arguments are not equal, returns the value of <varname>expr1</varname>. The data types of the expressions
must be compatible, according to the conversion rules from <xref href="impala_datatypes.xml#datatypes"/>.
You cannot use an expression that evaluates to <codeph>NULL</codeph> for <varname>expr1</varname>; that
way, you can distinguish a return value of <codeph>NULL</codeph> from an argument value of
<codeph>NULL</codeph>, which would never match <varname>expr2</varname>.
<p>
<b>Usage notes:</b> This function is effectively shorthand for a <codeph>CASE</codeph> expression of
the form:
</p>
<codeblock>CASE
WHEN <varname>expr1</varname> = <varname>expr2</varname> THEN NULL
ELSE <varname>expr1</varname>
END</codeblock>
<p>
It is commonly used in division expressions, to produce a <codeph>NULL</codeph> result instead of a
divide-by-zero error when the divisor is equal to zero:
</p>
<codeblock>select 1.0 / nullif(c1,0) as reciprocal from t1;</codeblock>
<p>
You might also use it for compatibility with other database systems that support the same
<codeph>NULLIF()</codeph> function.
</p>
<p conref="../shared/impala_common.xml#common/return_same_type"/>
<p conref="../shared/impala_common.xml#common/added_in_130"/>
</dd>
</dlentry>
<dlentry rev="1.3.0" id="nullifzero">
<dt>
<codeph>nullifzero(<varname>numeric_expr</varname>)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">nullifzero() function</indexterm>
<b>Purpose:</b> Returns <codeph>NULL</codeph> if the numeric expression evaluates to 0, otherwise returns
the result of the expression.
<p>
<b>Usage notes:</b> Used to avoid error conditions such as divide-by-zero in numeric calculations.
Serves as shorthand for a more elaborate <codeph>CASE</codeph> expression, to simplify porting SQL with
vendor extensions to Impala.
</p>
<p conref="../shared/impala_common.xml#common/return_same_type"/>
<p conref="../shared/impala_common.xml#common/added_in_130"/>
</dd>
</dlentry>
<dlentry id="nullvalue" rev="2.2.0">
<dt>
<codeph>nullvalue(<varname>expression</varname>)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">function</indexterm>
<b>Purpose:</b> Tests if an expression (of any type) is <codeph>NULL</codeph> or not.
Returns <codeph>true</codeph> if so.
The converse of <codeph>nonnullvalue()</codeph>.
<p conref="../shared/impala_common.xml#common/return_type_boolean"/>
<p conref="../shared/impala_common.xml#common/for_compatibility_only"/>
<p conref="../shared/impala_common.xml#common/added_in_220"/>
</dd>
</dlentry>
<dlentry id="nvl" rev="1.1">
<dt>
<codeph>nvl(type a, type ifNull)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">nvl() function</indexterm>
<b>Purpose:</b> Alias for the <codeph>isnull()</codeph> function. Tests if an expression is
<codeph>NULL</codeph>, and returns the expression result value if not. If the first argument is
<codeph>NULL</codeph>, returns the second argument. Equivalent to the <codeph>nvl()</codeph> function
from Oracle Database or <codeph>ifnull()</codeph> from MySQL.
<p>
<b>Return type:</b> Same as the first argument value
</p>
<p conref="../shared/impala_common.xml#common/added_in_11"/>
</dd>
</dlentry>
<dlentry rev="1.3.0" id="zeroifnull">
<dt>
<codeph>zeroifnull(<varname>numeric_expr</varname>)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">zeroifnull() function</indexterm>
<b>Purpose:</b> Returns 0 if the numeric expression evaluates to <codeph>NULL</codeph>, otherwise returns
the result of the expression.
<p>
<b>Usage notes:</b> Used to avoid unexpected results due to unexpected propagation of
<codeph>NULL</codeph> values in numeric calculations. Serves as shorthand for a more elaborate
<codeph>CASE</codeph> expression, to simplify porting SQL with vendor extensions to Impala.
</p>
<p conref="../shared/impala_common.xml#common/return_same_type"/>
<p conref="../shared/impala_common.xml#common/added_in_130"/>
</dd>
</dlentry>
</dl>
</conbody>
</concept>

View File

@@ -0,0 +1,57 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="config">
<title>Managing Impala</title>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Administrators"/>
<data name="Category" value="Configuring"/>
<data name="Category" value="JDBC"/>
<data name="Category" value="ODBC"/>
<data name="Category" value="Stub Pages"/>
</metadata>
</prolog>
<conbody>
<p>
This section explains how to configure Impala to accept connections from applications that use popular
programming APIs:
</p>
<ul>
<li>
<xref href="impala_config_performance.xml#config_performance"/>
</li>
<li>
<xref href="impala_odbc.xml#impala_odbc"/>
</li>
<li>
<xref href="impala_jdbc.xml#impala_jdbc"/>
</li>
</ul>
<p>
This type of configuration is especially useful when using Impala in combination with Business Intelligence
tools, which use these standard interfaces to query different kinds of database and Big Data systems.
</p>
<p>
You can also configure these other aspects of Impala:
</p>
<ul>
<li>
<xref href="impala_security.xml#security"/>
</li>
<li>
<xref href="impala_config_options.xml#config_options"/>
</li>
</ul>
</conbody>
</concept>

View File

@@ -0,0 +1,593 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="config_options">
<title>Modifying Impala Startup Options</title>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Configuring"/>
<data name="Category" value="Administrators"/>
<data name="Category" value="Developers"/>
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="Cloudera">defaults file</indexterm>
<indexterm audience="Cloudera">configuration file</indexterm>
<indexterm audience="Cloudera">options</indexterm>
<indexterm audience="Cloudera">IMPALA_STATE_STORE_PORT</indexterm>
<indexterm audience="Cloudera">IMPALA_BACKEND_PORT</indexterm>
<indexterm audience="Cloudera">IMPALA_LOG_DIR</indexterm>
<indexterm audience="Cloudera">IMPALA_STATE_STORE_ARGS</indexterm>
<indexterm audience="Cloudera">IMPALA_SERVER_ARGS</indexterm>
<indexterm audience="Cloudera">ENABLE_CORE_DUMPS</indexterm>
<indexterm audience="Cloudera">core dumps</indexterm>
<indexterm audience="Cloudera">restarting services</indexterm>
<indexterm audience="Cloudera">services</indexterm>
The configuration options for the Impala-related daemons let you choose which hosts and
ports to use for the services that run on a single host, specify directories for logging,
control resource usage and security, and specify other aspects of the Impala software.
</p>
<p outputclass="toc inpage"/>
</conbody>
<concept id="config_options_cm">
<title>Configuring Impala Startup Options through Cloudera Manager</title>
<conbody>
<p>
If you manage your cluster through Cloudera Manager, configure the settings for all the
Impala-related daemons by navigating to this page:
<menucascade><uicontrol>Clusters</uicontrol><uicontrol>Impala</uicontrol><uicontrol>Configuration</uicontrol><uicontrol>View
and Edit</uicontrol></menucascade>. See the Cloudera Manager documentation for
<xref href="http://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_impala_service.html" scope="external" format="html">instructions
about how to configure Impala through Cloudera Manager</xref>.
</p>
<p>
If the Cloudera Manager interface does not yet have a form field for a newly added
option, or if you need to use special options for debugging and troubleshooting, the
<uicontrol>Advanced</uicontrol> option page for each daemon includes one or more fields
where you can enter option names directly.
<ph conref="../shared/impala_common.xml#common/safety_valve"/> There is also a free-form
field for query options, on the top-level <uicontrol>Impala Daemon</uicontrol> options
page.
</p>
</conbody>
</concept>
<concept id="config_options_noncm">
<title>Configuring Impala Startup Options through the Command Line</title>
<conbody>
<p>
When you run Impala in a non-Cloudera Manager environment, the Impala server,
statestore, and catalog services start up using values provided in a defaults file,
<filepath>/etc/default/impala</filepath>.
</p>
<p>
This file includes information about many resources used by Impala. Most of the defaults
included in this file should be effective in most cases. For example, typically you
would not change the definition of the <codeph>CLASSPATH</codeph> variable, but you
would always set the address used by the statestore server. Some of the content you
might modify includes:
</p>
<!-- Note: Update the following example for each release with the associated lines from /etc/default/impala
from a non-CM-managed system. -->
<codeblock rev="ver">IMPALA_STATE_STORE_HOST=127.0.0.1
IMPALA_STATE_STORE_PORT=24000
IMPALA_BACKEND_PORT=22000
IMPALA_LOG_DIR=/var/log/impala
IMPALA_CATALOG_SERVICE_HOST=...
IMPALA_STATE_STORE_HOST=...
export IMPALA_STATE_STORE_ARGS=${IMPALA_STATE_STORE_ARGS:- \
-log_dir=${IMPALA_LOG_DIR} -state_store_port=${IMPALA_STATE_STORE_PORT}}
IMPALA_SERVER_ARGS=" \
-log_dir=${IMPALA_LOG_DIR} \
-catalog_service_host=${IMPALA_CATALOG_SERVICE_HOST} \
-state_store_port=${IMPALA_STATE_STORE_PORT} \
-use_statestore \
-state_store_host=${IMPALA_STATE_STORE_HOST} \
-be_port=${IMPALA_BACKEND_PORT}"
export ENABLE_CORE_DUMPS=${ENABLE_COREDUMPS:-false}</codeblock>
<p>
To use alternate values, edit the defaults file, then restart all the Impala-related
services so that the changes take effect. Restart the Impala server using the following
commands:
</p>
<codeblock>$ sudo service impala-server restart
Stopping Impala Server: [ OK ]
Starting Impala Server: [ OK ]</codeblock>
<p>
Restart the Impala statestore using the following commands:
</p>
<codeblock>$ sudo service impala-state-store restart
Stopping Impala State Store Server: [ OK ]
Starting Impala State Store Server: [ OK ]</codeblock>
<p>
Restart the Impala catalog service using the following commands:
</p>
<codeblock>$ sudo service impala-catalog restart
Stopping Impala Catalog Server: [ OK ]
Starting Impala Catalog Server: [ OK ]</codeblock>
<p>
Some common settings to change include:
</p>
<ul>
<li>
<p>
Statestore address. Where practical, put the statestore on a separate host not
running the <cmdname>impalad</cmdname> daemon. In that recommended configuration,
the <cmdname>impalad</cmdname> daemon cannot refer to the statestore server using
the loopback address. If the statestore is hosted on a machine with an IP address of
192.168.0.27, change:
</p>
<codeblock>IMPALA_STATE_STORE_HOST=127.0.0.1</codeblock>
<p>
to:
</p>
<codeblock>IMPALA_STATE_STORE_HOST=192.168.0.27</codeblock>
</li>
<li rev="1.2">
<p>
Catalog server address (including both the hostname and the port number). Update the
value of the <codeph>IMPALA_CATALOG_SERVICE_HOST</codeph> variable. Cloudera
recommends the catalog server be on the same host as the statestore. In that
recommended configuration, the <cmdname>impalad</cmdname> daemon cannot refer to the
catalog server using the loopback address. If the catalog service is hosted on a
machine with an IP address of 192.168.0.27, add the following line:
</p>
<codeblock>IMPALA_CATALOG_SERVICE_HOST=192.168.0.27:26000</codeblock>
<p>
The <filepath>/etc/default/impala</filepath> defaults file currently does not define
an <codeph>IMPALA_CATALOG_ARGS</codeph> environment variable, but if you add one it
will be recognized by the service startup/shutdown script. Add a definition for this
variable to <filepath>/etc/default/impala</filepath> and add the option
<codeph>-catalog_service_host=<varname>hostname</varname></codeph>. If the port is
different than the default 26000, also add the option
<codeph>-catalog_service_port=<varname>port</varname></codeph>.
</p>
</li>
<li id="mem_limit">
<p>
Memory limits. You can limit the amount of memory available to Impala. For example,
to allow Impala to use no more than 70% of system memory, change:
</p>
<!-- Note: also needs to be updated for each release to reflect latest /etc/default/impala. -->
<codeblock>export IMPALA_SERVER_ARGS=${IMPALA_SERVER_ARGS:- \
-log_dir=${IMPALA_LOG_DIR} \
-state_store_port=${IMPALA_STATE_STORE_PORT} \
-use_statestore -state_store_host=${IMPALA_STATE_STORE_HOST} \
-be_port=${IMPALA_BACKEND_PORT}}</codeblock>
<p>
to:
</p>
<codeblock>export IMPALA_SERVER_ARGS=${IMPALA_SERVER_ARGS:- \
-log_dir=${IMPALA_LOG_DIR} -state_store_port=${IMPALA_STATE_STORE_PORT} \
-use_statestore -state_store_host=${IMPALA_STATE_STORE_HOST} \
-be_port=${IMPALA_BACKEND_PORT} -mem_limit=70%}</codeblock>
<p>
You can specify the memory limit using absolute notation such as
<codeph>500m</codeph> or <codeph>2G</codeph>, or as a percentage of physical memory
such as <codeph>60%</codeph>.
</p>
<note>
Queries that exceed the specified memory limit are aborted. Percentage limits are
based on the physical memory of the machine and do not consider cgroups.
</note>
</li>
<li>
<p>
Core dump enablement. To enable core dumps on systems not managed by Cloudera
Manager, change:
</p>
<codeblock>export ENABLE_CORE_DUMPS=${ENABLE_COREDUMPS:-false}</codeblock>
<p>
to:
</p>
<codeblock>export ENABLE_CORE_DUMPS=${ENABLE_COREDUMPS:-true}</codeblock>
<p>
On systems managed by Cloudera Manager, enable the <uicontrol>Enable Core
Dump</uicontrol> setting for the Impala service.
</p>
<note conref="../shared/impala_common.xml#common/core_dump_considerations"/>
</li>
<li>
<p>
Authorization using the open source Sentry plugin. Specify the
<codeph>-server_name</codeph> and <codeph>-authorization_policy_file</codeph>
options as part of the <codeph>IMPALA_SERVER_ARGS</codeph> and
<codeph>IMPALA_STATE_STORE_ARGS</codeph> settings to enable the core Impala support
for authentication. See <xref href="impala_authorization.xml#secure_startup"/> for
details.
</p>
</li>
<li>
<p>
Auditing for successful or blocked Impala queries, another aspect of security.
Specify the <codeph>-audit_event_log_dir=<varname>directory_path</varname></codeph>
option and optionally the
<codeph>-max_audit_event_log_file_size=<varname>number_of_queries</varname></codeph>
and <codeph>-abort_on_failed_audit_event</codeph> options as part of the
<codeph>IMPALA_SERVER_ARGS</codeph> settings, for each Impala node, to enable and
customize auditing. See <xref href="impala_auditing.xml#auditing"/> for details.
</p>
</li>
<li>
<p>
Password protection for the Impala web UI, which listens on port 25000 by default.
This feature involves adding some or all of the
<codeph>--webserver_password_file</codeph>,
<codeph>--webserver_authentication_domain</codeph>, and
<codeph>--webserver_certificate_file</codeph> options to the
<codeph>IMPALA_SERVER_ARGS</codeph> and <codeph>IMPALA_STATE_STORE_ARGS</codeph>
settings. See <xref href="impala_security_guidelines.xml#security_guidelines"/> for
details.
</p>
</li>
<li id="default_query_options">
<p rev="DOCS-677">
Another setting you might add to <codeph>IMPALA_SERVER_ARGS</codeph> is a
comma-separated list of query options and values:
<codeblock>-default_query_options='<varname>option</varname>=<varname>value</varname>,<varname>option</varname>=<varname>value</varname>,...'
</codeblock>
These options control the behavior of queries performed by this
<cmdname>impalad</cmdname> instance. The option values you specify here override the
default values for <xref href="impala_query_options.xml#query_options">Impala query
options</xref>, as shown by the <codeph>SET</codeph> statement in
<cmdname>impala-shell</cmdname>.
</p>
</li>
<!-- Removing this reference now that the options are de-emphasized / desupported in CDH 5.5 / Impala 2.3 and up.
<li rev="1.2">
<p>
Options for resource management, in conjunction with the YARN component. These options include
<codeph>-enable_rm</codeph> and <codeph>-cgroup_hierarchy_path</codeph>.
<ph rev="1.4.0">Additional options to help fine-tune the resource estimates are
<codeph>-—rm_always_use_defaults</codeph>,
<codeph>-—rm_default_memory=<varname>size</varname></codeph>, and
<codeph>-—rm_default_cpu_cores</codeph>.</ph> For details about these options, see
<xref href="impala_resource_management.xml#rm_options"/>. See
<xref href="impala_resource_management.xml#resource_management"/> for information about resource
management in general.
</p>
</li>
-->
<li>
<p>
During troubleshooting, <keyword keyref="support_org"/> might direct you to change other values,
particularly for <codeph>IMPALA_SERVER_ARGS</codeph>, to work around issues or
gather debugging information.
</p>
</li>
</ul>
<!-- Removing this reference now that the options are de-emphasized / desupported in CDH 5.5 / Impala 2.3 and up.
<p conref="impala_resource_management.xml#rm_options/resource_management_impalad_options"/>
-->
<note>
<p>
These startup options for the <cmdname>impalad</cmdname> daemon are different from the
command-line options for the <cmdname>impala-shell</cmdname> command. For the
<cmdname>impala-shell</cmdname> options, see
<xref href="impala_shell_options.xml#shell_options"/>.
</p>
</note>
<p audience="Cloudera" outputclass="toc inpage"/>
</conbody>
<concept audience="Cloudera" id="config_options_impalad_details">
<title>Configuration Options for impalad Daemon</title>
<conbody>
<p>
Some common settings to change include:
</p>
<ul>
<li>
<p>
Statestore address. Where practical, put the statestore on a separate host not
running the <cmdname>impalad</cmdname> daemon. In that recommended configuration,
the <cmdname>impalad</cmdname> daemon cannot refer to the statestore server using
the loopback address. If the statestore is hosted on a machine with an IP address
of 192.168.0.27, change:
</p>
<codeblock>IMPALA_STATE_STORE_HOST=127.0.0.1</codeblock>
<p>
to:
</p>
<codeblock>IMPALA_STATE_STORE_HOST=192.168.0.27</codeblock>
</li>
<li rev="1.2">
<p>
Catalog server address. Update the <codeph>IMPALA_CATALOG_SERVICE_HOST</codeph>
variable, including both the hostname and the port number in the value. Cloudera
recommends the catalog server be on the same host as the statestore. In that
recommended configuration, the <cmdname>impalad</cmdname> daemon cannot refer to
the catalog server using the loopback address. If the catalog service is hosted on
a machine with an IP address of 192.168.0.27, add the following line:
</p>
<codeblock>IMPALA_CATALOG_SERVICE_HOST=192.168.0.27:26000</codeblock>
<p>
The <filepath>/etc/default/impala</filepath> defaults file currently does not
define an <codeph>IMPALA_CATALOG_ARGS</codeph> environment variable, but if you
add one it will be recognized by the service startup/shutdown script. Add a
definition for this variable to <filepath>/etc/default/impala</filepath> and add
the option <codeph>-catalog_service_host=<varname>hostname</varname></codeph>. If
the port is different than the default 26000, also add the option
<codeph>-catalog_service_port=<varname>port</varname></codeph>.
</p>
</li>
<li id="mem_limit">
Memory limits. You can limit the amount of memory available to Impala. For example,
to allow Impala to use no more than 70% of system memory, change:
<!-- Note: also needs to be updated for each release to reflect latest /etc/default/impala. -->
<codeblock>export IMPALA_SERVER_ARGS=${IMPALA_SERVER_ARGS:- \
-log_dir=${IMPALA_LOG_DIR} \
-state_store_port=${IMPALA_STATE_STORE_PORT} \
-use_statestore -state_store_host=${IMPALA_STATE_STORE_HOST} \
-be_port=${IMPALA_BACKEND_PORT}}</codeblock>
<p>
to:
</p>
<codeblock>export IMPALA_SERVER_ARGS=${IMPALA_SERVER_ARGS:- \
-log_dir=${IMPALA_LOG_DIR} -state_store_port=${IMPALA_STATE_STORE_PORT} \
-use_statestore -state_store_host=${IMPALA_STATE_STORE_HOST} \
-be_port=${IMPALA_BACKEND_PORT} -mem_limit=70%}</codeblock>
<p>
You can specify the memory limit using absolute notation such as
<codeph>500m</codeph> or <codeph>2G</codeph>, or as a percentage of physical
memory such as <codeph>60%</codeph>.
</p>
<note>
Queries that exceed the specified memory limit are aborted. Percentage limits are
based on the physical memory of the machine and do not consider cgroups.
</note>
</li>
<li>
Core dump enablement. To enable core dumps, change:
<codeblock>export ENABLE_CORE_DUMPS=${ENABLE_COREDUMPS:-false}</codeblock>
<p>
to:
</p>
<codeblock>export ENABLE_CORE_DUMPS=${ENABLE_COREDUMPS:-true}</codeblock>
<note>
The location of core dump files may vary according to your operating system
configuration. Other security settings may prevent Impala from writing core dumps
even when this option is enabled.
</note>
</li>
<li>
Authorization using the open source Sentry plugin. Specify the
<codeph>-server_name</codeph> and <codeph>-authorization_policy_file</codeph>
options as part of the <codeph>IMPALA_SERVER_ARGS</codeph> and
<codeph>IMPALA_STATE_STORE_ARGS</codeph> settings to enable the core Impala support
for authentication. See <xref href="impala_authorization.xml#secure_startup"/> for
details.
</li>
<li>
Auditing for successful or blocked Impala queries, another aspect of security.
Specify the <codeph>-audit_event_log_dir=<varname>directory_path</varname></codeph>
option and optionally the
<codeph>-max_audit_event_log_file_size=<varname>number_of_queries</varname></codeph>
and <codeph>-abort_on_failed_audit_event</codeph> options as part of the
<codeph>IMPALA_SERVER_ARGS</codeph> settings, for each Impala node, to enable and
customize auditing. See <xref href="impala_auditing.xml#auditing"/> for details.
</li>
<li>
Password protection for the Impala web UI, which listens on port 25000 by default.
This feature involves adding some or all of the
<codeph>--webserver_password_file</codeph>,
<codeph>--webserver_authentication_domain</codeph>, and
<codeph>--webserver_certificate_file</codeph> options to the
<codeph>IMPALA_SERVER_ARGS</codeph> and <codeph>IMPALA_STATE_STORE_ARGS</codeph>
settings. See <xref href="impala_security_webui.xml"/> for details.
</li>
<li id="default_query_options">
Another setting you might add to <codeph>IMPALA_SERVER_ARGS</codeph> is:
<codeblock>-default_query_options='<varname>option</varname>=<varname>value</varname>,<varname>option</varname>=<varname>value</varname>,...'
</codeblock>
These options control the behavior of queries performed by this
<cmdname>impalad</cmdname> instance. The option values you specify here override the
default values for <xref href="impala_query_options.xml#query_options">Impala query
options</xref>, as shown by the <codeph>SET</codeph> statement in
<cmdname>impala-shell</cmdname>.
</li>
<!-- Removing this reference now that the options are de-emphasized / desupported in CDH 5.5 / Impala 2.3 and up.
<li rev="1.2">
Options for resource management, in conjunction with the YARN component. These options
include <codeph>-enable_rm</codeph> and <codeph>-cgroup_hierarchy_path</codeph>.
<ph rev="1.4.0">Additional options to help fine-tune the resource estimates are
<codeph>-—rm_always_use_defaults</codeph>,
<codeph>-—rm_default_memory=<varname>size</varname></codeph>, and
<codeph>-—rm_default_cpu_cores</codeph>.</ph> For details about these options, see
<xref href="impala_resource_management.xml#rm_options"/>. See
<xref href="impala_resource_management.xml#resource_management"/> for information about resource
management in general.
</li>
-->
<li>
During troubleshooting, <keyword keyref="support_org"/> might direct you to change other values,
particularly for <codeph>IMPALA_SERVER_ARGS</codeph>, to work around issues or
gather debugging information.
</li>
</ul>
<!-- Removing this reference now that the options are de-emphasized / desupported in CDH 5.5 / Impala 2.3 and up.
<p conref="impala_resource_management.xml#rm_options/resource_management_impalad_options"/>
-->
<note>
<p>
These startup options for the <cmdname>impalad</cmdname> daemon are different from
the command-line options for the <cmdname>impala-shell</cmdname> command. For the
<cmdname>impala-shell</cmdname> options, see
<xref href="impala_shell_options.xml#shell_options"/>.
</p>
</note>
</conbody>
</concept>
<concept audience="Cloudera" id="config_options_statestored_details">
<title>Configuration Options for statestored Daemon</title>
<conbody>
<p></p>
</conbody>
</concept>
<concept audience="Cloudera" id="config_options_catalogd_details">
<title>Configuration Options for catalogd Daemon</title>
<conbody>
<p></p>
</conbody>
</concept>
</concept>
<concept id="config_options_checking">
<title>Checking the Values of Impala Configuration Options</title>
<conbody>
<p>
You can check the current runtime value of all these settings through the Impala web
interface, available by default at
<codeph>http://<varname>impala_hostname</varname>:25000/varz</codeph> for the
<cmdname>impalad</cmdname> daemon,
<codeph>http://<varname>impala_hostname</varname>:25010/varz</codeph> for the
<cmdname>statestored</cmdname> daemon, or
<codeph>http://<varname>impala_hostname</varname>:25020/varz</codeph> for the
<cmdname>catalogd</cmdname> daemon. In the Cloudera Manager interface, you can see the
link to the appropriate <uicontrol><varname>service_name</varname> Web UI</uicontrol>
page when you look at the status page for a specific daemon on a specific host.
</p>
</conbody>
</concept>
<concept id="config_options_impalad">
<title>Startup Options for impalad Daemon</title>
<conbody>
<p>
The <codeph>impalad</codeph> daemon implements the main Impala service, which performs
query processing and reads and writes the data files.
</p>
</conbody>
</concept>
<concept id="config_options_statestored">
<title>Startup Options for statestored Daemon</title>
<conbody>
<p>
The <cmdname>statestored</cmdname> daemon implements the Impala statestore service,
which monitors the availability of Impala services across the cluster, and handles
situations such as nodes becoming unavailable or becoming available again.
</p>
</conbody>
</concept>
<concept rev="1.2" id="config_options_catalogd">
<title>Startup Options for catalogd Daemon</title>
<conbody>
<p>
The <cmdname>catalogd</cmdname> daemon implements the Impala catalog service, which
broadcasts metadata changes to all the Impala nodes when Impala creates a table, inserts
data, or performs other kinds of DDL and DML operations.
</p>
<p conref="../shared/impala_common.xml#common/load_catalog_in_background"/>
</conbody>
</concept>
</concept>

View File

@@ -0,0 +1,179 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="config_performance">
<title>Post-Installation Configuration for Impala</title>
<prolog>
<metadata>
<data name="Category" value="Performance"/>
<data name="Category" value="Impala"/>
<data name="Category" value="Configuring"/>
<data name="Category" value="Administrators"/>
</metadata>
</prolog>
<conbody>
<p id="p_24">
This section describes the mandatory and recommended configuration settings for Impala. If Impala is
installed using Cloudera Manager, some of these configurations are completed automatically; you must still
configure short-circuit reads manually. If you installed Impala without Cloudera Manager, or if you want to
customize your environment, consider making the changes described in this topic.
</p>
<p>
<!-- Could conref this paragraph from ciiu_install.xml. -->
In some cases, depending on the level of Impala, CDH, and Cloudera Manager, you might need to add particular
component configuration details in one of the free-form fields on the Impala configuration pages within
Cloudera Manager. <ph conref="../shared/impala_common.xml#common/safety_valve"/>
</p>
<ul>
<li>
You must enable short-circuit reads, whether or not Impala was installed through Cloudera Manager. This
setting goes in the Impala configuration settings, not the Hadoop-wide settings.
</li>
<li>
If you installed Impala in an environment that is not managed by Cloudera Manager, you must enable block
location tracking, and you can optionally enable native checksumming for optimal performance.
</li>
<li>
If you deployed Impala using Cloudera Manager see
<xref href="impala_perf_testing.xml#performance_testing"/> to confirm proper configuration.
</li>
</ul>
<section id="section_fhq_wyv_ls">
<title>Mandatory: Short-Circuit Reads</title>
<p> Enabling short-circuit reads allows Impala to read local data directly
from the file system. This removes the need to communicate through the
DataNodes, improving performance. This setting also minimizes the number
of additional copies of data. Short-circuit reads requires
<codeph>libhadoop.so</codeph>
<!-- This link went stale. Not obvious how to keep it in sync with whatever Hadoop CDH is using behind the scenes. So hide the link for now. -->
<!-- (the <xref href="http://hadoop.apache.org/docs/r0.19.1/native_libraries.html" scope="external" format="html">Hadoop Native Library</xref>) -->
(the Hadoop Native Library) to be accessible to both the server and the
client. <codeph>libhadoop.so</codeph> is not available if you have
installed from a tarball. You must install from an
<codeph>.rpm</codeph>, <codeph>.deb</codeph>, or parcel to use
short-circuit local reads. <note> If you use Cloudera Manager, you can
enable short-circuit reads through a checkbox in the user interface
and that setting takes effect for Impala as well. </note>
</p>
<p>
<b>To configure DataNodes for short-circuit reads:</b>
</p>
<ol id="ol_qlq_wyv_ls">
<li id="copy_config_files"> Copy the client
<codeph>core-site.xml</codeph> and <codeph>hdfs-site.xml</codeph>
configuration files from the Hadoop configuration directory to the
Impala configuration directory. The default Impala configuration
location is <codeph>/etc/impala/conf</codeph>. </li>
<li>
<indexterm audience="Cloudera"
>dfs.client.read.shortcircuit</indexterm>
<indexterm audience="Cloudera">dfs.domain.socket.path</indexterm>
<indexterm audience="Cloudera"
>dfs.client.file-block-storage-locations.timeout.millis</indexterm>
On all Impala nodes, configure the following properties in <!-- Exact timing is unclear, since we say farther down to copy /etc/hadoop/conf/hdfs-site.xml to /etc/impala/conf.
Which wouldn't work if we already modified the Impala version of the file here. Not to mention that this
doesn't take the CM interface into account, where these /etc files might not exist in those locations. -->
<!-- <codeph>/etc/impala/conf/hdfs-site.xml</codeph> as shown: -->
Impala's copy of <codeph>hdfs-site.xml</codeph> as shown: <codeblock>&lt;property&gt;
&lt;name&gt;dfs.client.read.shortcircuit&lt;/name&gt;
&lt;value&gt;true&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;dfs.domain.socket.path&lt;/name&gt;
&lt;value&gt;/var/run/hdfs-sockets/dn&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;dfs.client.file-block-storage-locations.timeout.millis&lt;/name&gt;
&lt;value&gt;10000&lt;/value&gt;
&lt;/property&gt;</codeblock>
<!-- Former socket.path value: &lt;value&gt;/var/run/hadoop-hdfs/dn._PORT&lt;/value&gt; -->
<!--
<note>
The text <codeph>_PORT</codeph> appears just as shown; you do not need to
substitute a number.
</note>
-->
</li>
<li>
<p> If <codeph>/var/run/hadoop-hdfs/</codeph> is group-writable, make
sure its group is <codeph>root</codeph>. </p>
<note> If you are also going to enable block location tracking, you
can skip copying configuration files and restarting DataNodes and go
straight to <xref href="#config_performance/block_location_tracking"
>Optional: Block Location Tracking</xref>.
Configuring short-circuit reads and block location tracking require
the same process of copying files and restarting services, so you
can complete that process once when you have completed all
configuration changes. Whether you copy files and restart services
now or during configuring block location tracking, short-circuit
reads are not enabled until you complete those final steps. </note>
</li>
<li id="restart_all_datanodes"> After applying these changes, restart
all DataNodes. </li>
</ol>
</section>
<section id="block_location_tracking">
<title>Mandatory: Block Location Tracking</title>
<p>
Enabling block location metadata allows Impala to know which disk data blocks are located on, allowing
better utilization of the underlying disks. Impala will not start unless this setting is enabled.
</p>
<p>
<b>To enable block location tracking:</b>
</p>
<ol>
<li>
For each DataNode, adding the following to the <codeph>hdfs-site.xml</codeph> file:
<codeblock>&lt;property&gt;
&lt;name&gt;dfs.datanode.hdfs-blocks-metadata.enabled&lt;/name&gt;
&lt;value&gt;true&lt;/value&gt;
&lt;/property&gt; </codeblock>
</li>
<li conref="#config_performance/copy_config_files"/>
<li conref="#config_performance/restart_all_datanodes"/>
</ol>
</section>
<section id="native_checksumming">
<title>Optional: Native Checksumming</title>
<p>
Enabling native checksumming causes Impala to use an optimized native library for computing checksums, if
that library is available.
</p>
<p id="p_29">
<b>To enable native checksumming:</b>
</p>
<p>
If you installed CDH from packages, the native checksumming library is installed and setup correctly. In
such a case, no additional steps are required. Conversely, if you installed by other means, such as with
tarballs, native checksumming may not be available due to missing shared objects. Finding the message
"<codeph>Unable to load native-hadoop library for your platform... using builtin-java classes where
applicable</codeph>" in the Impala logs indicates native checksumming may be unavailable. To enable native
checksumming, you must build and install <codeph>libhadoop.so</codeph> (the
<!-- Another instance of stale link. -->
<!-- <xref href="http://hadoop.apache.org/docs/r0.19.1/native_libraries.html" scope="external" format="html">Hadoop Native Library</xref>). -->
Hadoop Native Library).
</p>
</section>
</conbody>
</concept>

View File

@@ -0,0 +1,202 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="connecting">
<title>Connecting to impalad through impala-shell</title>
<titlealts audience="PDF"><navtitle>Connecting to impalad</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="impala-shell"/>
<data name="Category" value="Network"/>
<data name="Category" value="DataNode"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<!--
TK: This would be a good theme for a tutorial topic.
Lots of nuances to illustrate through sample code.
-->
<p>
Within an <cmdname>impala-shell</cmdname> session, you can only issue queries while connected to an instance
of the <cmdname>impalad</cmdname> daemon. You can specify the connection information:
<ul>
<li>
Through command-line options when you run the <cmdname>impala-shell</cmdname> command.
</li>
<li>
Through a configuration file that is read when you run the <cmdname>impala-shell</cmdname> command.
</li>
<li>
During an <cmdname>impala-shell</cmdname> session, by issuing a <codeph>CONNECT</codeph> command.
</li>
</ul>
See <xref href="impala_shell_options.xml"/> for the command-line and configuration file options you can use.
</p>
<p>
You can connect to any DataNode where an instance of <cmdname>impalad</cmdname> is running,
and that host coordinates the execution of all queries sent to it.
</p>
<p>
For simplicity during development, you might always connect to the same host, perhaps running <cmdname>impala-shell</cmdname> on
the same host as <cmdname>impalad</cmdname> and specifying the hostname as <codeph>localhost</codeph>.
</p>
<p>
In a production environment, you might enable load balancing, in which you connect to specific host/port combination
but queries are forwarded to arbitrary hosts. This technique spreads the overhead of acting as the coordinator
node among all the DataNodes in the cluster. See <xref href="impala_proxy.xml"/> for details.
</p>
<p>
<b>To connect the Impala shell during shell startup:</b>
</p>
<ol>
<li>
Locate the hostname of a DataNode within the cluster that is running an instance of the
<cmdname>impalad</cmdname> daemon. If that DataNode uses a non-default port (something
other than port 21000) for <cmdname>impala-shell</cmdname> connections, find out the
port number also.
</li>
<li>
Use the <codeph>-i</codeph> option to the
<cmdname>impala-shell</cmdname> interpreter to specify the connection information for
that instance of <cmdname>impalad</cmdname>:
<codeblock>
# When you are logged into the same machine running impalad.
# The prompt will reflect the current hostname.
$ impala-shell
# When you are logged into the same machine running impalad.
# The host will reflect the hostname 'localhost'.
$ impala-shell -i localhost
# When you are logged onto a different host, perhaps a client machine
# outside the Hadoop cluster.
$ impala-shell -i <varname>some.other.hostname</varname>
# When you are logged onto a different host, and impalad is listening
# on a non-default port. Perhaps a load balancer is forwarding requests
# to a different host/port combination behind the scenes.
$ impala-shell -i <varname>some.other.hostname</varname>:<varname>port_number</varname>
</codeblock>
</li>
</ol>
<p>
<b>To connect the Impala shell after shell startup:</b>
</p>
<ol>
<li>
Start the Impala shell with no connection:
<codeblock>$ impala-shell</codeblock>
<p>
You should see a prompt like the following:
</p>
<codeblock>Welcome to the Impala shell. Press TAB twice to see a list of available commands.
Copyright (c) <varname>year</varname> Cloudera, Inc. All rights reserved.
<ph conref="../shared/ImpalaVariables.xml#impala_vars/ShellBanner"/>
[Not connected] &gt; </codeblock>
</li>
<li>
Locate the hostname of a DataNode within the cluster that is running an instance of the
<cmdname>impalad</cmdname> daemon. If that DataNode uses a non-default port (something
other than port 21000) for <cmdname>impala-shell</cmdname> connections, find out the
port number also.
</li>
<li>
Use the <codeph>connect</codeph> command to connect to an Impala instance. Enter a command of the form:
<codeblock>[Not connected] &gt; connect <varname>impalad-host</varname>
[<varname>impalad-host</varname>:21000] &gt;</codeblock>
<note>
Replace <varname>impalad-host</varname> with the hostname you have configured for any DataNode running
Impala in your environment. The changed prompt indicates a successful connection.
</note>
</li>
</ol>
<p>
<b>To start <cmdname>impala-shell</cmdname> in a specific database:</b>
</p>
<p>
You can use all the same connection options as in previous examples.
For simplicity, these examples assume that you are logged into one of
the DataNodes that is running the <cmdname>impalad</cmdname> daemon.
</p>
<ol>
<li>
Find the name of the database containing the relevant tables, views, and so
on that you want to operate on.
</li>
<li>
Use the <codeph>-d</codeph> option to the
<cmdname>impala-shell</cmdname> interpreter to connect and immediately
switch to the specified database, without the need for a <codeph>USE</codeph>
statement or fully qualified names:
<codeblock>
# Subsequent queries with unqualified names operate on
# tables, views, and so on inside the database named 'staging'.
$ impala-shell -i localhost -d staging
# It is common during development, ETL, benchmarking, and so on
# to have different databases containing the same table names
# but with different contents or layouts.
$ impala-shell -i localhost -d parquet_snappy_compression
$ impala-shell -i localhost -d parquet_gzip_compression
</codeblock>
</li>
</ol>
<p>
<b>To run one or several statements in non-interactive mode:</b>
</p>
<p>
You can use all the same connection options as in previous examples.
For simplicity, these examples assume that you are logged into one of
the DataNodes that is running the <cmdname>impalad</cmdname> daemon.
</p>
<ol>
<li>
Construct a statement, or a file containing a sequence of statements,
that you want to run in an automated way, without typing or copying
and pasting each time.
</li>
<li>
Invoke <cmdname>impala-shell</cmdname> with the <codeph>-q</codeph> option to run a single statement, or
the <codeph>-f</codeph> option to run a sequence of statements from a file.
The <cmdname>impala-shell</cmdname> command returns immediately, without going into
the interactive interpreter.
<codeblock>
# A utility command that you might run while developing shell scripts
# to manipulate HDFS files.
$ impala-shell -i localhost -d database_of_interest -q 'show tables'
# A sequence of CREATE TABLE, CREATE VIEW, and similar DDL statements
# can go into a file to make the setup process repeatable.
$ impala-shell -i localhost -d database_of_interest -f recreate_tables.sql
</codeblock>
</li>
</ol>
</conbody>
</concept>

View File

@@ -0,0 +1,758 @@
<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="conversion_functions">
<title>Impala Type Conversion Functions</title>
<titlealts audience="PDF"><navtitle>Type Conversion Functions</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Impala Functions"/>
<data name="Category" value="Impala Data Types"/>
<data name="Category" value="SQL"/>
<data name="Category" value="Data Analysts"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Querying"/>
</metadata>
</prolog>
<conbody>
<p>
Conversion functions are usually used in combination with other functions, to explicitly pass the expected
data types. Impala has strict rules regarding data types for function parameters. For example, Impala does
not automatically convert a <codeph>DOUBLE</codeph> value to <codeph>FLOAT</codeph>, a
<codeph>BIGINT</codeph> value to <codeph>INT</codeph>, or other conversion where precision could be lost or
overflow could occur. Also, for reporting or dealing with loosely defined schemas in big data contexts,
you might frequently need to convert values to or from the <codeph>STRING</codeph> type.
</p>
<note>
Although in CDH 5.5.0, the <codeph>SHOW FUNCTIONS</codeph> output for
database <codeph>_IMPALA_BUILTINS</codeph> contains some function signatures
matching the pattern <codeph>castto*</codeph>, these functions are not intended
for public use and are expected to be hidden in future.
</note>
<p>
<b>Function reference:</b>
</p>
<p>
Impala supports the following type conversion functions:
</p>
<dl>
<dlentry id="cast">
<dt>
<codeph>cast(<varname>expr</varname> AS <varname>type</varname>)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">cast() function</indexterm>
<b>Purpose:</b> Converts the value of an expression to any other type.
If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>.
<p><b>Usage notes:</b>
Use <codeph>CAST</codeph> when passing a column value or literal to a function that
expects a parameter with a different type.
Frequently used in SQL operations such as <codeph>CREATE TABLE AS SELECT</codeph>
and <codeph>INSERT ... VALUES</codeph> to ensure that values from various sources
are of the appropriate type for the destination columns.
Where practical, do a one-time <codeph>CAST()</codeph> operation during the ingestion process
to make each column into the appropriate type, rather than using many <codeph>CAST()</codeph>
operations in each query; doing type conversions for each row during each query can be expensive
for tables with millions or billions of rows.
</p>
<p conref="../shared/impala_common.xml#common/timezone_conversion_caveat"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<codeblock>select concat('Here are the first ',10,' results.'); -- Fails
select concat('Here are the first ',cast(10 as string),' results.'); -- Succeeds
</codeblock>
<p>
The following example starts with a text table where every column has a type of <codeph>STRING</codeph>,
which might be how you ingest data of unknown schema until you can verify the cleanliness of the underly values.
Then it uses <codeph>CAST()</codeph> to create a new Parquet table with the same data, but using specific
numeric data types for the columns with numeric data. Using numeric types of appropriate sizes can result in
substantial space savings on disk and in memory, and performance improvements in queries,
over using strings or larger-than-necessary numeric types.
</p>
<codeblock>create table t1 (name string, x string, y string, z string);
create table t2 stored as parquet
as select
name,
cast(x as bigint) x,
cast(y as timestamp) y,
cast(z as smallint) z
from t1;
describe t2;
+------+----------+---------+
| name | type | comment |
+------+----------+---------+
| name | string | |
| x | bigint | |
| y | smallint | |
| z | tinyint | |
+------+----------+---------+
</codeblock>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<!-- TK: Can you cast to or from MAP, ARRAY, STRUCT? -->
For details of casts from each kind of data type, see the description of
the appropriate type:
<xref href="impala_tinyint.xml#tinyint"/>,
<xref href="impala_smallint.xml#smallint"/>,
<xref href="impala_int.xml#int"/>,
<xref href="impala_bigint.xml#bigint"/>,
<xref href="impala_float.xml#float"/>,
<xref href="impala_double.xml#double"/>,
<xref href="impala_decimal.xml#decimal"/>,
<xref href="impala_string.xml#string"/>,
<xref href="impala_char.xml#char"/>,
<xref href="impala_varchar.xml#varchar"/>,
<xref href="impala_timestamp.xml#timestamp"/>,
<xref href="impala_boolean.xml#boolean"/>
</p>
</dd>
</dlentry>
<dlentry rev="2.3.0" id="casttobigint" audience="Cloudera">
<dt>
<codeph>casttobigint(type value)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">casttobigint() function</indexterm>
<b>Purpose:</b> Converts the value of an expression to <codeph>BIGINT</codeph>. If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>.
<p><b>Return type:</b> <codeph>bigint</codeph></p>
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_usage"/>
<p conref="../shared/impala_common.xml#common/added_in_230"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_example"/>
<codeblock>create table small_types (x tinyint, y smallint, z int);
create table big_types as
select casttobigint(x) as x, casttobigint(y) as y, casttobigint(z) as z
from small_types;
describe big_types;
+------+--------+---------+
| name | type | comment |
+------+--------+---------+
| x | bigint | |
| y | bigint | |
| z | bigint | |
+------+--------+---------+
</codeblock>
</dd>
</dlentry>
<dlentry rev="2.3.0" id="casttoboolean" audience="Cloudera">
<dt>
<codeph>casttoboolean(type value)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">casttoboolean() function</indexterm>
<b>Purpose:</b> Converts the value of an expression to <codeph>BOOLEAN</codeph>.
Numeric values of 0 evaluate to <codeph>false</codeph>, and non-zero values evaluate to <codeph>true</codeph>.
If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>.
In particular, <codeph>STRING</codeph> values (even <codeph>'1'</codeph>, <codeph>'0'</codeph>, <codeph>'true'</codeph>
or <codeph>'false'</codeph>) always return <codeph>NULL</codeph> when converted to <codeph>BOOLEAN</codeph>.
<p><b>Return type:</b> <codeph>boolean</codeph></p>
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_usage"/>
<p conref="../shared/impala_common.xml#common/added_in_230"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_example"/>
<codeblock>select casttoboolean(0);
+------------------+
| casttoboolean(0) |
+------------------+
| false |
+------------------+
select casttoboolean(1);
+------------------+
| casttoboolean(1) |
+------------------+
| true |
+------------------+
select casttoboolean(99);
+-------------------+
| casttoboolean(99) |
+-------------------+
| true |
+-------------------+
select casttoboolean(0.0);
+--------------------+
| casttoboolean(0.0) |
+--------------------+
| false |
+--------------------+
select casttoboolean(0.5);
+--------------------+
| casttoboolean(0.5) |
+--------------------+
| true |
+--------------------+
select casttoboolean('');
+-------------------+
| casttoboolean('') |
+-------------------+
| NULL |
+-------------------+
select casttoboolean('yes');
+----------------------+
| casttoboolean('yes') |
+----------------------+
| NULL |
+----------------------+
select casttoboolean('0');
+--------------------+
| casttoboolean('0') |
+--------------------+
| NULL |
+--------------------+
select casttoboolean('true');
+-----------------------+
| casttoboolean('true') |
+-----------------------+
| NULL |
+-----------------------+
select casttoboolean('false');
+------------------------+
| casttoboolean('false') |
+------------------------+
| NULL |
+------------------------+
</codeblock>
</dd>
</dlentry>
<dlentry rev="2.3.0" id="casttochar" audience="Cloudera">
<dt>
<codeph>casttochar(type value)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">casttochar() function</indexterm>
<b>Purpose:</b> Converts the value of an expression to <codeph>CHAR</codeph>. If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>.
<p><b>Return type:</b> <codeph>char</codeph></p>
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_usage"/>
<p conref="../shared/impala_common.xml#common/added_in_230"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_example"/>
<codeblock>create table char_types as select casttochar('hello world') as c1, casttochar('xyz') as c2, casttochar('x') as c3;
+-------------------+
| summary |
+-------------------+
| Inserted 1 row(s) |
+-------------------+
describe char_types;
+------+--------+---------+
| name | type | comment |
+------+--------+---------+
| c1 | string | |
| c2 | string | |
| c3 | string | |
+------+--------+---------+
</codeblock>
</dd>
</dlentry>
<dlentry rev="2.3.0" id="casttodecimal" audience="Cloudera">
<dt>
<codeph>casttodecimal(type value)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">casttodecimal() function</indexterm>
<b>Purpose:</b> Converts the value of an expression to <codeph>DECIMAL</codeph>. If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>.
<p><b>Return type:</b> <codeph>decimal</codeph></p>
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_usage"/>
<p conref="../shared/impala_common.xml#common/added_in_230"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_example"/>
<codeblock>select casttodecimal(5.4);
+--------------------+
| casttodecimal(5.4) |
+--------------------+
| 5.4 |
+--------------------+
</codeblock>
</dd>
</dlentry>
<dlentry rev="2.3.0" id="casttodouble" audience="Cloudera">
<dt>
<codeph>casttodouble(type value)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">casttodouble() function</indexterm>
<b>Purpose:</b> Converts the value of an expression to <codeph>DOUBLE</codeph>. If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>.
<p><b>Return type:</b> <codeph>double</codeph></p>
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_usage"/>
<p conref="../shared/impala_common.xml#common/added_in_230"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_example"/>
<codeblock>select casttodouble(5);
+-----------------+
| casttodouble(5) |
+-----------------+
| 5 |
+-----------------+
select casttodouble('3.141');
+-----------------------+
| casttodouble('3.141') |
+-----------------------+
| 3.141 |
+-----------------------+
select casttodouble(1e6);
+--------------------+
| casttodouble(1e+6) |
+--------------------+
| 1000000 |
+--------------------+
select casttodouble(true);
+--------------------+
| casttodouble(true) |
+--------------------+
| 1 |
+--------------------+
select casttodouble(now());
+---------------------+
| casttodouble(now()) |
+---------------------+
| 1447622306.031178 |
+---------------------+
</codeblock>
</dd>
</dlentry>
<dlentry rev="2.3.0" id="casttofloat" audience="Cloudera">
<dt>
<codeph>casttofloat(type value)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">casttofloat() function</indexterm>
<b>Purpose:</b> Converts the value of an expression to <codeph>FLOAT</codeph>. If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>.
<p><b>Return type:</b> <codeph>float</codeph></p>
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_usage"/>
<p conref="../shared/impala_common.xml#common/added_in_230"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_example"/>
<codeblock>select casttofloat(5);
+----------------+
| casttofloat(5) |
+----------------+
| 5 |
+----------------+
select casttofloat('3.141');
+----------------------+
| casttofloat('3.141') |
+----------------------+
| 3.141000032424927 |
+----------------------+
select casttofloat(1e6);
+-------------------+
| casttofloat(1e+6) |
+-------------------+
| 1000000 |
+-------------------+
select casttofloat(true);
+-------------------+
| casttofloat(true) |
+-------------------+
| 1 |
+-------------------+
select casttofloat(now());
+--------------------+
| casttofloat(now()) |
+--------------------+
| 1447622400 |
+--------------------+
</codeblock>
</dd>
</dlentry>
<dlentry rev="2.3.0" id="casttoint" audience="Cloudera">
<dt>
<codeph>casttoint(type value)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">casttoint() function</indexterm>
<b>Purpose:</b> Converts the value of an expression to <codeph>INT</codeph>. If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>.
<p><b>Return type:</b> <codeph>int</codeph></p>
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_usage"/>
<p conref="../shared/impala_common.xml#common/added_in_230"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_example"/>
<codeblock>select casttoint(5.4);
+----------------+
| casttoint(5.4) |
+----------------+
| 5 |
+----------------+
select casttoint(true);
+-----------------+
| casttoint(true) |
+-----------------+
| 1 |
+-----------------+
select casttoint(now());
+------------------+
| casttoint(now()) |
+------------------+
| 1447622487 |
+------------------+
select casttoint('3.141');
+--------------------+
| casttoint('3.141') |
+--------------------+
| NULL |
+--------------------+
select casttoint('3');
+----------------+
| casttoint('3') |
+----------------+
| 3 |
+----------------+
</codeblock>
</dd>
</dlentry>
<dlentry rev="2.3.0" id="casttosmallint" audience="Cloudera">
<dt>
<codeph>casttosmallint(type value)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">casttosmallint() function</indexterm>
<b>Purpose:</b> Converts the value of an expression to <codeph>SMALLINT</codeph>. If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>.
<p><b>Return type:</b> <codeph>smallint</codeph></p>
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_usage"/>
<p conref="../shared/impala_common.xml#common/added_in_230"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_example"/>
<codeblock>create table big_types (x bigint, y int, z smallint);
create table small_types as
select casttosmallint(x) as x, casttosmallint(y) as y, casttosmallint(z) as z
from big_types;
describe small_types;
+------+----------+---------+
| name | type | comment |
+------+----------+---------+
| x | smallint | |
| y | smallint | |
| z | smallint | |
+------+----------+---------+
</codeblock>
</dd>
</dlentry>
<dlentry rev="2.3.0" id="casttostring" audience="Cloudera">
<dt>
<codeph>casttostring(type value)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">casttostring() function</indexterm>
<b>Purpose:</b> Converts the value of an expression to <codeph>STRING</codeph>. If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>.
<p><b>Return type:</b> <codeph>string</codeph></p>
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_usage"/>
<p conref="../shared/impala_common.xml#common/added_in_230"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_example"/>
<codeblock>create table numeric_types (x int, y bigint, z tinyint);
create table string_types as
select casttostring(x) as x, casttostring(y) as y, casttostring(z) as z
from numeric_types;
describe string_types;
+------+--------+---------+
| name | type | comment |
+------+--------+---------+
| x | string | |
| y | string | |
| z | string | |
+------+--------+---------+
</codeblock>
</dd>
</dlentry>
<dlentry rev="2.3.0" id="casttotimestamp" audience="Cloudera">
<dt>
<codeph>casttotimestamp(type value)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">casttotimestamp() function</indexterm>
<b>Purpose:</b> Converts the value of an expression to <codeph>TIMESTAMP</codeph>. If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>.
<p><b>Return type:</b> <codeph>timestamp</codeph></p>
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_usage"/>
<p conref="../shared/impala_common.xml#common/added_in_230"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_example"/>
<codeblock>select casttotimestamp(1000);
+-----------------------+
| casttotimestamp(1000) |
+-----------------------+
| 1970-01-01 00:16:40 |
+-----------------------+
select casttotimestamp(1000.0);
+-------------------------+
| casttotimestamp(1000.0) |
+-------------------------+
| 1970-01-01 00:16:40 |
+-------------------------+
select casttotimestamp('1000');
+-------------------------+
| casttotimestamp('1000') |
+-------------------------+
| NULL |
+-------------------------+
</codeblock>
</dd>
</dlentry>
<dlentry rev="2.3.0" id="casttotinyint" audience="Cloudera">
<dt>
<codeph>casttotinyint(type value)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">casttotinyint() function</indexterm>
<b>Purpose:</b> Converts the value of an expression to <codeph>TINYINT</codeph>. If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>.
<p><b>Return type:</b> <codeph>tinyint</codeph></p>
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_usage"/>
<p conref="../shared/impala_common.xml#common/added_in_230"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_example"/>
<codeblock>create table big_types (x bigint, y int, z smallint);
create table tiny_types as
select casttotinyint(x) as x, casttotinyint(y) as y, casttotinyint(z) as z
from big_types;
describe tiny_types;
+------+---------+---------+
| name | type | comment |
+------+---------+---------+
| x | tinyint | |
| y | tinyint | |
| z | tinyint | |
+------+---------+---------+
</codeblock>
</dd>
</dlentry>
<dlentry rev="2.3.0" id="casttovarchar" audience="Cloudera">
<dt>
<codeph>casttovarchar(type value)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">casttovarchar() function</indexterm>
<b>Purpose:</b> Converts the value of an expression to <codeph>VARCHAR</codeph>. If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>.
<p><b>Return type:</b> <codeph>varchar</codeph></p>
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_usage"/>
<p conref="../shared/impala_common.xml#common/added_in_230"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p conref="../shared/impala_common.xml#common/cast_convenience_fn_example"/>
<codeblock>select casttovarchar('abcd');
+-----------------------+
| casttovarchar('abcd') |
+-----------------------+
| abcd |
+-----------------------+
select casttovarchar(999);
+--------------------+
| casttovarchar(999) |
+--------------------+
| 999 |
+--------------------+
select casttovarchar(999.5);
+----------------------+
| casttovarchar(999.5) |
+----------------------+
| 999.5 |
+----------------------+
select casttovarchar(now());
+-------------------------------+
| casttovarchar(now()) |
+-------------------------------+
| 2015-11-15 21:26:13.528073000 |
+-------------------------------+
select casttovarchar(true);
+---------------------+
| casttovarchar(true) |
+---------------------+
| 1 |
+---------------------+
</codeblock>
</dd>
</dlentry>
<dlentry rev="2.3.0" id="typeof">
<dt>
<codeph>typeof(type value)</codeph>
</dt>
<dd>
<indexterm audience="Cloudera">typeof() function</indexterm>
<b>Purpose:</b> Returns the name of the data type corresponding to an expression. For types with
extra attributes, such as length for <codeph>CHAR</codeph> and <codeph>VARCHAR</codeph>,
or precision and scale for <codeph>DECIMAL</codeph>, includes the full specification of the type.
<!-- To do: How about for columns of complex types? Or fields within complex types? -->
<p><b>Return type:</b> <codeph>string</codeph></p>
<p><b>Usage notes:</b> Typically used in interactive exploration of a schema, or in application code that programmatically generates schema definitions such as <codeph>CREATE TABLE</codeph> statements.
For example, previously, to understand the type of an expression such as
<codeph>col1 / col2</codeph> or <codeph>concat(col1, col2, col3)</codeph>,
you might have created a dummy table with a single row, using syntax such as <codeph>CREATE TABLE foo AS SELECT 5 / 3.0</codeph>,
and then doing a <codeph>DESCRIBE</codeph> to see the type of the row.
Or you might have done a <codeph>CREATE TABLE AS SELECT</codeph> operation to create a table and
copy data into it, only learning the types of the columns by doing a <codeph>DESCRIBE</codeph> afterward.
This technique is especially useful for arithmetic expressions involving <codeph>DECIMAL</codeph> types,
because the precision and scale of the result is typically different than that of the operands.
</p>
<p conref="../shared/impala_common.xml#common/added_in_230"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p>
These examples show how to check the type of a simple literal or function value.
Notice how adding even tiny integers together changes the data type of the result to
avoid overflow, and how the results of arithmetic operations on <codeph>DECIMAL</codeph> values
have specific precision and scale attributes.
</p>
<codeblock>select typeof(2)
+-----------+
| typeof(2) |
+-----------+
| TINYINT |
+-----------+
select typeof(2+2)
+---------------+
| typeof(2 + 2) |
+---------------+
| SMALLINT |
+---------------+
select typeof('xyz')
+---------------+
| typeof('xyz') |
+---------------+
| STRING |
+---------------+
select typeof(now())
+---------------+
| typeof(now()) |
+---------------+
| TIMESTAMP |
+---------------+
select typeof(5.3 / 2.1)
+-------------------+
| typeof(5.3 / 2.1) |
+-------------------+
| DECIMAL(6,4) |
+-------------------+
select typeof(5.30001 / 2342.1);
+--------------------------+
| typeof(5.30001 / 2342.1) |
+--------------------------+
| DECIMAL(13,11) |
+--------------------------+
select typeof(typeof(2+2))
+-----------------------+
| typeof(typeof(2 + 2)) |
+-----------------------+
| STRING |
+-----------------------+
</codeblock>
<p>
This example shows how even if you do not have a record of the type of a column,
for example because the type was changed by <codeph>ALTER TABLE</codeph> after the
original <codeph>CREATE TABLE</codeph>, you can still find out the type in a
more compact form than examining the full <codeph>DESCRIBE</codeph> output.
Remember to use <codeph>LIMIT 1</codeph> in such cases, to avoid an identical
result value for every row in the table.
</p>
<codeblock>create table typeof_example (a int, b tinyint, c smallint, d bigint);
/* Empty result set if there is no data in the table. */
select typeof(a) from typeof_example;
/* OK, now we have some data but the type of column A is being changed. */
insert into typeof_example values (1, 2, 3, 4);
alter table typeof_example change a a bigint;
/* We can always find out the current type of that column without doing a full DESCRIBE. */
select typeof(a) from typeof_example limit 1;
+-----------+
| typeof(a) |
+-----------+
| BIGINT |
+-----------+
</codeblock>
<p>
This example shows how you might programmatically generate a <codeph>CREATE TABLE</codeph> statement
with the appropriate column definitions to hold the result values of arbitrary expressions.
The <codeph>typeof()</codeph> function lets you construct a detailed <codeph>CREATE TABLE</codeph> statement
without actually creating the table, as opposed to <codeph>CREATE TABLE AS SELECT</codeph> operations
where you create the destination table but only learn the column data types afterward through <codeph>DESCRIBE</codeph>.
</p>
<codeblock>describe typeof_example;
+------+----------+---------+
| name | type | comment |
+------+----------+---------+
| a | bigint | |
| b | tinyint | |
| c | smallint | |
| d | bigint | |
+------+----------+---------+
/* An ETL or business intelligence tool might create variations on a table with different file formats,
different sets of columns, and so on. TYPEOF() lets an application introspect the types of the original columns. */
select concat('create table derived_table (a ', typeof(a), ', b ', typeof(b), ', c ',
typeof(c), ', d ', typeof(d), ') stored as parquet;')
as 'create table statement'
from typeof_example limit 1;
+-------------------------------------------------------------------------------------------+
| create table statement |
+-------------------------------------------------------------------------------------------+
| create table derived_table (a BIGINT, b TINYINT, c SMALLINT, d BIGINT) stored as parquet; |
+-------------------------------------------------------------------------------------------+
</codeblock>
</dd>
</dlentry>
</dl>
</conbody>
</concept>

View File

@@ -0,0 +1,236 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="count">
<title>COUNT Function</title>
<titlealts audience="PDF"><navtitle>COUNT</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="SQL"/>
<data name="Category" value="Impala Functions"/>
<data name="Category" value="Analytic Functions"/>
<data name="Category" value="Aggregate Functions"/>
<data name="Category" value="Querying"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="Cloudera">count() function</indexterm>
An aggregate function that returns the number of rows, or the number of non-<codeph>NULL</codeph> rows.
</p>
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
<codeblock>COUNT([DISTINCT | ALL] <varname>expression</varname>) [OVER (<varname>analytic_clause</varname>)]</codeblock>
<p>
Depending on the argument, <codeph>COUNT()</codeph> considers rows that meet certain conditions:
</p>
<ul>
<li>
The notation <codeph>COUNT(*)</codeph> includes <codeph>NULL</codeph> values in the total.
</li>
<li>
The notation <codeph>COUNT(<varname>column_name</varname>)</codeph> only considers rows where the column
contains a non-<codeph>NULL</codeph> value.
</li>
<li>
You can also combine <codeph>COUNT</codeph> with the <codeph>DISTINCT</codeph> operator to eliminate
duplicates before counting, and to count the combinations of values across multiple columns.
</li>
</ul>
<p>
When the query contains a <codeph>GROUP BY</codeph> clause, returns one value for each combination of
grouping values.
</p>
<p>
<b>Return type:</b> <codeph>BIGINT</codeph>
</p>
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
<p conref="../shared/impala_common.xml#common/partition_key_optimization"/>
<p conref="../shared/impala_common.xml#common/complex_types_blurb"/>
<p conref="../shared/impala_common.xml#common/complex_types_aggregation_explanation"/>
<p conref="../shared/impala_common.xml#common/complex_types_aggregation_example"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<codeblock>-- How many rows total are in the table, regardless of NULL values?
select count(*) from t1;
-- How many rows are in the table with non-NULL values for a column?
select count(c1) from t1;
-- Count the rows that meet certain conditions.
-- Again, * includes NULLs, so COUNT(*) might be greater than COUNT(col).
select count(*) from t1 where x &gt; 10;
select count(c1) from t1 where x &gt; 10;
-- Can also be used in combination with DISTINCT and/or GROUP BY.
-- Combine COUNT and DISTINCT to find the number of unique values.
-- Must use column names rather than * with COUNT(DISTINCT ...) syntax.
-- Rows with NULL values are not counted.
select count(distinct c1) from t1;
-- Rows with a NULL value in _either_ column are not counted.
select count(distinct c1, c2) from t1;
-- Return more than one result.
select month, year, count(distinct visitor_id) from web_stats group by month, year;
</codeblock>
<p rev="2.0.0">
The following examples show how to use <codeph>COUNT()</codeph> in an analytic context. They use a table
containing integers from 1 to 10. Notice how the <codeph>COUNT()</codeph> is reported for each input value, as
opposed to the <codeph>GROUP BY</codeph> clause which condenses the result set.
<codeblock>select x, property, count(x) over (partition by property) as count from int_t where property in ('odd','even');
+----+----------+-------+
| x | property | count |
+----+----------+-------+
| 2 | even | 5 |
| 4 | even | 5 |
| 6 | even | 5 |
| 8 | even | 5 |
| 10 | even | 5 |
| 1 | odd | 5 |
| 3 | odd | 5 |
| 5 | odd | 5 |
| 7 | odd | 5 |
| 9 | odd | 5 |
+----+----------+-------+
</codeblock>
Adding an <codeph>ORDER BY</codeph> clause lets you experiment with results that are cumulative or apply to a moving
set of rows (the <q>window</q>). The following examples use <codeph>COUNT()</codeph> in an analytic context
(that is, with an <codeph>OVER()</codeph> clause) to produce a running count of all the even values,
then a running count of all the odd values. The basic <codeph>ORDER BY x</codeph> clause implicitly
activates a window clause of <codeph>RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph>,
which is effectively the same as <codeph>ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph>,
therefore all of these examples produce the same results:
<codeblock>select x, property,
count(x) over (partition by property <b>order by x</b>) as 'cumulative count'
from int_t where property in ('odd','even');
+----+----------+------------------+
| x | property | cumulative count |
+----+----------+------------------+
| 2 | even | 1 |
| 4 | even | 2 |
| 6 | even | 3 |
| 8 | even | 4 |
| 10 | even | 5 |
| 1 | odd | 1 |
| 3 | odd | 2 |
| 5 | odd | 3 |
| 7 | odd | 4 |
| 9 | odd | 5 |
+----+----------+------------------+
select x, property,
count(x) over
(
partition by property
<b>order by x</b>
<b>range between unbounded preceding and current row</b>
) as 'cumulative total'
from int_t where property in ('odd','even');
+----+----------+------------------+
| x | property | cumulative count |
+----+----------+------------------+
| 2 | even | 1 |
| 4 | even | 2 |
| 6 | even | 3 |
| 8 | even | 4 |
| 10 | even | 5 |
| 1 | odd | 1 |
| 3 | odd | 2 |
| 5 | odd | 3 |
| 7 | odd | 4 |
| 9 | odd | 5 |
+----+----------+------------------+
select x, property,
count(x) over
(
partition by property
<b>order by x</b>
<b>rows between unbounded preceding and current row</b>
) as 'cumulative total'
from int_t where property in ('odd','even');
+----+----------+------------------+
| x | property | cumulative count |
+----+----------+------------------+
| 2 | even | 1 |
| 4 | even | 2 |
| 6 | even | 3 |
| 8 | even | 4 |
| 10 | even | 5 |
| 1 | odd | 1 |
| 3 | odd | 2 |
| 5 | odd | 3 |
| 7 | odd | 4 |
| 9 | odd | 5 |
+----+----------+------------------+
</codeblock>
The following examples show how to construct a moving window, with a running count taking into account 1 row before
and 1 row after the current row, within the same partition (all the even values or all the odd values).
Therefore, the count is consistently 3 for rows in the middle of the window, and 2 for
rows near the ends of the window, where there is no preceding or no following row in the partition.
Because of a restriction in the Impala <codeph>RANGE</codeph> syntax, this type of
moving window is possible with the <codeph>ROWS BETWEEN</codeph> clause but not the <codeph>RANGE BETWEEN</codeph>
clause:
<codeblock>select x, property,
count(x) over
(
partition by property
<b>order by x</b>
<b>rows between 1 preceding and 1 following</b>
) as 'moving total'
from int_t where property in ('odd','even');
+----+----------+--------------+
| x | property | moving total |
+----+----------+--------------+
| 2 | even | 2 |
| 4 | even | 3 |
| 6 | even | 3 |
| 8 | even | 3 |
| 10 | even | 2 |
| 1 | odd | 2 |
| 3 | odd | 3 |
| 5 | odd | 3 |
| 7 | odd | 3 |
| 9 | odd | 2 |
+----+----------+--------------+
-- Doesn't work because of syntax restriction on RANGE clause.
select x, property,
count(x) over
(
partition by property
<b>order by x</b>
<b>range between 1 preceding and 1 following</b>
) as 'moving total'
from int_t where property in ('odd','even');
ERROR: AnalysisException: RANGE is only supported with both the lower and upper bounds UNBOUNDED or one UNBOUNDED and the other CURRENT ROW.
</codeblock>
</p>
<note conref="../shared/impala_common.xml#common/multiple_count_distinct"/>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_analytic_functions.xml#analytic_functions"/>
</p>
</conbody>
</concept>

View File

@@ -0,0 +1,35 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept audience="Cloudera" rev="1.4.0" id="create_data_source">
<title>CREATE DATA SOURCE Statement</title>
<titlealts audience="PDF"><navtitle>CREATE DATA SOURCE</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="SQL"/>
<data name="Category" value="DDL"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="Cloudera">CREATE DATA SOURCE statement</indexterm>
</p>
<p conref="../shared/impala_common.xml#common/ddl_blurb"/>
<p conref="../shared/impala_common.xml#common/sync_ddl_blurb"/>
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
<p conref="../shared/impala_common.xml#common/related_info"/>
</conbody>
</concept>

View File

@@ -0,0 +1,137 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="create_database">
<title>CREATE DATABASE Statement</title>
<titlealts audience="PDF"><navtitle>CREATE DATABASE</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="SQL"/>
<data name="Category" value="Databases"/>
<data name="Category" value="Schemas"/>
<data name="Category" value="DDL"/>
<data name="Category" value="S3"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="Cloudera">CREATE DATABASE statement</indexterm>
Creates a new database.
</p>
<p>
In Impala, a database is both:
</p>
<ul>
<li>
A logical construct for grouping together related tables, views, and functions within their own namespace.
You might use a separate database for each application, set of related tables, or round of experimentation.
</li>
<li>
A physical construct represented by a directory tree in HDFS. Tables (internal tables), partitions, and
data files are all located under this directory. You can perform HDFS-level operations such as backing it up and measuring space usage,
or remove it with a <codeph>DROP DATABASE</codeph> statement.
</li>
</ul>
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
<codeblock>CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] <varname>database_name</varname>[COMMENT '<varname>database_comment</varname>']
[LOCATION <varname>hdfs_path</varname>];</codeblock>
<p conref="../shared/impala_common.xml#common/ddl_blurb"/>
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
<p>
A database is physically represented as a directory in HDFS, with a filename extension <codeph>.db</codeph>,
under the main Impala data directory. If the associated HDFS directory does not exist, it is created for you.
All databases and their associated directories are top-level objects, with no physical or logical nesting.
</p>
<p>
After creating a database, to make it the current database within an <cmdname>impala-shell</cmdname> session,
use the <codeph>USE</codeph> statement. You can refer to tables in the current database without prepending
any qualifier to their names.
</p>
<p>
When you first connect to Impala through <cmdname>impala-shell</cmdname>, the database you start in (before
issuing any <codeph>CREATE DATABASE</codeph> or <codeph>USE</codeph> statements) is named
<codeph>default</codeph>.
</p>
<p conref="../shared/impala_common.xml#common/builtins_db"/>
<p>
After creating a database, your <cmdname>impala-shell</cmdname> session or another
<cmdname>impala-shell</cmdname> connected to the same node can immediately access that database. To access
the database through the Impala daemon on a different node, issue the <codeph>INVALIDATE METADATA</codeph>
statement first while connected to that other node.
</p>
<p>
Setting the <codeph>LOCATION</codeph> attribute for a new database is a way to work with sets of files in an
HDFS directory structure outside the default Impala data directory, as opposed to setting the
<codeph>LOCATION</codeph> attribute for each individual table.
</p>
<p conref="../shared/impala_common.xml#common/sync_ddl_blurb"/>
<p conref="../shared/impala_common.xml#common/hive_blurb"/>
<p>
When you create a database in Impala, the database can also be used by Hive.
When you create a database in Hive, issue an <codeph>INVALIDATE METADATA</codeph>
statement in Impala to make Impala permanently aware of the new database.
</p>
<p>
The <codeph>SHOW DATABASES</codeph> statement lists all databases, or the databases whose name
matches a wildcard pattern. <ph rev="2.5.0">In <keyword keyref="impala25_full"/> and higher, the
<codeph>SHOW DATABASES</codeph> output includes a second column that displays the associated
comment, if any, for each database.</ph>
</p>
<p conref="../shared/impala_common.xml#common/s3_blurb"/>
<p rev="2.6.0 CDH-39913 IMPALA-1878">
To specify that any tables created within a database reside on the Amazon S3 system,
you can include an <codeph>s3a://</codeph> prefix on the <codeph>LOCATION</codeph>
attribute. In <keyword keyref="impala26_full"/> and higher, Impala automatically creates any
required folders as the databases, tables, and partitions are created, and removes
them when they are dropped.
</p>
<p conref="../shared/impala_common.xml#common/s3_ddl"/>
<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
<p conref="../shared/impala_common.xml#common/permissions_blurb"/>
<p rev="CDH-19187">
The user ID that the <cmdname>impalad</cmdname> daemon runs under,
typically the <codeph>impala</codeph> user, must have write
permission for the parent HDFS directory under which the database
is located.
</p>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<codeblock conref="../shared/impala_common.xml#common/create_drop_db_example"/>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_databases.xml#databases"/>, <xref href="impala_drop_database.xml#drop_database"/>,
<xref href="impala_use.xml#use"/>, <xref href="impala_show.xml#show_databases"/>,
<xref href="impala_tables.xml#tables"/>
</p>
</conbody>
</concept>

View File

@@ -0,0 +1,492 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept rev="1.2" id="create_function">
<title>CREATE FUNCTION Statement</title>
<titlealts audience="PDF"><navtitle>CREATE FUNCTION</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="SQL"/>
<data name="Category" value="DDL"/>
<data name="Category" value="Schemas"/>
<data name="Category" value="Impala Functions"/>
<data name="Category" value="UDFs"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="Cloudera">CREATE FUNCTION statement</indexterm>
Creates a user-defined function (UDF), which you can use to implement custom logic during
<codeph>SELECT</codeph> or <codeph>INSERT</codeph> operations.
</p>
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
<p>
The syntax is different depending on whether you create a scalar UDF, which is called once for each row and
implemented by a single function, or a user-defined aggregate function (UDA), which is implemented by
multiple functions that compute intermediate results across sets of rows.
</p>
<p rev="2.5.0 IMPALA-2843 CDH-39148">
In <keyword keyref="impala25_full"/> and higher, the syntax is also different for creating or dropping scalar Java-based UDFs.
The statements for Java UDFs use a new syntax, without any argument types or return type specified. Java-based UDFs
created using the new syntax persist across restarts of the Impala catalog server, and can be shared transparently
between Impala and Hive.
</p>
<p>
To create a persistent scalar C++ UDF with <codeph>CREATE FUNCTION</codeph>:
</p>
<codeblock>CREATE FUNCTION [IF NOT EXISTS] [<varname>db_name</varname>.]<varname>function_name</varname>([<varname>arg_type</varname>[, <varname>arg_type</varname>...])
RETURNS <varname>return_type</varname>
LOCATION '<varname>hdfs_path_to_dot_so</varname>'
SYMBOL='<varname>symbol_name</varname>'</codeblock>
<p rev="2.5.0 IMPALA-2843 CDH-39148">
To create a persistent Java UDF with <codeph>CREATE FUNCTION</codeph>:
<codeblock>CREATE FUNCTION [IF NOT EXISTS] [<varname>db_name</varname>.]<varname>function_name</varname>
LOCATION '<varname>hdfs_path_to_jar</varname>'
SYMBOL='<varname>class_name</varname>'</codeblock>
</p>
<!--
Examples:
CREATE FUNCTION IF NOT EXISTS foo location '/path/to/jar' SYMBOL='TestUdf';
CREATE FUNCTION bar location '/path/to/jar' SYMBOL='TestUdf2';
DROP FUNCTION foo;
DROP FUNCTION IF EXISTS bar;
-->
<p>
To create a persistent UDA, which must be written in C++, issue a <codeph>CREATE AGGREGATE FUNCTION</codeph> statement:
</p>
<codeblock>CREATE [AGGREGATE] FUNCTION [IF NOT EXISTS] [<varname>db_name</varname>.]<varname>function_name</varname>([<varname>arg_type</varname>[, <varname>arg_type</varname>...])
RETURNS <varname>return_type</varname>
LOCATION '<varname>hdfs_path</varname>'
[INIT_FN='<varname>function</varname>]
UPDATE_FN='<varname>function</varname>
MERGE_FN='<varname>function</varname>
[PREPARE_FN='<varname>function</varname>]
[CLOSEFN='<varname>function</varname>]
<ph rev="2.0.0">[SERIALIZE_FN='<varname>function</varname>]</ph>
[FINALIZE_FN='<varname>function</varname>]
<ph rev="2.3.0 IMPALA-1829 CDH-30572">[INTERMEDIATE <varname>type_spec</varname>]</ph></codeblock>
<p conref="../shared/impala_common.xml#common/ddl_blurb"/>
<p>
<b>Varargs notation:</b>
</p>
<note rev="CDH-39271 CDH-38572">
<p rev="CDH-39271 CDH-38572">
Variable-length argument lists are supported for C++ UDFs, but currently not for Java UDFs.
</p>
</note>
<p>
If the underlying implementation of your function accepts a variable number of arguments:
</p>
<ul>
<li>
The variable arguments must go last in the argument list.
</li>
<li>
The variable arguments must all be of the same type.
</li>
<li>
You must include at least one instance of the variable arguments in every function call invoked from SQL.
</li>
<li>
You designate the variable portion of the argument list in the <codeph>CREATE FUNCTION</codeph> statement
by including <codeph>...</codeph> immediately after the type name of the first variable argument. For
example, to create a function that accepts an <codeph>INT</codeph> argument, followed by a
<codeph>BOOLEAN</codeph>, followed by one or more <codeph>STRING</codeph> arguments, your <codeph>CREATE
FUNCTION</codeph> statement would look like:
<codeblock>CREATE FUNCTION <varname>func_name</varname> (INT, BOOLEAN, STRING ...)
RETURNS <varname>type</varname> LOCATION '<varname>path</varname>' SYMBOL='<varname>entry_point</varname>';
</codeblock>
</li>
</ul>
<p rev="CDH-39271 CDH-38572">
See <xref href="impala_udf.xml#udf_varargs"/> for how to code a C++ UDF to accept
variable-length argument lists.
</p>
<p>
<b>Scalar and aggregate functions:</b>
</p>
<p>
The simplest kind of user-defined function returns a single scalar value each time it is called, typically
once for each row in the result set. This general kind of function is what is usually meant by UDF.
User-defined aggregate functions (UDAs) are a specialized kind of UDF that produce a single value based on
the contents of multiple rows. You usually use UDAs in combination with a <codeph>GROUP BY</codeph> clause to
condense a large result set into a smaller one, or even a single row summarizing column values across an
entire table.
</p>
<p>
You create UDAs by using the <codeph>CREATE AGGREGATE FUNCTION</codeph> syntax. The clauses
<codeph>INIT_FN</codeph>, <codeph>UPDATE_FN</codeph>, <codeph>MERGE_FN</codeph>,
<ph rev="2.0.0"><codeph>SERIALIZE_FN</codeph>,</ph> <codeph>FINALIZE_FN</codeph>, and
<codeph>INTERMEDIATE</codeph> only apply when you create a UDA rather than a scalar UDF.
</p>
<p>
The <codeph>*_FN</codeph> clauses specify functions to call at different phases of function processing.
</p>
<ul>
<li>
<b>Initialize:</b> The function you specify with the <codeph>INIT_FN</codeph> clause does any initial
setup, such as initializing member variables in internal data structures. This function is often a stub for
simple UDAs. You can omit this clause and a default (no-op) function will be used.
</li>
<li>
<b>Update:</b> The function you specify with the <codeph>UPDATE_FN</codeph> clause is called once for each
row in the original result set, that is, before any <codeph>GROUP BY</codeph> clause is applied. A separate
instance of the function is called for each different value returned by the <codeph>GROUP BY</codeph>
clause. The final argument passed to this function is a pointer, to which you write an updated value based
on its original value and the value of the first argument.
</li>
<li>
<b>Merge:</b> The function you specify with the <codeph>MERGE_FN</codeph> clause is called an arbitrary
number of times, to combine intermediate values produced by different nodes or different threads as Impala
reads and processes data files in parallel. The final argument passed to this function is a pointer, to
which you write an updated value based on its original value and the value of the first argument.
</li>
<li rev="2.0.0">
<b>Serialize:</b> The function you specify with the <codeph>SERIALIZE_FN</codeph> clause frees memory
allocated to intermediate results. It is required if any memory was allocated by the Allocate function in
the Init, Update, or Merge functions, or if the intermediate type contains any pointers. See
<xref href="https://github.com/cloudera/impala-udf-samples/blob/master/uda-sample.cc" scope="external" format="html">the
UDA code samples</xref> for details.
</li>
<li>
<b>Finalize:</b> The function you specify with the <codeph>FINALIZE_FN</codeph> clause does any required
teardown for resources acquired by your UDF, such as freeing memory, closing file handles if you explicitly
opened any files, and so on. This function is often a stub for simple UDAs. You can omit this clause and a
default (no-op) function will be used. It is required in UDAs where the final return type is different than
the intermediate type. or if any memory was allocated by the Allocate function in the Init, Update, or
Merge functions. See
<xref href="https://github.com/cloudera/impala-udf-samples/blob/master/uda-sample.cc" scope="external" format="html">the
UDA code samples</xref> for details.
</li>
</ul>
<p>
If you use a consistent naming convention for each of the underlying functions, Impala can automatically
determine the names based on the first such clause, so the others are optional.
</p>
<p audience="Cloudera">
The <codeph>INTERMEDIATE</codeph> clause specifies the data type of intermediate values passed from the
<q>update</q> phase to the <q>merge</q> phase, and from the <q>merge</q> phase to the <q>finalize</q> phase.
You can use any of the existing Impala data types, or the special notation
<codeph>CHAR(<varname>n</varname>)</codeph> to allocate a scratch area of <varname>n</varname> bytes for the
intermediate result. For example, if the different phases of your UDA pass strings to each other but in the
end the function returns a <codeph>BIGINT</codeph> value, you would specify <codeph>INTERMEDIATE
STRING</codeph>. Likewise, if the different phases of your UDA pass 2 separate <codeph>BIGINT</codeph> values
between them (8 bytes each), you would specify <codeph>INTERMEDIATE CHAR(16)</codeph> so that each function
could read from and write to a 16-byte buffer.
</p>
<p>
For end-to-end examples of UDAs, see <xref href="impala_udf.xml#udfs"/>.
</p>
<p conref="../shared/impala_common.xml#common/complex_types_blurb"/>
<p conref="../shared/impala_common.xml#common/udfs_no_complex_types"/>
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
<ul>
<li>
You can write Impala UDFs in either C++ or Java. C++ UDFs are new to Impala, and are the recommended format
for high performance utilizing native code. Java-based UDFs are compatible between Impala and Hive, and are
most suited to reusing existing Hive UDFs. (Impala can run Java-based Hive UDFs but not Hive UDAs.)
</li>
<li rev="2.5.0 IMPALA-1748 CDH-38369 IMPALA-2843 CDH-39148">
<keyword keyref="impala25_full"/> introduces UDF improvements to persistence for both C++ and Java UDFs,
and better compatibility between Impala and Hive for Java UDFs.
See <xref href="impala_udf.xml#udfs"/> for details.
</li>
<li>
The body of the UDF is represented by a <codeph>.so</codeph> or <codeph>.jar</codeph> file, which you store
in HDFS and the <codeph>CREATE FUNCTION</codeph> statement distributes to each Impala node.
</li>
<li>
Impala calls the underlying code during SQL statement evaluation, as many times as needed to process all
the rows from the result set. All UDFs are assumed to be deterministic, that is, to always return the same
result when passed the same argument values. Impala might or might not skip some invocations of a UDF if
the result value is already known from a previous call. Therefore, do not rely on the UDF being called a
specific number of times, and do not return different result values based on some external factor such as
the current time, a random number function, or an external data source that could be updated while an
Impala query is in progress.
</li>
<li>
The names of the function arguments in the UDF are not significant, only their number, positions, and data
types.
</li>
<li>
You can overload the same function name by creating multiple versions of the function, each with a
different argument signature. For security reasons, you cannot make a UDF with the same name as any
built-in function.
</li>
<li>
In the UDF code, you represent the function return result as a <codeph>struct</codeph>. This
<codeph>struct</codeph> contains 2 fields. The first field is a <codeph>boolean</codeph> representing
whether the value is <codeph>NULL</codeph> or not. (When this field is <codeph>true</codeph>, the return
value is interpreted as <codeph>NULL</codeph>.) The second field is the same type as the specified function
return type, and holds the return value when the function returns something other than
<codeph>NULL</codeph>.
</li>
<li>
In the UDF code, you represent the function arguments as an initial pointer to a UDF context structure,
followed by references to zero or more <codeph>struct</codeph>s, corresponding to each of the arguments.
Each <codeph>struct</codeph> has the same 2 fields as with the return value, a <codeph>boolean</codeph>
field representing whether the argument is <codeph>NULL</codeph>, and a field of the appropriate type
holding any non-<codeph>NULL</codeph> argument value.
</li>
<li>
For sample code and build instructions for UDFs,
see <xref href="https://github.com/cloudera/impala/tree/master/be/src/udf_samples" scope="external" format="html">the sample UDFs in the Impala github repo</xref>.
</li>
<li>
Because the file representing the body of the UDF is stored in HDFS, it is automatically available to all
the Impala nodes. You do not need to manually copy any UDF-related files between servers.
</li>
<li>
Because Impala currently does not have any <codeph>ALTER FUNCTION</codeph> statement, if you need to rename
a function, move it to a different database, or change its signature or other properties, issue a
<codeph>DROP FUNCTION</codeph> statement for the original function followed by a <codeph>CREATE
FUNCTION</codeph> with the desired properties.
</li>
<li>
Because each UDF is associated with a particular database, either issue a <codeph>USE</codeph> statement
before doing any <codeph>CREATE FUNCTION</codeph> statements, or specify the name of the function as
<codeph><varname>db_name</varname>.<varname>function_name</varname></codeph>.
</li>
</ul>
<p conref="../shared/impala_common.xml#common/sync_ddl_blurb"/>
<p conref="../shared/impala_common.xml#common/compatibility_blurb"/>
<p>
Impala can run UDFs that were created through Hive, as long as they refer to Impala-compatible data types
(not composite or nested column types). Hive can run Java-based UDFs that were created through Impala, but
not Impala UDFs written in C++.
</p>
<p conref="../shared/impala_common.xml#common/current_user_caveat"/>
<p><b>Persistence:</b></p>
<p conref="../shared/impala_common.xml#common/udf_persistence_restriction"/>
<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
<p conref="../shared/impala_common.xml#common/permissions_blurb_no"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p>
For additional examples of all kinds of user-defined functions, see <xref href="impala_udf.xml#udfs"/>.
</p>
<p rev="2.5.0 IMPALA-2843 CDH-39148">
The following example shows how to take a Java jar file and make all the functions inside one of its classes
into UDFs under a single (overloaded) function name in Impala. Each <codeph>CREATE FUNCTION</codeph> or
<codeph>DROP FUNCTION</codeph> statement applies to all the overloaded Java functions with the same name.
This example uses the signatureless syntax for <codeph>CREATE FUNCTION</codeph> and <codeph>DROP FUNCTION</codeph>,
which is available in <keyword keyref="impala25_full"/> and higher.
</p>
<p rev="2.5.0 IMPALA-2843 CDH-39148">
At the start, the jar file is in the local filesystem. Then it is copied into HDFS, so that it is
available for Impala to reference through the <codeph>CREATE FUNCTION</codeph> statement and
queries that refer to the Impala function name.
</p>
<codeblock rev="2.5.0 IMPALA-2843 CDH-39148">
$ jar -tvf udf-examples-cdh570.jar
0 Mon Feb 22 04:06:50 PST 2016 META-INF/
122 Mon Feb 22 04:06:48 PST 2016 META-INF/MANIFEST.MF
0 Mon Feb 22 04:06:46 PST 2016 com/
0 Mon Feb 22 04:06:46 PST 2016 com/cloudera/
0 Mon Feb 22 04:06:46 PST 2016 com/cloudera/impala/
2460 Mon Feb 22 04:06:46 PST 2016 com/cloudera/impala/IncompatibleUdfTest.class
541 Mon Feb 22 04:06:46 PST 2016 com/cloudera/impala/TestUdfException.class
3438 Mon Feb 22 04:06:46 PST 2016 com/cloudera/impala/JavaUdfTest.class
5872 Mon Feb 22 04:06:46 PST 2016 com/cloudera/impala/TestUdf.class
...
$ hdfs dfs -put udf-examples-cdh570.jar /user/impala/udfs
$ hdfs dfs -ls /user/impala/udfs
Found 2 items
-rw-r--r-- 3 jrussell supergroup 853 2015-10-09 14:05 /user/impala/udfs/hello_world.jar
-rw-r--r-- 3 jrussell supergroup 7366 2016-06-08 14:25 /user/impala/udfs/udf-examples-cdh570.jar
</codeblock>
<p rev="2.5.0 IMPALA-2843 CDH-39148">
In <cmdname>impala-shell</cmdname>, the <codeph>CREATE FUNCTION</codeph> refers to the HDFS path of the jar file
and the fully qualified class name inside the jar. Each of the functions inside the class becomes an
Impala function, each one overloaded under the specified Impala function name.
</p>
<codeblock rev="2.5.0 IMPALA-2843 CDH-39148">
[localhost:21000] > create function testudf location '/user/impala/udfs/udf-examples-cdh570.jar' symbol='com.cloudera.impala.TestUdf';
[localhost:21000] > show functions;
+-------------+---------------------------------------+-------------+---------------+
| return type | signature | binary type | is persistent |
+-------------+---------------------------------------+-------------+---------------+
| BIGINT | testudf(BIGINT) | JAVA | true |
| BOOLEAN | testudf(BOOLEAN) | JAVA | true |
| BOOLEAN | testudf(BOOLEAN, BOOLEAN) | JAVA | true |
| BOOLEAN | testudf(BOOLEAN, BOOLEAN, BOOLEAN) | JAVA | true |
| DOUBLE | testudf(DOUBLE) | JAVA | true |
| DOUBLE | testudf(DOUBLE, DOUBLE) | JAVA | true |
| DOUBLE | testudf(DOUBLE, DOUBLE, DOUBLE) | JAVA | true |
| FLOAT | testudf(FLOAT) | JAVA | true |
| FLOAT | testudf(FLOAT, FLOAT) | JAVA | true |
| FLOAT | testudf(FLOAT, FLOAT, FLOAT) | JAVA | true |
| INT | testudf(INT) | JAVA | true |
| DOUBLE | testudf(INT, DOUBLE) | JAVA | true |
| INT | testudf(INT, INT) | JAVA | true |
| INT | testudf(INT, INT, INT) | JAVA | true |
| SMALLINT | testudf(SMALLINT) | JAVA | true |
| SMALLINT | testudf(SMALLINT, SMALLINT) | JAVA | true |
| SMALLINT | testudf(SMALLINT, SMALLINT, SMALLINT) | JAVA | true |
| STRING | testudf(STRING) | JAVA | true |
| STRING | testudf(STRING, STRING) | JAVA | true |
| STRING | testudf(STRING, STRING, STRING) | JAVA | true |
| TINYINT | testudf(TINYINT) | JAVA | true |
+-------------+---------------------------------------+-------------+---------------+
</codeblock>
<p rev="2.5.0 IMPALA-2843 CDH-39148">
These are all simple functions that return their single arguments, or
sum, concatenate, and so on their multiple arguments. Impala determines which
overloaded function to use based on the number and types of the arguments.
</p>
<codeblock rev="2.5.0 IMPALA-2843 CDH-39148">
insert into bigint_x values (1), (2), (4), (3);
select testudf(x) from bigint_x;
+-----------------+
| udfs.testudf(x) |
+-----------------+
| 1 |
| 2 |
| 4 |
| 3 |
+-----------------+
insert into int_x values (1), (2), (4), (3);
select testudf(x, x+1, x*x) from int_x;
+-------------------------------+
| udfs.testudf(x, x + 1, x * x) |
+-------------------------------+
| 4 |
| 9 |
| 25 |
| 16 |
+-------------------------------+
select testudf(x) from string_x;
+-----------------+
| udfs.testudf(x) |
+-----------------+
| one |
| two |
| four |
| three |
+-----------------+
select testudf(x,x) from string_x;
+--------------------+
| udfs.testudf(x, x) |
+--------------------+
| oneone |
| twotwo |
| fourfour |
| threethree |
+--------------------+
</codeblock>
<p rev="2.5.0 IMPALA-2843 CDH-39148">
The previous example used the same Impala function name as the name of the class.
This example shows how the Impala function name is independent of the underlying
Java class or function names. A second <codeph>CREATE FUNCTION</codeph> statement
results in a set of overloaded functions all named <codeph>my_func</codeph>,
to go along with the overloaded functions all named <codeph>testudf</codeph>.
</p>
<codeblock rev="2.5.0 IMPALA-2843 CDH-39148">
create function my_func location '/user/impala/udfs/udf-examples-cdh570.jar'
symbol='com.cloudera.impala.TestUdf';
show functions;
+-------------+---------------------------------------+-------------+---------------+
| return type | signature | binary type | is persistent |
+-------------+---------------------------------------+-------------+---------------+
| BIGINT | my_func(BIGINT) | JAVA | true |
| BOOLEAN | my_func(BOOLEAN) | JAVA | true |
| BOOLEAN | my_func(BOOLEAN, BOOLEAN) | JAVA | true |
...
| BIGINT | testudf(BIGINT) | JAVA | true |
| BOOLEAN | testudf(BOOLEAN) | JAVA | true |
| BOOLEAN | testudf(BOOLEAN, BOOLEAN) | JAVA | true |
...
</codeblock>
<p rev="2.5.0 IMPALA-2843 CDH-39148">
The corresponding <codeph>DROP FUNCTION</codeph> statement with no signature
drops all the overloaded functions with that name.
</p>
<codeblock rev="2.5.0 IMPALA-2843 CDH-39148">
drop function my_func;
show functions;
+-------------+---------------------------------------+-------------+---------------+
| return type | signature | binary type | is persistent |
+-------------+---------------------------------------+-------------+---------------+
| BIGINT | testudf(BIGINT) | JAVA | true |
| BOOLEAN | testudf(BOOLEAN) | JAVA | true |
| BOOLEAN | testudf(BOOLEAN, BOOLEAN) | JAVA | true |
...
</codeblock>
<p rev="2.5.0 IMPALA-2843 CDH-39148">
The signatureless <codeph>CREATE FUNCTION</codeph> syntax for Java UDFs ensures that
the functions shown in this example remain available after the Impala service
(specifically, the Catalog Server) are restarted.
</p>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_udf.xml#udfs"/> for more background information, usage instructions, and examples for
Impala UDFs; <xref href="impala_drop_function.xml#drop_function"/>
</p>
</conbody>
</concept>

View File

@@ -0,0 +1,70 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept rev="1.4.0" id="create_role">
<title>CREATE ROLE Statement (<keyword keyref="impala20"/> or higher only)</title>
<titlealts audience="PDF"><navtitle>CREATE ROLE</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="DDL"/>
<data name="Category" value="SQL"/>
<data name="Category" value="Sentry"/>
<data name="Category" value="Security"/>
<data name="Category" value="Roles"/>
<data name="Category" value="Administrators"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
<!-- Consider whether to go deeper into categories like Security for the Sentry-related statements. -->
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="Cloudera">CREATE ROLE statement</indexterm>
<!-- Copied from Sentry docs. Turn into conref. -->
The <codeph>CREATE ROLE</codeph> statement creates a role to which privileges can be granted. Privileges can
be granted to roles, which can then be assigned to users. A user that has been assigned a role will only be
able to exercise the privileges of that role. Only users that have administrative privileges can create/drop
roles. By default, the <codeph>hive</codeph>, <codeph>impala</codeph> and <codeph>hue</codeph> users have
administrative privileges in Sentry.
</p>
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
<codeblock>CREATE ROLE <varname>role_name</varname>
</codeblock>
<p conref="../shared/impala_common.xml#common/privileges_blurb"/>
<p>
Only administrative users (those with <codeph>ALL</codeph> privileges on the server, defined in the Sentry
policy file) can use this statement.
</p>
<p conref="../shared/impala_common.xml#common/compatibility_blurb"/>
<p>
Impala makes use of any roles and privileges specified by the <codeph>GRANT</codeph> and
<codeph>REVOKE</codeph> statements in Hive, and Hive makes use of any roles and privileges specified by the
<codeph>GRANT</codeph> and <codeph>REVOKE</codeph> statements in Impala. The Impala <codeph>GRANT</codeph>
and <codeph>REVOKE</codeph> statements for privileges do not require the <codeph>ROLE</codeph> keyword to be
repeated before each role name, unlike the equivalent Hive statements.
</p>
<!-- To do: nail down the new SHOW syntax, e.g. SHOW ROLES, SHOW CURRENT ROLES, SHOW GROUPS. -->
<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
<p conref="../shared/impala_common.xml#common/permissions_blurb_no"/>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_authorization.xml#authorization"/>, <xref href="impala_grant.xml#grant"/>,
<xref href="impala_revoke.xml#revoke"/>, <xref href="impala_drop_role.xml#drop_role"/>,
<xref href="impala_show.xml#show"/>
</p>
</conbody>
</concept>

View File

@@ -0,0 +1,832 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="create_table" outputclass="impala sql_statement">
<title outputclass="impala_title sql_statement_title">CREATE TABLE Statement</title>
<titlealts audience="PDF"><navtitle>CREATE TABLE</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="SQL"/>
<data name="Category" value="DDL"/>
<data name="Category" value="Impala Data Types"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
<data name="Category" value="HDFS Caching"/>
<data name="Category" value="Tables"/>
<data name="Category" value="Schemas"/>
<data name="Category" value="S3"/>
<!-- <data name="Category" value="Kudu"/> -->
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="Cloudera">CREATE TABLE statement</indexterm>
Creates a new table and specifies its characteristics. While creating a table, you optionally specify aspects
such as:
</p>
<ul>
<li>
Whether the table is internal or external.
</li>
<li>
The columns and associated data types.
</li>
<li>
The columns used for physically partitioning the data.
</li>
<li>
The file format for data files.
</li>
<li>
The HDFS directory where the data files are located.
</li>
</ul>
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
<p>
The general syntax for creating a table and specifying its columns is as follows:
</p>
<p>
<b>Explicit column definitions:</b>
</p>
<codeblock>CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [<varname>db_name</varname>.]<varname>table_name</varname>
(<varname>col_name</varname> <varname>data_type</varname> [COMMENT '<varname>col_comment</varname>'], ...)
[PARTITIONED BY (<varname>col_name</varname> <varname>data_type</varname> [COMMENT '<varname>col_comment</varname>'], ...)]
[COMMENT '<varname>table_comment</varname>']
[WITH SERDEPROPERTIES ('<varname>key1</varname>'='<varname>value1</varname>', '<varname>key2</varname>'='<varname>value2</varname>', ...)]
[
[ROW FORMAT <varname>row_format</varname>] [STORED AS <varname>file_format</varname>]
]
[LOCATION '<varname>hdfs_path</varname>']
[TBLPROPERTIES ('<varname>key1</varname>'='<varname>value1</varname>', '<varname>key2</varname>'='<varname>value2</varname>', ...)]
<ph rev="1.4.0"> [CACHED IN '<varname>pool_name</varname>'</ph> <ph rev="2.2.0">[WITH REPLICATION = <varname>integer</varname>]</ph> | UNCACHED]
</codeblock>
<p>
<b>Column definitions inferred from data file:</b>
</p>
<codeblock>CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [<varname>db_name</varname>.]<varname>table_name</varname>
LIKE PARQUET '<varname>hdfs_path_of_parquet_file</varname>'
[COMMENT '<varname>table_comment</varname>']
[PARTITIONED BY (<varname>col_name</varname> <varname>data_type</varname> [COMMENT '<varname>col_comment</varname>'], ...)]
[WITH SERDEPROPERTIES ('<varname>key1</varname>'='<varname>value1</varname>', '<varname>key2</varname>'='<varname>value2</varname>', ...)]
[
[ROW FORMAT <varname>row_format</varname>] [STORED AS <varname>file_format</varname>]
]
[LOCATION '<varname>hdfs_path</varname>']
[TBLPROPERTIES ('<varname>key1</varname>'='<varname>value1</varname>', '<varname>key2</varname>'='<varname>value2</varname>', ...)]
<ph rev="1.4.0"> [CACHED IN '<varname>pool_name</varname>'</ph> <ph rev="2.2.0">[WITH REPLICATION = <varname>integer</varname>]</ph> | UNCACHED]
data_type:
<varname>primitive_type</varname>
| array_type
| map_type
| struct_type
</codeblock>
<p>
<b>CREATE TABLE AS SELECT:</b>
</p>
<codeblock>CREATE [EXTERNAL] TABLE [IF NOT EXISTS] <varname>db_name</varname>.]<varname>table_name</varname>
<ph rev="2.5.0">[PARTITIONED BY (<varname>col_name</varname>[, ...])]</ph>
[COMMENT '<varname>table_comment</varname>']
[WITH SERDEPROPERTIES ('<varname>key1</varname>'='<varname>value1</varname>', '<varname>key2</varname>'='<varname>value2</varname>', ...)]
[
[ROW FORMAT <varname>row_format</varname>] <ph rev="CDH-41501">[STORED AS <varname>ctas_file_format</varname>]</ph>
]
[LOCATION '<varname>hdfs_path</varname>']
[TBLPROPERTIES ('<varname>key1</varname>'='<varname>value1</varname>', '<varname>key2</varname>'='<varname>value2</varname>', ...)]
<ph rev="1.4.0"> [CACHED IN '<varname>pool_name</varname>'</ph> <ph rev="2.2.0">[WITH REPLICATION = <varname>integer</varname>]</ph> | UNCACHED]
AS
<varname>select_statement</varname></codeblock>
<codeblock>primitive_type:
TINYINT
| SMALLINT
| INT
| BIGINT
| BOOLEAN
| FLOAT
| DOUBLE
<ph rev="1.4.0">| DECIMAL</ph>
| STRING
<ph rev="2.0.0">| CHAR</ph>
<ph rev="2.0.0">| VARCHAR</ph>
| TIMESTAMP
<ph rev="2.3.0">complex_type:
struct_type
| array_type
| map_type
struct_type: STRUCT &lt; <varname>name</varname> : <varname>primitive_or_complex_type</varname> [COMMENT '<varname>comment_string</varname>'], ... &gt;
array_type: ARRAY &lt; <varname>primitive_or_complex_type</varname> &gt;
map_type: MAP &lt; <varname>primitive_type</varname>, <varname>primitive_or_complex_type</varname> &gt;
</ph>
row_format:
DELIMITED [FIELDS TERMINATED BY '<varname>char</varname>' [ESCAPED BY '<varname>char</varname>']]
[LINES TERMINATED BY '<varname>char</varname>']
file_format:
PARQUET
| TEXTFILE
| AVRO
| SEQUENCEFILE
| RCFILE
<ph rev="CDH-41501">ctas_file_format:
PARQUET
| TEXTFILE</ph>
</codeblock>
<p conref="../shared/impala_common.xml#common/ddl_blurb"/>
<!-- Should really have some info up front about all the data types and file formats.
Consider adding here, or at least making inline links to the relevant keywords
in the syntax spec above. -->
<p>
<b>Column definitions:</b>
</p>
<p>
Depending on the form of the <codeph>CREATE TABLE</codeph> statement, the column definitions are
required or not allowed.
</p>
<p>
With the <codeph>CREATE TABLE AS SELECT</codeph> and <codeph>CREATE TABLE LIKE</codeph>
syntax, you do not specify the columns at all; the column names and types are derived from the source table, query,
or data file.
</p>
<p>
With the basic <codeph>CREATE TABLE</codeph> syntax, you must list one or more columns,
its name, type, and optionally a comment, in addition to any columns used as partitioning keys.
There is one exception where the column list is not required: when creating an Avro table with the
<codeph>STORED AS AVRO</codeph> clause, you can omit the list of columns and specify the same metadata
as part of the <codeph>TBLPROPERTIES</codeph> clause.
</p>
<p conref="../shared/impala_common.xml#common/complex_types_blurb"/>
<p rev="2.3.0">
The Impala complex types (<codeph>STRUCT</codeph>, <codeph>ARRAY</codeph>, or <codeph>MAP</codeph>)
are available in <keyword keyref="impala23_full"/> and higher.
Because you can nest these types (for example, to make an array of maps or a struct
with an array field), these types are also sometimes referred to as nested types.
See <xref href="impala_complex_types.xml#complex_types"/> for usage details.
</p>
<!-- This is kind of an obscure and rare usage scenario. Consider moving all the complex type stuff further down
after some of the more common clauses. -->
<p rev="2.3.0">
Impala can create tables containing complex type columns, with any supported file format.
Because currently Impala can only query complex type columns in Parquet tables, creating
tables with complex type columns and other file formats such as text is of limited use.
For example, you might create a text table including some columns with complex types with Impala, and use Hive
as part of your to ingest the nested type data and copy it to an identical Parquet table.
Or you might create a partitioned table containing complex type columns using one file format, and
use <codeph>ALTER TABLE</codeph> to change the file format of individual partitions to Parquet; Impala
can then query only the Parquet-format partitions in that table.
</p>
<p conref="../shared/impala_common.xml#common/complex_types_partitioning"/>
<p>
<b>Internal and external tables (EXTERNAL and LOCATION clauses):</b>
</p>
<p>
By default, Impala creates an <q>internal</q> table, where Impala manages the underlying data files for the
table, and physically deletes the data files when you drop the table. If you specify the
<codeph>EXTERNAL</codeph> clause, Impala treats the table as an <q>external</q> table, where the data files
are typically produced outside Impala and queried from their original locations in HDFS, and Impala leaves
the data files in place when you drop the table. For details about internal and external tables, see
<xref href="impala_tables.xml#tables"/>.
</p>
<p>
Typically, for an external table you include a <codeph>LOCATION</codeph> clause to specify the path to the
HDFS directory where Impala reads and writes files for the table. For example, if your data pipeline produces
Parquet files in the HDFS directory <filepath>/user/etl/destination</filepath>, you might create an external
table as follows:
</p>
<codeblock>CREATE EXTERNAL TABLE external_parquet (c1 INT, c2 STRING, c3 TIMESTAMP)
STORED AS PARQUET LOCATION '/user/etl/destination';
</codeblock>
<p>
Although the <codeph>EXTERNAL</codeph> and <codeph>LOCATION</codeph> clauses are often specified together,
<codeph>LOCATION</codeph> is optional for external tables, and you can also specify <codeph>LOCATION</codeph>
for internal tables. The difference is all about whether Impala <q>takes control</q> of the underlying data
files and moves them when you rename the table, or deletes them when you drop the table. For more about
internal and external tables and how they interact with the <codeph>LOCATION</codeph> attribute, see
<xref href="impala_tables.xml#tables"/>.
</p>
<p>
<b>Partitioned tables (PARTITIONED BY clause):</b>
</p>
<p>
The <codeph>PARTITIONED BY</codeph> clause divides the data files based on the values from one or more
specified columns. Impala queries can use the partition metadata to minimize the amount of data that is read
from disk or transmitted across the network, particularly during join queries. For details about
partitioning, see <xref href="impala_partitioning.xml#partitioning"/>.
</p>
<p rev="2.5.0">
Prior to <keyword keyref="impala25_full"/> you could use a partitioned table
as the source and copy data from it, but could not specify any partitioning clauses for the new table.
In <keyword keyref="impala25_full"/> and higher, you can now use the <codeph>PARTITIONED BY</codeph> clause with a
<codeph>CREATE TABLE AS SELECT</codeph> statement. See the examples under the following discussion of
the <codeph>CREATE TABLE AS SELECT</codeph> syntax variation.
</p>
<!--
<p rev="kudu">
<b>Partitioning for Kudu tables (DISTRIBUTE BY clause)</b>
</p>
<p rev="kudu">
For Kudu tables, you specify logical partitioning across one or more columns using the
<codeph>DISTRIBUTE BY</codeph> clause. In contrast to partitioning for HDFS-based tables,
multiple values for a partition key column can be located in the same partition.
The optional <codeph>HASH</codeph> clause lets you divide one or a set of partition key columns
into a specified number of buckets; you can use more than one <codeph>HASH</codeph>
clause, specifying a distinct set of partition key columns for each.
The optional <codeph>RANGE</codeph> clause further subdivides the partitions, based on
a set of literal values for the partition key columns.
</p>
-->
<p>
<b>Specifying file format (STORED AS and ROW FORMAT clauses):</b>
</p>
<p rev="DOCS-1523">
The <codeph>STORED AS</codeph> clause identifies the format of the underlying data files. Currently, Impala
can query more types of file formats than it can create or insert into. Use Hive to perform any create or
data load operations that are not currently available in Impala. For example, Impala can create an Avro,
SequenceFile, or RCFile table but cannot insert data into it. There are also Impala-specific procedures for using
compression with each kind of file format. For details about working with data files of various formats, see
<xref href="impala_file_formats.xml#file_formats"/>.
</p>
<note>
In Impala 1.4.0 and higher, Impala can create Avro tables, which formerly required doing the <codeph>CREATE
TABLE</codeph> statement in Hive. See <xref href="impala_avro.xml#avro"/> for details and examples.
</note>
<p>
By default (when no <codeph>STORED AS</codeph> clause is specified), data files in Impala tables are created
as text files with Ctrl-A (hex 01) characters as the delimiter.
<!-- Verify if ROW FORMAT is entirely ignored outside of text tables, or does it apply somehow to SequenceFile and/or RCFile too? -->
Specify the <codeph>ROW FORMAT DELIMITED</codeph> clause to produce or ingest data files that use a different
delimiter character such as tab or <codeph>|</codeph>, or a different line end character such as carriage
return or newline. When specifying delimiter and line end characters with the <codeph>FIELDS TERMINATED
BY</codeph> and <codeph>LINES TERMINATED BY</codeph> clauses, use <codeph>'\t'</codeph> for tab,
<codeph>'\n'</codeph> for newline or linefeed, <codeph>'\r'</codeph> for carriage return, and
<codeph>\</codeph><codeph>0</codeph> for ASCII <codeph>nul</codeph> (hex 00). For more examples of text
tables, see <xref href="impala_txtfile.xml#txtfile"/>.
</p>
<p>
The <codeph>ESCAPED BY</codeph> clause applies both to text files that you create through an
<codeph>INSERT</codeph> statement to an Impala <codeph>TEXTFILE</codeph> table, and to existing data files
that you put into an Impala table directory. (You can ingest existing data files either by creating the table
with <codeph>CREATE EXTERNAL TABLE ... LOCATION</codeph>, the <codeph>LOAD DATA</codeph> statement, or
through an HDFS operation such as <codeph>hdfs dfs -put <varname>file</varname>
<varname>hdfs_path</varname></codeph>.) Choose an escape character that is not used anywhere else in the
file, and put it in front of each instance of the delimiter character that occurs within a field value.
Surrounding field values with quotation marks does not help Impala to parse fields with embedded delimiter
characters; the quotation marks are considered to be part of the column value. If you want to use
<codeph>\</codeph> as the escape character, specify the clause in <cmdname>impala-shell</cmdname> as
<codeph>ESCAPED BY '\\'</codeph>.
</p>
<note conref="../shared/impala_common.xml#common/thorn"/>
<p>
<b>Cloning tables (LIKE clause):</b>
</p>
<p>
To create an empty table with the same columns, comments, and other attributes as another table, use the
following variation. The <codeph>CREATE TABLE ... LIKE</codeph> form allows a restricted set of clauses,
currently only the <codeph>LOCATION</codeph>, <codeph>COMMENT</codeph>, and <codeph>STORED AS</codeph>
clauses.
</p>
<codeblock>CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [<varname>db_name</varname>.]<varname>table_name</varname>
<ph rev="1.4.0">LIKE { [<varname>db_name</varname>.]<varname>table_name</varname> | PARQUET '<varname>hdfs_path_of_parquet_file</varname>' }</ph>
[COMMENT '<varname>table_comment</varname>']
[STORED AS <varname>file_format</varname>]
[LOCATION '<varname>hdfs_path</varname>']</codeblock>
<note rev="1.2.0">
<p rev="1.2.0">
To clone the structure of a table and transfer data into it in a single operation, use the <codeph>CREATE
TABLE AS SELECT</codeph> syntax described in the next subsection.
</p>
</note>
<p>
When you clone the structure of an existing table using the <codeph>CREATE TABLE ... LIKE</codeph> syntax,
the new table keeps the same file format as the original one, so you only need to specify the <codeph>STORED
AS</codeph> clause if you want to use a different file format, or when specifying a view as the original
table. (Creating a table <q>like</q> a view produces a text table by default.)
</p>
<p>
Although normally Impala cannot create an HBase table directly, Impala can clone the structure of an existing
HBase table with the <codeph>CREATE TABLE ... LIKE</codeph> syntax, preserving the file format and metadata
from the original table.
</p>
<p>
There are some exceptions to the ability to use <codeph>CREATE TABLE ... LIKE</codeph> with an Avro table.
For example, you cannot use this technique for an Avro table that is specified with an Avro schema but no
columns. When in doubt, check if a <codeph>CREATE TABLE ... LIKE</codeph> operation works in Hive; if not, it
typically will not work in Impala either.
</p>
<p>
If the original table is partitioned, the new table inherits the same partition key columns. Because the new
table is initially empty, it does not inherit the actual partitions that exist in the original one. To create
partitions in the new table, insert data or issue <codeph>ALTER TABLE ... ADD PARTITION</codeph> statements.
</p>
<p conref="../shared/impala_common.xml#common/create_table_like_view"/>
<p>
Because <codeph>CREATE TABLE ... LIKE</codeph> only manipulates table metadata, not the physical data of the
table, issue <codeph>INSERT INTO TABLE</codeph> statements afterward to copy any data from the original table
into the new one, optionally converting the data to a new file format. (For some file formats, Impala can do
a <codeph>CREATE TABLE ... LIKE</codeph> to create the table, but Impala cannot insert data in that file
format; in these cases, you must load the data in Hive. See
<xref href="impala_file_formats.xml#file_formats"/> for details.)
</p>
<p rev="1.2" id="ctas">
<b>CREATE TABLE AS SELECT:</b>
</p>
<p>
The <codeph>CREATE TABLE AS SELECT</codeph> syntax is a shorthand notation to create a table based on column
definitions from another table, and copy data from the source table to the destination table without issuing
any separate <codeph>INSERT</codeph> statement. This idiom is so popular that it has its own acronym,
<q>CTAS</q>.
</p>
<p>
The following examples show how to copy data from a source table <codeph>T1</codeph>
to a variety of destinations tables, applying various transformations to the table
properties, table layout, or the data itself as part of the operation:
</p>
<codeblock>
-- Sample table to be the source of CTAS operations.
CREATE TABLE t1 (x INT, y STRING);
INSERT INTO t1 VALUES (1, 'one'), (2, 'two'), (3, 'three');
-- Clone all the columns and data from one table to another.
CREATE TABLE clone_of_t1 AS SELECT * FROM t1;
+-------------------+
| summary |
+-------------------+
| Inserted 3 row(s) |
+-------------------+
-- Clone the columns and data, and convert the data to a different file format.
CREATE TABLE parquet_version_of_t1 STORED AS PARQUET AS SELECT * FROM t1;
+-------------------+
| summary |
+-------------------+
| Inserted 3 row(s) |
+-------------------+
-- Copy only some rows to the new table.
CREATE TABLE subset_of_t1 AS SELECT * FROM t1 WHERE x >= 2;
+-------------------+
| summary |
+-------------------+
| Inserted 2 row(s) |
+-------------------+
-- Same idea as CREATE TABLE LIKE: clone table layout but do not copy any data.
CREATE TABLE empty_clone_of_t1 AS SELECT * FROM t1 WHERE 1=0;
+-------------------+
| summary |
+-------------------+
| Inserted 0 row(s) |
+-------------------+
-- Reorder and rename columns and transform the data.
CREATE TABLE t5 AS SELECT upper(y) AS s, x+1 AS a, 'Entirely new column' AS n FROM t1;
+-------------------+
| summary |
+-------------------+
| Inserted 3 row(s) |
+-------------------+
SELECT * FROM t5;
+-------+---+---------------------+
| s | a | n |
+-------+---+---------------------+
| ONE | 2 | Entirely new column |
| TWO | 3 | Entirely new column |
| THREE | 4 | Entirely new column |
+-------+---+---------------------+
</codeblock>
<!-- These are a little heavyweight to get into here. Therefore commenting out.
Some overlap with the new column-changing examples in the code listing above.
Create tables with different column order, names, or types than the original.
CREATE TABLE some_columns_from_t1 AS SELECT c1, c3, c5 FROM t1;
CREATE TABLE reordered_columns_from_t1 AS SELECT c4, c3, c1, c2 FROM t1;
CREATE TABLE synthesized_columns AS SELECT upper(c1) AS all_caps, c2+c3 AS total, "California" AS state FROM t1;</codeblock>
-->
<!-- CREATE TABLE AS <select> now incorporated up higher in the original syntax diagram. -->
<p rev="1.2">
See <xref href="impala_select.xml#select"/> for details about query syntax for the <codeph>SELECT</codeph>
portion of a <codeph>CREATE TABLE AS SELECT</codeph> statement.
</p>
<p rev="1.2">
The newly created table inherits the column names that you select from the original table, which you can
override by specifying column aliases in the query. Any column or table comments from the original table are
not carried over to the new table.
</p>
<note rev="DOCS-1523">
When using the <codeph>STORED AS</codeph> clause with a <codeph>CREATE TABLE AS SELECT</codeph>
statement, the destination table must be a file format that Impala can write to: currently,
text or Parquet. You cannot specify an Avro, SequenceFile, or RCFile table as the destination
table for a CTAS operation.
</note>
<p rev="2.5.0">
Prior to <keyword keyref="impala25_full"/> you could use a partitioned table
as the source and copy data from it, but could not specify any partitioning clauses for the new table.
In <keyword keyref="impala25_full"/> and higher, you can now use the <codeph>PARTITIONED BY</codeph> clause with a
<codeph>CREATE TABLE AS SELECT</codeph> statement. The following example demonstrates how you can copy
data from an unpartitioned table in a <codeph>CREATE TABLE AS SELECT</codeph> operation, creating a new
partitioned table in the process. The main syntax consideration is the column order in the <codeph>PARTITIONED BY</codeph>
clause and the select list: the partition key columns must be listed last in the select list, in the same
order as in the <codeph>PARTITIONED BY</codeph> clause. Therefore, in this case, the column order in the
destination table is different from the source table. You also only specify the column names in the
<codeph>PARTITIONED BY</codeph> clause, not the data types or column comments.
</p>
<codeblock rev="2.5.0">
create table partitions_no (year smallint, month tinyint, s string);
insert into partitions_no values (2016, 1, 'January 2016'),
(2016, 2, 'February 2016'), (2016, 3, 'March 2016');
-- Prove that the source table is not partitioned.
show partitions partitions_no;
ERROR: AnalysisException: Table is not partitioned: ctas_partition_by.partitions_no
-- Create new table with partitions based on column values from source table.
<b>create table partitions_yes partitioned by (year, month)
as select s, year, month from partitions_no;</b>
+-------------------+
| summary |
+-------------------+
| Inserted 3 row(s) |
+-------------------+
-- Prove that the destination table is partitioned.
show partitions partitions_yes;
+-------+-------+-------+--------+------+...
| year | month | #Rows | #Files | Size |...
+-------+-------+-------+--------+------+...
| 2016 | 1 | -1 | 1 | 13B |...
| 2016 | 2 | -1 | 1 | 14B |...
| 2016 | 3 | -1 | 1 | 11B |...
| Total | | -1 | 3 | 38B |...
+-------+-------+-------+--------+------+...
</codeblock>
<p rev="2.5.0">
The most convenient layout for partitioned tables is with all the
partition key columns at the end. The CTAS <codeph>PARTITIONED BY</codeph> syntax
requires that column order in the select list, resulting in that same
column order in the destination table.
</p>
<codeblock rev="2.5.0">
describe partitions_no;
+-------+----------+---------+
| name | type | comment |
+-------+----------+---------+
| year | smallint | |
| month | tinyint | |
| s | string | |
+-------+----------+---------+
-- The CTAS operation forced us to put the partition key columns last.
-- Having those columns last works better with idioms such as SELECT *
-- for partitioned tables.
describe partitions_yes;
+-------+----------+---------+
| name | type | comment |
+-------+----------+---------+
| s | string | |
| year | smallint | |
| month | tinyint | |
+-------+----------+---------+
</codeblock>
<p rev="2.5.0">
Attempting to use a select list with the partition key columns
not at the end results in an error due to a column name mismatch:
</p>
<codeblock rev="2.5.0">
-- We expect this CTAS to fail because non-key column S
-- comes after key columns YEAR and MONTH in the select list.
create table partitions_maybe partitioned by (year, month)
as select year, month, s from partitions_no;
ERROR: AnalysisException: Partition column name mismatch: year != month
</codeblock>
<p rev="1.2">
For example, the following statements show how you can clone all the data in a table, or a subset of the
columns and/or rows, or reorder columns, rename them, or construct them out of expressions:
</p>
<p rev="1.2">
As part of a CTAS operation, you can convert the data to any file format that Impala can write (currently,
<codeph>TEXTFILE</codeph> and <codeph>PARQUET</codeph>). You cannot specify the lower-level properties of a
text table, such as the delimiter.
</p>
<p rev="obwl" conref="../shared/impala_common.xml#common/insert_sort_blurb"/>
<p rev="1.4.0">
<b>CREATE TABLE LIKE PARQUET:</b>
</p>
<p rev="1.4.0">
The variation <codeph>CREATE TABLE ... LIKE PARQUET '<varname>hdfs_path_of_parquet_file</varname>'</codeph>
lets you skip the column definitions of the <codeph>CREATE TABLE</codeph> statement. The column names and
data types are automatically configured based on the organization of the specified Parquet data file, which
must already reside in HDFS. You can use a data file located outside the Impala database directories, or a
file from an existing Impala Parquet table; either way, Impala only uses the column definitions from the file
and does not use the HDFS location for the <codeph>LOCATION</codeph> attribute of the new table. (Although
you can also specify the enclosing directory with the <codeph>LOCATION</codeph> attribute, to both use the
same schema as the data file and point the Impala table at the associated directory for querying.)
</p>
<p rev="1.4.0">
The following considerations apply when you use the <codeph>CREATE TABLE LIKE PARQUET</codeph> technique:
</p>
<ul rev="1.4.0">
<li>
Any column comments from the original table are not preserved in the new table. Each column in the new
table has a comment stating the low-level Parquet field type used to deduce the appropriate SQL column
type.
</li>
<li>
If you use a data file from a partitioned Impala table, any partition key columns from the original table
are left out of the new table, because they are represented in HDFS directory names rather than stored in
the data file. To preserve the partition information, repeat the same <codeph>PARTITION</codeph> clause as
in the original <codeph>CREATE TABLE</codeph> statement.
</li>
<li>
The file format of the new table defaults to text, as with other kinds of <codeph>CREATE TABLE</codeph>
statements. To make the new table also use Parquet format, include the clause <codeph>STORED AS
PARQUET</codeph> in the <codeph>CREATE TABLE LIKE PARQUET</codeph> statement.
</li>
<li>
If the Parquet data file comes from an existing Impala table, currently, any <codeph>TINYINT</codeph> or
<codeph>SMALLINT</codeph> columns are turned into <codeph>INT</codeph> columns in the new table.
Internally, Parquet stores such values as 32-bit integers.
</li>
<li>
When the destination table uses the Parquet file format, the <codeph>CREATE TABLE AS SELECT</codeph> and
<codeph>INSERT ... SELECT</codeph> statements always create at least one data file, even if the
<codeph>SELECT</codeph> part of the statement does not match any rows. You can use such an empty Parquet
data file as a template for subsequent <codeph>CREATE TABLE LIKE PARQUET</codeph> statements.
</li>
</ul>
<p>
For more details about creating Parquet tables, and examples of the <codeph>CREATE TABLE LIKE
PARQUET</codeph> syntax, see <xref href="impala_parquet.xml#parquet"/>.
</p>
<p>
<b>Visibility and Metadata (TBLPROPERTIES and WITH SERDEPROPERTIES clauses):</b>
</p>
<p rev="1.2">
You can associate arbitrary items of metadata with a table by specifying the <codeph>TBLPROPERTIES</codeph>
clause. This clause takes a comma-separated list of key-value pairs and stores those items in the metastore
database. You can also change the table properties later with an <codeph>ALTER TABLE</codeph> statement. You
can observe the table properties for different delimiter and escape characters using the <codeph>DESCRIBE
FORMATTED</codeph> command, and change those settings for an existing table with <codeph>ALTER TABLE ... SET
TBLPROPERTIES</codeph>.
</p>
<p rev="1.2">
You can also associate SerDes properties with the table by specifying key-value pairs through the
<codeph>WITH SERDEPROPERTIES</codeph> clause. This metadata is not used by Impala, which has its own built-in
serializer and deserializer for the file formats it supports. Particular property values might be needed for
Hive compatibility with certain variations of file formats, particularly Avro.
</p>
<p>
Some DDL operations that interact with other Hadoop components require specifying particular values in the
<codeph>SERDEPROPERTIES</codeph> or <codeph>TBLPROPERTIES</codeph> fields, such as creating an Avro table or
an HBase table. (You typically create HBase tables in Hive, because they require additional clauses not
currently available in Impala.)
<!-- Haven't got a working example from Lenni, so suppressing this recommendation for now.
The Avro schema properties can be specified through either
<codeph>TBLPROPERTIES</codeph> or <codeph>SERDEPROPERTIES</codeph>;
for best compatibility with future versions of Hive,
use <codeph>SERDEPROPERTIES</codeph> in this case.
-->
</p>
<p>
To see the column definitions and column comments for an existing table, for example before issuing a
<codeph>CREATE TABLE ... LIKE</codeph> or a <codeph>CREATE TABLE ... AS SELECT</codeph> statement, issue the
statement <codeph>DESCRIBE <varname>table_name</varname></codeph>. To see even more detail, such as the
location of data files and the values for clauses such as <codeph>ROW FORMAT</codeph> and <codeph>STORED
AS</codeph>, issue the statement <codeph>DESCRIBE FORMATTED <varname>table_name</varname></codeph>.
<codeph>DESCRIBE FORMATTED</codeph> is also needed to see any overall table comment (as opposed to individual
column comments).
</p>
<p>
After creating a table, your <cmdname>impala-shell</cmdname> session or another
<cmdname>impala-shell</cmdname> connected to the same node can immediately query that table. There might be a
brief interval (one statestore heartbeat) before the table can be queried through a different Impala node. To
make the <codeph>CREATE TABLE</codeph> statement return only when the table is recognized by all Impala nodes
in the cluster, enable the <codeph>SYNC_DDL</codeph> query option.
</p>
<p rev="1.4.0">
<b>HDFS caching (CACHED IN clause):</b>
</p>
<p rev="1.4.0">
If you specify the <codeph>CACHED IN</codeph> clause, any existing or future data files in the table
directory or the partition subdirectories are designated to be loaded into memory with the HDFS caching
mechanism. See <xref href="impala_perf_hdfs_caching.xml#hdfs_caching"/> for details about using the HDFS
caching feature.
</p>
<p conref="../shared/impala_common.xml#common/impala_cache_replication_factor"/>
<!-- Say something in here about the SHOW statement, e.g. SHOW TABLES, SHOW TABLE/COLUMN STATS, SHOW PARTITIONS. -->
<p>
<b>Column order</b>:
</p>
<p>
If you intend to use the table to hold data files produced by some external source, specify the columns in
the same order as they appear in the data files.
</p>
<p>
If you intend to insert or copy data into the table through Impala, or if you have control over the way
externally produced data files are arranged, use your judgment to specify columns in the most convenient
order:
</p>
<ul>
<li>
<p>
If certain columns are often <codeph>NULL</codeph>, specify those columns last. You might produce data
files that omit these trailing columns entirely. Impala automatically fills in the <codeph>NULL</codeph>
values if so.
</p>
</li>
<li>
<p>
If an unpartitioned table will be used as the source for an <codeph>INSERT ... SELECT</codeph> operation
into a partitioned table, specify last in the unpartitioned table any columns that correspond to
partition key columns in the partitioned table, and in the same order as the partition key columns are
declared in the partitioned table. This technique lets you use <codeph>INSERT ... SELECT *</codeph> when
copying data to the partitioned table, rather than specifying each column name individually.
</p>
</li>
<li>
<p>
If you specify columns in an order that you later discover is suboptimal, you can sometimes work around
the problem without recreating the table. You can create a view that selects columns from the original
table in a permuted order, then do a <codeph>SELECT *</codeph> from the view. When inserting data into a
table, you can specify a permuted order for the inserted columns to match the order in the destination
table.
</p>
</li>
</ul>
<p conref="../shared/impala_common.xml#common/hive_blurb"/>
<p>
Impala queries can make use of metadata about the table and columns, such as the number of rows in a table or
the number of different values in a column. Prior to Impala 1.2.2, to create this metadata, you issued the
<codeph>ANALYZE TABLE</codeph> statement in Hive to gather this information, after creating the table and
loading representative data into it. In Impala 1.2.2 and higher, the <codeph>COMPUTE STATS</codeph> statement
produces these statistics within Impala, without needing to use Hive at all.
</p>
<p conref="../shared/impala_common.xml#common/hbase_blurb"/>
<note>
<p>
The Impala <codeph>CREATE TABLE</codeph> statement cannot create an HBase table, because it currently does
not support the <codeph>STORED BY</codeph> clause needed for HBase tables. Create such tables in Hive, then
query them through Impala. For information on using Impala with HBase tables, see
<xref href="impala_hbase.xml#impala_hbase"/>.
</p>
</note>
<p conref="../shared/impala_common.xml#common/s3_blurb"/>
<p rev="2.2.0">
To create a table where the data resides in the Amazon Simple Storage Service (S3),
specify a <codeph>s3a://</codeph> prefix <codeph>LOCATION</codeph> attribute pointing to the data files in S3.
</p>
<p rev="2.6.0 CDH-39913 IMPALA-1878">
In <keyword keyref="impala26_full"/> and higher, you can
use this special <codeph>LOCATION</codeph> syntax
as part of a <codeph>CREATE TABLE AS SELECT</codeph> statement.
</p>
<p conref="../shared/impala_common.xml#common/s3_ddl"/>
<p conref="../shared/impala_common.xml#common/insert_sort_blurb"/>
<p conref="../shared/impala_common.xml#common/hdfs_blurb"/>
<p>
The <codeph>CREATE TABLE</codeph> statement for an internal table creates a directory in HDFS. The
<codeph>CREATE EXTERNAL TABLE</codeph> statement associates the table with an existing HDFS directory, and
does not create any new directory in HDFS. To locate the HDFS data directory for a table, issue a
<codeph>DESCRIBE FORMATTED <varname>table</varname></codeph> statement. To examine the contents of that HDFS
directory, use an OS command such as <codeph>hdfs dfs -ls hdfs://<varname>path</varname></codeph>, either
from the OS command line or through the <codeph>shell</codeph> or <codeph>!</codeph> commands in
<cmdname>impala-shell</cmdname>.
</p>
<p>
The <codeph>CREATE TABLE AS SELECT</codeph> syntax creates data files under the table data directory to hold
any data copied by the <codeph>INSERT</codeph> portion of the statement. (Even if no data is copied, Impala
might create one or more empty data files.)
</p>
<p conref="../shared/impala_common.xml#common/permissions_blurb"/>
<p rev="CDH-19187">
The user ID that the <cmdname>impalad</cmdname> daemon runs under,
typically the <codeph>impala</codeph> user, must have both execute and write
permission for the database directory where the table is being created.
</p>
<p conref="../shared/impala_common.xml#common/security_blurb"/>
<p conref="../shared/impala_common.xml#common/redaction_yes"/>
<p conref="../shared/impala_common.xml#common/cancel_blurb_maybe"/>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_tables.xml#tables"/>,
<xref href="impala_alter_table.xml#alter_table"/>, <xref href="impala_drop_table.xml#drop_table"/>,
<xref href="impala_partitioning.xml#partitioning"/>, <xref href="impala_tables.xml#internal_tables"/>,
<xref href="impala_tables.xml#external_tables"/>, <xref href="impala_compute_stats.xml#compute_stats"/>,
<xref href="impala_sync_ddl.xml#sync_ddl"/>, <xref href="impala_show.xml#show_tables"/>,
<xref href="impala_show.xml#show_create_table"/>, <xref href="impala_describe.xml#describe"/>
</p>
</conbody>
</concept>

View File

@@ -0,0 +1,139 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept rev="1.1" id="create_view">
<title>CREATE VIEW Statement</title>
<titlealts audience="PDF"><navtitle>CREATE VIEW</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="SQL"/>
<data name="Category" value="DDL"/>
<data name="Category" value="Tables"/>
<data name="Category" value="Schemas"/>
<data name="Category" value="Views"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="Cloudera">CREATE VIEW statement</indexterm>
The <codeph>CREATE VIEW</codeph> statement lets you create a shorthand abbreviation for a more complicated
query. The base query can involve joins, expressions, reordered columns, column aliases, and other SQL
features that can make a query hard to understand or maintain.
</p>
<p>
Because a view is purely a logical construct (an alias for a query) with no physical data behind it,
<codeph>ALTER VIEW</codeph> only involves changes to metadata in the metastore database, not any data files
in HDFS.
</p>
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
<codeblock>CREATE VIEW [IF NOT EXISTS] <varname>view_name</varname> [(<varname>column_list</varname>)]
AS <varname>select_statement</varname></codeblock>
<p conref="../shared/impala_common.xml#common/ddl_blurb"/>
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
<p>
The <codeph>CREATE VIEW</codeph> statement can be useful in scenarios such as the following:
</p>
<ul>
<li>
To turn even the most lengthy and complicated SQL query into a one-liner. You can issue simple queries
against the view from applications, scripts, or interactive queries in <cmdname>impala-shell</cmdname>.
For example:
<codeblock>select * from <varname>view_name</varname>;
select * from <varname>view_name</varname> order by c1 desc limit 10;</codeblock>
The more complicated and hard-to-read the original query, the more benefit there is to simplifying the
query using a view.
</li>
<li>
To hide the underlying table and column names, to minimize maintenance problems if those names change. In
that case, you re-create the view using the new names, and all queries that use the view rather than the
underlying tables keep running with no changes.
</li>
<li>
To experiment with optimization techniques and make the optimized queries available to all applications.
For example, if you find a combination of <codeph>WHERE</codeph> conditions, join order, join hints, and so
on that works the best for a class of queries, you can establish a view that incorporates the
best-performing techniques. Applications can then make relatively simple queries against the view, without
repeating the complicated and optimized logic over and over. If you later find a better way to optimize the
original query, when you re-create the view, all the applications immediately take advantage of the
optimized base query.
</li>
<li>
To simplify a whole class of related queries, especially complicated queries involving joins between
multiple tables, complicated expressions in the column list, and other SQL syntax that makes the query
difficult to understand and debug. For example, you might create a view that joins several tables, filters
using several <codeph>WHERE</codeph> conditions, and selects several columns from the result set.
Applications might issue queries against this view that only vary in their <codeph>LIMIT</codeph>,
<codeph>ORDER BY</codeph>, and similar simple clauses.
</li>
</ul>
<p>
For queries that require repeating complicated clauses over and over again, for example in the select list,
<codeph>ORDER BY</codeph>, and <codeph>GROUP BY</codeph> clauses, you can use the <codeph>WITH</codeph>
clause as an alternative to creating a view.
</p>
<p conref="../shared/impala_common.xml#common/complex_types_blurb"/>
<p conref="../shared/impala_common.xml#common/complex_types_views"/>
<p conref="../shared/impala_common.xml#common/complex_types_views_caveat"/>
<p conref="../shared/impala_common.xml#common/sync_ddl_blurb"/>
<p conref="../shared/impala_common.xml#common/security_blurb"/>
<p conref="../shared/impala_common.xml#common/redaction_yes"/>
<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
<p conref="../shared/impala_common.xml#common/permissions_blurb_no"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<!-- TK: Elaborate on these, show queries and real output. -->
<codeblock>-- Create a view that is exactly the same as the underlying table.
create view v1 as select * from t1;
-- Create a view that includes only certain columns from the underlying table.
create view v2 as select c1, c3, c7 from t1;
-- Create a view that filters the values from the underlying table.
create view v3 as select distinct c1, c3, c7 from t1 where c1 is not null and c5 &gt; 0;
-- Create a view that that reorders and renames columns from the underlying table.
create view v4 as select c4 as last_name, c6 as address, c2 as birth_date from t1;
-- Create a view that runs functions to convert or transform certain columns.
create view v5 as select c1, cast(c3 as string) c3, concat(c4,c5) c5, trim(c6) c6, "Constant" c8 from t1;
-- Create a view that hides the complexity of a view query.
create view v6 as select t1.c1, t2.c2 from t1 join t2 on t1.id = t2.id;
</codeblock>
<!-- These examples show CREATE VIEW and corresponding DROP VIEW statements, with different combinations
of qualified and unqualified names. -->
<p conref="../shared/impala_common.xml#common/create_drop_view_examples"/>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_views.xml#views"/>, <xref href="impala_alter_view.xml#alter_view"/>,
<xref href="impala_drop_view.xml#drop_view"/>
</p>
</conbody>
</concept>

View File

@@ -0,0 +1,22 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept rev="1.4.0" id="data_sources">
<title>Data Sources</title>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p>
<xref href="impala_create_data_source.xml#create_data_source"/>
<xref href="impala_drop_data_source.xml#drop_data_source"/>
<xref href="impala_create_table.xml#create_table"/>
</p>
</conbody>
</concept>

View File

@@ -0,0 +1,65 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="databases">
<title>Overview of Impala Databases</title>
<titlealts audience="PDF"><navtitle>Databases</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Databases"/>
<data name="Category" value="SQL"/>
<data name="Category" value="Data Analysts"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Querying"/>
<data name="Category" value="Schemas"/>
</metadata>
</prolog>
<conbody>
<p>
In Impala, a database is a logical container for a group of tables. Each database defines a separate
namespace. Within a database, you can refer to the tables inside it using their unqualified names. Different
databases can contain tables with identical names.
</p>
<p>
Creating a database is a lightweight operation. There are minimal database-specific properties to configure,
only <codeph>LOCATION</codeph> and <codeph>COMMENT</codeph>. There is no <codeph>ALTER DATABASE</codeph> statement.
</p>
<p>
Typically, you create a separate database for each project or application, to avoid naming conflicts between
tables and to make clear which tables are related to each other. The <codeph>USE</codeph> statement lets
you switch between databases. Unqualified references to tables, views, and functions refer to objects
within the current database. You can also refer to objects in other databases by using qualified names
of the form <codeph><varname>dbname</varname>.<varname>object_name</varname></codeph>.
</p>
<p>
Each database is physically represented by a directory in HDFS. When you do not specify a <codeph>LOCATION</codeph>
attribute, the directory is located in the Impala data directory with the associated tables managed by Impala.
When you do specify a <codeph>LOCATION</codeph> attribute, any read and write operations for tables in that
database are relative to the specified HDFS directory.
</p>
<p>
There is a special database, named <codeph>default</codeph>, where you begin when you connect to Impala.
Tables created in <codeph>default</codeph> are physically located one level higher in HDFS than all the
user-created databases.
</p>
<p conref="../shared/impala_common.xml#common/builtins_db"/>
<p>
<b>Related statements:</b>
</p>
<p>
<xref href="impala_create_database.xml#create_database"/>,
<xref href="impala_drop_database.xml#drop_database"/>, <xref href="impala_use.xml#use"/>,
<xref href="impala_show.xml#show_databases"/>
</p>
</conbody>
</concept>

View File

@@ -0,0 +1,43 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="datatypes">
<title>Data Types</title>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Impala Data Types"/>
<data name="Category" value="SQL"/>
<data name="Category" value="Data Analysts"/>
<data name="Category" value="Developers"/>
<data name="Category" value="SQL"/>
<data name="Category" value="Schemas"/>
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="Cloudera">data types</indexterm>
Impala supports a set of data types that you can use for table columns, expression values, and function
arguments and return values.
</p>
<note>
Currently, Impala supports only scalar types, not composite or nested types. Accessing a table containing any
columns with unsupported types causes an error.
</note>
<p outputclass="toc"/>
<p>
For the notation to write literals of each of these data types, see
<xref href="impala_literals.xml#literals"/>.
</p>
<p>
See <xref href="impala_langref_unsupported.xml#langref_hiveql_delta"/> for differences between Impala and
Hive data types.
</p>
</conbody>
</concept>

104
docs/topics/impala_date.xml Normal file
View File

@@ -0,0 +1,104 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept audience="Cloudera" id="date" rev="2.0.0">
<title>DATE Data Type (<keyword keyref="impala21"/> or higher only)</title>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Impala Data Types"/>
<data name="Category" value="SQL"/>
<data name="Category" value="Data Analysts"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Dates and Times"/>
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="Cloudera">DATE data type</indexterm>
A type representing the date (year, month, and day) as a single numeric value. Used to represent a broader
date range than possible with the <codeph>TIMESTAMP</codeph> type, with fewer distinct values than
<codeph>TIMESTAMP</codeph>, and in a more compact and efficient form than using a <codeph>STRING</codeph>
such as <codeph>'2014-12-31'</codeph>.
</p>
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
<codeblock><varname>column_name</varname> DATE</codeblock>
<p>
<b>Range:</b> January 1, -4712 BC .. December 31, 9999 AD.
</p>
<p conref="../shared/impala_common.xml#common/hbase_ok"/>
<p conref="../shared/impala_common.xml#common/parquet_blurb"/>
<ul>
<li>
This type can be read from and written to Parquet files.
</li>
<li>
There is no requirement for a particular level of Parquet.
</li>
<li>
Parquet files generated by Impala and containing this type can be freely interchanged with other components
such as Hive and MapReduce.
</li>
</ul>
<p conref="../shared/impala_common.xml#common/hive_blurb"/>
<p>
TK.
</p>
<p conref="../shared/impala_common.xml#common/conversion_blurb"/>
<p>
TK.
</p>
<p conref="../shared/impala_common.xml#common/partitioning_blurb"/>
<p>
This type can be used for partition key columns. Because it has less granularity (and thus fewer distinct
values) than an equivalent <codeph>TIMESTAMP</codeph> column, and numeric columns are more efficient as
partition keys than strings, prefer to partition by a <codeph>DATE</codeph> column rather than a
<codeph>TIMESTAMP</codeph> column or a <codeph>STRING</codeph> representation of a date.
</p>
<p conref="../shared/impala_common.xml#common/compatibility_blurb"/>
<p>
This type is available on CDH 5.2 or higher.
</p>
<p conref="../shared/impala_common.xml#common/internals_2_bytes"/>
<p conref="../shared/impala_common.xml#common/added_in_20"/>
<p conref="../shared/impala_common.xml#common/column_stats_constant"/>
<p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
<p>
Things happen when converting <codeph>TIMESTAMP</codeph> to <codeph>DATE</codeph> or <codeph>DATE</codeph> to
<codeph>TIMESTAMP</codeph>. TK.
</p>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
The <xref href="impala_timestamp.xml#timestamp">TIMESTAMP</xref> data type is closely related. Some functions
from <xref href="impala_datetime_functions.xml#datetime_functions"/> accept and return <codeph>DATE</codeph>
values.
</p>
</conbody>
</concept>

File diff suppressed because it is too large Load Diff

150
docs/topics/impala_ddl.xml Normal file
View File

@@ -0,0 +1,150 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="ddl">
<title>DDL Statements</title>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="SQL"/>
<data name="Category" value="DDL"/>
<data name="Category" value="Data Analysts"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Schemas"/>
<data name="Category" value="Tables"/>
<data name="Category" value="Databases"/>
</metadata>
</prolog>
<conbody>
<p>
DDL refers to <q>Data Definition Language</q>, a subset of SQL statements that change the structure of the
database schema in some way, typically by creating, deleting, or modifying schema objects such as databases,
tables, and views. Most Impala DDL statements start with the keywords <codeph>CREATE</codeph>,
<codeph>DROP</codeph>, or <codeph>ALTER</codeph>.
</p>
<p>
The Impala DDL statements are:
</p>
<ul>
<li>
<xref href="impala_alter_table.xml#alter_table"/>
</li>
<li>
<xref href="impala_alter_view.xml#alter_view"/>
</li>
<li>
<xref href="impala_compute_stats.xml#compute_stats"/>
</li>
<li>
<xref href="impala_create_database.xml#create_database"/>
</li>
<li>
<xref href="impala_create_function.xml#create_function"/>
</li>
<li rev="2.0.0">
<xref href="impala_create_role.xml#create_role"/>
</li>
<li>
<xref href="impala_create_table.xml#create_table"/>
</li>
<li>
<xref href="impala_create_view.xml#create_view"/>
</li>
<li>
<xref href="impala_drop_database.xml#drop_database"/>
</li>
<li>
<xref href="impala_drop_function.xml#drop_function"/>
</li>
<li rev="2.0.0">
<xref href="impala_drop_role.xml#drop_role"/>
</li>
<li>
<xref href="impala_drop_table.xml#drop_table"/>
</li>
<li>
<xref href="impala_drop_view.xml#drop_view"/>
</li>
<li rev="2.0.0">
<xref href="impala_grant.xml#grant"/>
</li>
<li rev="2.0.0">
<xref href="impala_revoke.xml#revoke"/>
</li>
</ul>
<p>
After Impala executes a DDL command, information about available tables, columns, views, partitions, and so
on is automatically synchronized between all the Impala nodes in a cluster. (Prior to Impala 1.2, you had to
issue a <codeph>REFRESH</codeph> or <codeph>INVALIDATE METADATA</codeph> statement manually on the other
nodes to make them aware of the changes.)
</p>
<p>
If the timing of metadata updates is significant, for example if you use round-robin scheduling where each
query could be issued through a different Impala node, you can enable the
<xref href="impala_sync_ddl.xml#sync_ddl">SYNC_DDL</xref> query option to make the DDL statement wait until
all nodes have been notified about the metadata changes.
</p>
<p rev="2.2.0">
See <xref href="impala_s3.xml#s3"/> for details about how Impala DDL statements interact with
tables and partitions stored in the Amazon S3 filesystem.
</p>
<p>
Although the <codeph>INSERT</codeph> statement is officially classified as a DML (data manipulation language)
statement, it also involves metadata changes that must be broadcast to all Impala nodes, and so is also
affected by the <codeph>SYNC_DDL</codeph> query option.
</p>
<p>
Because the <codeph>SYNC_DDL</codeph> query option makes each DDL operation take longer than normal, you
might only enable it before the last DDL operation in a sequence. For example, if you are running a script
that issues multiple of DDL operations to set up an entire new schema, add several new partitions, and so on,
you might minimize the performance overhead by enabling the query option only before the last
<codeph>CREATE</codeph>, <codeph>DROP</codeph>, <codeph>ALTER</codeph>, or <codeph>INSERT</codeph> statement.
The script only finishes when all the relevant metadata changes are recognized by all the Impala nodes, so
you could connect to any node and issue queries through it.
</p>
<p>
The classification of DDL, DML, and other statements is not necessarily the same between Impala and Hive.
Impala organizes these statements in a way intended to be familiar to people familiar with relational
databases or data warehouse products. Statements that modify the metastore database, such as <codeph>COMPUTE
STATS</codeph>, are classified as DDL. Statements that only query the metastore database, such as
<codeph>SHOW</codeph> or <codeph>DESCRIBE</codeph>, are put into a separate category of utility statements.
</p>
<note>
The query types shown in the Impala debug web user interface might not match exactly the categories listed
here. For example, currently the <codeph>USE</codeph> statement is shown as DDL in the debug web UI. The
query types shown in the debug web UI are subject to change, for improved consistency.
</note>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
The other major classifications of SQL statements are data manipulation language (see
<xref href="impala_dml.xml#dml"/>) and queries (see <xref href="impala_select.xml#select"/>).
</p>
</conbody>
</concept>

View File

@@ -0,0 +1,33 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="debug_action">
<title>DEBUG_ACTION Query Option</title>
<titlealts audience="PDF"><navtitle>DEBUG_ACTION</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Impala Query Options"/>
<data name="Category" value="Querying"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Administrators"/>
<data name="Category" value="Troubleshooting"/>
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="Cloudera">DEBUG_ACTION query option</indexterm>
Introduces artificial problem conditions within queries. For internal Cloudera debugging and troubleshooting.
</p>
<p>
<b>Type:</b> <codeph>STRING</codeph>
</p>
<p>
<b>Default:</b> empty string
</p>
</conbody>
</concept>

View File

@@ -0,0 +1,817 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept rev="1.4.0" id="decimal">
<title>DECIMAL Data Type (<keyword keyref="impala14"/> or higher only)</title>
<titlealts audience="PDF"><navtitle>DECIMAL</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Impala Data Types"/>
<data name="Category" value="SQL"/>
<data name="Category" value="Data Analysts"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Schemas"/>
</metadata>
</prolog>
<conbody>
<p>
A numeric data type with fixed scale and precision, used in <codeph>CREATE TABLE</codeph> and <codeph>ALTER
TABLE</codeph> statements. Suitable for financial and other arithmetic calculations where the imprecise
representation and rounding behavior of <codeph>FLOAT</codeph> and <codeph>DOUBLE</codeph> make those types
impractical.
</p>
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
<p>
In the column definition of a <codeph>CREATE TABLE</codeph> statement:
</p>
<codeblock><varname>column_name</varname> DECIMAL[(<varname>precision</varname>[,<varname>scale</varname>])]</codeblock>
<p>
<codeph>DECIMAL</codeph> with no precision or scale values is equivalent to <codeph>DECIMAL(9,0)</codeph>.
</p>
<p>
<b>Precision and Scale:</b>
</p>
<p>
<varname>precision</varname> represents the total number of digits that can be represented by the column,
regardless of the location of the decimal point. This value must be between 1 and 38. For example,
representing integer values up to 9999, and floating-point values up to 99.99, both require a precision of 4.
You can also represent corresponding negative values, without any change in the precision. For example, the
range -9999 to 9999 still only requires a precision of 4.
</p>
<p>
<varname>scale</varname> represents the number of fractional digits. This value must be less than or equal to
<varname>precision</varname>. A scale of 0 produces integral values, with no fractional part. If precision
and scale are equal, all the digits come after the decimal point, making all the values between 0 and
0.999... or 0 and -0.999...
</p>
<p>
When <varname>precision</varname> and <varname>scale</varname> are omitted, a <codeph>DECIMAL</codeph> value
is treated as <codeph>DECIMAL(9,0)</codeph>, that is, an integer value ranging from
<codeph>-999,999,999</codeph> to <codeph>999,999,999</codeph>. This is the largest <codeph>DECIMAL</codeph>
value that can still be represented in 4 bytes. If precision is specified but scale is omitted, Impala uses a
value of zero for the scale.
</p>
<p>
Both <varname>precision</varname> and <varname>scale</varname> must be specified as integer literals, not any
other kind of constant expressions.
</p>
<p>
To check the precision or scale for arbitrary values, you can call the
<xref href="impala_math_functions.xml#math_functions"><codeph>precision()</codeph> and
<codeph>scale()</codeph> built-in functions</xref>. For example, you might use these values to figure out how
many characters are required for various fields in a report, or to understand the rounding characteristics of
a formula as applied to a particular <codeph>DECIMAL</codeph> column.
</p>
<p>
<b>Range:</b>
</p>
<p>
The maximum precision value is 38. Thus, the largest integral value is represented by
<codeph>DECIMAL(38,0)</codeph> (999... with 9 repeated 38 times). The most precise fractional value (between
0 and 1, or 0 and -1) is represented by <codeph>DECIMAL(38,38)</codeph>, with 38 digits to the right of the
decimal point. The value closest to 0 would be .0000...1 (37 zeros and the final 1). The value closest to 1
would be .999... (9 repeated 38 times).
</p>
<p>
For a given precision and scale, the range of <codeph>DECIMAL</codeph> values is the same in the positive and
negative directions. For example, <codeph>DECIMAL(4,2)</codeph> can represent from -99.99 to 99.99. This is
different from other integral numeric types where the positive and negative bounds differ slightly.
</p>
<p>
When you use <codeph>DECIMAL</codeph> values in arithmetic expressions, the precision and scale of the result
value are determined as follows:
</p>
<ul>
<li>
<p>
For addition and subtraction, the precision and scale are based on the maximum possible result, that is,
if all the digits of the input values were 9s and the absolute values were added together.
</p>
<!-- Seems like buggy output from this first query, so hiding the example for the time being. -->
<codeblock audience="Cloudera"><![CDATA[[localhost:21000] > select 50000.5 + 12.444, precision(50000.5 + 12.444), scale(50000.5 + 12.444);
+------------------+-----------------------------+-------------------------+
| 50000.5 + 12.444 | precision(50000.5 + 12.444) | scale(50000.5 + 12.444) |
+------------------+-----------------------------+-------------------------+
| 50012.944 | 9 | 3 |
+------------------+-----------------------------+-------------------------+
[localhost:21000] > select 99999.9 + 99.999, precision(99999.9 + 99.999), scale(99999.9 + 99.999);
+------------------+-----------------------------+-------------------------+
| 99999.9 + 99.999 | precision(99999.9 + 99.999) | scale(99999.9 + 99.999) |
+------------------+-----------------------------+-------------------------+
| 100099.899 | 9 | 3 |
+------------------+-----------------------------+-------------------------+
]]>
</codeblock>
</li>
<li>
<p>
For multiplication, the precision is the sum of the precisions of the input values. The scale is the sum
of the scales of the input values.
</p>
</li>
<!-- Need to add some specifics to discussion of division. Details here: http://blogs.msdn.com/b/sqlprogrammability/archive/2006/03/29/564110.aspx -->
<li>
<p>
For division, Impala sets the precision and scale to values large enough to represent the whole and
fractional parts of the result.
</p>
</li>
<li>
<p>
For <codeph>UNION</codeph>, the scale is the larger of the scales of the input values, and the precision
is increased if necessary to accommodate any additional fractional digits. If the same input value has
the largest precision and the largest scale, the result value has the same precision and scale. If one
value has a larger precision but smaller scale, the scale of the result value is increased. For example,
<codeph>DECIMAL(20,2) UNION DECIMAL(8,6)</codeph> produces a result of type
<codeph>DECIMAL(24,6)</codeph>. The extra 4 fractional digits of scale (6-2) are accommodated by
extending the precision by the same amount (20+4).
</p>
</li>
<li>
<p>
To doublecheck, you can always call the <codeph>PRECISION()</codeph> and <codeph>SCALE()</codeph>
functions on the results of an arithmetic expression to see the relevant values, or use a <codeph>CREATE
TABLE AS SELECT</codeph> statement to define a column based on the return type of the expression.
</p>
</li>
</ul>
<p conref="../shared/impala_common.xml#common/compatibility_blurb"/>
<ul>
<li>
Using the <codeph>DECIMAL</codeph> type is only supported under <keyword keyref="impala14_full"/> and higher.
</li>
<li>
Use the <codeph>DECIMAL</codeph> data type in Impala for applications where you used the
<codeph>NUMBER</codeph> data type in Oracle. The Impala <codeph>DECIMAL</codeph> type does not support the
Oracle idioms of <codeph>*</codeph> for scale or negative values for precision.
</li>
</ul>
<p>
<b>Conversions and casting:</b>
</p>
<p>
<ph conref="../shared/impala_common.xml#common/cast_int_to_timestamp"/>
</p>
<p>
Impala automatically converts between <codeph>DECIMAL</codeph> and other numeric types where possible. A
<codeph>DECIMAL</codeph> with zero scale is converted to or from the smallest appropriate integral type. A
<codeph>DECIMAL</codeph> with a fractional part is automatically converted to or from the smallest
appropriate floating-point type. If the destination type does not have sufficient precision or scale to hold
all possible values of the source type, Impala raises an error and does not convert the value.
</p>
<p>
For example, these statements show how expressions of <codeph>DECIMAL</codeph> and other types are reconciled
to the same type in the context of <codeph>UNION</codeph> queries and <codeph>INSERT</codeph> statements:
</p>
<codeblock><![CDATA[[localhost:21000] > select cast(1 as int) as x union select cast(1.5 as decimal(9,4)) as x;
+----------------+
| x |
+----------------+
| 1.5000 |
| 1.0000 |
+----------------+
[localhost:21000] > create table int_vs_decimal as select cast(1 as int) as x union select cast(1.5 as decimal(9,4)) as x;
+-------------------+
| summary |
+-------------------+
| Inserted 2 row(s) |
+-------------------+
[localhost:21000] > desc int_vs_decimal;
+------+---------------+---------+
| name | type | comment |
+------+---------------+---------+
| x | decimal(14,4) | |
+------+---------------+---------+
]]>
</codeblock>
<p>
To avoid potential conversion errors, you can use <codeph>CAST()</codeph> to convert <codeph>DECIMAL</codeph>
values to <codeph>FLOAT</codeph>, <codeph>TINYINT</codeph>, <codeph>SMALLINT</codeph>, <codeph>INT</codeph>,
<codeph>BIGINT</codeph>, <codeph>STRING</codeph>, <codeph>TIMESTAMP</codeph>, or <codeph>BOOLEAN</codeph>.
You can use exponential notation in <codeph>DECIMAL</codeph> literals or when casting from
<codeph>STRING</codeph>, for example <codeph>1.0e6</codeph> to represent one million.
</p>
<p>
If you cast a value with more fractional digits than the scale of the destination type, any extra fractional
digits are truncated (not rounded). Casting a value to a target type with not enough precision produces a
result of <codeph>NULL</codeph> and displays a runtime warning.
</p>
<codeblock><![CDATA[[localhost:21000] > select cast(1.239 as decimal(3,2));
+-----------------------------+
| cast(1.239 as decimal(3,2)) |
+-----------------------------+
| 1.23 |
+-----------------------------+
[localhost:21000] > select cast(1234 as decimal(3));
+----------------------------+
| cast(1234 as decimal(3,0)) |
+----------------------------+
| NULL |
+----------------------------+
WARNINGS: Expression overflowed, returning NULL
]]>
</codeblock>
<p>
When you specify integer literals, for example in <codeph>INSERT ... VALUES</codeph> statements or arithmetic
expressions, those numbers are interpreted as the smallest applicable integer type. You must use
<codeph>CAST()</codeph> calls for some combinations of integer literals and <codeph>DECIMAL</codeph>
precision. For example, <codeph>INT</codeph> has a maximum value that is 10 digits long,
<codeph>TINYINT</codeph> has a maximum value that is 3 digits long, and so on. If you specify a value such as
123456 to go into a <codeph>DECIMAL</codeph> column, Impala checks if the column has enough precision to
represent the largest value of that integer type, and raises an error if not. Therefore, use an expression
like <codeph>CAST(123456 TO DECIMAL(9,0))</codeph> for <codeph>DECIMAL</codeph> columns with precision 9 or
less, <codeph>CAST(50 TO DECIMAL(2,0))</codeph> for <codeph>DECIMAL</codeph> columns with precision 2 or
less, and so on. For <codeph>DECIMAL</codeph> columns with precision 10 or greater, Impala automatically
interprets the value as the correct <codeph>DECIMAL</codeph> type; however, because
<codeph>DECIMAL(10)</codeph> requires 8 bytes of storage while <codeph>DECIMAL(9)</codeph> requires only 4
bytes, only use precision of 10 or higher when actually needed.
</p>
<codeblock><![CDATA[[localhost:21000] > create table decimals_9_0 (x decimal);
[localhost:21000] > insert into decimals_9_0 values (1), (2), (4), (8), (16), (1024), (32768), (65536), (1000000);
ERROR: AnalysisException: Possible loss of precision for target table 'decimal_testing.decimals_9_0'.
Expression '1' (type: INT) would need to be cast to DECIMAL(9,0) for column 'x'
[localhost:21000] > insert into decimals_9_0 values (cast(1 as decimal)), (cast(2 as decimal)), (cast(4 as decimal)), (cast(8 as decimal)), (cast(16 as decimal)), (cast(1024 as decimal)), (cast(32768 as decimal)), (cast(65536 as decimal)), (cast(1000000 as decimal));
[localhost:21000] > create table decimals_10_0 (x decimal(10,0));
[localhost:21000] > insert into decimals_10_0 values (1), (2), (4), (8), (16), (1024), (32768), (65536), (1000000);
]]>
</codeblock>
<p>
Be aware that in memory and for binary file formats such as Parquet or Avro, <codeph>DECIMAL(10)</codeph> or
higher consumes 8 bytes while <codeph>DECIMAL(9)</codeph> (the default for <codeph>DECIMAL</codeph>) or lower
consumes 4 bytes. Therefore, to conserve space in large tables, use the smallest-precision
<codeph>DECIMAL</codeph> type that is appropriate and <codeph>CAST()</codeph> literal values where necessary,
rather than declaring <codeph>DECIMAL</codeph> columns with high precision for convenience.
</p>
<p>
To represent a very large or precise <codeph>DECIMAL</codeph> value as a literal, for example one that
contains more digits than can be represented by a <codeph>BIGINT</codeph> literal, use a quoted string or a
floating-point value for the number, and <codeph>CAST()</codeph> to the desired <codeph>DECIMAL</codeph>
type:
</p>
<codeblock>insert into decimals_38_5 values (1), (2), (4), (8), (16), (1024), (32768), (65536), (1000000),
(cast("999999999999999999999999999999" as decimal(38,5))),
(cast(999999999999999999999999999999. as decimal(38,5)));
</codeblock>
<ul>
<li>
<p> The result of the <codeph>SUM()</codeph> aggregate function on
<codeph>DECIMAL</codeph> values is promoted to a precision of 38,
with the same precision as the underlying column. Thus, the result can
represent the largest possible value at that particular precision. </p>
</li>
<li>
<p>
<codeph>STRING</codeph> columns, literals, or expressions can be converted to <codeph>DECIMAL</codeph> as
long as the overall number of digits and digits to the right of the decimal point fit within the
specified precision and scale for the declared <codeph>DECIMAL</codeph> type. By default, a
<codeph>DECIMAL</codeph> value with no specified scale or precision can hold a maximum of 9 digits of an
integer value. If there are more digits in the string value than are allowed by the
<codeph>DECIMAL</codeph> scale and precision, the result is <codeph>NULL</codeph>.
</p>
<p>
The following examples demonstrate how <codeph>STRING</codeph> values with integer and fractional parts
are represented when converted to <codeph>DECIMAL</codeph>. If the scale is 0, the number is treated
as an integer value with a maximum of <varname>precision</varname> digits. If the precision is greater than
0, the scale must be increased to account for the digits both to the left and right of the decimal point.
As the precision increases, output values are printed with additional trailing zeros after the decimal
point if needed. Any trailing zeros after the decimal point in the <codeph>STRING</codeph> value must fit
within the number of digits specified by the precision.
</p>
<codeblock><![CDATA[[localhost:21000] > select cast('100' as decimal); -- Small integer value fits within 9 digits of scale.
+-----------------------------+
| cast('100' as decimal(9,0)) |
+-----------------------------+
| 100 |
+-----------------------------+
[localhost:21000] > select cast('100' as decimal(3,0)); -- Small integer value fits within 3 digits of scale.
+-----------------------------+
| cast('100' as decimal(3,0)) |
+-----------------------------+
| 100 |
+-----------------------------+
[localhost:21000] > select cast('100' as decimal(2,0)); -- 2 digits of scale is not enough!
+-----------------------------+
| cast('100' as decimal(2,0)) |
+-----------------------------+
| NULL |
+-----------------------------+
[localhost:21000] > select cast('100' as decimal(3,1)); -- (3,1) = 2 digits left of the decimal point, 1 to the right. Not enough.
+-----------------------------+
| cast('100' as decimal(3,1)) |
+-----------------------------+
| NULL |
+-----------------------------+
[localhost:21000] > select cast('100' as decimal(4,1)); -- 4 digits total, 1 to the right of the decimal point.
+-----------------------------+
| cast('100' as decimal(4,1)) |
+-----------------------------+
| 100.0 |
+-----------------------------+
[localhost:21000] > select cast('98.6' as decimal(3,1)); -- (3,1) can hold a 3 digit number with 1 fractional digit.
+------------------------------+
| cast('98.6' as decimal(3,1)) |
+------------------------------+
| 98.6 |
+------------------------------+
[localhost:21000] > select cast('98.6' as decimal(15,1)); -- Larger scale allows bigger numbers but still only 1 fractional digit.
+-------------------------------+
| cast('98.6' as decimal(15,1)) |
+-------------------------------+
| 98.6 |
+-------------------------------+
[localhost:21000] > select cast('98.6' as decimal(15,5)); -- Larger precision allows more fractional digits, outputs trailing zeros.
+-------------------------------+
| cast('98.6' as decimal(15,5)) |
+-------------------------------+
| 98.60000 |
+-------------------------------+
[localhost:21000] > select cast('98.60000' as decimal(15,1)); -- Trailing zeros in the string must fit within 'scale' digits (1 in this case).
+-----------------------------------+
| cast('98.60000' as decimal(15,1)) |
+-----------------------------------+
| NULL |
+-----------------------------------+
]]>
</codeblock>
</li>
<li>
Most built-in arithmetic functions such as <codeph>SIN()</codeph> and <codeph>COS()</codeph> continue to
accept only <codeph>DOUBLE</codeph> values because they are so commonly used in scientific context for
calculations of IEEE 954-compliant values. The built-in functions that accept and return
<codeph>DECIMAL</codeph> are:
<!-- List from Skye: positive, negative, least, greatest, fnv_hash, if, nullif, zeroifnull, isnull, coalesce -->
<!-- Nong had already told me about abs, ceil, floor, round, truncate -->
<ul>
<li>
<codeph>ABS()</codeph>
</li>
<li>
<codeph>CEIL()</codeph>
</li>
<li>
<codeph>COALESCE()</codeph>
</li>
<li>
<codeph>FLOOR()</codeph>
</li>
<li>
<codeph>FNV_HASH()</codeph>
</li>
<li>
<codeph>GREATEST()</codeph>
</li>
<li>
<codeph>IF()</codeph>
</li>
<li>
<codeph>ISNULL()</codeph>
</li>
<li>
<codeph>LEAST()</codeph>
</li>
<li>
<codeph>NEGATIVE()</codeph>
</li>
<li>
<codeph>NULLIF()</codeph>
</li>
<li>
<codeph>POSITIVE()</codeph>
</li>
<li>
<codeph>PRECISION()</codeph>
</li>
<li>
<codeph>ROUND()</codeph>
</li>
<li>
<codeph>SCALE()</codeph>
</li>
<li>
<codeph>TRUNCATE()</codeph>
</li>
<li>
<codeph>ZEROIFNULL()</codeph>
</li>
</ul>
See <xref href="impala_functions.xml#builtins"/> for details.
</li>
<li>
<p>
<codeph>BIGINT</codeph>, <codeph>INT</codeph>, <codeph>SMALLINT</codeph>, and <codeph>TINYINT</codeph>
values can all be cast to <codeph>DECIMAL</codeph>. The number of digits to the left of the decimal point
in the <codeph>DECIMAL</codeph> type must be sufficient to hold the largest value of the corresponding
integer type. Note that integer literals are treated as the smallest appropriate integer type, meaning
there is sometimes a range of values that require one more digit of <codeph>DECIMAL</codeph> scale than
you might expect. For integer values, the precision of the <codeph>DECIMAL</codeph> type can be zero; if
the precision is greater than zero, remember to increase the scale value by an equivalent amount to hold
the required number of digits to the left of the decimal point.
</p>
<p>
The following examples show how different integer types are converted to <codeph>DECIMAL</codeph>.
</p>
<!-- According to Nong, it's a bug that so many integer digits can be converted to a DECIMAL
value with small (s,p) spec. So expect to re-do this example. -->
<codeblock><![CDATA[[localhost:21000] > select cast(1 as decimal(1,0));
+-------------------------+
| cast(1 as decimal(1,0)) |
+-------------------------+
| 1 |
+-------------------------+
[localhost:21000] > select cast(9 as decimal(1,0));
+-------------------------+
| cast(9 as decimal(1,0)) |
+-------------------------+
| 9 |
+-------------------------+
[localhost:21000] > select cast(10 as decimal(1,0));
+--------------------------+
| cast(10 as decimal(1,0)) |
+--------------------------+
| 10 |
+--------------------------+
[localhost:21000] > select cast(10 as decimal(1,1));
+--------------------------+
| cast(10 as decimal(1,1)) |
+--------------------------+
| 10.0 |
+--------------------------+
[localhost:21000] > select cast(100 as decimal(1,1));
+---------------------------+
| cast(100 as decimal(1,1)) |
+---------------------------+
| 100.0 |
+---------------------------+
[localhost:21000] > select cast(1000 as decimal(1,1));
+----------------------------+
| cast(1000 as decimal(1,1)) |
+----------------------------+
| 1000.0 |
+----------------------------+
]]>
</codeblock>
</li>
<li>
<p>
When a <codeph>DECIMAL</codeph> value is converted to any of the integer types, any fractional part is
truncated (that is, rounded towards zero):
</p>
<codeblock><![CDATA[[localhost:21000] > create table num_dec_days (x decimal(4,1));
[localhost:21000] > insert into num_dec_days values (1), (2), (cast(4.5 as decimal(4,1)));
[localhost:21000] > insert into num_dec_days values (cast(0.1 as decimal(4,1))), (cast(.9 as decimal(4,1))), (cast(9.1 as decimal(4,1))), (cast(9.9 as decimal(4,1)));
[localhost:21000] > select cast(x as int) from num_dec_days;
+----------------+
| cast(x as int) |
+----------------+
| 1 |
| 2 |
| 4 |
| 0 |
| 0 |
| 9 |
| 9 |
+----------------+
]]>
</codeblock>
</li>
<li>
<p>
You cannot directly cast <codeph>TIMESTAMP</codeph> or <codeph>BOOLEAN</codeph> values to or from
<codeph>DECIMAL</codeph> values. You can turn a <codeph>DECIMAL</codeph> value into a time-related
representation using a two-step process, by converting it to an integer value and then using that result
in a call to a date and time function such as <codeph>from_unixtime()</codeph>.
</p>
<codeblock><![CDATA[[localhost:21000] > select from_unixtime(cast(cast(1000.0 as decimal) as bigint));
+-------------------------------------------------------------+
| from_unixtime(cast(cast(1000.0 as decimal(9,0)) as bigint)) |
+-------------------------------------------------------------+
| 1970-01-01 00:16:40 |
+-------------------------------------------------------------+
[localhost:21000] > select now() + interval cast(x as int) days from num_dec_days; -- x is a DECIMAL column.
[localhost:21000] > create table num_dec_days (x decimal(4,1));
[localhost:21000] > insert into num_dec_days values (1), (2), (cast(4.5 as decimal(4,1)));
[localhost:21000] > select now() + interval cast(x as int) days from num_dec_days; -- The 4.5 value is truncated to 4 and becomes '4 days'.
+--------------------------------------+
| now() + interval cast(x as int) days |
+--------------------------------------+
| 2014-05-13 23:11:55.163284000 |
| 2014-05-14 23:11:55.163284000 |
| 2014-05-16 23:11:55.163284000 |
+--------------------------------------+
]]>
</codeblock>
</li>
<li>
<p>
Because values in <codeph>INSERT</codeph> statements are checked rigorously for type compatibility, be
prepared to use <codeph>CAST()</codeph> function calls around literals, column references, or other
expressions that you are inserting into a <codeph>DECIMAL</codeph> column.
</p>
</li>
</ul>
<p conref="../shared/impala_common.xml#common/null_bad_numeric_cast"/>
<p>
<b>DECIMAL differences from integer and floating-point types:</b>
</p>
<p>
With the <codeph>DECIMAL</codeph> type, you are concerned with the number of overall digits of a number
rather than powers of 2 (as in <codeph>TINYINT</codeph>, <codeph>SMALLINT</codeph>, and so on). Therefore,
the limits with integral values of <codeph>DECIMAL</codeph> types fall around 99, 999, 9999, and so on rather
than 32767, 65535, 2
<sup>32</sup>
-1, and so on. For fractional values, you do not need to account for imprecise representation of the
fractional part according to the IEEE-954 standard (as in <codeph>FLOAT</codeph> and
<codeph>DOUBLE</codeph>). Therefore, when you insert a fractional value into a <codeph>DECIMAL</codeph>
column, you can compare, sum, query, <codeph>GROUP BY</codeph>, and so on that column and get back the
original values rather than some <q>close but not identical</q> value.
</p>
<p>
<codeph>FLOAT</codeph> and <codeph>DOUBLE</codeph> can cause problems or unexpected behavior due to inability
to precisely represent certain fractional values, for example dollar and cents values for currency. You might
find output values slightly different than you inserted, equality tests that do not match precisely, or
unexpected values for <codeph>GROUP BY</codeph> columns. <codeph>DECIMAL</codeph> can help reduce unexpected
behavior and rounding errors, at the expense of some performance overhead for assignments and comparisons.
</p>
<p>
<b>Literals and expressions:</b>
<ul>
<li>
<p>
When you use an integer literal such as <codeph>1</codeph> or <codeph>999</codeph> in a SQL statement,
depending on the context, Impala will treat it as either the smallest appropriate
<codeph>DECIMAL</codeph> type, or the smallest integer type (<codeph>TINYINT</codeph>,
<codeph>SMALLINT</codeph>, <codeph>INT</codeph>, or <codeph>BIGINT</codeph>). To minimize memory usage,
Impala prefers to treat the literal as the smallest appropriate integer type.
</p>
</li>
<li>
<p>
When you use a floating-point literal such as <codeph>1.1</codeph> or <codeph>999.44</codeph> in a SQL
statement, depending on the context, Impala will treat it as either the smallest appropriate
<codeph>DECIMAL</codeph> type, or the smallest floating-point type (<codeph>FLOAT</codeph> or
<codeph>DOUBLE</codeph>). To avoid loss of accuracy, Impala prefers to treat the literal as a
<codeph>DECIMAL</codeph>.
</p>
</li>
</ul>
</p>
<p>
<b>Storage considerations:</b>
</p>
<ul>
<li>
Only the precision determines the storage size for <codeph>DECIMAL</codeph> values; the scale setting has
no effect on the storage size.
</li>
<li>
Text, RCFile, and SequenceFile tables all use ASCII-based formats. In these text-based file formats,
leading zeros are not stored, but trailing zeros are stored. In these tables, each <codeph>DECIMAL</codeph>
value takes up as many bytes as there are digits in the value, plus an extra byte if the decimal point is
present and an extra byte for negative values. Once the values are loaded into memory, they are represented
in 4, 8, or 16 bytes as described in the following list items. The on-disk representation varies depending
on the file format of the table.
</li>
<!-- Next couple of points can be conref'ed with identical list bullets farther down under File Format Considerations. -->
<li>
Parquet and Avro tables use binary formats, In these tables, Impala stores each value in as few bytes as
possible
<!-- 4, 8, or 16 bytes -->
depending on the precision specified for the <codeph>DECIMAL</codeph> column.
<ul>
<li>
In memory, <codeph>DECIMAL</codeph> values with precision of 9 or less are stored in 4 bytes.
</li>
<li>
In memory, <codeph>DECIMAL</codeph> values with precision of 10 through 18 are stored in 8 bytes.
</li>
<li>
In memory, <codeph>DECIMAL</codeph> values with precision greater than 18 are stored in 16 bytes.
</li>
</ul>
</li>
</ul>
<p conref="../shared/impala_common.xml#common/file_format_blurb"/>
<ul>
<li>
The <codeph>DECIMAL</codeph> data type can be stored in any of the file formats supported by Impala, as
described in <xref href="impala_file_formats.xml#file_formats"/>. Impala only writes to tables that use the
Parquet and text formats, so those formats are the focus for file format compatibility.
</li>
<li>
Impala can query Avro, RCFile, or SequenceFile tables containing <codeph>DECIMAL</codeph> columns, created
by other Hadoop components, on CDH 5 only.
</li>
<li>
You can use <codeph>DECIMAL</codeph> columns in Impala tables that are mapped to HBase tables. Impala can
query and insert into such tables.
</li>
<li>
Text, RCFile, and SequenceFile tables all use ASCII-based formats. In these tables, each
<codeph>DECIMAL</codeph> value takes up as many bytes as there are digits in the value, plus an extra byte
if the decimal point is present. The binary format of Parquet or Avro files offers more compact storage for
<codeph>DECIMAL</codeph> columns.
</li>
<li>
Parquet and Avro tables use binary formats, In these tables, Impala stores each value in 4, 8, or 16 bytes
depending on the precision specified for the <codeph>DECIMAL</codeph> column.
</li>
<li>
Parquet files containing <codeph>DECIMAL</codeph> columns are not expected to be readable under CDH 4. See
the <b>Compatibility</b> section for details.
</li>
</ul>
<p>
<b>UDF considerations:</b> When writing a C++ UDF, use the <codeph>DecimalVal</codeph> data type defined in
<filepath>/usr/include/impala_udf/udf.h</filepath>.
</p>
<p conref="../shared/impala_common.xml#common/partitioning_blurb"/>
<p>
You can use a <codeph>DECIMAL</codeph> column as a partition key. Doing so provides a better match between
the partition key values and the HDFS directory names than using a <codeph>DOUBLE</codeph> or
<codeph>FLOAT</codeph> partitioning column:
</p>
<p conref="../shared/impala_common.xml#common/schema_evolution_blurb"/>
<ul>
<li>
For text-based formats (text, RCFile, and SequenceFile tables), you can issue an <codeph>ALTER TABLE ...
REPLACE COLUMNS</codeph> statement to change the precision and scale of an existing
<codeph>DECIMAL</codeph> column. As long as the values in the column fit within the new precision and
scale, they are returned correctly by a query. Any values that do not fit within the new precision and
scale are returned as <codeph>NULL</codeph>, and Impala reports the conversion error. Leading zeros do not
count against the precision value, but trailing zeros after the decimal point do.
<codeblock><![CDATA[[localhost:21000] > create table text_decimals (x string);
[localhost:21000] > insert into text_decimals values ("1"), ("2"), ("99.99"), ("1.234"), ("000001"), ("1.000000000");
[localhost:21000] > select * from text_decimals;
+-------------+
| x |
+-------------+
| 1 |
| 2 |
| 99.99 |
| 1.234 |
| 000001 |
| 1.000000000 |
+-------------+
[localhost:21000] > alter table text_decimals replace columns (x decimal(4,2));
[localhost:21000] > select * from text_decimals;
+-------+
| x |
+-------+
| 1.00 |
| 2.00 |
| 99.99 |
| NULL |
| 1.00 |
| NULL |
+-------+
ERRORS:
Backend 0:Error converting column: 0 TO DECIMAL(4, 2) (Data is: 1.234)
file: hdfs://127.0.0.1:8020/user/hive/warehouse/decimal_testing.db/text_decimals/634d4bd3aa0
e8420-b4b13bab7f1be787_56794587_data.0
record: 1.234
Error converting column: 0 TO DECIMAL(4, 2) (Data is: 1.000000000)
file: hdfs://127.0.0.1:8020/user/hive/warehouse/decimal_testing.db/text_decimals/cd40dc68e20
c565a-cc4bd86c724c96ba_311873428_data.0
record: 1.000000000
]]>
</codeblock>
</li>
<li>
For binary formats (Parquet and Avro tables), although an <codeph>ALTER TABLE ... REPLACE COLUMNS</codeph>
statement that changes the precision or scale of a <codeph>DECIMAL</codeph> column succeeds, any subsequent
attempt to query the changed column results in a fatal error. (The other columns can still be queried
successfully.) This is because the metadata about the columns is stored in the data files themselves, and
<codeph>ALTER TABLE</codeph> does not actually make any updates to the data files. If the metadata in the
data files disagrees with the metadata in the metastore database, Impala cancels the query.
</li>
</ul>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<codeblock>CREATE TABLE t1 (x DECIMAL, y DECIMAL(5,2), z DECIMAL(25,0));
INSERT INTO t1 VALUES (5, 99.44, 123456), (300, 6.7, 999999999);
SELECT x+y, ROUND(y,1), z/98.6 FROM t1;
SELECT CAST(1000.5 AS DECIMAL);
</codeblock>
<p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
<p conref="../shared/impala_common.xml#common/decimal_no_stats"/>
<!-- <p conref="../shared/impala_common.xml#common/partitioning_good"/> -->
<p conref="../shared/impala_common.xml#common/hbase_ok"/>
<p conref="../shared/impala_common.xml#common/parquet_ok"/>
<p conref="../shared/impala_common.xml#common/text_bulky"/>
<!-- <p conref="../shared/impala_common.xml#common/compatibility_blurb"/> -->
<!-- <p conref="../shared/impala_common.xml#common/internals_blurb"/> -->
<!-- <p conref="../shared/impala_common.xml#common/added_in_20"/> -->
<p conref="../shared/impala_common.xml#common/column_stats_constant"/>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_literals.xml#numeric_literals"/>, <xref href="impala_tinyint.xml#tinyint"/>,
<xref href="impala_smallint.xml#smallint"/>, <xref href="impala_int.xml#int"/>,
<xref href="impala_bigint.xml#bigint"/>, <xref href="impala_decimal.xml#decimal"/>,
<xref href="impala_math_functions.xml#math_functions"/> (especially <codeph>PRECISION()</codeph> and
<codeph>SCALE()</codeph>)
</p>
</conbody>
</concept>

View File

@@ -0,0 +1,37 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept rev="obwl" id="default_order_by_limit">
<title>DEFAULT_ORDER_BY_LIMIT Query Option</title>
<titlealts audience="PDF"><navtitle>DEFAULT_ORDER_BY_LIMIT</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Impala Query Options"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p conref="../shared/impala_common.xml#common/obwl_query_options"/>
<p rev="1.4.0">
Prior to Impala 1.4.0, Impala queries that use the <codeph><xref href="impala_order_by.xml#order_by">ORDER
BY</xref></codeph> clause must also include a
<codeph><xref href="impala_limit.xml#limit">LIMIT</xref></codeph> clause, to avoid accidentally producing
huge result sets that must be sorted. Sorting a huge result set is a memory-intensive operation. In Impala
1.4.0 and higher, Impala uses a temporary disk work area to perform the sort if that operation would
otherwise exceed the Impala memory limit on a particular host.
</p>
<p>
<b>Type: numeric</b>
</p>
<p>
<b>Default:</b> -1 (no default limit)
</p>
</conbody>
</concept>

View File

@@ -0,0 +1,88 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept rev="1.2" id="delegation">
<title>Configuring Impala Delegation for Hue and BI Tools</title>
<prolog>
<metadata>
<data name="Category" value="Security"/>
<data name="Category" value="Impala"/>
<data name="Category" value="Authentication"/>
<data name="Category" value="Delegation"/>
<data name="Category" value="Hue"/>
<data name="Category" value="Administrators"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p>
<!--
When users connect to Impala directly through the <cmdname>impala-shell</cmdname> interpreter, the Sentry
authorization framework determines what actions they can take and what data they can see.
-->
When users submit Impala queries through a separate application, such as Hue or a business intelligence tool,
typically all requests are treated as coming from the same user. In Impala 1.2 and higher, authentication is
extended by a new feature that allows applications to pass along credentials for the users that connect to
them (known as <q>delegation</q>), and issue Impala queries with the privileges for those users. Currently,
the delegation feature is available only for Impala queries submitted through application interfaces such as
Hue and BI tools; for example, Impala cannot issue queries using the privileges of the HDFS user.
</p>
<p>
The delegation feature is enabled by a startup option for <cmdname>impalad</cmdname>:
<codeph>--authorized_proxy_user_config</codeph>. When you specify this option, users whose names you specify
(such as <codeph>hue</codeph>) can delegate the execution of a query to another user. The query runs with the
privileges of the delegated user, not the original user such as <codeph>hue</codeph>. The name of the
delegated user is passed using the HiveServer2 configuration property <codeph>impala.doas.user</codeph>.
</p>
<p>
You can specify a list of users that the application user can delegate to, or <codeph>*</codeph> to allow a
superuser to delegate to any other user. For example:
</p>
<codeblock>impalad --authorized_proxy_user_config 'hue=user1,user2;admin=*' ...</codeblock>
<note>
Make sure to use single quotes or escape characters to ensure that any <codeph>*</codeph> characters do not
undergo wildcard expansion when specified in command-line arguments.
</note>
<p>
See <xref href="impala_config_options.xml#config_options"/> for details about adding or changing
<cmdname>impalad</cmdname> startup options. See
<xref href="http://blog.cloudera.com/blog/2013/07/how-hiveserver2-brings-security-and-concurrency-to-apache-hive/" scope="external" format="html">this
Cloudera blog post</xref> for background information about the delegation capability in HiveServer2.
</p>
<p>
To set up authentication for the delegated users:
</p>
<ul>
<li>
<p>
On the server side, configure either user/password authentication through LDAP, or Kerberos
authentication, for all the delegated users. See <xref href="impala_ldap.xml#ldap"/> or
<xref href="impala_kerberos.xml#kerberos"/> for details.
</p>
</li>
<li>
<p>
On the client side, follow the instructions in the <q>Using User Name and Password</q> section in the
<xref href="http://www.cloudera.com/content/cloudera-content/cloudera-docs/Connectors/PDF/Cloudera-ODBC-Driver-for-Impala-Install-Guide.pdf" scope="external" format="pdf">ODBC
driver installation guide</xref>. Then search for <q>delegation</q> in that same installation guide to
learn about the <uicontrol>Delegation UID</uicontrol> field and <codeph>DelegationUID</codeph> configuration keyword to enable the delegation feature for
ODBC-based BI tools.
</p>
</li>
</ul>
</conbody>
</concept>

View File

@@ -0,0 +1,65 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="delete">
<title>DELETE Statement (<keyword keyref="impala28"/> or higher only)</title>
<titlealts audience="PDF"><navtitle>DELETE</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="SQL"/>
<data name="Category" value="Kudu"/>
<data name="Category" value="ETL"/>
<data name="Category" value="Ingest"/>
<data name="Category" value="DML"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="Cloudera">DELETE statement</indexterm>
Deletes one or more rows from a Kudu table.
Although deleting a single row or a range of rows would be inefficient for tables using HDFS
data files, Kudu is able to perform this operation efficiently. Therefore, this statement
only works for Impala tables that use the Kudu storage engine.
</p>
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
<codeblock>
</codeblock>
<p rev="kudu">
Normally, a <codeph>DELETE</codeph> operation for a Kudu table fails if
some partition key columns are not found, due to their being deleted or changed
by a concurrent <codeph>UPDATE</codeph> or <codeph>DELETE</codeph> operation.
Specify <codeph>DELETE IGNORE <varname>rest_of_statement</varname></codeph> to
make the <codeph>DELETE</codeph> continue in this case. The rows with the nonexistent
duplicate partition key column values are not removed.
</p>
<p conref="../shared/impala_common.xml#common/dml_blurb"/>
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
<p conref="../shared/impala_common.xml#common/sync_ddl_blurb"/>
<note conref="../shared/impala_common.xml#common/compute_stats_next"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<codeblock>
</codeblock>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_kudu.xml#impala_kudu"/>
</p>
</conbody>
</concept>

View File

@@ -0,0 +1,689 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="describe">
<title id="desc">DESCRIBE Statement</title>
<titlealts audience="PDF"><navtitle>DESCRIBE</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Impala Data Types"/>
<data name="Category" value="SQL"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
<data name="Category" value="Tables"/>
<data name="Category" value="Reports"/>
<data name="Category" value="Schemas"/>
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="Cloudera">DESCRIBE statement</indexterm>
The <codeph>DESCRIBE</codeph> statement displays metadata about a table, such as the column names and their
data types.
<ph rev="2.3.0">In <keyword keyref="impala23_full"/> and higher, you can specify the name of a complex type column, which takes
the form of a dotted path. The path might include multiple components in the case of a nested type definition.</ph>
<ph rev="2.5.0">In <keyword keyref="impala25_full"/> and higher, the <codeph>DESCRIBE DATABASE</codeph> form can display
information about a database.</ph>
</p>
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
<codeblock rev="2.5.0">DESCRIBE [DATABASE] [FORMATTED|EXTENDED] <varname>object_name</varname>
object_name ::=
[<varname>db_name</varname>.]<varname>table_name</varname>[.<varname>complex_col_name</varname> ...]
| <varname>db_name</varname>
</codeblock>
<p>
You can use the abbreviation <codeph>DESC</codeph> for the <codeph>DESCRIBE</codeph> statement.
</p>
<p rev="1.1">
The <codeph>DESCRIBE FORMATTED</codeph> variation displays additional information, in a format familiar to
users of Apache Hive. The extra information includes low-level details such as whether the table is internal
or external, when it was created, the file format, the location of the data in HDFS, whether the object is a
table or a view, and (for views) the text of the query from the view definition.
</p>
<note>
The <codeph>Compressed</codeph> field is not a reliable indicator of whether the table contains compressed
data. It typically always shows <codeph>No</codeph>, because the compression settings only apply during the
session that loads data and are not stored persistently with the table metadata.
</note>
<p rev="2.5.0 IMPALA-2196">
<b>Describing databases:</b>
</p>
<p rev="2.5.0">
By default, the <codeph>DESCRIBE</codeph> output for a database includes the location
and the comment, which can be set by the <codeph>LOCATION</codeph> and <codeph>COMMENT</codeph>
clauses on the <codeph>CREATE DATABASE</codeph> statement.
</p>
<p rev="2.5.0">
The additional information displayed by the <codeph>FORMATTED</codeph> or <codeph>EXTENDED</codeph>
keyword includes the HDFS user ID that is considered the owner of the database, and any
optional database properties. The properties could be specified by the <codeph>WITH DBPROPERTIES</codeph>
clause if the database is created using a Hive <codeph>CREATE DATABASE</codeph> statement.
Impala currently does not set or do any special processing based on those properties.
</p>
<p rev="2.5.0">
The following examples show the variations in syntax and output for
describing databases. This feature is available in <keyword keyref="impala25_full"/>
and higher.
</p>
<codeblock rev="2.5.0">
describe database default;
+---------+----------------------+-----------------------+
| name | location | comment |
+---------+----------------------+-----------------------+
| default | /user/hive/warehouse | Default Hive database |
+---------+----------------------+-----------------------+
describe database formatted default;
+---------+----------------------+-----------------------+
| name | location | comment |
+---------+----------------------+-----------------------+
| default | /user/hive/warehouse | Default Hive database |
| Owner: | | |
| | public | ROLE |
+---------+----------------------+-----------------------+
describe database extended default;
+---------+----------------------+-----------------------+
| name | location | comment |
+---------+----------------------+-----------------------+
| default | /user/hive/warehouse | Default Hive database |
| Owner: | | |
| | public | ROLE |
+---------+----------------------+-----------------------+
</codeblock>
<p>
<b>Describing tables:</b>
</p>
<p>
If the <codeph>DATABASE</codeph> keyword is omitted, the default
for the <codeph>DESCRIBE</codeph> statement is to refer to a table.
</p>
<codeblock>
-- By default, the table is assumed to be in the current database.
describe my_table;
+------+--------+---------+
| name | type | comment |
+------+--------+---------+
| x | int | |
| s | string | |
+------+--------+---------+
-- Use a fully qualified table name to specify a table in any database.
describe my_database.my_table;
+------+--------+---------+
| name | type | comment |
+------+--------+---------+
| x | int | |
| s | string | |
+------+--------+---------+
-- The formatted or extended output includes additional useful information.
-- The LOCATION field is especially useful to know for DDL statements and HDFS commands
-- during ETL jobs. (The LOCATION includes a full hdfs:// URL, omitted here for readability.)
describe formatted my_table;
+------------------------------+----------------------------------------------+----------------------+
| name | type | comment |
+------------------------------+----------------------------------------------+----------------------+
| # col_name | data_type | comment |
| | NULL | NULL |
| x | int | NULL |
| s | string | NULL |
| | NULL | NULL |
| # Detailed Table Information | NULL | NULL |
| Database: | my_database | NULL |
| Owner: | jrussell | NULL |
| CreateTime: | Fri Mar 18 15:58:00 PDT 2016 | NULL |
| LastAccessTime: | UNKNOWN | NULL |
| Protect Mode: | None | NULL |
| Retention: | 0 | NULL |
| Location: | /user/hive/warehouse/my_database.db/my_table | NULL |
| Table Type: | MANAGED_TABLE | NULL |
| Table Parameters: | NULL | NULL |
| | transient_lastDdlTime | 1458341880 |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
| SerDe Library: | org. ... .LazySimpleSerDe | NULL |
| InputFormat: | org.apache.hadoop.mapred.TextInputFormat | NULL |
| OutputFormat: | org. ... .HiveIgnoreKeyTextOutputFormat | NULL |
| Compressed: | No | NULL |
| Num Buckets: | 0 | NULL |
| Bucket Columns: | [] | NULL |
| Sort Columns: | [] | NULL |
+------------------------------+----------------------------------------------+----------------------+
</codeblock>
<p conref="../shared/impala_common.xml#common/complex_types_blurb"/>
<p rev="2.3.0">
Because the column definitions for complex types can become long, particularly when such types are nested,
the <codeph>DESCRIBE</codeph> statement uses special formatting for complex type columns to make the output readable.
</p>
<p rev="2.3.0">
For the <codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and <codeph>MAP</codeph> types available in
<keyword keyref="impala23_full"/> and higher, the <codeph>DESCRIBE</codeph> output is formatted to avoid
excessively long lines for multiple fields within a <codeph>STRUCT</codeph>, or a nested sequence of
complex types.
</p>
<p conref="../shared/impala_common.xml#common/complex_types_describe"/>
<p rev="2.3.0">
For example, here is the <codeph>DESCRIBE</codeph> output for a table containing a single top-level column
of each complex type:
</p>
<codeblock rev="2.3.0"><![CDATA[create table t1 (x int, a array<int>, s struct<f1: string, f2: bigint>, m map<string,int>) stored as parquet;
describe t1;
+------+-----------------+---------+
| name | type | comment |
+------+-----------------+---------+
| x | int | |
| a | array<int> | |
| s | struct< | |
| | f1:string, | |
| | f2:bigint | |
| | > | |
| m | map<string,int> | |
+------+-----------------+---------+
]]>
</codeblock>
<p rev="2.3.0">
Here are examples showing how to <q>drill down</q> into the layouts of complex types, including
using multi-part names to examine the definitions of nested types.
The <codeph>&lt; &gt;</codeph> delimiters identify the columns with complex types;
these are the columns where you can descend another level to see the parts that make up
the complex type.
This technique helps you to understand the multi-part names you use as table references in queries
involving complex types, and the corresponding column names you refer to in the <codeph>SELECT</codeph> list.
These tables are from the <q>nested TPC-H</q> schema, shown in detail in
<xref href="impala_complex_types.xml#complex_sample_schema"/>.
</p>
<p>
The <codeph>REGION</codeph> table contains an <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph>
elements:
</p>
<ul>
<li>
<p>
The first <codeph>DESCRIBE</codeph> specifies the table name, to display the definition
of each top-level column.
</p>
</li>
<li>
<p>
The second <codeph>DESCRIBE</codeph> specifies the name of a complex
column, <codeph>REGION.R_NATIONS</codeph>, showing that when you include the name of an <codeph>ARRAY</codeph>
column in a <codeph>FROM</codeph> clause, that table reference acts like a two-column table with
columns <codeph>ITEM</codeph> and <codeph>POS</codeph>.
</p>
</li>
<li>
<p>
The final <codeph>DESCRIBE</codeph> specifies the fully qualified name of the <codeph>ITEM</codeph> field,
to display the layout of its underlying <codeph>STRUCT</codeph> type in table format, with the fields
mapped to column names.
</p>
</li>
</ul>
<codeblock rev="2.3.0"><![CDATA[
-- #1: The overall layout of the entire table.
describe region;
+-------------+-------------------------+---------+
| name | type | comment |
+-------------+-------------------------+---------+
| r_regionkey | smallint | |
| r_name | string | |
| r_comment | string | |
| r_nations | array<struct< | |
| | n_nationkey:smallint, | |
| | n_name:string, | |
| | n_comment:string | |
| | >> | |
+-------------+-------------------------+---------+
-- #2: The ARRAY column within the table.
describe region.r_nations;
+------+-------------------------+---------+
| name | type | comment |
+------+-------------------------+---------+
| item | struct< | |
| | n_nationkey:smallint, | |
| | n_name:string, | |
| | n_comment:string | |
| | > | |
| pos | bigint | |
+------+-------------------------+---------+
-- #3: The STRUCT that makes up each ARRAY element.
-- The fields of the STRUCT act like columns of a table.
describe region.r_nations.item;
+-------------+----------+---------+
| name | type | comment |
+-------------+----------+---------+
| n_nationkey | smallint | |
| n_name | string | |
| n_comment | string | |
+-------------+----------+---------+
]]>
</codeblock>
<p>
The <codeph>CUSTOMER</codeph> table contains an <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph>
elements, where one field in the <codeph>STRUCT</codeph> is another <codeph>ARRAY</codeph> of
<codeph>STRUCT</codeph> elements:
</p>
<ul>
<li>
<p>
Again, the initial <codeph>DESCRIBE</codeph> specifies only the table name.
</p>
</li>
<li>
<p>
The second <codeph>DESCRIBE</codeph> specifies the qualified name of the complex
column, <codeph>CUSTOMER.C_ORDERS</codeph>, showing how an <codeph>ARRAY</codeph>
is represented as a two-column table with columns <codeph>ITEM</codeph> and <codeph>POS</codeph>.
</p>
</li>
<li>
<p>
The third <codeph>DESCRIBE</codeph> specifies the qualified name of the <codeph>ITEM</codeph>
of the <codeph>ARRAY</codeph> column, to see the structure of the nested <codeph>ARRAY</codeph>.
Again, it has has two parts, <codeph>ITEM</codeph> and <codeph>POS</codeph>. Because the
<codeph>ARRAY</codeph> contains a <codeph>STRUCT</codeph>, the layout of the <codeph>STRUCT</codeph>
is shown.
</p>
</li>
<li>
<p>
The fourth and fifth <codeph>DESCRIBE</codeph> statements drill down into a <codeph>STRUCT</codeph> field that
is itself a complex type, an <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph>.
The <codeph>ITEM</codeph> portion of the qualified name is only required when the <codeph>ARRAY</codeph>
elements are anonymous. The fields of the <codeph>STRUCT</codeph> give names to any other complex types
nested inside the <codeph>STRUCT</codeph>. Therefore, the <codeph>DESCRIBE</codeph> parameters
<codeph>CUSTOMER.C_ORDERS.ITEM.O_LINEITEMS</codeph> and <codeph>CUSTOMER.C_ORDERS.O_LINEITEMS</codeph>
are equivalent. (For brevity, leave out the <codeph>ITEM</codeph> portion of
a qualified name when it is not required.)
</p>
</li>
<li>
<p>
The final <codeph>DESCRIBE</codeph> shows the layout of the deeply nested <codeph>STRUCT</codeph> type.
Because there are no more complex types nested inside this <codeph>STRUCT</codeph>, this is as far
as you can drill down into the layout for this table.
</p>
</li>
</ul>
<codeblock rev="2.3.0"><![CDATA[-- #1: The overall layout of the entire table.
describe customer;
+--------------+------------------------------------+
| name | type |
+--------------+------------------------------------+
| c_custkey | bigint |
... more scalar columns ...
| c_orders | array<struct< |
| | o_orderkey:bigint, |
| | o_orderstatus:string, |
| | o_totalprice:decimal(12,2), |
| | o_orderdate:string, |
| | o_orderpriority:string, |
| | o_clerk:string, |
| | o_shippriority:int, |
| | o_comment:string, |
| | o_lineitems:array<struct< |
| | l_partkey:bigint, |
| | l_suppkey:bigint, |
| | l_linenumber:int, |
| | l_quantity:decimal(12,2), |
| | l_extendedprice:decimal(12,2), |
| | l_discount:decimal(12,2), |
| | l_tax:decimal(12,2), |
| | l_returnflag:string, |
| | l_linestatus:string, |
| | l_shipdate:string, |
| | l_commitdate:string, |
| | l_receiptdate:string, |
| | l_shipinstruct:string, |
| | l_shipmode:string, |
| | l_comment:string |
| | >> |
| | >> |
+--------------+------------------------------------+
-- #2: The ARRAY column within the table.
describe customer.c_orders;
+------+------------------------------------+
| name | type |
+------+------------------------------------+
| item | struct< |
| | o_orderkey:bigint, |
| | o_orderstatus:string, |
... more struct fields ...
| | o_lineitems:array<struct< |
| | l_partkey:bigint, |
| | l_suppkey:bigint, |
... more nested struct fields ...
| | l_comment:string |
| | >> |
| | > |
| pos | bigint |
+------+------------------------------------+
-- #3: The STRUCT that makes up each ARRAY element.
-- The fields of the STRUCT act like columns of a table.
describe customer.c_orders.item;
+-----------------+----------------------------------+
| name | type |
+-----------------+----------------------------------+
| o_orderkey | bigint |
| o_orderstatus | string |
| o_totalprice | decimal(12,2) |
| o_orderdate | string |
| o_orderpriority | string |
| o_clerk | string |
| o_shippriority | int |
| o_comment | string |
| o_lineitems | array<struct< |
| | l_partkey:bigint, |
| | l_suppkey:bigint, |
... more struct fields ...
| | l_comment:string |
| | >> |
+-----------------+----------------------------------+
-- #4: The ARRAY nested inside the STRUCT elements of the first ARRAY.
describe customer.c_orders.item.o_lineitems;
+------+----------------------------------+
| name | type |
+------+----------------------------------+
| item | struct< |
| | l_partkey:bigint, |
| | l_suppkey:bigint, |
... more struct fields ...
| | l_comment:string |
| | > |
| pos | bigint |
+------+----------------------------------+
-- #5: Shorter form of the previous DESCRIBE. Omits the .ITEM portion of the name
-- because O_LINEITEMS and other field names provide a way to refer to things
-- inside the ARRAY element.
describe customer.c_orders.o_lineitems;
+------+----------------------------------+
| name | type |
+------+----------------------------------+
| item | struct< |
| | l_partkey:bigint, |
| | l_suppkey:bigint, |
... more struct fields ...
| | l_comment:string |
| | > |
| pos | bigint |
+------+----------------------------------+
-- #6: The STRUCT representing ARRAY elements nested inside
-- another ARRAY of STRUCTs. The lack of any complex types
-- in this output means this is as far as DESCRIBE can
-- descend into the table layout.
describe customer.c_orders.o_lineitems.item;
+-----------------+---------------+
| name | type |
+-----------------+---------------+
| l_partkey | bigint |
| l_suppkey | bigint |
... more scalar columns ...
| l_comment | string |
+-----------------+---------------+
]]>
</codeblock>
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
<p>
After the <cmdname>impalad</cmdname> daemons are restarted, the first query against a table can take longer
than subsequent queries, because the metadata for the table is loaded before the query is processed. This
one-time delay for each table can cause misleading results in benchmark tests or cause unnecessary concern.
To <q>warm up</q> the Impala metadata cache, you can issue a <codeph>DESCRIBE</codeph> statement in advance
for each table you intend to access later.
</p>
<p>
When you are dealing with data files stored in HDFS, sometimes it is important to know details such as the
path of the data files for an Impala table, and the hostname for the namenode. You can get this information
from the <codeph>DESCRIBE FORMATTED</codeph> output. You specify HDFS URIs or path specifications with
statements such as <codeph>LOAD DATA</codeph> and the <codeph>LOCATION</codeph> clause of <codeph>CREATE
TABLE</codeph> or <codeph>ALTER TABLE</codeph>. You might also use HDFS URIs or paths with Linux commands
such as <cmdname>hadoop</cmdname> and <cmdname>hdfs</cmdname> to copy, rename, and so on, data files in HDFS.
</p>
<p conref="../shared/impala_common.xml#common/sync_ddl_blurb"/>
<p rev="1.2.1">
Each table can also have associated table statistics and column statistics. To see these categories of
information, use the <codeph>SHOW TABLE STATS <varname>table_name</varname></codeph> and <codeph>SHOW COLUMN
STATS <varname>table_name</varname></codeph> statements.
<!--
For example, the table statistics can often show you the number
and total size of the files in the table, even if you have not
run <codeph>COMPUTE STATS</codeph>.
-->
See <xref href="impala_show.xml#show"/> for details.
</p>
<note conref="../shared/impala_common.xml#common/compute_stats_next"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p>
The following example shows the results of both a standard <codeph>DESCRIBE</codeph> and <codeph>DESCRIBE
FORMATTED</codeph> for different kinds of schema objects:
</p>
<ul>
<li>
<codeph>DESCRIBE</codeph> for a table or a view returns the name, type, and comment for each of the
columns. For a view, if the column value is computed by an expression, the column name is automatically
generated as <codeph>_c0</codeph>, <codeph>_c1</codeph>, and so on depending on the ordinal number of the
column.
</li>
<li>
A table created with no special format or storage clauses is designated as a <codeph>MANAGED_TABLE</codeph>
(an <q>internal table</q> in Impala terminology). Its data files are stored in an HDFS directory under the
default Hive data directory. By default, it uses Text data format.
</li>
<li>
A view is designated as <codeph>VIRTUAL_VIEW</codeph> in <codeph>DESCRIBE FORMATTED</codeph> output. Some
of its properties are <codeph>NULL</codeph> or blank because they are inherited from the base table. The
text of the query that defines the view is part of the <codeph>DESCRIBE FORMATTED</codeph> output.
</li>
<li>
A table with additional clauses in the <codeph>CREATE TABLE</codeph> statement has differences in
<codeph>DESCRIBE FORMATTED</codeph> output. The output for <codeph>T2</codeph> includes the
<codeph>EXTERNAL_TABLE</codeph> keyword because of the <codeph>CREATE EXTERNAL TABLE</codeph> syntax, and
different <codeph>InputFormat</codeph> and <codeph>OutputFormat</codeph> fields to reflect the Parquet file
format.
</li>
</ul>
<codeblock>[localhost:21000] &gt; create table t1 (x int, y int, s string);
Query: create table t1 (x int, y int, s string)
[localhost:21000] &gt; describe t1;
Query: describe t1
Query finished, fetching results ...
+------+--------+---------+
| name | type | comment |
+------+--------+---------+
| x | int | |
| y | int | |
| s | string | |
+------+--------+---------+
Returned 3 row(s) in 0.13s
[localhost:21000] &gt; describe formatted t1;
Query: describe formatted t1
Query finished, fetching results ...
+------------------------------+--------------------------------------------+------------+
| name | type | comment |
+------------------------------+--------------------------------------------+------------+
| # col_name | data_type | comment |
| | NULL | NULL |
| x | int | None |
| y | int | None |
| s | string | None |
| | NULL | NULL |
| # Detailed Table Information | NULL | NULL |
| Database: | describe_formatted | NULL |
| Owner: | cloudera | NULL |
| CreateTime: | Mon Jul 22 17:03:16 EDT 2013 | NULL |
| LastAccessTime: | UNKNOWN | NULL |
| Protect Mode: | None | NULL |
| Retention: | 0 | NULL |
| Location: | hdfs://127.0.0.1:8020/user/hive/warehouse/ | |
| | describe_formatted.db/t1 | NULL |
| Table Type: | MANAGED_TABLE | NULL |
| Table Parameters: | NULL | NULL |
| | transient_lastDdlTime | 1374526996 |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.serde2.lazy. | |
| | LazySimpleSerDe | NULL |
| InputFormat: | org.apache.hadoop.mapred.TextInputFormat | NULL |
| OutputFormat: | org.apache.hadoop.hive.ql.io. | |
| | HiveIgnoreKeyTextOutputFormat | NULL |
| Compressed: | No | NULL |
| Num Buckets: | 0 | NULL |
| Bucket Columns: | [] | NULL |
| Sort Columns: | [] | NULL |
+------------------------------+--------------------------------------------+------------+
Returned 26 row(s) in 0.03s
[localhost:21000] &gt; create view v1 as select x, upper(s) from t1;
Query: create view v1 as select x, upper(s) from t1
[localhost:21000] &gt; describe v1;
Query: describe v1
Query finished, fetching results ...
+------+--------+---------+
| name | type | comment |
+------+--------+---------+
| x | int | |
| _c1 | string | |
+------+--------+---------+
Returned 2 row(s) in 0.10s
[localhost:21000] &gt; describe formatted v1;
Query: describe formatted v1
Query finished, fetching results ...
+------------------------------+------------------------------+----------------------+
| name | type | comment |
+------------------------------+------------------------------+----------------------+
| # col_name | data_type | comment |
| | NULL | NULL |
| x | int | None |
| _c1 | string | None |
| | NULL | NULL |
| # Detailed Table Information | NULL | NULL |
| Database: | describe_formatted | NULL |
| Owner: | cloudera | NULL |
| CreateTime: | Mon Jul 22 16:56:38 EDT 2013 | NULL |
| LastAccessTime: | UNKNOWN | NULL |
| Protect Mode: | None | NULL |
| Retention: | 0 | NULL |
| Table Type: | VIRTUAL_VIEW | NULL |
| Table Parameters: | NULL | NULL |
| | transient_lastDdlTime | 1374526598 |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
| SerDe Library: | null | NULL |
| InputFormat: | null | NULL |
| OutputFormat: | null | NULL |
| Compressed: | No | NULL |
| Num Buckets: | 0 | NULL |
| Bucket Columns: | [] | NULL |
| Sort Columns: | [] | NULL |
| | NULL | NULL |
| # View Information | NULL | NULL |
| View Original Text: | SELECT x, upper(s) FROM t1 | NULL |
| View Expanded Text: | SELECT x, upper(s) FROM t1 | NULL |
+------------------------------+------------------------------+----------------------+
Returned 28 row(s) in 0.03s
[localhost:21000] &gt; create external table t2 (x int, y int, s string) stored as parquet location '/user/cloudera/sample_data';
[localhost:21000] &gt; describe formatted t2;
Query: describe formatted t2
Query finished, fetching results ...
+------------------------------+----------------------------------------------------+------------+
| name | type | comment |
+------------------------------+----------------------------------------------------+------------+
| # col_name | data_type | comment |
| | NULL | NULL |
| x | int | None |
| y | int | None |
| s | string | None |
| | NULL | NULL |
| # Detailed Table Information | NULL | NULL |
| Database: | describe_formatted | NULL |
| Owner: | cloudera | NULL |
| CreateTime: | Mon Jul 22 17:01:47 EDT 2013 | NULL |
| LastAccessTime: | UNKNOWN | NULL |
| Protect Mode: | None | NULL |
| Retention: | 0 | NULL |
| Location: | hdfs://127.0.0.1:8020/user/cloudera/sample_data | NULL |
| Table Type: | EXTERNAL_TABLE | NULL |
| Table Parameters: | NULL | NULL |
| | EXTERNAL | TRUE |
| | transient_lastDdlTime | 1374526907 |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL |
| InputFormat: | com.cloudera.impala.hive.serde.ParquetInputFormat | NULL |
| OutputFormat: | com.cloudera.impala.hive.serde.ParquetOutputFormat | NULL |
| Compressed: | No | NULL |
| Num Buckets: | 0 | NULL |
| Bucket Columns: | [] | NULL |
| Sort Columns: | [] | NULL |
+------------------------------+----------------------------------------------------+------------+
Returned 27 row(s) in 0.17s</codeblock>
<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
<p conref="../shared/impala_common.xml#common/permissions_blurb"/>
<p rev="CDH-19187">
The user ID that the <cmdname>impalad</cmdname> daemon runs under,
typically the <codeph>impala</codeph> user, must have read and execute
permissions for all directories that are part of the table.
(A table could span multiple different HDFS directories if it is partitioned.
The directories could be widely scattered because a partition can reside
in an arbitrary HDFS directory based on its <codeph>LOCATION</codeph> attribute.)
</p>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_tables.xml#tables"/>, <xref href="impala_create_table.xml#create_table"/>,
<xref href="impala_show.xml#show_tables"/>, <xref href="impala_show.xml#show_create_table"/>
</p>
</conbody>
</concept>

View File

@@ -0,0 +1,231 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="intro_dev">
<title>Developing Impala Applications</title>
<titlealts audience="PDF"><navtitle>Developing Applications</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="SQL"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
<data name="Category" value="Concepts"/>
</metadata>
</prolog>
<conbody>
<p>
The core development language with Impala is SQL. You can also use Java or other languages to interact with
Impala through the standard JDBC and ODBC interfaces used by many business intelligence tools. For
specialized kinds of analysis, you can supplement the SQL built-in functions by writing
<xref href="impala_udf.xml#udfs">user-defined functions (UDFs)</xref> in C++ or Java.
</p>
<p outputclass="toc inpage"/>
</conbody>
<concept id="intro_sql">
<title>Overview of the Impala SQL Dialect</title>
<prolog>
<metadata>
<data name="Category" value="SQL"/>
<data name="Category" value="Concepts"/>
</metadata>
</prolog>
<conbody>
<p>
The Impala SQL dialect is highly compatible with the SQL syntax used in the Apache Hive component (HiveQL). As
such, it is familiar to users who are already familiar with running SQL queries on the Hadoop
infrastructure. Currently, Impala SQL supports a subset of HiveQL statements, data types, and built-in
functions. Impala also includes additional built-in functions for common industry features, to simplify
porting SQL from non-Hadoop systems.
</p>
<p>
For users coming to Impala from traditional database or data warehousing backgrounds, the following aspects of the SQL dialect
might seem familiar:
</p>
<ul>
<li>
<p>
The <xref href="impala_select.xml#select">SELECT statement</xref> includes familiar clauses such as <codeph>WHERE</codeph>,
<codeph>GROUP BY</codeph>, <codeph>ORDER BY</codeph>, and <codeph>WITH</codeph>.
You will find familiar notions such as
<xref href="impala_joins.xml#joins">joins</xref>, <xref href="impala_functions.xml#builtins">built-in
functions</xref> for processing strings, numbers, and dates,
<xref href="impala_aggregate_functions.xml#aggregate_functions">aggregate functions</xref>,
<xref href="impala_subqueries.xml#subqueries">subqueries</xref>, and
<xref href="impala_operators.xml#comparison_operators">comparison operators</xref>
such as <codeph>IN()</codeph> and <codeph>BETWEEN</codeph>.
The <codeph>SELECT</codeph> statement is the place where SQL standards compliance is most important.
</p>
</li>
<li>
<p>
From the data warehousing world, you will recognize the notion of
<xref href="impala_partitioning.xml#partitioning">partitioned tables</xref>.
One or more columns serve as partition keys, and the data is physically arranged so that
queries that refer to the partition key columns in the <codeph>WHERE</codeph> clause
can skip partitions that do not match the filter conditions. For example, if you have 10
years worth of data and use a clause such as <codeph>WHERE year = 2015</codeph>,
<codeph>WHERE year &gt; 2010</codeph>, or <codeph>WHERE year IN (2014, 2015)</codeph>,
Impala skips all the data for non-matching years, greatly reducing the amount of I/O
for the query.
</p>
</li>
<li rev="1.2">
<p>
In Impala 1.2 and higher, <xref href="impala_udf.xml#udfs">UDFs</xref> let you perform custom comparisons
and transformation logic during <codeph>SELECT</codeph> and <codeph>INSERT...SELECT</codeph> statements.
</p>
</li>
</ul>
<p>
For users coming to Impala from traditional database or data warehousing backgrounds, the following aspects of the SQL dialect
might require some learning and practice for you to become proficient in the Hadoop environment:
</p>
<ul>
<li>
<p>
Impala SQL is focused on queries and includes relatively little DML. There is no <codeph>UPDATE</codeph>
or <codeph>DELETE</codeph> statement. Stale data is typically discarded (by <codeph>DROP TABLE</codeph>
or <codeph>ALTER TABLE ... DROP PARTITION</codeph> statements) or replaced (by <codeph>INSERT
OVERWRITE</codeph> statements).
</p>
</li>
<li>
<p>
All data creation is done by <codeph>INSERT</codeph> statements, which typically insert data in bulk by
querying from other tables. There are two variations, <codeph>INSERT INTO</codeph> which appends to the
existing data, and <codeph>INSERT OVERWRITE</codeph> which replaces the entire contents of a table or
partition (similar to <codeph>TRUNCATE TABLE</codeph> followed by a new <codeph>INSERT</codeph>).
Although there is an <codeph>INSERT ... VALUES</codeph> syntax to create a small number of values in
a single statement, it is far more efficient to use the <codeph>INSERT ... SELECT</codeph> to copy
and transform large amounts of data from one table to another in a single operation.
</p>
</li>
<li>
<p>
You often construct Impala table definitions and data files in some other environment, and then attach
Impala so that it can run real-time queries. The same data files and table metadata are shared with other
components of the Hadoop ecosystem. In particular, Impala can access tables created by Hive or data
inserted by Hive, and Hive can access tables and data produced by Impala. Many other Hadoop components
can write files in formats such as Parquet and Avro, that can then be queried by Impala.
</p>
</li>
<li>
<p>
Because Hadoop and Impala are focused on data warehouse-style operations on large data sets, Impala SQL
includes some idioms that you might find in the import utilities for traditional database systems. For
example, you can create a table that reads comma-separated or tab-separated text files, specifying the
separator in the <codeph>CREATE TABLE</codeph> statement. You can create <b>external tables</b> that read
existing data files but do not move or transform them.
</p>
</li>
<li>
<p>
Because Impala reads large quantities of data that might not be perfectly tidy and predictable, it does
not require length constraints on string data types. For example, you can define a database column as
<codeph>STRING</codeph> with unlimited length, rather than <codeph>CHAR(1)</codeph> or
<codeph>VARCHAR(64)</codeph>. <ph rev="2.0.0">(Although in Impala 2.0 and later, you can also use
length-constrained <codeph>CHAR</codeph> and <codeph>VARCHAR</codeph> types.)</ph>
</p>
</li>
</ul>
<p>
<b>Related information:</b> <xref href="impala_langref.xml#langref"/>, especially
<xref href="impala_langref_sql.xml#langref_sql"/> and <xref href="impala_functions.xml#builtins"/>
</p>
</conbody>
</concept>
<!-- Bunch of potential concept topics for future consideration. Major areas of Impala modelled on areas of discussion for Oracle Database, and distributed databases in general. -->
<concept id="intro_datatypes" audience="Cloudera">
<title>Overview of Impala SQL Data Types</title>
<conbody/>
</concept>
<concept id="intro_network" audience="Cloudera">
<title>Overview of Impala Network Topology</title>
<conbody/>
</concept>
<concept id="intro_cluster" audience="Cloudera">
<title>Overview of Impala Cluster Topology</title>
<conbody/>
</concept>
<concept id="intro_apis">
<title>Overview of Impala Programming Interfaces</title>
<prolog>
<metadata>
<data name="Category" value="JDBC"/>
<data name="Category" value="ODBC"/>
<data name="Category" value="Hue"/>
</metadata>
</prolog>
<conbody>
<p>
You can connect and submit requests to the Impala daemons through:
</p>
<ul>
<li>
The <codeph><xref href="impala_impala_shell.xml#impala_shell">impala-shell</xref></codeph> interactive
command interpreter.
</li>
<li>
The <xref href="http://gethue.com/" scope="external" format="html">Hue</xref> web-based user interface.
</li>
<li>
<xref href="impala_jdbc.xml#impala_jdbc">JDBC</xref>.
</li>
<li>
<xref href="impala_odbc.xml#impala_odbc">ODBC</xref>.
</li>
</ul>
<p>
With these options, you can use Impala in heterogeneous environments, with JDBC or ODBC applications
running on non-Linux platforms. You can also use Impala on combination with various Business Intelligence
tools that use the JDBC and ODBC interfaces.
</p>
<p>
Each <codeph>impalad</codeph> daemon process, running on separate nodes in a cluster, listens to
<xref href="impala_ports.xml#ports">several ports</xref> for incoming requests. Requests from
<codeph>impala-shell</codeph> and Hue are routed to the <codeph>impalad</codeph> daemons through the same
port. The <codeph>impalad</codeph> daemons listen on separate ports for JDBC and ODBC requests.
</p>
</conbody>
</concept>
</concept>

View File

@@ -0,0 +1,36 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="disable_cached_reads" rev="1.4.0">
<title>DISABLE_CACHED_READS Query Option</title>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Impala Query Options"/>
<data name="Category" value="HDFS"/>
<data name="Category" value="HDFS Caching"/>
<data name="Category" value="Querying"/>
<data name="Category" value="Performance"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="Cloudera">DISABLE_CACHED_READS query option</indexterm>
Prevents Impala from reading data files that are <q>pinned</q> in memory
through the HDFS caching feature. Primarily a debugging option for
cases where processing of HDFS cached data is concentrated on a single
host, leading to excessive CPU usage on that host.
</p>
<p conref="../shared/impala_common.xml#common/type_boolean"/>
<p conref="../shared/impala_common.xml#common/default_false"/>
<p conref="../shared/impala_common.xml#common/added_in_140"/>
</conbody>
</concept>

View File

@@ -0,0 +1,38 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="disable_codegen">
<title>DISABLE_CODEGEN Query Option</title>
<titlealts audience="PDF"><navtitle>DISABLE_CODEGEN</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Impala Query Options"/>
<data name="Category" value="Troubleshooting"/>
<data name="Category" value="Performance"/>
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="Cloudera">DISABLE_CODEGEN query option</indexterm>
This is a debug option, intended for diagnosing and working around issues that cause crashes. If a query
fails with an <q>illegal instruction</q> or other hardware-specific message, try setting
<codeph>DISABLE_CODEGEN=true</codeph> and running the query again. If the query succeeds only when the
<codeph>DISABLE_CODEGEN</codeph> option is turned on, submit the problem to <keyword keyref="support_org"/> and include that
detail in the problem report. Do not otherwise run with this setting turned on, because it results in lower
overall performance.
</p>
<p>
Because the code generation phase adds a small amount of overhead for each query, you might turn on the
<codeph>DISABLE_CODEGEN</codeph> option to achieve maximum throughput when running many short-lived queries
against small tables.
</p>
<p conref="../shared/impala_common.xml#common/type_boolean"/>
<p conref="../shared/impala_common.xml#common/default_false_0"/>
</conbody>
</concept>

View File

@@ -0,0 +1,29 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="disable_outermost_topn" rev="2.5.0">
<title>DISABLE_OUTERMOST_TOPN Query Option</title>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Impala Query Options"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p rev="2.5.0">
<indexterm audience="Cloudera">DISABLE_OUTERMOST_TOPN query option</indexterm>
</p>
<p>
<b>Type:</b>
</p>
<p>
<b>Default:</b>
</p>
</conbody>
</concept>

View File

@@ -0,0 +1,65 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="disable_row_runtime_filtering" rev="2.5.0">
<title>DISABLE_ROW_RUNTIME_FILTERING Query Option (<keyword keyref="impala25"/> or higher only)</title>
<titlealts audience="PDF"><navtitle>DISABLE_ROW_RUNTIME_FILTERING</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Impala Query Options"/>
<data name="Category" value="Performance"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p rev="2.5.0">
<indexterm audience="Cloudera">DISABLE_ROW_RUNTIME_FILTERING query option</indexterm>
The <codeph>DISABLE_ROW_RUNTIME_FILTERING</codeph> query option
reduces the scope of the runtime filtering feature. Queries still dynamically prune
partitions, but do not apply the filtering logic to individual rows within partitions.
</p>
<p>
Only applies to queries against Parquet tables. For other file formats, Impala
only prunes at the level of partitions, not individual rows.
</p>
<p conref="../shared/impala_common.xml#common/type_boolean"/>
<p conref="../shared/impala_common.xml#common/default_false"/>
<p conref="../shared/impala_common.xml#common/added_in_250"/>
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
<p>
Impala automatically evaluates whether the per-row filters are being
effective at reducing the amount of intermediate data. Therefore,
this option is typically only needed for the rare case where Impala
cannot accurately determine how effective the per-row filtering is
for a query.
</p>
<p conref="../shared/impala_common.xml#common/runtime_filtering_option_caveat"/>
<p>
Because this setting only improves query performance in very specific
circumstances, depending on the query characteristics and data distribution,
only use it when you determine through benchmarking that it improves
performance of specific expensive queries.
Consider setting this query option immediately before the expensive query and
unsetting it immediately afterward.
</p>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_runtime_filtering.xml"/>,
<xref href="impala_runtime_filter_mode.xml#runtime_filter_mode"/>
<!-- , <xref href="impala_partitioning.xml#dynamic_partition_pruning"/> -->
</p>
</conbody>
</concept>

View File

@@ -0,0 +1,45 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="disable_streaming_preaggregations" rev="2.5.0 IMPALA-1305">
<title>DISABLE_STREAMING_PREAGGREGATIONS Query Option (<keyword keyref="impala25"/> or higher only)</title>
<titlealts audience="PDF"><navtitle>DISABLE_STREAMING_PREAGGREGATIONS</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Impala Query Options"/>
<data name="Category" value="Performance"/>
<data name="Category" value="Aggregate Functions"/>
<data name="Category" value="Troubleshooting"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p rev="2.5.0 IMPALA-1305">
<indexterm audience="Cloudera">DISABLE_STREAMING_PREAGGREGATIONS query option</indexterm>
Turns off the <q>streaming preaggregation</q> optimization that is available in <keyword keyref="impala25_full"/>
and higher. This optimization reduces unnecessary work performed by queries that perform aggregation
operations on columns with few or no duplicate values, for example <codeph>DISTINCT <varname>id_column</varname></codeph>
or <codeph>GROUP BY <varname>unique_column</varname></codeph>. If the optimization causes regressions in
existing queries that use aggregation functions, you can turn it off as needed by setting this query option.
</p>
<p conref="../shared/impala_common.xml#common/type_boolean"/>
<p conref="../shared/impala_common.xml#common/default_false_0"/>
<note conref="../shared/impala_common.xml#common/one_but_not_true"/>
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
<p>
Typically, queries that would require enabling this option involve very large numbers of
aggregated values, such as a billion or more distinct keys being processed on each
worker node.
</p>
<p conref="../shared/impala_common.xml#common/added_in_250"/>
</conbody>
</concept>

View File

@@ -0,0 +1,53 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept rev="2.0.0" id="disable_unsafe_spills">
<title>DISABLE_UNSAFE_SPILLS Query Option (<keyword keyref="impala20"/> or higher only)</title>
<titlealts audience="PDF"><navtitle>DISABLE_UNSAFE_SPILLS</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Impala Query Options"/>
<data name="Category" value="Performance"/>
<data name="Category" value="Scalability"/>
<data name="Category" value="Memory"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p rev="2.0.0">
<indexterm audience="Cloudera">DISABLE_UNSAFE_SPILLS query option</indexterm>
Enable this option if you prefer to have queries fail when they exceed the Impala memory limit, rather than
write temporary data to disk.
</p>
<p>
Queries that <q>spill</q> to disk typically complete successfully, when in earlier Impala releases they would have failed.
However, queries with exorbitant memory requirements due to missing statistics or inefficient join clauses could
become so slow as a result that you would rather have them cancelled automatically and reduce the memory
usage through standard Impala tuning techniques.
</p>
<p>
This option prevents only <q>unsafe</q> spill operations, meaning that one or more tables are missing
statistics or the query does not include a hint to set the most efficient mechanism for a join or
<codeph>INSERT ... SELECT</codeph> into a partitioned table. These are the tables most likely to result in
suboptimal execution plans that could cause unnecessary spilling. Therefore, leaving this option enabled is a
good way to find tables on which to run the <codeph>COMPUTE STATS</codeph> statement.
</p>
<p>
See <xref href="impala_scalability.xml#spill_to_disk"/> for information about the <q>spill to disk</q>
feature for queries processing large result sets with joins, <codeph>ORDER BY</codeph>, <codeph>GROUP
BY</codeph>, <codeph>DISTINCT</codeph>, aggregation functions, or analytic functions.
</p>
<p conref="../shared/impala_common.xml#common/type_boolean"/>
<p conref="../shared/impala_common.xml#common/default_false_0"/>
<p conref="../shared/impala_common.xml#common/added_in_20"/>
</conbody>
</concept>

View File

@@ -0,0 +1,129 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="disk_space">
<title>Managing Disk Space for Impala Data</title>
<titlealts audience="PDF"><navtitle>Managing Disk Space</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Disk Storage"/>
<data name="Category" value="Administrators"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
<data name="Category" value="Tables"/>
<data name="Category" value="Compression"/>
</metadata>
</prolog>
<conbody>
<p>
Although Impala typically works with many large files in an HDFS storage system with plenty of capacity,
there are times when you might perform some file cleanup to reclaim space, or advise developers on techniques
to minimize space consumption and file duplication.
</p>
<ul>
<li>
<p>
Use compact binary file formats where practical. Numeric and time-based data in particular can be stored
in more compact form in binary data files. Depending on the file format, various compression and encoding
features can reduce file size even further. You can specify the <codeph>STORED AS</codeph> clause as part
of the <codeph>CREATE TABLE</codeph> statement, or <codeph>ALTER TABLE</codeph> with the <codeph>SET
FILEFORMAT</codeph> clause for an existing table or partition within a partitioned table. See
<xref href="impala_file_formats.xml#file_formats"/> for details about file formats, especially
<xref href="impala_parquet.xml#parquet"/>. See <xref href="impala_create_table.xml#create_table"/> and
<xref href="impala_alter_table.xml#alter_table"/> for syntax details.
</p>
</li>
<li>
<p>
You manage underlying data files differently depending on whether the corresponding Impala table is
defined as an <xref href="impala_tables.xml#internal_tables">internal</xref> or
<xref href="impala_tables.xml#external_tables">external</xref> table:
</p>
<ul>
<li>
Use the <codeph>DESCRIBE FORMATTED</codeph> statement to check if a particular table is internal
(managed by Impala) or external, and to see the physical location of the data files in HDFS. See
<xref href="impala_describe.xml#describe"/> for details.
</li>
<li>
For Impala-managed (<q>internal</q>) tables, use <codeph>DROP TABLE</codeph> statements to remove
data files. See <xref href="impala_drop_table.xml#drop_table"/> for details.
</li>
<li>
For tables not managed by Impala (<q>external</q> tables), use appropriate HDFS-related commands such
as <codeph>hadoop fs</codeph>, <codeph>hdfs dfs</codeph>, or <codeph>distcp</codeph>, to create, move,
copy, or delete files within HDFS directories that are accessible by the <codeph>impala</codeph> user.
Issue a <codeph>REFRESH <varname>table_name</varname></codeph> statement after adding or removing any
files from the data directory of an external table. See <xref href="impala_refresh.xml#refresh"/> for
details.
</li>
<li>
Use external tables to reference HDFS data files in their original location. With this technique, you
avoid copying the files, and you can map more than one Impala table to the same set of data files. When
you drop the Impala table, the data files are left undisturbed. See
<xref href="impala_tables.xml#external_tables"/> for details.
</li>
<li>
Use the <codeph>LOAD DATA</codeph> statement to move HDFS files into the data directory for an Impala
table from inside Impala, without the need to specify the HDFS path of the destination directory. This
technique works for both internal and external tables. See
<xref href="impala_load_data.xml#load_data"/> for details.
</li>
</ul>
</li>
<li>
<p>
Make sure that the HDFS trashcan is configured correctly. When you remove files from HDFS, the space
might not be reclaimed for use by other files until sometime later, when the trashcan is emptied. See
<xref href="impala_drop_table.xml#drop_table"/> and the FAQ entry
<xref href="impala_faq.xml#faq_sql/faq_drop_table_space"/> for details. See
<xref href="impala_prereqs.xml#prereqs_account"/> for permissions needed for the HDFS trashcan to operate
correctly.
</p>
</li>
<li>
<p>
Drop all tables in a database before dropping the database itself. See
<xref href="impala_drop_database.xml#drop_database"/> for details.
</p>
</li>
<li>
<p>
Clean up temporary files after failed <codeph>INSERT</codeph> statements. If an <codeph>INSERT</codeph>
statement encounters an error, and you see a directory named <filepath>.impala_insert_staging</filepath>
or <filepath>_impala_insert_staging</filepath> left behind in the data directory for the table, it might
contain temporary data files taking up space in HDFS. You might be able to salvage these data files, for
example if they are complete but could not be moved into place due to a permission error. Or, you might
delete those files through commands such as <codeph>hadoop fs</codeph> or <codeph>hdfs dfs</codeph>, to
reclaim space before re-trying the <codeph>INSERT</codeph>. Issue <codeph>DESCRIBE FORMATTED
<varname>table_name</varname></codeph> to see the HDFS path where you can check for temporary files.
</p>
</li>
<li rev="1.4.0">
<p rev="obwl" conref="../shared/impala_common.xml#common/order_by_scratch_dir"/>
</li>
<li rev="2.2.0">
<p>
If you use the Amazon Simple Storage Service (S3) as a place to offload
data to reduce the volume of local storage, Impala 2.2.0 and higher
can query the data directly from S3.
See <xref href="impala_s3.xml#s3"/> for details.
</p>
</li>
</ul>
</conbody>
</concept>

View File

@@ -0,0 +1,61 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="distinct">
<title>DISTINCT Operator</title>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="SQL"/>
<data name="Category" value="Querying"/>
<data name="Category" value="Aggregate Functions"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="Cloudera">DISTINCT operator</indexterm>
The <codeph>DISTINCT</codeph> operator in a <codeph>SELECT</codeph> statement filters the result set to
remove duplicates:
</p>
<codeblock>-- Returns the unique values from one column.
-- NULL is included in the set of values if any rows have a NULL in this column.
select distinct c_birth_country from customer;
-- Returns the unique combinations of values from multiple columns.
select distinct c_salutation, c_last_name from customer;</codeblock>
<p>
You can use <codeph>DISTINCT</codeph> in combination with an aggregation function, typically
<codeph>COUNT()</codeph>, to find how many different values a column contains:
</p>
<codeblock>-- Counts the unique values from one column.
-- NULL is not included as a distinct value in the count.
select count(distinct c_birth_country) from customer;
-- Counts the unique combinations of values from multiple columns.
select count(distinct c_salutation, c_last_name) from customer;</codeblock>
<p>
One construct that Impala SQL does <i>not</i> support is using <codeph>DISTINCT</codeph> in more than one
aggregation function in the same query. For example, you could not have a single query with both
<codeph>COUNT(DISTINCT c_first_name)</codeph> and <codeph>COUNT(DISTINCT c_last_name)</codeph> in the
<codeph>SELECT</codeph> list.
</p>
<p conref="../shared/impala_common.xml#common/zero_length_strings"/>
<note conref="../shared/impala_common.xml#common/multiple_count_distinct"/>
<note>
<p>
In contrast with some database systems that always return <codeph>DISTINCT</codeph> values in sorted order,
Impala does not do any ordering of <codeph>DISTINCT</codeph> values. Always include an <codeph>ORDER
BY</codeph> clause if you need the values in alphabetical or numeric sorted order.
</p>
</note>
</conbody>
</concept>

View File

@@ -0,0 +1,91 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="dml">
<title>DML Statements</title>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="SQL"/>
<data name="Category" value="DML"/>
<data name="Category" value="Data Analysts"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Tables"/>
<data name="Category" value="ETL"/>
<data name="Category" value="Ingest"/>
</metadata>
</prolog>
<conbody>
<p>
DML refers to <q>Data Manipulation Language</q>, a subset of SQL statements that modify the data stored in
tables. Because Impala focuses on query performance and leverages the append-only nature of HDFS storage,
currently Impala only supports a small set of DML statements:
</p>
<ul>
<li>
<xref keyref="delete"/>. Works for Kudu tables only.
</li>
<li>
<xref keyref="insert"/>.
</li>
<li>
<xref keyref="load_data"/>. Does not apply for HBase or Kudu tables.
</li>
<li>
<xref keyref="update"/>. Works for Kudu tables only.
</li>
<li>
<xref keyref="upsert"/>. Works for Kudu tables only.
</li>
</ul>
<p>
<codeph>INSERT</codeph> in Impala is primarily optimized for inserting large volumes of data in a single
statement, to make effective use of the multi-megabyte HDFS blocks. This is the way in Impala to create new
data files. If you intend to insert one or a few rows at a time, such as using the <codeph>INSERT ...
VALUES</codeph> syntax, that technique is much more efficient for Impala tables stored in HBase. See
<xref href="impala_hbase.xml#impala_hbase"/> for details.
</p>
<p>
<codeph>LOAD DATA</codeph> moves existing data files into the directory for an Impala table, making them
immediately available for Impala queries. This is one way in Impala to work with data files produced by other
Hadoop components. (<codeph>CREATE EXTERNAL TABLE</codeph> is the other alternative; with external tables,
you can query existing data files, while the files remain in their original location.)
</p>
<p>
In <keyword keyref="impala28_full"/> and higher, Impala does support the <codeph>UPDATE</codeph>, <codeph>DELETE</codeph>,
and <codeph>UPSERT</codeph> statements for Kudu tables.
For HDFS or S3 tables, to simulate the effects of an <codeph>UPDATE</codeph> or <codeph>DELETE</codeph> statement
in other database systems, typically you use <codeph>INSERT</codeph> or <codeph>CREATE TABLE AS SELECT</codeph> to copy data
from one table to another, filtering out or changing the appropriate rows during the copy operation.
</p>
<p>
You can also achieve a result similar to <codeph>UPDATE</codeph> by using Impala tables stored in HBase.
When you insert a row into an HBase table, and the table
already contains a row with the same value for the key column, the older row is hidden, effectively the same
as a single-row <codeph>UPDATE</codeph>.
</p>
<p rev="2.6.0">
Impala can perform DML operations for tables or partitions stored in the Amazon S3 filesystem
with <keyword keyref="impala26_full"/> and higher. See <xref href="impala_s3.xml#s3"/> for details.
</p>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
The other major classifications of SQL statements are data definition language (see
<xref href="impala_ddl.xml#ddl"/>) and queries (see <xref href="impala_select.xml#select"/>).
</p>
</conbody>
</concept>

View File

@@ -0,0 +1,100 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="double">
<title>DOUBLE Data Type</title>
<titlealts audience="PDF"><navtitle>DOUBLE</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Impala Data Types"/>
<data name="Category" value="SQL"/>
<data name="Category" value="Data Analysts"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Schemas"/>
</metadata>
</prolog>
<conbody>
<p>
A double precision floating-point data type used in <codeph>CREATE TABLE</codeph> and <codeph>ALTER
TABLE</codeph> statements.
</p>
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
<p>
In the column definition of a <codeph>CREATE TABLE</codeph> statement:
</p>
<codeblock><varname>column_name</varname> DOUBLE</codeblock>
<p>
<b>Range:</b> 4.94065645841246544e-324d .. 1.79769313486231570e+308, positive or negative
</p>
<p>
<b>Precision:</b> 15 to 17 significant digits, depending on usage. The number of significant digits does
not depend on the position of the decimal point.
</p>
<p>
<b>Representation:</b> The values are stored in 8 bytes, using
<xref href="https://en.wikipedia.org/wiki/Double-precision_floating-point_format" scope="external" format="html">IEEE 754 Double Precision Binary Floating Point</xref> format.
</p>
<p>
<b>Conversions:</b> Impala does not automatically convert <codeph>DOUBLE</codeph> to any other type. You can
use <codeph>CAST()</codeph> to convert <codeph>DOUBLE</codeph> values to <codeph>FLOAT</codeph>,
<codeph>TINYINT</codeph>, <codeph>SMALLINT</codeph>, <codeph>INT</codeph>, <codeph>BIGINT</codeph>,
<codeph>STRING</codeph>, <codeph>TIMESTAMP</codeph>, or <codeph>BOOLEAN</codeph>. You can use exponential
notation in <codeph>DOUBLE</codeph> literals or when casting from <codeph>STRING</codeph>, for example
<codeph>1.0e6</codeph> to represent one million.
<ph conref="../shared/impala_common.xml#common/cast_int_to_timestamp"/>
</p>
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
<p>
The data type <codeph>REAL</codeph> is an alias for <codeph>DOUBLE</codeph>.
</p>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<codeblock>CREATE TABLE t1 (x DOUBLE);
SELECT CAST(1000.5 AS DOUBLE);
</codeblock>
<p conref="../shared/impala_common.xml#common/partitioning_imprecise"/>
<p conref="../shared/impala_common.xml#common/hbase_ok"/>
<p conref="../shared/impala_common.xml#common/parquet_ok"/>
<p conref="../shared/impala_common.xml#common/text_bulky"/>
<!-- <p conref="../shared/impala_common.xml#common/compatibility_blurb"/> -->
<p conref="../shared/impala_common.xml#common/internals_8_bytes"/>
<!-- <p conref="../shared/impala_common.xml#common/added_in_20"/> -->
<p conref="../shared/impala_common.xml#common/column_stats_constant"/>
<p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
<!-- This conref appears under SUM(), AVG(), FLOAT, and DOUBLE topics. -->
<p conref="../shared/impala_common.xml#common/sum_double"/>
<p conref="../shared/impala_common.xml#common/float_double_decimal_caveat"/>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_literals.xml#numeric_literals"/>, <xref href="impala_math_functions.xml#math_functions"/>,
<xref href="impala_float.xml#float"/>
</p>
</conbody>
</concept>

View File

@@ -0,0 +1,35 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept audience="Cloudera" rev="1.4.0" id="drop_data_source">
<title>DROP DATA SOURCE Statement</title>
<titlealts audience="PDF"><navtitle>DROP DATA SOURCE</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="SQL"/>
<data name="Category" value="DDL"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="Cloudera">DROP DATA SOURCE statement</indexterm>
</p>
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
<p conref="../shared/impala_common.xml#common/ddl_blurb"/>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
</conbody>
</concept>

View File

@@ -0,0 +1,130 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="drop_database">
<title>DROP DATABASE Statement</title>
<titlealts audience="PDF"><navtitle>DROP DATABASE</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="SQL"/>
<data name="Category" value="Databases"/>
<data name="Category" value="DDL"/>
<data name="Category" value="Schemas"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="Cloudera">DROP DATABASE statement</indexterm>
Removes a database from the system. The physical operations involve removing the metadata for the database
from the metastore, and deleting the corresponding <codeph>*.db</codeph> directory from HDFS.
</p>
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
<codeblock>DROP (DATABASE|SCHEMA) [IF EXISTS] <varname>database_name</varname> <ph rev="2.3.0">[RESTRICT | CASCADE]</ph>;</codeblock>
<p conref="../shared/impala_common.xml#common/ddl_blurb"/>
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
<p>
By default, the database must be empty before it can be dropped, to avoid losing any data.
</p>
<p rev="2.3.0">
In <keyword keyref="impala23_full"/> and higher, you can include the <codeph>CASCADE</codeph>
clause to make Impala drop all tables and other objects in the database before dropping the database itself.
The <codeph>RESTRICT</codeph> clause enforces the original requirement that the database be empty
before being dropped. Because the <codeph>RESTRICT</codeph> behavior is still the default, this
clause is optional.
</p>
<p rev="2.3.0">
The automatic dropping resulting from the <codeph>CASCADE</codeph> clause follows the same rules as the
corresponding <codeph>DROP TABLE</codeph>, <codeph>DROP VIEW</codeph>, and <codeph>DROP FUNCTION</codeph> statements.
In particular, the HDFS directories and data files for any external tables are left behind when the
tables are removed.
</p>
<p>
When you do not use the <codeph>CASCADE</codeph> clause, drop or move all the objects inside the database manually
before dropping the database itself:
</p>
<ul>
<li>
<p>
Use the <codeph>SHOW TABLES</codeph> statement to locate all tables and views in the database,
and issue <codeph>DROP TABLE</codeph> and <codeph>DROP VIEW</codeph> statements to remove them all.
</p>
</li>
<li>
<p>
Use the <codeph>SHOW FUNCTIONS</codeph> and <codeph>SHOW AGGREGATE FUNCTIONS</codeph> statements
to locate all user-defined functions in the database, and issue <codeph>DROP FUNCTION</codeph>
and <codeph>DROP AGGREGATE FUNCTION</codeph> statements to remove them all.
</p>
</li>
<li>
<p>
To keep tables or views contained by a database while removing the database itself, use
<codeph>ALTER TABLE</codeph> and <codeph>ALTER VIEW</codeph> to move the relevant
objects to a different database before dropping the original database.
</p>
</li>
</ul>
<p>
You cannot drop the current database, that is, the database your session connected to
either through the <codeph>USE</codeph> statement or the <codeph>-d</codeph> option of <cmdname>impala-shell</cmdname>.
Issue a <codeph>USE</codeph> statement to switch to a different database first.
Because the <codeph>default</codeph> database is always available, issuing
<codeph>USE default</codeph> is a convenient way to leave the current database
before dropping it.
</p>
<p conref="../shared/impala_common.xml#common/hive_blurb"/>
<p>
When you drop a database in Impala, the database can no longer be used by Hive.
</p>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<!-- Better to conref the same examples in both places. -->
<p>
See <xref href="impala_create_database.xml#create_database"/> for examples covering <codeph>CREATE
DATABASE</codeph>, <codeph>USE</codeph>, and <codeph>DROP DATABASE</codeph>.
</p>
<p conref="../shared/impala_common.xml#common/s3_blurb"/>
<p conref="../shared/impala_common.xml#common/s3_ddl"/>
<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
<p conref="../shared/impala_common.xml#common/permissions_blurb"/>
<p rev="CDH-19187">
The user ID that the <cmdname>impalad</cmdname> daemon runs under,
typically the <codeph>impala</codeph> user, must have write
permission for the directory associated with the database.
</p>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<codeblock conref="../shared/impala_common.xml#common/create_drop_db_example"/>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_databases.xml#databases"/>, <xref href="impala_create_database.xml#create_database"/>,
<xref href="impala_use.xml#use"/>, <xref href="impala_show.xml#show_databases"/>, <xref href="impala_drop_table.xml#drop_table"/>
</p>
</conbody>
</concept>

View File

@@ -0,0 +1,127 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept rev="1.2" id="drop_function">
<title>DROP FUNCTION Statement</title>
<titlealts audience="PDF"><navtitle>DROP FUNCTION</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="SQL"/>
<data name="Category" value="DDL"/>
<data name="Category" value="Impala Functions"/>
<data name="Category" value="UDFs"/>
<data name="Category" value="Schemas"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="Cloudera">DROP FUNCTION statement</indexterm>
Removes a user-defined function (UDF), so that it is not available for execution during Impala
<codeph>SELECT</codeph> or <codeph>INSERT</codeph> operations.
</p>
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
<p>
To drop C++ UDFs and UDAs:
</p>
<codeblock>DROP [AGGREGATE] FUNCTION [IF EXISTS] [<varname>db_name</varname>.]<varname>function_name</varname>(<varname>type</varname>[, <varname>type</varname>...])</codeblock>
<note rev="2.5.0 IMPALA-2843 CDH-39148">
<p rev="2.5.0 IMPALA-2843 CDH-39148">
The preceding syntax, which includes the function signature, also applies to Java UDFs that were created
using the corresponding <codeph>CREATE FUNCTION</codeph> syntax that includes the argument and return types.
After upgrading to <keyword keyref="impala25_full"/> or higher, consider re-creating all Java UDFs with the
<codeph>CREATE FUNCTION</codeph> syntax that does not include the function signature. Java UDFs created this
way are now persisted in the metastore database and do not need to be re-created after an Impala restart.
</p>
</note>
<p rev="2.5.0 IMPALA-2843 CDH-39148">
To drop Java UDFs (created using the <codeph>CREATE FUNCTION</codeph> syntax with no function signature):
</p>
<codeblock rev="2.5.0">DROP FUNCTION [IF EXISTS] [<varname>db_name</varname>.]<varname>function_name</varname></codeblock>
<!--
Examples:
CREATE FUNCTION IF NOT EXISTS foo location '/path/to/jar' SYMBOL='TestUdf';
CREATE FUNCTION bar location '/path/to/jar' SYMBOL='TestUdf2';
DROP FUNCTION foo;
DROP FUNCTION IF EXISTS bar;
-->
<p conref="../shared/impala_common.xml#common/ddl_blurb"/>
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
<p>
Because the same function name could be overloaded with different argument signatures, you specify the
argument types to identify the exact function to drop.
</p>
<p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
<p conref="../shared/impala_common.xml#common/udf_persistence_restriction"/>
<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
<p conref="../shared/impala_common.xml#common/permissions_blurb"/>
<p rev="CDH-19187">
The user ID that the <cmdname>impalad</cmdname> daemon runs under,
typically the <codeph>impala</codeph> user, does not need any
particular HDFS permissions to perform this statement.
All read and write operations are on the metastore database,
not HDFS files and directories.
</p>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p rev="2.5.0 IMPALA-2843 CDH-39148">
The following example shows how to drop Java functions created with the signatureless
<codeph>CREATE FUNCTION</codeph> syntax in <keyword keyref="impala25_full"/> and higher.
Issuing <codeph>DROP FUNCTION <varname>function_name</varname></codeph> removes all the
overloaded functions under that name.
(See <xref href="impala_create_function.xml#create_function"/> for a longer example
showing how to set up such functions in the first place.)
</p>
<codeblock rev="2.5.0 IMPALA-2843 CDH-39148">
create function my_func location '/user/impala/udfs/udf-examples-cdh570.jar'
symbol='com.cloudera.impala.TestUdf';
show functions;
+-------------+---------------------------------------+-------------+---------------+
| return type | signature | binary type | is persistent |
+-------------+---------------------------------------+-------------+---------------+
| BIGINT | my_func(BIGINT) | JAVA | true |
| BOOLEAN | my_func(BOOLEAN) | JAVA | true |
| BOOLEAN | my_func(BOOLEAN, BOOLEAN) | JAVA | true |
...
| BIGINT | testudf(BIGINT) | JAVA | true |
| BOOLEAN | testudf(BOOLEAN) | JAVA | true |
| BOOLEAN | testudf(BOOLEAN, BOOLEAN) | JAVA | true |
...
drop function my_func;
show functions;
+-------------+---------------------------------------+-------------+---------------+
| return type | signature | binary type | is persistent |
+-------------+---------------------------------------+-------------+---------------+
| BIGINT | testudf(BIGINT) | JAVA | true |
| BOOLEAN | testudf(BOOLEAN) | JAVA | true |
| BOOLEAN | testudf(BOOLEAN, BOOLEAN) | JAVA | true |
...
</codeblock>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_udf.xml#udfs"/>, <xref href="impala_create_function.xml#create_function"/>
</p>
</conbody>
</concept>

View File

@@ -0,0 +1,71 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept rev="1.4.0" id="drop_role">
<title>DROP ROLE Statement (<keyword keyref="impala20"/> or higher only)</title>
<titlealts audience="PDF"><navtitle>DROP ROLE</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="DDL"/>
<data name="Category" value="SQL"/>
<data name="Category" value="Sentry"/>
<data name="Category" value="Security"/>
<data name="Category" value="Roles"/>
<data name="Category" value="Administrators"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
<!-- Consider whether to go deeper into categories like Security for the Sentry-related statements. -->
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="Cloudera">DROP ROLE statement</indexterm>
<!-- Copied from Sentry docs. Turn into conref. I did some rewording for clarity. -->
The <codeph>DROP ROLE</codeph> statement removes a role from the metastore database. Once dropped, the role
is revoked for all users to whom it was previously assigned, and all privileges granted to that role are
revoked. Queries that are already executing are not affected. Impala verifies the role information
approximately every 60 seconds, so the effects of <codeph>DROP ROLE</codeph> might not take effect for new
Impala queries for a brief period.
</p>
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
<codeblock>DROP ROLE <varname>role_name</varname>
</codeblock>
<p conref="../shared/impala_common.xml#common/privileges_blurb"/>
<p>
Only administrative users (initially, a predefined set of users specified in the Sentry service configuration
file) can use this statement.
</p>
<p conref="../shared/impala_common.xml#common/compatibility_blurb"/>
<p>
Impala makes use of any roles and privileges specified by the <codeph>GRANT</codeph> and
<codeph>REVOKE</codeph> statements in Hive, and Hive makes use of any roles and privileges specified by the
<codeph>GRANT</codeph> and <codeph>REVOKE</codeph> statements in Impala. The Impala <codeph>GRANT</codeph>
and <codeph>REVOKE</codeph> statements for privileges do not require the <codeph>ROLE</codeph> keyword to be
repeated before each role name, unlike the equivalent Hive statements.
</p>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_authorization.xml#authorization"/>, <xref href="impala_grant.xml#grant"/>
<xref href="impala_revoke.xml#revoke"/>, <xref href="impala_create_role.xml#create_role"/>,
<xref href="impala_show.xml#show"/>
</p>
<!-- To do: nail down the new SHOW syntax, e.g. SHOW ROLES, SHOW CURRENT ROLES, SHOW GROUPS. -->
<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
<p conref="../shared/impala_common.xml#common/permissions_blurb_no"/>
</conbody>
</concept>

View File

@@ -0,0 +1,279 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept rev="2.1.0" id="drop_stats">
<title>DROP STATS Statement</title>
<titlealts audience="PDF"><navtitle>DROP STATS</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="SQL"/>
<data name="Category" value="DDL"/>
<data name="Category" value="ETL"/>
<data name="Category" value="Ingest"/>
<data name="Category" value="Tables"/>
<data name="Category" value="Performance"/>
<data name="Category" value="Scalability"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p rev="2.1.0">
<indexterm audience="Cloudera">DROP STATS statement</indexterm>
Removes the specified statistics from a table or partition. The statistics were originally created by the
<codeph>COMPUTE STATS</codeph> or <codeph>COMPUTE INCREMENTAL STATS</codeph> statement.
</p>
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
<codeblock rev="2.1.0">DROP STATS [<varname>database_name</varname>.]<varname>table_name</varname>
DROP INCREMENTAL STATS [<varname>database_name</varname>.]<varname>table_name</varname> PARTITION (<varname>partition_spec</varname>)
<varname>partition_spec</varname> ::= <varname>partition_col</varname>=<varname>constant_value</varname>
</codeblock>
<p conref="../shared/impala_common.xml#common/incremental_partition_spec"/>
<p>
<codeph>DROP STATS</codeph> removes all statistics from the table, whether created by <codeph>COMPUTE
STATS</codeph> or <codeph>COMPUTE INCREMENTAL STATS</codeph>.
</p>
<p rev="2.1.0">
<codeph>DROP INCREMENTAL STATS</codeph> only affects incremental statistics for a single partition, specified
through the <codeph>PARTITION</codeph> clause. The incremental stats are marked as outdated, so that they are
recomputed by the next <codeph>COMPUTE INCREMENTAL STATS</codeph> statement.
</p>
<!-- To do: what release was this added in? -->
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
<p>
You typically use this statement when the statistics for a table or a partition have become stale due to data
files being added to or removed from the associated HDFS data directories, whether by manual HDFS operations
or <codeph>INSERT</codeph>, <codeph>INSERT OVERWRITE</codeph>, or <codeph>LOAD DATA</codeph> statements, or
adding or dropping partitions.
</p>
<p>
When a table or partition has no associated statistics, Impala treats it as essentially zero-sized when
constructing the execution plan for a query. In particular, the statistics influence the order in which
tables are joined in a join query. To ensure proper query planning and good query performance and
scalability, make sure to run <codeph>COMPUTE STATS</codeph> or <codeph>COMPUTE INCREMENTAL STATS</codeph> on
the table or partition after removing any stale statistics.
</p>
<p>
Dropping the statistics is not required for an unpartitioned table or a partitioned table covered by the
original type of statistics. A subsequent <codeph>COMPUTE STATS</codeph> statement replaces any existing
statistics with new ones, for all partitions, regardless of whether the old ones were outdated. Therefore,
this statement was rarely used before the introduction of incremental statistics.
</p>
<p>
Dropping the statistics is required for a partitioned table containing incremental statistics, to make a
subsequent <codeph>COMPUTE INCREMENTAL STATS</codeph> statement rescan an existing partition. See
<xref href="impala_perf_stats.xml#perf_stats"/> for information about incremental statistics, a new feature
available in Impala 2.1.0 and higher.
</p>
<p conref="../shared/impala_common.xml#common/ddl_blurb"/>
<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
<p conref="../shared/impala_common.xml#common/permissions_blurb"/>
<p rev="CDH-19187">
The user ID that the <cmdname>impalad</cmdname> daemon runs under,
typically the <codeph>impala</codeph> user, does not need any
particular HDFS permissions to perform this statement.
All read and write operations are on the metastore database,
not HDFS files and directories.
</p>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p>
The following example shows a partitioned table that has associated statistics produced by the
<codeph>COMPUTE INCREMENTAL STATS</codeph> statement, and how the situation evolves as statistics are dropped
from specific partitions, then the entire table.
</p>
<p>
Initially, all table and column statistics are filled in.
</p>
<!-- Note: chopped off any excess characters at position 87 and after,
to avoid weird wrapping in PDF.
Applies to any subsequent examples with output from SHOW ... STATS too. -->
<codeblock>show table stats item_partitioned;
+-------------+-------+--------+----------+--------------+---------+-----------------
| i_category | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats
+-------------+-------+--------+----------+--------------+---------+-----------------
| Books | 1733 | 1 | 223.74KB | NOT CACHED | PARQUET | true
| Children | 1786 | 1 | 230.05KB | NOT CACHED | PARQUET | true
| Electronics | 1812 | 1 | 232.67KB | NOT CACHED | PARQUET | true
| Home | 1807 | 1 | 232.56KB | NOT CACHED | PARQUET | true
| Jewelry | 1740 | 1 | 223.72KB | NOT CACHED | PARQUET | true
| Men | 1811 | 1 | 231.25KB | NOT CACHED | PARQUET | true
| Music | 1860 | 1 | 237.90KB | NOT CACHED | PARQUET | true
| Shoes | 1835 | 1 | 234.90KB | NOT CACHED | PARQUET | true
| Sports | 1783 | 1 | 227.97KB | NOT CACHED | PARQUET | true
| Women | 1790 | 1 | 226.27KB | NOT CACHED | PARQUET | true
| Total | 17957 | 10 | 2.25MB | 0B | |
+-------------+-------+--------+----------+--------------+---------+-----------------
show column stats item_partitioned;
+------------------+-----------+------------------+--------+----------+--------------
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size
+------------------+-----------+------------------+--------+----------+--------------
| i_item_sk | INT | 19443 | -1 | 4 | 4
| i_item_id | STRING | 9025 | -1 | 16 | 16
| i_rec_start_date | TIMESTAMP | 4 | -1 | 16 | 16
| i_rec_end_date | TIMESTAMP | 3 | -1 | 16 | 16
| i_item_desc | STRING | 13330 | -1 | 200 | 100.302803039
| i_current_price | FLOAT | 2807 | -1 | 4 | 4
| i_wholesale_cost | FLOAT | 2105 | -1 | 4 | 4
| i_brand_id | INT | 965 | -1 | 4 | 4
| i_brand | STRING | 725 | -1 | 22 | 16.1776008605
| i_class_id | INT | 16 | -1 | 4 | 4
| i_class | STRING | 101 | -1 | 15 | 7.76749992370
| i_category_id | INT | 10 | -1 | 4 | 4
| i_manufact_id | INT | 1857 | -1 | 4 | 4
| i_manufact | STRING | 1028 | -1 | 15 | 11.3295001983
| i_size | STRING | 8 | -1 | 11 | 4.33459997177
| i_formulation | STRING | 12884 | -1 | 20 | 19.9799995422
| i_color | STRING | 92 | -1 | 10 | 5.38089990615
| i_units | STRING | 22 | -1 | 7 | 4.18690013885
| i_container | STRING | 2 | -1 | 7 | 6.99259996414
| i_manager_id | INT | 105 | -1 | 4 | 4
| i_product_name | STRING | 19094 | -1 | 25 | 18.0233001708
| i_category | STRING | 10 | 0 | -1 | -1
+------------------+-----------+------------------+--------+----------+--------------
</codeblock>
<p>
To remove statistics for particular partitions, use the <codeph>DROP INCREMENTAL STATS</codeph> statement.
After removing statistics for two partitions, the table-level statistics reflect that change in the
<codeph>#Rows</codeph> and <codeph>Incremental stats</codeph> fields. The counts, maximums, and averages of
the column-level statistics are unaffected.
</p>
<note>
(It is possible that the row count might be preserved in future after a <codeph>DROP INCREMENTAL
STATS</codeph> statement. Check the resolution of the issue
<xref href="https://issues.cloudera.org/browse/IMPALA-1615" scope="external" format="html">IMPALA-1615</xref>.)
</note>
<codeblock>drop incremental stats item_partitioned partition (i_category='Sports');
drop incremental stats item_partitioned partition (i_category='Electronics');
show table stats item_partitioned
+-------------+-------+--------+----------+--------------+---------+------------------
| i_category | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats
+-------------+-------+--------+----------+--------------+---------+-----------------
| Books | 1733 | 1 | 223.74KB | NOT CACHED | PARQUET | true
| Children | 1786 | 1 | 230.05KB | NOT CACHED | PARQUET | true
| Electronics | -1 | 1 | 232.67KB | NOT CACHED | PARQUET | false
| Home | 1807 | 1 | 232.56KB | NOT CACHED | PARQUET | true
| Jewelry | 1740 | 1 | 223.72KB | NOT CACHED | PARQUET | true
| Men | 1811 | 1 | 231.25KB | NOT CACHED | PARQUET | true
| Music | 1860 | 1 | 237.90KB | NOT CACHED | PARQUET | true
| Shoes | 1835 | 1 | 234.90KB | NOT CACHED | PARQUET | true
| Sports | -1 | 1 | 227.97KB | NOT CACHED | PARQUET | false
| Women | 1790 | 1 | 226.27KB | NOT CACHED | PARQUET | true
| Total | 17957 | 10 | 2.25MB | 0B | |
+-------------+-------+--------+----------+--------------+---------+-----------------
show column stats item_partitioned
+------------------+-----------+------------------+--------+----------+--------------
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size
+------------------+-----------+------------------+--------+----------+--------------
| i_item_sk | INT | 19443 | -1 | 4 | 4
| i_item_id | STRING | 9025 | -1 | 16 | 16
| i_rec_start_date | TIMESTAMP | 4 | -1 | 16 | 16
| i_rec_end_date | TIMESTAMP | 3 | -1 | 16 | 16
| i_item_desc | STRING | 13330 | -1 | 200 | 100.302803039
| i_current_price | FLOAT | 2807 | -1 | 4 | 4
| i_wholesale_cost | FLOAT | 2105 | -1 | 4 | 4
| i_brand_id | INT | 965 | -1 | 4 | 4
| i_brand | STRING | 725 | -1 | 22 | 16.1776008605
| i_class_id | INT | 16 | -1 | 4 | 4
| i_class | STRING | 101 | -1 | 15 | 7.76749992370
| i_category_id | INT | 10 | -1 | 4 | 4
| i_manufact_id | INT | 1857 | -1 | 4 | 4
| i_manufact | STRING | 1028 | -1 | 15 | 11.3295001983
| i_size | STRING | 8 | -1 | 11 | 4.33459997177
| i_formulation | STRING | 12884 | -1 | 20 | 19.9799995422
| i_color | STRING | 92 | -1 | 10 | 5.38089990615
| i_units | STRING | 22 | -1 | 7 | 4.18690013885
| i_container | STRING | 2 | -1 | 7 | 6.99259996414
| i_manager_id | INT | 105 | -1 | 4 | 4
| i_product_name | STRING | 19094 | -1 | 25 | 18.0233001708
| i_category | STRING | 10 | 0 | -1 | -1
+------------------+-----------+------------------+--------+----------+--------------
</codeblock>
<p>
To remove all statistics from the table, whether produced by <codeph>COMPUTE STATS</codeph> or
<codeph>COMPUTE INCREMENTAL STATS</codeph>, use the <codeph>DROP STATS</codeph> statement without the
<codeph>INCREMENTAL</codeph> clause). Now, both table-level and column-level statistics are reset.
</p>
<codeblock>drop stats item_partitioned;
show table stats item_partitioned
+-------------+-------+--------+----------+--------------+---------+------------------
| i_category | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats
+-------------+-------+--------+----------+--------------+---------+------------------
| Books | -1 | 1 | 223.74KB | NOT CACHED | PARQUET | false
| Children | -1 | 1 | 230.05KB | NOT CACHED | PARQUET | false
| Electronics | -1 | 1 | 232.67KB | NOT CACHED | PARQUET | false
| Home | -1 | 1 | 232.56KB | NOT CACHED | PARQUET | false
| Jewelry | -1 | 1 | 223.72KB | NOT CACHED | PARQUET | false
| Men | -1 | 1 | 231.25KB | NOT CACHED | PARQUET | false
| Music | -1 | 1 | 237.90KB | NOT CACHED | PARQUET | false
| Shoes | -1 | 1 | 234.90KB | NOT CACHED | PARQUET | false
| Sports | -1 | 1 | 227.97KB | NOT CACHED | PARQUET | false
| Women | -1 | 1 | 226.27KB | NOT CACHED | PARQUET | false
| Total | -1 | 10 | 2.25MB | 0B | |
+-------------+-------+--------+----------+--------------+---------+------------------
show column stats item_partitioned
+------------------+-----------+------------------+--------+----------+----------+
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
+------------------+-----------+------------------+--------+----------+----------+
| i_item_sk | INT | -1 | -1 | 4 | 4 |
| i_item_id | STRING | -1 | -1 | -1 | -1 |
| i_rec_start_date | TIMESTAMP | -1 | -1 | 16 | 16 |
| i_rec_end_date | TIMESTAMP | -1 | -1 | 16 | 16 |
| i_item_desc | STRING | -1 | -1 | -1 | -1 |
| i_current_price | FLOAT | -1 | -1 | 4 | 4 |
| i_wholesale_cost | FLOAT | -1 | -1 | 4 | 4 |
| i_brand_id | INT | -1 | -1 | 4 | 4 |
| i_brand | STRING | -1 | -1 | -1 | -1 |
| i_class_id | INT | -1 | -1 | 4 | 4 |
| i_class | STRING | -1 | -1 | -1 | -1 |
| i_category_id | INT | -1 | -1 | 4 | 4 |
| i_manufact_id | INT | -1 | -1 | 4 | 4 |
| i_manufact | STRING | -1 | -1 | -1 | -1 |
| i_size | STRING | -1 | -1 | -1 | -1 |
| i_formulation | STRING | -1 | -1 | -1 | -1 |
| i_color | STRING | -1 | -1 | -1 | -1 |
| i_units | STRING | -1 | -1 | -1 | -1 |
| i_container | STRING | -1 | -1 | -1 | -1 |
| i_manager_id | INT | -1 | -1 | 4 | 4 |
| i_product_name | STRING | -1 | -1 | -1 | -1 |
| i_category | STRING | 10 | 0 | -1 | -1 |
+------------------+-----------+------------------+--------+----------+----------+
</codeblock>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_compute_stats.xml#compute_stats"/>, <xref href="impala_show.xml#show_table_stats"/>,
<xref href="impala_show.xml#show_column_stats"/>, <xref href="impala_perf_stats.xml#perf_stats"/>
</p>
</conbody>
</concept>

View File

@@ -0,0 +1,150 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="drop_table">
<title>DROP TABLE Statement</title>
<titlealts audience="PDF"><navtitle>DROP TABLE</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="SQL"/>
<data name="Category" value="DDL"/>
<data name="Category" value="Tables"/>
<data name="Category" value="Schemas"/>
<data name="Category" value="S3"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="Cloudera">DROP TABLE statement</indexterm>
Removes an Impala table. Also removes the underlying HDFS data files for internal tables, although not for
external tables.
</p>
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
<codeblock>DROP TABLE [IF EXISTS] [<varname>db_name</varname>.]<varname>table_name</varname> <ph rev="2.3.0">[PURGE]</ph></codeblock>
<p>
<b>IF EXISTS clause:</b>
</p>
<p>
The optional <codeph>IF EXISTS</codeph> clause makes the statement succeed whether or not the table exists.
If the table does exist, it is dropped; if it does not exist, the statement has no effect. This capability is
useful in standardized setup scripts that remove existing schema objects and create new ones. By using some
combination of <codeph>IF EXISTS</codeph> for the <codeph>DROP</codeph> statements and <codeph>IF NOT
EXISTS</codeph> clauses for the <codeph>CREATE</codeph> statements, the script can run successfully the first
time you run it (when the objects do not exist yet) and subsequent times (when some or all of the objects do
already exist).
</p>
<p rev="2.3.0">
<b>PURGE clause:</b>
</p>
<p rev="2.3.0"> The optional <codeph>PURGE</codeph> keyword, available in
<keyword keyref="impala23_full"/> and higher, causes Impala to remove the associated
HDFS data files immediately, rather than going through the HDFS trashcan
mechanism. Use this keyword when dropping a table if it is crucial to
remove the data as quickly as possible to free up space, or if there is a
problem with the trashcan, such as the trash cannot being configured or
being in a different HDFS encryption zone than the data files. </p>
<p conref="../shared/impala_common.xml#common/ddl_blurb"/>
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
<p>
By default, Impala removes the associated HDFS directory and data files for the table. If you issue a
<codeph>DROP TABLE</codeph> and the data files are not deleted, it might be for the following reasons:
</p>
<ul>
<li>
If the table was created with the
<codeph><xref href="impala_tables.xml#external_tables">EXTERNAL</xref></codeph> clause, Impala leaves all
files and directories untouched. Use external tables when the data is under the control of other Hadoop
components, and Impala is only used to query the data files from their original locations.
</li>
<li>
Impala might leave the data files behind unintentionally, if there is no HDFS location available to hold
the HDFS trashcan for the <codeph>impala</codeph> user. See
<xref href="impala_prereqs.xml#prereqs_account"/> for the procedure to set up the required HDFS home
directory.
</li>
</ul>
<p>
Make sure that you are in the correct database before dropping a table, either by issuing a
<codeph>USE</codeph> statement first or by using a fully qualified name
<codeph><varname>db_name</varname>.<varname>table_name</varname></codeph>.
</p>
<p>
If you intend to issue a <codeph>DROP DATABASE</codeph> statement, first issue <codeph>DROP TABLE</codeph>
statements to remove all the tables in that database.
</p>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<codeblock>create database temporary;
use temporary;
create table unimportant (x int);
create table trivial (s string);
-- Drop a table in the current database.
drop table unimportant;
-- Switch to a different database.
use default;
-- To drop a table in a different database...
drop table trivial;
<i>ERROR: AnalysisException: Table does not exist: default.trivial</i>
-- ...use a fully qualified name.
drop table temporary.trivial;</codeblock>
<p conref="../shared/impala_common.xml#common/disk_space_blurb"/>
<p conref="../shared/impala_common.xml#common/s3_blurb"/>
<p rev="2.6.0 CDH-39913 IMPALA-1878">
The <codeph>DROP TABLE</codeph> statement can remove data files from S3
if the associated S3 table is an internal table.
In <keyword keyref="impala26_full"/> and higher, as part of improved support for writing
to S3, Impala also removes the associated folder when dropping an internal table
that resides on S3.
See <xref href="impala_s3.xml#s3"/> for details about working with S3 tables.
</p>
<p conref="../shared/impala_common.xml#common/s3_drop_table_purge"/>
<p conref="../shared/impala_common.xml#common/s3_ddl"/>
<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
<p conref="../shared/impala_common.xml#common/permissions_blurb"/>
<p rev="CDH-19187">
For an internal table, the user ID that the <cmdname>impalad</cmdname> daemon runs under,
typically the <codeph>impala</codeph> user, must have write
permission for all the files and directories that make up the table.
</p>
<p>
For an external table, dropping the table only involves changes to metadata in the metastore database.
Because Impala does not remove any HDFS files or directories when external tables are dropped,
no particular permissions are needed for the associated HDFS files or directories.
</p>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_tables.xml#tables"/>,
<xref href="impala_alter_table.xml#alter_table"/>, <xref href="impala_create_table.xml#create_table"/>,
<xref href="impala_partitioning.xml#partitioning"/>, <xref href="impala_tables.xml#internal_tables"/>,
<xref href="impala_tables.xml#external_tables"/>
</p>
</conbody>
</concept>

View File

@@ -0,0 +1,49 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept rev="1.1" id="drop_view">
<title>DROP VIEW Statement</title>
<titlealts audience="PDF"><navtitle>DROP VIEW</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="SQL"/>
<data name="Category" value="DDL"/>
<data name="Category" value="Schemas"/>
<data name="Category" value="Tables"/>
<data name="Category" value="Views"/>
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="Cloudera">DROP VIEW statement</indexterm>
Removes the specified view, which was originally created by the <codeph>CREATE VIEW</codeph> statement.
Because a view is purely a logical construct (an alias for a query) with no physical data behind it,
<codeph>DROP VIEW</codeph> only involves changes to metadata in the metastore database, not any data files in
HDFS.
</p>
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
<codeblock>DROP VIEW [IF EXISTS] [<varname>db_name</varname>.]<varname>view_name</varname></codeblock>
<p conref="../shared/impala_common.xml#common/ddl_blurb"/>
<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
<p conref="../shared/impala_common.xml#common/permissions_blurb_no"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p conref="../shared/impala_common.xml#common/create_drop_view_examples"/>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_views.xml#views"/>, <xref href="impala_create_view.xml#create_view"/>,
<xref href="impala_alter_view.xml#alter_view"/>
</p>
</conbody>
</concept>

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,96 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept rev="2.0.0" id="exec_single_node_rows_threshold">
<title>EXEC_SINGLE_NODE_ROWS_THRESHOLD Query Option (<keyword keyref="impala21"/> or higher only)</title>
<titlealts audience="PDF"><navtitle>EXEC_SINGLE_NODE_ROWS_THRESHOLD</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Impala Query Options"/>
<data name="Category" value="Scalability"/>
<data name="Category" value="Performance"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p rev="2.0.0">
<indexterm audience="Cloudera">EXEC_SINGLE_NODE_ROWS_THRESHOLD query option</indexterm>
This setting controls the cutoff point (in terms of number of rows scanned) below which Impala treats a query
as a <q>small</q> query, turning off optimizations such as parallel execution and native code generation. The
overhead for these optimizations is applicable for queries involving substantial amounts of data, but it
makes sense to skip them for queries involving tiny amounts of data. Reducing the overhead for small queries
allows Impala to complete them more quickly, keeping YARN resources, admission control slots, and so on
available for data-intensive queries.
</p>
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
<codeblock>SET EXEC_SINGLE_NODE_ROWS_THRESHOLD=<varname>number_of_rows</varname></codeblock>
<p>
<b>Type:</b> numeric
</p>
<p>
<b>Default:</b> 100
</p>
<p>
<b>Usage notes:</b> Typically, you increase the default value to make this optimization apply to more queries.
If incorrect or corrupted table and column statistics cause Impala to apply this optimization
incorrectly to queries that actually involve substantial work, you might see the queries being slower as a
result of remote reads. In that case, recompute statistics with the <codeph>COMPUTE STATS</codeph>
or <codeph>COMPUTE INCREMENTAL STATS</codeph> statement. If there is a problem collecting accurate
statistics, you can turn this feature off by setting the value to -1.
</p>
<p conref="../shared/impala_common.xml#common/internals_blurb"/>
<p>
This setting applies to query fragments where the amount of data to scan can be accurately determined, either
through table and column statistics, or by the presence of a <codeph>LIMIT</codeph> clause. If Impala cannot
accurately estimate the size of the input data, this setting does not apply.
</p>
<p rev="2.3.0">
In <keyword keyref="impala23_full"/> and higher, where Impala supports the complex data types <codeph>STRUCT</codeph>,
<codeph>ARRAY</codeph>, and <codeph>MAP</codeph>, if a query refers to any column of those types,
the small-query optimization is turned off for that query regardless of the
<codeph>EXEC_SINGLE_NODE_ROWS_THRESHOLD</codeph> setting.
</p>
<p>
For a query that is determined to be <q>small</q>, all work is performed on the coordinator node. This might
result in some I/O being performed by remote reads. The savings from not distributing the query work and not
generating native code are expected to outweigh any overhead from the remote reads.
</p>
<p conref="../shared/impala_common.xml#common/added_in_210"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p>
A common use case is to query just a few rows from a table to inspect typical data values. In this example,
Impala does not parallelize the query or perform native code generation because the result set is guaranteed
to be smaller than the threshold value from this query option:
</p>
<codeblock>SET EXEC_SINGLE_NODE_ROWS_THRESHOLD=500;
SELECT * FROM enormous_table LIMIT 300;
</codeblock>
<!-- Don't have any other places that tie into this particular optimization technique yet.
Potentially: conceptual topics about code generation, distributed queries
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
</p>
-->
</conbody>
</concept>

View File

@@ -0,0 +1,228 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="explain">
<title>EXPLAIN Statement</title>
<titlealts audience="PDF"><navtitle>EXPLAIN</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="SQL"/>
<data name="Category" value="Querying"/>
<data name="Category" value="Reports"/>
<data name="Category" value="Planning"/>
<data name="Category" value="Performance"/>
<data name="Category" value="Troubleshooting"/>
<data name="Category" value="Administrators"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="Cloudera">EXPLAIN statement</indexterm>
Returns the execution plan for a statement, showing the low-level mechanisms that Impala will use to read the
data, divide the work among nodes in the cluster, and transmit intermediate and final results across the
network. Use <codeph>explain</codeph> followed by a complete <codeph>SELECT</codeph> query. For example:
</p>
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
<codeblock>EXPLAIN { <varname>select_query</varname> | <varname>ctas_stmt</varname> | <varname>insert_stmt</varname> }
</codeblock>
<p>
The <varname>select_query</varname> is a <codeph>SELECT</codeph> statement, optionally prefixed by a
<codeph>WITH</codeph> clause. See <xref href="impala_select.xml#select"/> for details.
</p>
<p>
The <varname>insert_stmt</varname> is an <codeph>INSERT</codeph> statement that inserts into or overwrites an
existing table. It can use either the <codeph>INSERT ... SELECT</codeph> or <codeph>INSERT ...
VALUES</codeph> syntax. See <xref href="impala_insert.xml#insert"/> for details.
</p>
<p>
The <varname>ctas_stmt</varname> is a <codeph>CREATE TABLE</codeph> statement using the <codeph>AS
SELECT</codeph> clause, typically abbreviated as a <q>CTAS</q> operation. See
<xref href="impala_create_table.xml#create_table"/> for details.
</p>
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
<p>
You can interpret the output to judge whether the query is performing efficiently, and adjust the query
and/or the schema if not. For example, you might change the tests in the <codeph>WHERE</codeph> clause, add
hints to make join operations more efficient, introduce subqueries, change the order of tables in a join, add
or change partitioning for a table, collect column statistics and/or table statistics in Hive, or any other
performance tuning steps.
</p>
<p>
The <codeph>EXPLAIN</codeph> output reminds you if table or column statistics are missing from any table
involved in the query. These statistics are important for optimizing queries involving large tables or
multi-table joins. See <xref href="impala_compute_stats.xml#compute_stats"/> for how to gather statistics,
and <xref href="impala_perf_stats.xml#perf_stats"/> for how to use this information for query tuning.
</p>
<p conref="../shared/impala_common.xml#common/explain_interpret"/>
<p>
If you come from a traditional database background and are not familiar with data warehousing, keep in mind
that Impala is optimized for full table scans across very large tables. The structure and distribution of
this data is typically not suitable for the kind of indexing and single-row lookups that are common in OLTP
environments. Seeing a query scan entirely through a large table is common, not necessarily an indication of
an inefficient query. Of course, if you can reduce the volume of scanned data by orders of magnitude, for
example by using a query that affects only certain partitions within a partitioned table, then you might be
able to optimize a query so that it executes in seconds rather than minutes.
</p>
<p>
For more information and examples to help you interpret <codeph>EXPLAIN</codeph> output, see
<xref href="impala_explain_plan.xml#perf_explain"/>.
</p>
<p rev="1.2">
<b>Extended EXPLAIN output:</b>
</p>
<p rev="1.2">
For performance tuning of complex queries, and capacity planning (such as using the admission control and
resource management features), you can enable more detailed and informative output for the
<codeph>EXPLAIN</codeph> statement. In the <cmdname>impala-shell</cmdname> interpreter, issue the command
<codeph>SET EXPLAIN_LEVEL=<varname>level</varname></codeph>, where <varname>level</varname> is an integer
from 0 to 3 or corresponding mnemonic values <codeph>minimal</codeph>, <codeph>standard</codeph>,
<codeph>extended</codeph>, or <codeph>verbose</codeph>.
</p>
<p rev="1.2">
When extended <codeph>EXPLAIN</codeph> output is enabled, <codeph>EXPLAIN</codeph> statements print
information about estimated memory requirements, minimum number of virtual cores, and so on.
<!--
that you can use to fine-tune the resource management options explained in <xref href="impala_resource_management.xml#rm_options"/>.
(The estimated memory requirements are intentionally on the high side, to allow a margin for error,
to avoid cancelling a query unnecessarily if you set the <codeph>MEM_LIMIT</codeph> option to the estimated memory figure.)
-->
</p>
<p>
See <xref href="impala_explain_level.xml#explain_level"/> for details and examples.
</p>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p>
This example shows how the standard <codeph>EXPLAIN</codeph> output moves from the lowest (physical) level to
the higher (logical) levels. The query begins by scanning a certain amount of data; each node performs an
aggregation operation (evaluating <codeph>COUNT(*)</codeph>) on some subset of data that is local to that
node; the intermediate results are transmitted back to the coordinator node (labelled here as the
<codeph>EXCHANGE</codeph> node); lastly, the intermediate results are summed to display the final result.
</p>
<codeblock id="explain_plan_simple">[impalad-host:21000] &gt; explain select count(*) from customer_address;
+----------------------------------------------------------+
| Explain String |
+----------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=42.00MB VCores=1 |
| |
| 03:AGGREGATE [MERGE FINALIZE] |
| | output: sum(count(*)) |
| | |
| 02:EXCHANGE [PARTITION=UNPARTITIONED] |
| | |
| 01:AGGREGATE |
| | output: count(*) |
| | |
| 00:SCAN HDFS [default.customer_address] |
| partitions=1/1 size=5.25MB |
+----------------------------------------------------------+
</codeblock>
<p>
These examples show how the extended <codeph>EXPLAIN</codeph> output becomes more accurate and informative as
statistics are gathered by the <codeph>COMPUTE STATS</codeph> statement. Initially, much of the information
about data size and distribution is marked <q>unavailable</q>. Impala can determine the raw data size, but
not the number of rows or number of distinct values for each column without additional analysis. The
<codeph>COMPUTE STATS</codeph> statement performs this analysis, so a subsequent <codeph>EXPLAIN</codeph>
statement has additional information to use in deciding how to optimize the distributed query.
</p>
<!-- To do:
Re-run these examples with more substantial tables populated with data.
-->
<codeblock rev="1.2">[localhost:21000] &gt; set explain_level=extended;
EXPLAIN_LEVEL set to extended
[localhost:21000] &gt; explain select x from t1;
[localhost:21000] &gt; explain select x from t1;
+----------------------------------------------------------+
| Explain String |
+----------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=32.00MB VCores=1 |
| |
| 01:EXCHANGE [PARTITION=UNPARTITIONED] |
| | hosts=1 per-host-mem=unavailable |
<b>| | tuple-ids=0 row-size=4B cardinality=unavailable |</b>
| | |
| 00:SCAN HDFS [default.t2, PARTITION=RANDOM] |
| partitions=1/1 size=36B |
<b>| table stats: unavailable |</b>
<b>| column stats: unavailable |</b>
| hosts=1 per-host-mem=32.00MB |
<b>| tuple-ids=0 row-size=4B cardinality=unavailable |</b>
+----------------------------------------------------------+
</codeblock>
<codeblock rev="1.2">[localhost:21000] &gt; compute stats t1;
+-----------------------------------------+
| summary |
+-----------------------------------------+
| Updated 1 partition(s) and 1 column(s). |
+-----------------------------------------+
[localhost:21000] &gt; explain select x from t1;
+----------------------------------------------------------+
| Explain String |
+----------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=64.00MB VCores=1 |
| |
| 01:EXCHANGE [PARTITION=UNPARTITIONED] |
| | hosts=1 per-host-mem=unavailable |
| | tuple-ids=0 row-size=4B cardinality=0 |
| | |
| 00:SCAN HDFS [default.t1, PARTITION=RANDOM] |
| partitions=1/1 size=36B |
<b>| table stats: 0 rows total |</b>
<b>| column stats: all |</b>
| hosts=1 per-host-mem=64.00MB |
<b>| tuple-ids=0 row-size=4B cardinality=0 |</b>
+----------------------------------------------------------+
</codeblock>
<p conref="../shared/impala_common.xml#common/security_blurb"/>
<p conref="../shared/impala_common.xml#common/redaction_yes"/>
<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
<p conref="../shared/impala_common.xml#common/permissions_blurb"/>
<p rev="CDH-19187">
<!-- Doublecheck these details. Does EXPLAIN really need any permissions? -->
The user ID that the <cmdname>impalad</cmdname> daemon runs under,
typically the <codeph>impala</codeph> user, must have read
and execute permissions for all applicable directories in all source tables
for the query that is being explained.
(A <codeph>SELECT</codeph> operation could read files from multiple different HDFS directories
if the source table is partitioned.)
</p>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_select.xml#select"/>,
<xref href="impala_insert.xml#insert"/>,
<xref href="impala_create_table.xml#create_table"/>,
<xref href="impala_explain_plan.xml#explain_plan"/>
</p>
</conbody>
</concept>

View File

@@ -0,0 +1,350 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept rev="1.2" id="explain_level">
<title>EXPLAIN_LEVEL Query Option</title>
<titlealts audience="PDF"><navtitle>EXPLAIN_LEVEL</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Impala Query Options"/>
<data name="Category" value="Troubleshooting"/>
<data name="Category" value="Querying"/>
<data name="Category" value="Performance"/>
<data name="Category" value="Reports"/>
<data name="Category" value="Administrators"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="Cloudera">EXPLAIN_LEVEL query option</indexterm>
Controls the amount of detail provided in the output of the <codeph>EXPLAIN</codeph> statement. The basic
output can help you identify high-level performance issues such as scanning a higher volume of data or more
partitions than you expect. The higher levels of detail show how intermediate results flow between nodes and
how different SQL operations such as <codeph>ORDER BY</codeph>, <codeph>GROUP BY</codeph>, joins, and
<codeph>WHERE</codeph> clauses are implemented within a distributed query.
</p>
<p>
<b>Type:</b> <codeph>STRING</codeph> or <codeph>INT</codeph>
</p>
<p>
<b>Default:</b> <codeph>1</codeph>
</p>
<p>
<b>Arguments:</b>
</p>
<p>
The allowed range of numeric values for this option is 0 to 3:
</p>
<ul>
<li>
<codeph>0</codeph> or <codeph>MINIMAL</codeph>: A barebones list, one line per operation. Primarily useful
for checking the join order in very long queries where the regular <codeph>EXPLAIN</codeph> output is too
long to read easily.
</li>
<li>
<codeph>1</codeph> or <codeph>STANDARD</codeph>: The default level of detail, showing the logical way that
work is split up for the distributed query.
</li>
<li>
<codeph>2</codeph> or <codeph>EXTENDED</codeph>: Includes additional detail about how the query planner
uses statistics in its decision-making process, to understand how a query could be tuned by gathering
statistics, using query hints, adding or removing predicates, and so on.
</li>
<li>
<codeph>3</codeph> or <codeph>VERBOSE</codeph>: The maximum level of detail, showing how work is split up
within each node into <q>query fragments</q> that are connected in a pipeline. This extra detail is
primarily useful for low-level performance testing and tuning within Impala itself, rather than for
rewriting the SQL code at the user level.
</li>
</ul>
<note>
Prior to Impala 1.3, the allowed argument range for <codeph>EXPLAIN_LEVEL</codeph> was 0 to 1: level 0 had
the mnemonic <codeph>NORMAL</codeph>, and level 1 was <codeph>VERBOSE</codeph>. In Impala 1.3 and higher,
<codeph>NORMAL</codeph> is not a valid mnemonic value, and <codeph>VERBOSE</codeph> still applies to the
highest level of detail but now corresponds to level 3. You might need to adjust the values if you have any
older <codeph>impala-shell</codeph> script files that set the <codeph>EXPLAIN_LEVEL</codeph> query option.
</note>
<p>
Changing the value of this option controls the amount of detail in the output of the <codeph>EXPLAIN</codeph>
statement. The extended information from level 2 or 3 is especially useful during performance tuning, when
you need to confirm whether the work for the query is distributed the way you expect, particularly for the
most resource-intensive operations such as join queries against large tables, queries against tables with
large numbers of partitions, and insert operations for Parquet tables. The extended information also helps to
check estimated resource usage when you use the admission control or resource management features explained
in <xref href="impala_resource_management.xml#resource_management"/>. See
<xref href="impala_explain.xml#explain"/> for the syntax of the <codeph>EXPLAIN</codeph> statement, and
<xref href="impala_explain_plan.xml#perf_explain"/> for details about how to use the extended information.
</p>
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
<p>
As always, read the <codeph>EXPLAIN</codeph> output from bottom to top. The lowest lines represent the
initial work of the query (scanning data files), the lines in the middle represent calculations done on each
node and how intermediate results are transmitted from one node to another, and the topmost lines represent
the final results being sent back to the coordinator node.
</p>
<p>
The numbers in the left column are generated internally during the initial planning phase and do not
represent the actual order of operations, so it is not significant if they appear out of order in the
<codeph>EXPLAIN</codeph> output.
</p>
<p>
At all <codeph>EXPLAIN</codeph> levels, the plan contains a warning if any tables in the query are missing
statistics. Use the <codeph>COMPUTE STATS</codeph> statement to gather statistics for each table and suppress
this warning. See <xref href="impala_perf_stats.xml#perf_stats"/> for details about how the statistics help
query performance.
</p>
<p>
The <codeph>PROFILE</codeph> command in <cmdname>impala-shell</cmdname> always starts with an explain plan
showing full detail, the same as with <codeph>EXPLAIN_LEVEL=3</codeph>. <ph rev="1.4.0">After the explain
plan comes the executive summary, the same output as produced by the <codeph>SUMMARY</codeph> command in
<cmdname>impala-shell</cmdname>.</ph>
</p>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p>
These examples use a trivial, empty table to illustrate how the essential aspects of query planning are shown
in <codeph>EXPLAIN</codeph> output:
</p>
<codeblock>[localhost:21000] &gt; create table t1 (x int, s string);
[localhost:21000] &gt; set explain_level=1;
[localhost:21000] &gt; explain select count(*) from t1;
+------------------------------------------------------------------------+
| Explain String |
+------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=10.00MB VCores=1 |
| WARNING: The following tables are missing relevant table and/or column |
| statistics. |
| explain_plan.t1 |
| |
| 03:AGGREGATE [MERGE FINALIZE] |
| | output: sum(count(*)) |
| | |
| 02:EXCHANGE [PARTITION=UNPARTITIONED] |
| | |
| 01:AGGREGATE |
| | output: count(*) |
| | |
| 00:SCAN HDFS [explain_plan.t1] |
| partitions=1/1 size=0B |
+------------------------------------------------------------------------+
[localhost:21000] &gt; explain select * from t1;
+------------------------------------------------------------------------+
| Explain String |
+------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=-9223372036854775808B VCores=0 |
| WARNING: The following tables are missing relevant table and/or column |
| statistics. |
| explain_plan.t1 |
| |
| 01:EXCHANGE [PARTITION=UNPARTITIONED] |
| | |
| 00:SCAN HDFS [explain_plan.t1] |
| partitions=1/1 size=0B |
+------------------------------------------------------------------------+
[localhost:21000] &gt; set explain_level=2;
[localhost:21000] &gt; explain select * from t1;
+------------------------------------------------------------------------+
| Explain String |
+------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=-9223372036854775808B VCores=0 |
| WARNING: The following tables are missing relevant table and/or column |
| statistics. |
| explain_plan.t1 |
| |
| 01:EXCHANGE [PARTITION=UNPARTITIONED] |
| | hosts=0 per-host-mem=unavailable |
| | tuple-ids=0 row-size=19B cardinality=unavailable |
| | |
| 00:SCAN HDFS [explain_plan.t1, PARTITION=RANDOM] |
| partitions=1/1 size=0B |
| table stats: unavailable |
| column stats: unavailable |
| hosts=0 per-host-mem=0B |
| tuple-ids=0 row-size=19B cardinality=unavailable |
+------------------------------------------------------------------------+
[localhost:21000] &gt; set explain_level=3;
[localhost:21000] &gt; explain select * from t1;
+------------------------------------------------------------------------+
| Explain String |
+------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=-9223372036854775808B VCores=0 |
<b>| WARNING: The following tables are missing relevant table and/or column |</b>
<b>| statistics. |</b>
<b>| explain_plan.t1 |</b>
| |
| F01:PLAN FRAGMENT [PARTITION=UNPARTITIONED] |
| 01:EXCHANGE [PARTITION=UNPARTITIONED] |
| hosts=0 per-host-mem=unavailable |
| tuple-ids=0 row-size=19B cardinality=unavailable |
| |
| F00:PLAN FRAGMENT [PARTITION=RANDOM] |
| DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=01, PARTITION=UNPARTITIONED] |
| 00:SCAN HDFS [explain_plan.t1, PARTITION=RANDOM] |
| partitions=1/1 size=0B |
<b>| table stats: unavailable |</b>
<b>| column stats: unavailable |</b>
| hosts=0 per-host-mem=0B |
| tuple-ids=0 row-size=19B cardinality=unavailable |
+------------------------------------------------------------------------+
</codeblock>
<p>
As the warning message demonstrates, most of the information needed for Impala to do efficient query
planning, and for you to understand the performance characteristics of the query, requires running the
<codeph>COMPUTE STATS</codeph> statement for the table:
</p>
<codeblock>[localhost:21000] &gt; compute stats t1;
+-----------------------------------------+
| summary |
+-----------------------------------------+
| Updated 1 partition(s) and 2 column(s). |
+-----------------------------------------+
[localhost:21000] &gt; explain select * from t1;
+------------------------------------------------------------------------+
| Explain String |
+------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=-9223372036854775808B VCores=0 |
| |
| F01:PLAN FRAGMENT [PARTITION=UNPARTITIONED] |
| 01:EXCHANGE [PARTITION=UNPARTITIONED] |
| hosts=0 per-host-mem=unavailable |
| tuple-ids=0 row-size=20B cardinality=0 |
| |
| F00:PLAN FRAGMENT [PARTITION=RANDOM] |
| DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=01, PARTITION=UNPARTITIONED] |
| 00:SCAN HDFS [explain_plan.t1, PARTITION=RANDOM] |
| partitions=1/1 size=0B |
<b>| table stats: 0 rows total |</b>
<b>| column stats: all |</b>
| hosts=0 per-host-mem=0B |
| tuple-ids=0 row-size=20B cardinality=0 |
+------------------------------------------------------------------------+
</codeblock>
<p>
Joins and other complicated, multi-part queries are the ones where you most commonly need to examine the
<codeph>EXPLAIN</codeph> output and customize the amount of detail in the output. This example shows the
default <codeph>EXPLAIN</codeph> output for a three-way join query, then the equivalent output with a
<codeph>[SHUFFLE]</codeph> hint to change the join mechanism between the first two tables from a broadcast
join to a shuffle join.
</p>
<codeblock>[localhost:21000] &gt; set explain_level=1;
[localhost:21000] &gt; explain select one.*, two.*, three.* from t1 one, t1 two, t1 three where one.x = two.x and two.x = three.x;
+---------------------------------------------------------+
| Explain String |
+---------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=4.00GB VCores=3 |
| |
| 07:EXCHANGE [PARTITION=UNPARTITIONED] |
| | |
<b>| 04:HASH JOIN [INNER JOIN, BROADCAST] |</b>
| | hash predicates: two.x = three.x |
| | |
<b>| |--06:EXCHANGE [BROADCAST] |</b>
| | | |
| | 02:SCAN HDFS [explain_plan.t1 three] |
| | partitions=1/1 size=0B |
| | |
<b>| 03:HASH JOIN [INNER JOIN, BROADCAST] |</b>
| | hash predicates: one.x = two.x |
| | |
<b>| |--05:EXCHANGE [BROADCAST] |</b>
| | | |
| | 01:SCAN HDFS [explain_plan.t1 two] |
| | partitions=1/1 size=0B |
| | |
| 00:SCAN HDFS [explain_plan.t1 one] |
| partitions=1/1 size=0B |
+---------------------------------------------------------+
[localhost:21000] &gt; explain select one.*, two.*, three.*
&gt; from t1 one join [shuffle] t1 two join t1 three
&gt; where one.x = two.x and two.x = three.x;
+---------------------------------------------------------+
| Explain String |
+---------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=4.00GB VCores=3 |
| |
| 08:EXCHANGE [PARTITION=UNPARTITIONED] |
| | |
<b>| 04:HASH JOIN [INNER JOIN, BROADCAST] |</b>
| | hash predicates: two.x = three.x |
| | |
<b>| |--07:EXCHANGE [BROADCAST] |</b>
| | | |
| | 02:SCAN HDFS [explain_plan.t1 three] |
| | partitions=1/1 size=0B |
| | |
<b>| 03:HASH JOIN [INNER JOIN, PARTITIONED] |</b>
| | hash predicates: one.x = two.x |
| | |
<b>| |--06:EXCHANGE [PARTITION=HASH(two.x)] |</b>
| | | |
| | 01:SCAN HDFS [explain_plan.t1 two] |
| | partitions=1/1 size=0B |
| | |
<b>| 05:EXCHANGE [PARTITION=HASH(one.x)] |</b>
| | |
| 00:SCAN HDFS [explain_plan.t1 one] |
| partitions=1/1 size=0B |
+---------------------------------------------------------+
</codeblock>
<p>
For a join involving many different tables, the default <codeph>EXPLAIN</codeph> output might stretch over
several pages, and the only details you care about might be the join order and the mechanism (broadcast or
shuffle) for joining each pair of tables. In that case, you might set <codeph>EXPLAIN_LEVEL</codeph> to its
lowest value of 0, to focus on just the join order and join mechanism for each stage. The following example
shows how the rows from the first and second joined tables are hashed and divided among the nodes of the
cluster for further filtering; then the entire contents of the third table are broadcast to all nodes for the
final stage of join processing.
</p>
<codeblock>[localhost:21000] &gt; set explain_level=0;
[localhost:21000] &gt; explain select one.*, two.*, three.*
&gt; from t1 one join [shuffle] t1 two join t1 three
&gt; where one.x = two.x and two.x = three.x;
+---------------------------------------------------------+
| Explain String |
+---------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=4.00GB VCores=3 |
| |
| 08:EXCHANGE [PARTITION=UNPARTITIONED] |
<b>| 04:HASH JOIN [INNER JOIN, BROADCAST] |</b>
<b>| |--07:EXCHANGE [BROADCAST] |</b>
| | 02:SCAN HDFS [explain_plan.t1 three] |
<b>| 03:HASH JOIN [INNER JOIN, PARTITIONED] |</b>
<b>| |--06:EXCHANGE [PARTITION=HASH(two.x)] |</b>
| | 01:SCAN HDFS [explain_plan.t1 two] |
<b>| 05:EXCHANGE [PARTITION=HASH(one.x)] |</b>
| 00:SCAN HDFS [explain_plan.t1 one] |
+---------------------------------------------------------+
</codeblock>
<!-- Consider adding a related info section to collect the xrefs earlier on this page. -->
</conbody>
</concept>

View File

@@ -0,0 +1,568 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="explain_plan">
<title>Understanding Impala Query Performance - EXPLAIN Plans and Query Profiles</title>
<titlealts audience="PDF"><navtitle>EXPLAIN Plans and Query Profiles</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Performance"/>
<data name="Category" value="Impala"/>
<data name="Category" value="Querying"/>
<data name="Category" value="Troubleshooting"/>
<data name="Category" value="Reports"/>
<data name="Category" value="Concepts"/>
<data name="Category" value="Administrators"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p>
To understand the high-level performance considerations for Impala queries, read the output of the
<codeph>EXPLAIN</codeph> statement for the query. You can get the <codeph>EXPLAIN</codeph> plan without
actually running the query itself.
</p>
<p rev="1.4.0">
For an overview of the physical performance characteristics for a query, issue the <codeph>SUMMARY</codeph>
statement in <cmdname>impala-shell</cmdname> immediately after executing a query. This condensed information
shows which phases of execution took the most time, and how the estimates for memory usage and number of rows
at each phase compare to the actual values.
</p>
<p>
To understand the detailed performance characteristics for a query, issue the <codeph>PROFILE</codeph>
statement in <cmdname>impala-shell</cmdname> immediately after executing a query. This low-level information
includes physical details about memory, CPU, I/O, and network usage, and thus is only available after the
query is actually run.
</p>
<p outputclass="toc inpage"/>
<p>
Also, see <xref href="impala_hbase.xml#hbase_performance"/>
and <xref href="impala_s3.xml#s3_performance"/>
for examples of interpreting
<codeph>EXPLAIN</codeph> plans for queries against HBase tables
<ph rev="2.2.0">and data stored in the Amazon Simple Storage System (S3)</ph>.
</p>
</conbody>
<concept id="perf_explain">
<title>Using the EXPLAIN Plan for Performance Tuning</title>
<conbody>
<p>
The <codeph><xref href="impala_explain.xml#explain">EXPLAIN</xref></codeph> statement gives you an outline
of the logical steps that a query will perform, such as how the work will be distributed among the nodes
and how intermediate results will be combined to produce the final result set. You can see these details
before actually running the query. You can use this information to check that the query will not operate in
some very unexpected or inefficient way.
</p>
<!-- Turn into a conref in ciiu_langref too. Relocate to common.xml. -->
<codeblock conref="impala_explain.xml#explain/explain_plan_simple"/>
<p conref="../shared/impala_common.xml#common/explain_interpret"/>
<p>
The <codeph>EXPLAIN</codeph> plan is also printed at the beginning of the query profile report described in
<xref href="#perf_profile"/>, for convenience in examining both the logical and physical aspects of the
query side-by-side.
</p>
<p rev="1.2">
The amount of detail displayed in the <codeph>EXPLAIN</codeph> output is controlled by the
<xref href="impala_explain_level.xml#explain_level">EXPLAIN_LEVEL</xref> query option. You typically
increase this setting from <codeph>normal</codeph> to <codeph>verbose</codeph> (or from <codeph>0</codeph>
to <codeph>1</codeph>) when doublechecking the presence of table and column statistics during performance
tuning, or when estimating query resource usage in conjunction with the resource management features in CDH
5.
</p>
<!-- To do:
This is a good place to have a few examples.
-->
</conbody>
</concept>
<concept id="perf_summary">
<title>Using the SUMMARY Report for Performance Tuning</title>
<conbody>
<p>
The <codeph><xref href="impala_shell_commands.xml#shell_commands">SUMMARY</xref></codeph> command within
the <cmdname>impala-shell</cmdname> interpreter gives you an easy-to-digest overview of the timings for the
different phases of execution for a query. Like the <codeph>EXPLAIN</codeph> plan, it is easy to see
potential performance bottlenecks. Like the <codeph>PROFILE</codeph> output, it is available after the
query is run and so displays actual timing numbers.
</p>
<p>
The <codeph>SUMMARY</codeph> report is also printed at the beginning of the query profile report described
in <xref href="#perf_profile"/>, for convenience in examining high-level and low-level aspects of the query
side-by-side.
</p>
<p>
For example, here is a query involving an aggregate function, on a single-node VM. The different stages of
the query and their timings are shown (rolled up for all nodes), along with estimated and actual values
used in planning the query. In this case, the <codeph>AVG()</codeph> function is computed for a subset of
data on each node (stage 01) and then the aggregated results from all nodes are combined at the end (stage
03). You can see which stages took the most time, and whether any estimates were substantially different
than the actual data distribution. (When examining the time values, be sure to consider the suffixes such
as <codeph>us</codeph> for microseconds and <codeph>ms</codeph> for milliseconds, rather than just looking
for the largest numbers.)
</p>
<codeblock>[localhost:21000] &gt; select avg(ss_sales_price) from store_sales where ss_coupon_amt = 0;
+---------------------+
| avg(ss_sales_price) |
+---------------------+
| 37.80770926328327 |
+---------------------+
[localhost:21000] &gt; summary;
+--------------+--------+----------+----------+-------+------------+----------+---------------+-----------------+
| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail |
+--------------+--------+----------+----------+-------+------------+----------+---------------+-----------------+
| 03:AGGREGATE | 1 | 1.03ms | 1.03ms | 1 | 1 | 48.00 KB | -1 B | MERGE FINALIZE |
| 02:EXCHANGE | 1 | 0ns | 0ns | 1 | 1 | 0 B | -1 B | UNPARTITIONED |
| 01:AGGREGATE | 1 | 30.79ms | 30.79ms | 1 | 1 | 80.00 KB | 10.00 MB | |
| 00:SCAN HDFS | 1 | 5.45s | 5.45s | 2.21M | -1 | 64.05 MB | 432.00 MB | tpc.store_sales |
+--------------+--------+----------+----------+-------+------------+----------+---------------+-----------------+
</codeblock>
<p>
Notice how the longest initial phase of the query is measured in seconds (s), while later phases working on
smaller intermediate results are measured in milliseconds (ms) or even nanoseconds (ns).
</p>
<p>
Here is an example from a more complicated query, as it would appear in the <codeph>PROFILE</codeph>
output:
</p>
<!-- This example taken from: https://github.com/cloudera/Impala/commit/af85d3b518089b8840ddea4356947e40d1aca9bd -->
<codeblock>Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail
------------------------------------------------------------------------------------------------------------------------
09:MERGING-EXCHANGE 1 79.738us 79.738us 5 5 0 -1.00 B UNPARTITIONED
05:TOP-N 3 84.693us 88.810us 5 5 12.00 KB 120.00 B
04:AGGREGATE 3 5.263ms 6.432ms 5 5 44.00 KB 10.00 MB MERGE FINALIZE
08:AGGREGATE 3 16.659ms 27.444ms 52.52K 600.12K 3.20 MB 15.11 MB MERGE
07:EXCHANGE 3 2.644ms 5.1ms 52.52K 600.12K 0 0 HASH(o_orderpriority)
03:AGGREGATE 3 342.913ms 966.291ms 52.52K 600.12K 10.80 MB 15.11 MB
02:HASH JOIN 3 2s165ms 2s171ms 144.87K 600.12K 13.63 MB 941.01 KB INNER JOIN, BROADCAST
|--06:EXCHANGE 3 8.296ms 8.692ms 57.22K 15.00K 0 0 BROADCAST
| 01:SCAN HDFS 2 1s412ms 1s978ms 57.22K 15.00K 24.21 MB 176.00 MB tpch.orders o
00:SCAN HDFS 3 8s032ms 8s558ms 3.79M 600.12K 32.29 MB 264.00 MB tpch.lineitem l
</codeblock>
</conbody>
</concept>
<concept id="perf_profile">
<title>Using the Query Profile for Performance Tuning</title>
<conbody>
<p>
The <codeph>PROFILE</codeph> statement, available in the <cmdname>impala-shell</cmdname> interpreter,
produces a detailed low-level report showing how the most recent query was executed. Unlike the
<codeph>EXPLAIN</codeph> plan described in <xref href="#perf_explain"/>, this information is only available
after the query has finished. It shows physical details such as the number of bytes read, maximum memory
usage, and so on for each node. You can use this information to determine if the query is I/O-bound or
CPU-bound, whether some network condition is imposing a bottleneck, whether a slowdown is affecting some
nodes but not others, and to check that recommended configuration settings such as short-circuit local
reads are in effect.
</p>
<p rev="CDH-29157">
By default, time values in the profile output reflect the wall-clock time taken by an operation.
For values denoting system time or user time, the measurement unit is reflected in the metric
name, such as <codeph>ScannerThreadsSysTime</codeph> or <codeph>ScannerThreadsUserTime</codeph>.
For example, a multi-threaded I/O operation might show a small figure for wall-clock time,
while the corresponding system time is larger, representing the sum of the CPU time taken by each thread.
Or a wall-clock time figure might be larger because it counts time spent waiting, while
the corresponding system and user time figures only measure the time while the operation
is actively using CPU cycles.
</p>
<p>
The <xref href="impala_explain_plan.xml#perf_explain"><codeph>EXPLAIN</codeph> plan</xref> is also printed
at the beginning of the query profile report, for convenience in examining both the logical and physical
aspects of the query side-by-side. The
<xref href="impala_explain_level.xml#explain_level">EXPLAIN_LEVEL</xref> query option also controls the
verbosity of the <codeph>EXPLAIN</codeph> output printed by the <codeph>PROFILE</codeph> command.
</p>
<!-- To do:
This is a good place to have a few more examples.
-->
<p>
Here is an example of a query profile, from a relatively straightforward query on a single-node
pseudo-distributed cluster to keep the output relatively brief.
</p>
<codeblock>[localhost:21000] &gt; profile;
Query Runtime Profile:
Query (id=6540a03d4bee0691:4963d6269b210ebd):
Summary:
Session ID: ea4a197f1c7bf858:c74e66f72e3a33ba
Session Type: BEESWAX
Start Time: 2013-12-02 17:10:30.263067000
End Time: 2013-12-02 17:10:50.932044000
Query Type: QUERY
Query State: FINISHED
Query Status: OK
Impala Version: impalad version 1.2.1 RELEASE (build edb5af1bcad63d410bc5d47cc203df3a880e9324)
User: cloudera
Network Address: 127.0.0.1:49161
Default Db: stats_testing
Sql Statement: select t1.s, t2.s from t1 join t2 on (t1.id = t2.parent)
Plan:
----------------
Estimated Per-Host Requirements: Memory=2.09GB VCores=2
PLAN FRAGMENT 0
PARTITION: UNPARTITIONED
4:EXCHANGE
cardinality: unavailable
per-host memory: unavailable
tuple ids: 0 1
PLAN FRAGMENT 1
PARTITION: RANDOM
STREAM DATA SINK
EXCHANGE ID: 4
UNPARTITIONED
2:HASH JOIN
| join op: INNER JOIN (BROADCAST)
| hash predicates:
| t1.id = t2.parent
| cardinality: unavailable
| per-host memory: 2.00GB
| tuple ids: 0 1
|
|----3:EXCHANGE
| cardinality: unavailable
| per-host memory: 0B
| tuple ids: 1
|
0:SCAN HDFS
table=stats_testing.t1 #partitions=1/1 size=33B
table stats: unavailable
column stats: unavailable
cardinality: unavailable
per-host memory: 32.00MB
tuple ids: 0
PLAN FRAGMENT 2
PARTITION: RANDOM
STREAM DATA SINK
EXCHANGE ID: 3
UNPARTITIONED
1:SCAN HDFS
table=stats_testing.t2 #partitions=1/1 size=960.00KB
table stats: unavailable
column stats: unavailable
cardinality: unavailable
per-host memory: 96.00MB
tuple ids: 1
----------------
Query Timeline: 20s670ms
- Start execution: 2.559ms (2.559ms)
- Planning finished: 23.587ms (21.27ms)
- Rows available: 666.199ms (642.612ms)
- First row fetched: 668.919ms (2.719ms)
- Unregister query: 20s668ms (20s000ms)
ImpalaServer:
- ClientFetchWaitTimer: 19s637ms
- RowMaterializationTimer: 167.121ms
Execution Profile 6540a03d4bee0691:4963d6269b210ebd:(Active: 837.815ms, % non-child: 0.00%)
Per Node Peak Memory Usage: impala-1.example.com:22000(7.42 MB)
- FinalizationTimer: 0ns
Coordinator Fragment:(Active: 195.198ms, % non-child: 0.00%)
MemoryUsage(500.0ms): 16.00 KB, 7.42 MB, 7.33 MB, 7.10 MB, 6.94 MB, 6.71 MB, 6.56 MB, 6.40 MB, 6.17 MB, 6.02 MB, 5.79 MB, 5.63 MB, 5.48 MB, 5.25 MB, 5.09 MB, 4.86 MB, 4.71 MB, 4.47 MB, 4.32 MB, 4.09 MB, 3.93 MB, 3.78 MB, 3.55 MB, 3.39 MB, 3.16 MB, 3.01 MB, 2.78 MB, 2.62 MB, 2.39 MB, 2.24 MB, 2.08 MB, 1.85 MB, 1.70 MB, 1.54 MB, 1.31 MB, 1.16 MB, 948.00 KB, 790.00 KB, 553.00 KB, 395.00 KB, 237.00 KB
ThreadUsage(500.0ms): 1
- AverageThreadTokens: 1.00
- PeakMemoryUsage: 7.42 MB
- PrepareTime: 36.144us
- RowsProduced: 98.30K (98304)
- TotalCpuTime: 20s449ms
- TotalNetworkWaitTime: 191.630ms
- TotalStorageWaitTime: 0ns
CodeGen:(Active: 150.679ms, % non-child: 77.19%)
- CodegenTime: 0ns
- CompileTime: 139.503ms
- LoadTime: 10.7ms
- ModuleFileSize: 95.27 KB
EXCHANGE_NODE (id=4):(Active: 194.858ms, % non-child: 99.83%)
- BytesReceived: 2.33 MB
- ConvertRowBatchTime: 2.732ms
- DataArrivalWaitTime: 191.118ms
- DeserializeRowBatchTimer: 14.943ms
- FirstBatchArrivalWaitTime: 191.117ms
- PeakMemoryUsage: 7.41 MB
- RowsReturned: 98.30K (98304)
- RowsReturnedRate: 504.49 K/sec
- SendersBlockedTimer: 0ns
- SendersBlockedTotalTimer(*): 0ns
Averaged Fragment 1:(Active: 442.360ms, % non-child: 0.00%)
split sizes: min: 33.00 B, max: 33.00 B, avg: 33.00 B, stddev: 0.00
completion times: min:443.720ms max:443.720ms mean: 443.720ms stddev:0ns
execution rates: min:74.00 B/sec max:74.00 B/sec mean:74.00 B/sec stddev:0.00 /sec
num instances: 1
- AverageThreadTokens: 1.00
- PeakMemoryUsage: 6.06 MB
- PrepareTime: 7.291ms
- RowsProduced: 98.30K (98304)
- TotalCpuTime: 784.259ms
- TotalNetworkWaitTime: 388.818ms
- TotalStorageWaitTime: 3.934ms
CodeGen:(Active: 312.862ms, % non-child: 70.73%)
- CodegenTime: 2.669ms
- CompileTime: 302.467ms
- LoadTime: 9.231ms
- ModuleFileSize: 95.27 KB
DataStreamSender (dst_id=4):(Active: 80.63ms, % non-child: 18.10%)
- BytesSent: 2.33 MB
- NetworkThroughput(*): 35.89 MB/sec
- OverallThroughput: 29.06 MB/sec
- PeakMemoryUsage: 5.33 KB
- SerializeBatchTime: 26.487ms
- ThriftTransmitTime(*): 64.814ms
- UncompressedRowBatchSize: 6.66 MB
HASH_JOIN_NODE (id=2):(Active: 362.25ms, % non-child: 3.92%)
- BuildBuckets: 1.02K (1024)
- BuildRows: 98.30K (98304)
- BuildTime: 12.622ms
- LoadFactor: 0.00
- PeakMemoryUsage: 6.02 MB
- ProbeRows: 3
- ProbeTime: 3.579ms
- RowsReturned: 98.30K (98304)
- RowsReturnedRate: 271.54 K/sec
EXCHANGE_NODE (id=3):(Active: 344.680ms, % non-child: 77.92%)
- BytesReceived: 1.15 MB
- ConvertRowBatchTime: 2.792ms
- DataArrivalWaitTime: 339.936ms
- DeserializeRowBatchTimer: 9.910ms
- FirstBatchArrivalWaitTime: 199.474ms
- PeakMemoryUsage: 156.00 KB
- RowsReturned: 98.30K (98304)
- RowsReturnedRate: 285.20 K/sec
- SendersBlockedTimer: 0ns
- SendersBlockedTotalTimer(*): 0ns
HDFS_SCAN_NODE (id=0):(Active: 13.616us, % non-child: 0.00%)
- AverageHdfsReadThreadConcurrency: 0.00
- AverageScannerThreadConcurrency: 0.00
- BytesRead: 33.00 B
- BytesReadLocal: 33.00 B
- BytesReadShortCircuit: 33.00 B
- NumDisksAccessed: 1
- NumScannerThreadsStarted: 1
- PeakMemoryUsage: 46.00 KB
- PerReadThreadRawHdfsThroughput: 287.52 KB/sec
- RowsRead: 3
- RowsReturned: 3
- RowsReturnedRate: 220.33 K/sec
- ScanRangesComplete: 1
- ScannerThreadsInvoluntaryContextSwitches: 26
- ScannerThreadsTotalWallClockTime: 55.199ms
- DelimiterParseTime: 2.463us
- MaterializeTupleTime(*): 1.226us
- ScannerThreadsSysTime: 0ns
- ScannerThreadsUserTime: 42.993ms
- ScannerThreadsVoluntaryContextSwitches: 1
- TotalRawHdfsReadTime(*): 112.86us
- TotalReadThroughput: 0.00 /sec
Averaged Fragment 2:(Active: 190.120ms, % non-child: 0.00%)
split sizes: min: 960.00 KB, max: 960.00 KB, avg: 960.00 KB, stddev: 0.00
completion times: min:191.736ms max:191.736ms mean: 191.736ms stddev:0ns
execution rates: min:4.89 MB/sec max:4.89 MB/sec mean:4.89 MB/sec stddev:0.00 /sec
num instances: 1
- AverageThreadTokens: 0.00
- PeakMemoryUsage: 906.33 KB
- PrepareTime: 3.67ms
- RowsProduced: 98.30K (98304)
- TotalCpuTime: 403.351ms
- TotalNetworkWaitTime: 34.999ms
- TotalStorageWaitTime: 108.675ms
CodeGen:(Active: 162.57ms, % non-child: 85.24%)
- CodegenTime: 3.133ms
- CompileTime: 148.316ms
- LoadTime: 12.317ms
- ModuleFileSize: 95.27 KB
DataStreamSender (dst_id=3):(Active: 70.620ms, % non-child: 37.14%)
- BytesSent: 1.15 MB
- NetworkThroughput(*): 23.30 MB/sec
- OverallThroughput: 16.23 MB/sec
- PeakMemoryUsage: 5.33 KB
- SerializeBatchTime: 22.69ms
- ThriftTransmitTime(*): 49.178ms
- UncompressedRowBatchSize: 3.28 MB
HDFS_SCAN_NODE (id=1):(Active: 118.839ms, % non-child: 62.51%)
- AverageHdfsReadThreadConcurrency: 0.00
- AverageScannerThreadConcurrency: 0.00
- BytesRead: 960.00 KB
- BytesReadLocal: 960.00 KB
- BytesReadShortCircuit: 960.00 KB
- NumDisksAccessed: 1
- NumScannerThreadsStarted: 1
- PeakMemoryUsage: 869.00 KB
- PerReadThreadRawHdfsThroughput: 130.21 MB/sec
- RowsRead: 98.30K (98304)
- RowsReturned: 98.30K (98304)
- RowsReturnedRate: 827.20 K/sec
- ScanRangesComplete: 15
- ScannerThreadsInvoluntaryContextSwitches: 34
- ScannerThreadsTotalWallClockTime: 189.774ms
- DelimiterParseTime: 15.703ms
- MaterializeTupleTime(*): 3.419ms
- ScannerThreadsSysTime: 1.999ms
- ScannerThreadsUserTime: 44.993ms
- ScannerThreadsVoluntaryContextSwitches: 118
- TotalRawHdfsReadTime(*): 7.199ms
- TotalReadThroughput: 0.00 /sec
Fragment 1:
Instance 6540a03d4bee0691:4963d6269b210ebf (host=impala-1.example.com:22000):(Active: 442.360ms, % non-child: 0.00%)
Hdfs split stats (&lt;volume id&gt;:&lt;# splits&gt;/&lt;split lengths&gt;): 0:1/33.00 B
MemoryUsage(500.0ms): 69.33 KB
ThreadUsage(500.0ms): 1
- AverageThreadTokens: 1.00
- PeakMemoryUsage: 6.06 MB
- PrepareTime: 7.291ms
- RowsProduced: 98.30K (98304)
- TotalCpuTime: 784.259ms
- TotalNetworkWaitTime: 388.818ms
- TotalStorageWaitTime: 3.934ms
CodeGen:(Active: 312.862ms, % non-child: 70.73%)
- CodegenTime: 2.669ms
- CompileTime: 302.467ms
- LoadTime: 9.231ms
- ModuleFileSize: 95.27 KB
DataStreamSender (dst_id=4):(Active: 80.63ms, % non-child: 18.10%)
- BytesSent: 2.33 MB
- NetworkThroughput(*): 35.89 MB/sec
- OverallThroughput: 29.06 MB/sec
- PeakMemoryUsage: 5.33 KB
- SerializeBatchTime: 26.487ms
- ThriftTransmitTime(*): 64.814ms
- UncompressedRowBatchSize: 6.66 MB
HASH_JOIN_NODE (id=2):(Active: 362.25ms, % non-child: 3.92%)
ExecOption: Build Side Codegen Enabled, Probe Side Codegen Enabled, Hash Table Built Asynchronously
- BuildBuckets: 1.02K (1024)
- BuildRows: 98.30K (98304)
- BuildTime: 12.622ms
- LoadFactor: 0.00
- PeakMemoryUsage: 6.02 MB
- ProbeRows: 3
- ProbeTime: 3.579ms
- RowsReturned: 98.30K (98304)
- RowsReturnedRate: 271.54 K/sec
EXCHANGE_NODE (id=3):(Active: 344.680ms, % non-child: 77.92%)
- BytesReceived: 1.15 MB
- ConvertRowBatchTime: 2.792ms
- DataArrivalWaitTime: 339.936ms
- DeserializeRowBatchTimer: 9.910ms
- FirstBatchArrivalWaitTime: 199.474ms
- PeakMemoryUsage: 156.00 KB
- RowsReturned: 98.30K (98304)
- RowsReturnedRate: 285.20 K/sec
- SendersBlockedTimer: 0ns
- SendersBlockedTotalTimer(*): 0ns
HDFS_SCAN_NODE (id=0):(Active: 13.616us, % non-child: 0.00%)
Hdfs split stats (&lt;volume id&gt;:&lt;# splits&gt;/&lt;split lengths&gt;): 0:1/33.00 B
Hdfs Read Thread Concurrency Bucket: 0:0% 1:0%
File Formats: TEXT/NONE:1
ExecOption: Codegen enabled: 1 out of 1
- AverageHdfsReadThreadConcurrency: 0.00
- AverageScannerThreadConcurrency: 0.00
- BytesRead: 33.00 B
- BytesReadLocal: 33.00 B
- BytesReadShortCircuit: 33.00 B
- NumDisksAccessed: 1
- NumScannerThreadsStarted: 1
- PeakMemoryUsage: 46.00 KB
- PerReadThreadRawHdfsThroughput: 287.52 KB/sec
- RowsRead: 3
- RowsReturned: 3
- RowsReturnedRate: 220.33 K/sec
- ScanRangesComplete: 1
- ScannerThreadsInvoluntaryContextSwitches: 26
- ScannerThreadsTotalWallClockTime: 55.199ms
- DelimiterParseTime: 2.463us
- MaterializeTupleTime(*): 1.226us
- ScannerThreadsSysTime: 0ns
- ScannerThreadsUserTime: 42.993ms
- ScannerThreadsVoluntaryContextSwitches: 1
- TotalRawHdfsReadTime(*): 112.86us
- TotalReadThroughput: 0.00 /sec
Fragment 2:
Instance 6540a03d4bee0691:4963d6269b210ec0 (host=impala-1.example.com:22000):(Active: 190.120ms, % non-child: 0.00%)
Hdfs split stats (&lt;volume id&gt;:&lt;# splits&gt;/&lt;split lengths&gt;): 0:15/960.00 KB
- AverageThreadTokens: 0.00
- PeakMemoryUsage: 906.33 KB
- PrepareTime: 3.67ms
- RowsProduced: 98.30K (98304)
- TotalCpuTime: 403.351ms
- TotalNetworkWaitTime: 34.999ms
- TotalStorageWaitTime: 108.675ms
CodeGen:(Active: 162.57ms, % non-child: 85.24%)
- CodegenTime: 3.133ms
- CompileTime: 148.316ms
- LoadTime: 12.317ms
- ModuleFileSize: 95.27 KB
DataStreamSender (dst_id=3):(Active: 70.620ms, % non-child: 37.14%)
- BytesSent: 1.15 MB
- NetworkThroughput(*): 23.30 MB/sec
- OverallThroughput: 16.23 MB/sec
- PeakMemoryUsage: 5.33 KB
- SerializeBatchTime: 22.69ms
- ThriftTransmitTime(*): 49.178ms
- UncompressedRowBatchSize: 3.28 MB
HDFS_SCAN_NODE (id=1):(Active: 118.839ms, % non-child: 62.51%)
Hdfs split stats (&lt;volume id&gt;:&lt;# splits&gt;/&lt;split lengths&gt;): 0:15/960.00 KB
Hdfs Read Thread Concurrency Bucket: 0:0% 1:0%
File Formats: TEXT/NONE:15
ExecOption: Codegen enabled: 15 out of 15
- AverageHdfsReadThreadConcurrency: 0.00
- AverageScannerThreadConcurrency: 0.00
- BytesRead: 960.00 KB
- BytesReadLocal: 960.00 KB
- BytesReadShortCircuit: 960.00 KB
- NumDisksAccessed: 1
- NumScannerThreadsStarted: 1
- PeakMemoryUsage: 869.00 KB
- PerReadThreadRawHdfsThroughput: 130.21 MB/sec
- RowsRead: 98.30K (98304)
- RowsReturned: 98.30K (98304)
- RowsReturnedRate: 827.20 K/sec
- ScanRangesComplete: 15
- ScannerThreadsInvoluntaryContextSwitches: 34
- ScannerThreadsTotalWallClockTime: 189.774ms
- DelimiterParseTime: 15.703ms
- MaterializeTupleTime(*): 3.419ms
- ScannerThreadsSysTime: 1.999ms
- ScannerThreadsUserTime: 44.993ms
- ScannerThreadsVoluntaryContextSwitches: 118
- TotalRawHdfsReadTime(*): 7.199ms
- TotalReadThroughput: 0.00 /sec</codeblock>
</conbody>
</concept>
</concept>

1877
docs/topics/impala_faq.xml Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,24 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="obsolete_faq">
<title>Impala Frequently Asked Questions</title>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="FAQs"/>
<data name="Category" value="Planning"/>
<data name="Category" value="Getting Started"/>
<data name="Category" value="Administrators"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<draft-comment translate="no">
Obsolete. Content all moved into impala_faq.xml.
</draft-comment>
</conbody>
</concept>

View File

@@ -0,0 +1,21 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="features">
<title>Primary Impala Features</title>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Concepts"/>
<data name="Category" value="Getting Started"/>
<data name="Category" value="Administrators"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p conref="../shared/impala_common.xml#common/feature_list"/>
</conbody>
</concept>

Some files were not shown because too many files have changed in this diff Show More