mirror of
https://github.com/apache/impala.git
synced 2025-12-30 12:02:10 -05:00
For this change to land in master, the audience="hidden" code review needs to be completed first. Otherwise, the doc build would still work but the audience="hidden" content would be visible rather than hidden as desired. Some work happening in parallel might introduce additional instances of audience="Cloudera". I suggest addressing those in a followup CR so this global change can land quickly. Since the changes apply across so many different files, but are so narrow in scope, I suggest that the way to validate (check that no extraneous changes were introduced accidentally) is to diff just the changed lines: git diff -U0 HEAD^ HEAD In patch set 2, I updated other topics marked audience="Cloudera" by CRs that were pushed in the meantime. Change-Id: Ic93d89da77e1f51bbf548a522d98d0c4e2fb31c8 Reviewed-on: http://gerrit.cloudera.org:8080/5613 Reviewed-by: John Russell <jrussell@cloudera.com> Tested-by: Impala Public Jenkins
250 lines
10 KiB
XML
250 lines
10 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
|
<!--
|
|
Licensed to the Apache Software Foundation (ASF) under one
|
|
or more contributor license agreements. See the NOTICE file
|
|
distributed with this work for additional information
|
|
regarding copyright ownership. The ASF licenses this file
|
|
to you under the Apache License, Version 2.0 (the
|
|
"License"); you may not use this file except in compliance
|
|
with the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing,
|
|
software distributed under the License is distributed on an
|
|
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations
|
|
under the License.
|
|
-->
|
|
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
|
<concept id="intro_dev">
|
|
|
|
<title>Developing Impala Applications</title>
|
|
<titlealts audience="PDF"><navtitle>Developing Applications</navtitle></titlealts>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Impala"/>
|
|
<data name="Category" value="SQL"/>
|
|
<data name="Category" value="Developers"/>
|
|
<data name="Category" value="Data Analysts"/>
|
|
<data name="Category" value="Concepts"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
The core development language with Impala is SQL. You can also use Java or other languages to interact with
|
|
Impala through the standard JDBC and ODBC interfaces used by many business intelligence tools. For
|
|
specialized kinds of analysis, you can supplement the SQL built-in functions by writing
|
|
<xref href="impala_udf.xml#udfs">user-defined functions (UDFs)</xref> in C++ or Java.
|
|
</p>
|
|
|
|
<p outputclass="toc inpage"/>
|
|
</conbody>
|
|
|
|
<concept id="intro_sql">
|
|
|
|
<title>Overview of the Impala SQL Dialect</title>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="SQL"/>
|
|
<data name="Category" value="Concepts"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
The Impala SQL dialect is highly compatible with the SQL syntax used in the Apache Hive component (HiveQL). As
|
|
such, it is familiar to users who are already familiar with running SQL queries on the Hadoop
|
|
infrastructure. Currently, Impala SQL supports a subset of HiveQL statements, data types, and built-in
|
|
functions. Impala also includes additional built-in functions for common industry features, to simplify
|
|
porting SQL from non-Hadoop systems.
|
|
</p>
|
|
|
|
<p>
|
|
For users coming to Impala from traditional database or data warehousing backgrounds, the following aspects of the SQL dialect
|
|
might seem familiar:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p>
|
|
The <xref href="impala_select.xml#select">SELECT statement</xref> includes familiar clauses such as <codeph>WHERE</codeph>,
|
|
<codeph>GROUP BY</codeph>, <codeph>ORDER BY</codeph>, and <codeph>WITH</codeph>.
|
|
You will find familiar notions such as
|
|
<xref href="impala_joins.xml#joins">joins</xref>, <xref href="impala_functions.xml#builtins">built-in
|
|
functions</xref> for processing strings, numbers, and dates,
|
|
<xref href="impala_aggregate_functions.xml#aggregate_functions">aggregate functions</xref>,
|
|
<xref href="impala_subqueries.xml#subqueries">subqueries</xref>, and
|
|
<xref href="impala_operators.xml#comparison_operators">comparison operators</xref>
|
|
such as <codeph>IN()</codeph> and <codeph>BETWEEN</codeph>.
|
|
The <codeph>SELECT</codeph> statement is the place where SQL standards compliance is most important.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
From the data warehousing world, you will recognize the notion of
|
|
<xref href="impala_partitioning.xml#partitioning">partitioned tables</xref>.
|
|
One or more columns serve as partition keys, and the data is physically arranged so that
|
|
queries that refer to the partition key columns in the <codeph>WHERE</codeph> clause
|
|
can skip partitions that do not match the filter conditions. For example, if you have 10
|
|
years worth of data and use a clause such as <codeph>WHERE year = 2015</codeph>,
|
|
<codeph>WHERE year > 2010</codeph>, or <codeph>WHERE year IN (2014, 2015)</codeph>,
|
|
Impala skips all the data for non-matching years, greatly reducing the amount of I/O
|
|
for the query.
|
|
</p>
|
|
</li>
|
|
|
|
<li rev="1.2">
|
|
<p>
|
|
In Impala 1.2 and higher, <xref href="impala_udf.xml#udfs">UDFs</xref> let you perform custom comparisons
|
|
and transformation logic during <codeph>SELECT</codeph> and <codeph>INSERT...SELECT</codeph> statements.
|
|
</p>
|
|
</li>
|
|
</ul>
|
|
|
|
<p>
|
|
For users coming to Impala from traditional database or data warehousing backgrounds, the following aspects of the SQL dialect
|
|
might require some learning and practice for you to become proficient in the Hadoop environment:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p>
|
|
Impala SQL is focused on queries and includes relatively little DML. There is no <codeph>UPDATE</codeph>
|
|
or <codeph>DELETE</codeph> statement. Stale data is typically discarded (by <codeph>DROP TABLE</codeph>
|
|
or <codeph>ALTER TABLE ... DROP PARTITION</codeph> statements) or replaced (by <codeph>INSERT
|
|
OVERWRITE</codeph> statements).
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
All data creation is done by <codeph>INSERT</codeph> statements, which typically insert data in bulk by
|
|
querying from other tables. There are two variations, <codeph>INSERT INTO</codeph> which appends to the
|
|
existing data, and <codeph>INSERT OVERWRITE</codeph> which replaces the entire contents of a table or
|
|
partition (similar to <codeph>TRUNCATE TABLE</codeph> followed by a new <codeph>INSERT</codeph>).
|
|
Although there is an <codeph>INSERT ... VALUES</codeph> syntax to create a small number of values in
|
|
a single statement, it is far more efficient to use the <codeph>INSERT ... SELECT</codeph> to copy
|
|
and transform large amounts of data from one table to another in a single operation.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
You often construct Impala table definitions and data files in some other environment, and then attach
|
|
Impala so that it can run real-time queries. The same data files and table metadata are shared with other
|
|
components of the Hadoop ecosystem. In particular, Impala can access tables created by Hive or data
|
|
inserted by Hive, and Hive can access tables and data produced by Impala. Many other Hadoop components
|
|
can write files in formats such as Parquet and Avro, that can then be queried by Impala.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Because Hadoop and Impala are focused on data warehouse-style operations on large data sets, Impala SQL
|
|
includes some idioms that you might find in the import utilities for traditional database systems. For
|
|
example, you can create a table that reads comma-separated or tab-separated text files, specifying the
|
|
separator in the <codeph>CREATE TABLE</codeph> statement. You can create <b>external tables</b> that read
|
|
existing data files but do not move or transform them.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Because Impala reads large quantities of data that might not be perfectly tidy and predictable, it does
|
|
not require length constraints on string data types. For example, you can define a database column as
|
|
<codeph>STRING</codeph> with unlimited length, rather than <codeph>CHAR(1)</codeph> or
|
|
<codeph>VARCHAR(64)</codeph>. <ph rev="2.0.0">(Although in Impala 2.0 and later, you can also use
|
|
length-constrained <codeph>CHAR</codeph> and <codeph>VARCHAR</codeph> types.)</ph>
|
|
</p>
|
|
</li>
|
|
|
|
</ul>
|
|
|
|
<p>
|
|
<b>Related information:</b> <xref href="impala_langref.xml#langref"/>, especially
|
|
<xref href="impala_langref_sql.xml#langref_sql"/> and <xref href="impala_functions.xml#builtins"/>
|
|
</p>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<!-- Bunch of potential concept topics for future consideration. Major areas of Impala modelled on areas of discussion for Oracle Database, and distributed databases in general. -->
|
|
|
|
<concept id="intro_datatypes" audience="hidden">
|
|
|
|
<title>Overview of Impala SQL Data Types</title>
|
|
|
|
<conbody/>
|
|
</concept>
|
|
|
|
<concept id="intro_network" audience="hidden">
|
|
|
|
<title>Overview of Impala Network Topology</title>
|
|
|
|
<conbody/>
|
|
</concept>
|
|
|
|
<concept id="intro_cluster" audience="hidden">
|
|
|
|
<title>Overview of Impala Cluster Topology</title>
|
|
|
|
<conbody/>
|
|
</concept>
|
|
|
|
<concept id="intro_apis">
|
|
|
|
<title>Overview of Impala Programming Interfaces</title>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="JDBC"/>
|
|
<data name="Category" value="ODBC"/>
|
|
<data name="Category" value="Hue"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
You can connect and submit requests to the Impala daemons through:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
The <codeph><xref href="impala_impala_shell.xml#impala_shell">impala-shell</xref></codeph> interactive
|
|
command interpreter.
|
|
</li>
|
|
|
|
<li>
|
|
The <xref href="http://gethue.com/" scope="external" format="html">Hue</xref> web-based user interface.
|
|
</li>
|
|
|
|
<li>
|
|
<xref href="impala_jdbc.xml#impala_jdbc">JDBC</xref>.
|
|
</li>
|
|
|
|
<li>
|
|
<xref href="impala_odbc.xml#impala_odbc">ODBC</xref>.
|
|
</li>
|
|
</ul>
|
|
|
|
<p>
|
|
With these options, you can use Impala in heterogeneous environments, with JDBC or ODBC applications
|
|
running on non-Linux platforms. You can also use Impala on combination with various Business Intelligence
|
|
tools that use the JDBC and ODBC interfaces.
|
|
</p>
|
|
|
|
<p>
|
|
Each <codeph>impalad</codeph> daemon process, running on separate nodes in a cluster, listens to
|
|
<xref href="impala_ports.xml#ports">several ports</xref> for incoming requests. Requests from
|
|
<codeph>impala-shell</codeph> and Hue are routed to the <codeph>impalad</codeph> daemons through the same
|
|
port. The <codeph>impalad</codeph> daemons listen on separate ports for JDBC and ODBC requests.
|
|
</p>
|
|
</conbody>
|
|
</concept>
|
|
</concept>
|