mirror of
https://github.com/apache/impala.git
synced 2025-12-19 09:58:28 -05:00
Change-Id: I37337a18c7add3c64795b3b2e49670493a9a8e44 Reviewed-on: http://gerrit.cloudera.org:8080/14891 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
899 lines
44 KiB
XML
899 lines
44 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
|
<!--
|
|
Licensed to the Apache Software Foundation (ASF) under one
|
|
or more contributor license agreements. See the NOTICE file
|
|
distributed with this work for additional information
|
|
regarding copyright ownership. The ASF licenses this file
|
|
to you under the Apache License, Version 2.0 (the
|
|
"License"); you may not use this file except in compliance
|
|
with the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing,
|
|
software distributed under the License is distributed on an
|
|
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations
|
|
under the License.
|
|
-->
|
|
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
|
<concept id="impala_hbase">
|
|
|
|
<title id="hbase">Using Impala to Query HBase Tables</title>
|
|
<titlealts audience="PDF"><navtitle>HBase Tables</navtitle></titlealts>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Impala"/>
|
|
<data name="Category" value="HBase"/>
|
|
<data name="Category" value="Querying"/>
|
|
<data name="Category" value="Data Analysts"/>
|
|
<data name="Category" value="Developers"/>
|
|
<data name="Category" value="Tables"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
<indexterm audience="hidden">HBase</indexterm>
|
|
You can use Impala to query HBase tables. This is useful for accessing any of
|
|
your existing HBase tables via SQL and performing analytics over them. HDFS
|
|
and Kudu tables are preferred over HBase for analytic workloads and offer
|
|
superior performance. Kudu supports efficient inserts, updates and deletes
|
|
of small numbers of rows and can replace HBase for most analytics-oriented use
|
|
cases. See <xref href="impala_kudu.xml#impala_kudu"/> for information on using
|
|
Impala with Kudu.
|
|
</p>
|
|
|
|
<p>
|
|
From the perspective of an Impala user, coming from an RDBMS background, HBase is a kind of key-value store
|
|
where the value consists of multiple fields. The key is mapped to one column in the Impala table, and the
|
|
various fields of the value are mapped to the other columns in the Impala table.
|
|
</p>
|
|
|
|
<p>
|
|
For background information on HBase, see <xref keyref="upstream_hbase_docs"/>.
|
|
</p>
|
|
|
|
<p outputclass="toc inpage"/>
|
|
</conbody>
|
|
|
|
<concept id="hbase_using">
|
|
|
|
<title>Overview of Using HBase with Impala</title>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Concepts"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
When you use Impala with HBase:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
You create the tables on the Impala side using the Hive shell, because the Impala <codeph>CREATE
|
|
TABLE</codeph> statement currently does not support custom SerDes and some other syntax needed for these
|
|
tables:
|
|
<ul>
|
|
<li>
|
|
You designate it as an HBase table using the <codeph>STORED BY
|
|
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'</codeph> clause on the Hive <codeph>CREATE
|
|
TABLE</codeph> statement.
|
|
</li>
|
|
|
|
<li>
|
|
You map these specially created tables to corresponding tables that exist in HBase, with the clause
|
|
<codeph>TBLPROPERTIES("hbase.table.name" = "<varname>table_name_in_hbase</varname>")</codeph> on the
|
|
Hive <codeph>CREATE TABLE</codeph> statement.
|
|
</li>
|
|
|
|
<li>
|
|
See <xref href="#hbase_queries"/> for a full example.
|
|
</li>
|
|
</ul>
|
|
</li>
|
|
|
|
<li>
|
|
You define the column corresponding to the HBase row key as a string with the <codeph>#string</codeph>
|
|
keyword, or map it to a <codeph>STRING</codeph> column.
|
|
</li>
|
|
|
|
<li>
|
|
Because Impala and Hive share the same metastore database, once you create the table in Hive, you can
|
|
query or insert into it through Impala. (After creating a new table through Hive, issue the
|
|
<codeph>INVALIDATE METADATA</codeph> statement in <cmdname>impala-shell</cmdname> to make Impala aware of
|
|
the new table.)
|
|
</li>
|
|
|
|
<li> You issue queries against the Impala tables. For efficient queries,
|
|
use the <codeph>WHERE</codeph> clause to find a single key value or a
|
|
range of key values wherever practical, by testing the Impala column
|
|
corresponding to the HBase row key. Avoid queries that do full-table
|
|
scans, which are efficient for regular Impala tables but inefficient
|
|
in HBase. </li>
|
|
</ul>
|
|
|
|
<p>
|
|
To work with an HBase table from Impala, ensure that the <codeph>impala</codeph> user has read/write
|
|
privileges for the HBase table, using the <codeph>GRANT</codeph> command in the HBase shell. For details
|
|
about HBase security, see <xref keyref="upstream_hbase_security_docs"/>.
|
|
</p>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="hbase_config">
|
|
|
|
<title>Configuring HBase for Use with Impala</title>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Configuring"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
HBase works out of the box with Impala. There is no mandatory configuration needed to use these two
|
|
components together.
|
|
</p>
|
|
|
|
<p>
|
|
To avoid delays if HBase is unavailable during Impala startup or after an <codeph>INVALIDATE
|
|
METADATA</codeph> statement, set timeout values similar to the following in
|
|
<filepath>/etc/impala/conf/hbase-site.xml</filepath>:
|
|
</p>
|
|
|
|
<codeblock><property>
|
|
<name>hbase.client.retries.number</name>
|
|
<value>3</value>
|
|
</property>
|
|
<property>
|
|
<name>hbase.rpc.timeout</name>
|
|
<value>3000</value>
|
|
</property>
|
|
</codeblock>
|
|
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="hbase_types">
|
|
|
|
<title>Supported Data Types for HBase Columns</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
To understand how Impala column data types are mapped to fields in HBase, you should have some background
|
|
knowledge about HBase first. You set up the mapping by running the <codeph>CREATE TABLE</codeph> statement
|
|
in the Hive shell. See
|
|
<xref href="https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration" scope="external" format="html">the
|
|
Hive wiki</xref> for a starting point, and <xref href="#hbase_queries"/> for examples.
|
|
</p>
|
|
|
|
<p>
|
|
HBase works as a kind of <q>bit bucket</q>, in the sense that HBase does not enforce any typing for the
|
|
key or value fields. All the type enforcement is done on the Impala side.
|
|
</p>
|
|
|
|
<p> For best performance of Impala queries against HBase tables, most
|
|
queries will perform comparisons in the <codeph>WHERE</codeph> clause
|
|
against the column that corresponds to the HBase row key. When creating
|
|
the table through the Hive shell, use the <codeph>STRING</codeph> data
|
|
type for the column that corresponds to the HBase row key. Impala can
|
|
translate predicates (through operators such as <codeph>=</codeph>,
|
|
<codeph><</codeph>, and <codeph>BETWEEN</codeph>) against this
|
|
column into fast lookups in HBase, but this optimization (<q>predicate
|
|
pushdown</q>) only works when that column is defined as
|
|
<codeph>STRING</codeph>. </p>
|
|
|
|
<p>
|
|
Starting in Impala 1.1, Impala also supports reading and writing to columns that are defined in the Hive
|
|
<codeph>CREATE TABLE</codeph> statement using binary data types, represented in the Hive table definition
|
|
using the <codeph>#binary</codeph> keyword, often abbreviated as <codeph>#b</codeph>. Defining numeric
|
|
columns as binary can reduce the overall data volume in the HBase tables. You should still define the
|
|
column that corresponds to the HBase row key as a <codeph>STRING</codeph>, to allow fast lookups using
|
|
those columns.
|
|
</p>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="hbase_performance">
|
|
|
|
<title>Performance Considerations for the Impala-HBase Integration</title>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Performance"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
To understand the performance characteristics of SQL queries against data stored in HBase, you should have
|
|
some background knowledge about how HBase interacts with SQL-oriented systems first. See
|
|
<xref href="https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration" scope="external" format="html">the
|
|
Hive wiki</xref> for a starting point; because Impala shares the same metastore database as Hive, the
|
|
information about mapping columns from Hive tables to HBase tables is generally applicable to Impala too.
|
|
</p>
|
|
|
|
<p>
|
|
Impala uses the HBase client API via Java Native Interface (JNI) to query data stored in HBase. This
|
|
querying does not read HFiles directly. The extra communication overhead makes it important to choose what
|
|
data to store in HBase or in HDFS, and construct efficient queries that can retrieve the HBase data
|
|
efficiently:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
Use HBase table for queries that return a single row or a small range of rows,
|
|
not queries that perform a full table scan of an entire table. (If a query has
|
|
a HBase table and no <codeph>WHERE</codeph> clause referencing that table,
|
|
that is a strong indicator that it is an inefficient query for an HBase table.)
|
|
</li>
|
|
|
|
<li>
|
|
HBase may offer acceptable performance for storing small dimension tables where
|
|
the table is small enough that executing a full table scan for every query is
|
|
efficient enough. However, Kudu is almost always a superior alternative for
|
|
storing dimension tables. HDFS tables are also appropriate for dimension
|
|
tables that do not need to support update queries, delete queries or insert
|
|
queries with small numbers of rows.
|
|
</li>
|
|
</ul>
|
|
|
|
<p>
|
|
Query predicates are applied to row keys as start and stop keys, thereby limiting the scope of a particular
|
|
lookup. If row keys are not mapped to string columns, then ordering is typically incorrect and comparison
|
|
operations do not work. For example, if row keys are not mapped to string columns, evaluating for greater
|
|
than (>) or less than (<) cannot be completed.
|
|
</p>
|
|
|
|
<p>
|
|
Predicates on non-key columns can be sent to HBase to scan as <codeph>SingleColumnValueFilters</codeph>,
|
|
providing some performance gains. In such a case, HBase returns fewer rows than if those same predicates
|
|
were applied using Impala. While there is some improvement, it is not as great when start and stop rows are
|
|
used. This is because the number of rows that HBase must examine is not limited as it is when start and
|
|
stop rows are used. As long as the row key predicate only applies to a single row, HBase will locate and
|
|
return that row. Conversely, if a non-key predicate is used, even if it only applies to a single row, HBase
|
|
must still scan the entire table to find the correct result.
|
|
</p>
|
|
|
|
<example>
|
|
|
|
<title>Interpreting EXPLAIN Output for HBase Queries</title>
|
|
|
|
<p>
|
|
For example, here are some queries against the following Impala table, which is mapped to an HBase table.
|
|
The examples show excerpts from the output of the <codeph>EXPLAIN</codeph> statement, demonstrating what
|
|
things to look for to indicate an efficient or inefficient query against an HBase table.
|
|
</p>
|
|
|
|
<p>
|
|
The first column (<codeph>cust_id</codeph>) was specified as the key column in the <codeph>CREATE
|
|
EXTERNAL TABLE</codeph> statement; for performance, it is important to declare this column as
|
|
<codeph>STRING</codeph>. Other columns, such as <codeph>BIRTH_YEAR</codeph> and
|
|
<codeph>NEVER_LOGGED_ON</codeph>, are also declared as <codeph>STRING</codeph>, rather than their
|
|
<q>natural</q> types of <codeph>INT</codeph> or <codeph>BOOLEAN</codeph>, because Impala can optimize
|
|
those types more effectively in HBase tables. For comparison, we leave one column,
|
|
<codeph>YEAR_REGISTERED</codeph>, as <codeph>INT</codeph> to show that filtering on this column is
|
|
inefficient.
|
|
</p>
|
|
|
|
<codeblock>describe hbase_table;
|
|
Query: describe hbase_table
|
|
+-----------------------+--------+---------+
|
|
| name | type | comment |
|
|
+-----------------------+--------+---------+
|
|
| cust_id | <b>string</b> | |
|
|
| birth_year | <b>string</b> | |
|
|
| never_logged_on | <b>string</b> | |
|
|
| private_email_address | string | |
|
|
| year_registered | <b>int</b> | |
|
|
+-----------------------+--------+---------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
The best case for performance involves a single row lookup using an equality comparison on the column
|
|
defined as the row key:
|
|
</p>
|
|
|
|
<codeblock>explain select count(*) from hbase_table where cust_id = 'some_user@example.com';
|
|
+------------------------------------------------------------------------------------+
|
|
| Explain String |
|
|
+------------------------------------------------------------------------------------+
|
|
| Estimated Per-Host Requirements: Memory=1.01GB VCores=1 |
|
|
| WARNING: The following tables are missing relevant table and/or column statistics. |
|
|
| hbase.hbase_table |
|
|
| |
|
|
| 03:AGGREGATE [MERGE FINALIZE] |
|
|
| | output: sum(count(*)) |
|
|
| | |
|
|
| 02:EXCHANGE [PARTITION=UNPARTITIONED] |
|
|
| | |
|
|
| 01:AGGREGATE |
|
|
| | output: count(*) |
|
|
| | |
|
|
<b>| 00:SCAN HBASE [hbase.hbase_table] |</b>
|
|
<b>| start key: some_user@example.com |</b>
|
|
<b>| stop key: some_user@example.com\0 |</b>
|
|
+------------------------------------------------------------------------------------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
Another type of efficient query involves a range lookup on the row key column, using SQL operators such
|
|
as greater than (or equal), less than (or equal), or <codeph>BETWEEN</codeph>. This example also includes
|
|
an equality test on a non-key column; because that column is a <codeph>STRING</codeph>, Impala can let
|
|
HBase perform that test, indicated by the <codeph>hbase filters:</codeph> line in the
|
|
<codeph>EXPLAIN</codeph> output. Doing the filtering within HBase is more efficient than transmitting all
|
|
the data to Impala and doing the filtering on the Impala side.
|
|
</p>
|
|
|
|
<codeblock>explain select count(*) from hbase_table where cust_id between 'a' and 'b'
|
|
and never_logged_on = 'true';
|
|
+------------------------------------------------------------------------------------+
|
|
| Explain String |
|
|
+------------------------------------------------------------------------------------+
|
|
...
|
|
<!--| Estimated Per-Host Requirements: Memory=1.01GB VCores=1 |
|
|
| WARNING: The following tables are missing relevant table and/or column statistics. |
|
|
| hbase.hbase_table |
|
|
| |
|
|
| 03:AGGREGATE [MERGE FINALIZE] |
|
|
| | output: sum(count(*)) |
|
|
| | |
|
|
| 02:EXCHANGE [PARTITION=UNPARTITIONED] |
|
|
| | |-->
|
|
| 01:AGGREGATE |
|
|
| | output: count(*) |
|
|
| | |
|
|
<b>| 00:SCAN HBASE [hbase.hbase_table] |</b>
|
|
<b>| start key: a |</b>
|
|
<b>| stop key: b\0 |</b>
|
|
<b>| hbase filters: cols:never_logged_on EQUAL 'true' |</b>
|
|
+------------------------------------------------------------------------------------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
The query is less efficient if Impala has to evaluate any of the predicates, because Impala must scan the
|
|
entire HBase table. Impala can only push down predicates to HBase for columns declared as
|
|
<codeph>STRING</codeph>. This example tests a column declared as <codeph>INT</codeph>, and the
|
|
<codeph>predicates:</codeph> line in the <codeph>EXPLAIN</codeph> output indicates that the test is
|
|
performed after the data is transmitted to Impala.
|
|
</p>
|
|
|
|
<codeblock>explain select count(*) from hbase_table where year_registered = 2010;
|
|
+------------------------------------------------------------------------------------+
|
|
| Explain String |
|
|
+------------------------------------------------------------------------------------+
|
|
...
|
|
<!--| Estimated Per-Host Requirements: Memory=1.01GB VCores=1 |
|
|
| WARNING: The following tables are missing relevant table and/or column statistics. |
|
|
| hbase.hbase_table |
|
|
| |
|
|
| 03:AGGREGATE [MERGE FINALIZE] |
|
|
| | output: sum(count(*)) |
|
|
| | |
|
|
| 02:EXCHANGE [PARTITION=UNPARTITIONED] |
|
|
| | |-->
|
|
| 01:AGGREGATE |
|
|
| | output: count(*) |
|
|
| | |
|
|
<b>| 00:SCAN HBASE [hbase.hbase_table] |</b>
|
|
<b>| predicates: year_registered = 2010 |</b>
|
|
+------------------------------------------------------------------------------------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
The same inefficiency applies if the key column is compared to any non-constant value. Here, even though
|
|
the key column is a <codeph>STRING</codeph>, and is tested using an equality operator, Impala must scan
|
|
the entire HBase table because the key column is compared to another column value rather than a constant.
|
|
</p>
|
|
|
|
<codeblock>explain select count(*) from hbase_table where cust_id = private_email_address;
|
|
+------------------------------------------------------------------------------------+
|
|
| Explain String |
|
|
+------------------------------------------------------------------------------------+
|
|
...
|
|
<!--| Estimated Per-Host Requirements: Memory=1.01GB VCores=1 |
|
|
| WARNING: The following tables are missing relevant table and/or column statistics. |
|
|
| hbase.hbase_table |
|
|
| |
|
|
| 03:AGGREGATE [MERGE FINALIZE] |
|
|
| | output: sum(count(*)) |
|
|
| | |
|
|
| 02:EXCHANGE [PARTITION=UNPARTITIONED] |
|
|
| | |-->
|
|
| 01:AGGREGATE |
|
|
| | output: count(*) |
|
|
| | |
|
|
<b>| 00:SCAN HBASE [hbase.hbase_table] |</b>
|
|
<b>| predicates: cust_id = private_email_address |</b>
|
|
+------------------------------------------------------------------------------------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
Currently, tests on the row key using <codeph>OR</codeph> or <codeph>IN</codeph> clauses are not
|
|
optimized into direct lookups either. Such limitations might be lifted in the future, so always check the
|
|
<codeph>EXPLAIN</codeph> output to be sure whether a particular SQL construct results in an efficient
|
|
query or not for HBase tables.
|
|
</p>
|
|
|
|
<codeblock>explain select count(*) from hbase_table where
|
|
cust_id = 'some_user@example.com' or cust_id = 'other_user@example.com';
|
|
+----------------------------------------------------------------------------------------+
|
|
| Explain String |
|
|
+----------------------------------------------------------------------------------------+
|
|
...
|
|
<!--| Estimated Per-Host Requirements: Memory=1.01GB VCores=1 |
|
|
| WARNING: The following tables are missing relevant table and/or column statistics. |
|
|
| hbase.hbase_table |
|
|
| |
|
|
| 03:AGGREGATE [MERGE FINALIZE] |
|
|
| | output: sum(count(*)) |
|
|
| | |
|
|
| 02:EXCHANGE [PARTITION=UNPARTITIONED] |
|
|
| | |-->
|
|
| 01:AGGREGATE |
|
|
| | output: count(*) |
|
|
| | |
|
|
<b>| 00:SCAN HBASE [hbase.hbase_table] |</b>
|
|
<b>| predicates: cust_id = 'some_user@example.com' OR cust_id = 'other_user@example.com' |</b>
|
|
+----------------------------------------------------------------------------------------+
|
|
|
|
explain select count(*) from hbase_table where
|
|
cust_id in ('some_user@example.com', 'other_user@example.com');
|
|
+------------------------------------------------------------------------------------+
|
|
| Explain String |
|
|
+------------------------------------------------------------------------------------+
|
|
...
|
|
<!--| Estimated Per-Host Requirements: Memory=1.01GB VCores=1 |
|
|
| WARNING: The following tables are missing relevant table and/or column statistics. |
|
|
| hbase.hbase_table |
|
|
| |
|
|
| 03:AGGREGATE [MERGE FINALIZE] |
|
|
| | output: sum(count(*)) |
|
|
| | |
|
|
| 02:EXCHANGE [PARTITION=UNPARTITIONED] |
|
|
| | |-->
|
|
| 01:AGGREGATE |
|
|
| | output: count(*) |
|
|
| | |
|
|
<b>| 00:SCAN HBASE [hbase.hbase_table] |</b>
|
|
<b>| predicates: cust_id IN ('some_user@example.com', 'other_user@example.com') |</b>
|
|
+------------------------------------------------------------------------------------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
Either rewrite into separate queries for each value and combine the results in the application, or
|
|
combine the single-row queries using UNION ALL:
|
|
</p>
|
|
|
|
<codeblock>select count(*) from hbase_table where cust_id = 'some_user@example.com';
|
|
select count(*) from hbase_table where cust_id = 'other_user@example.com';
|
|
|
|
explain
|
|
select count(*) from hbase_table where cust_id = 'some_user@example.com'
|
|
union all
|
|
select count(*) from hbase_table where cust_id = 'other_user@example.com';
|
|
+------------------------------------------------------------------------------------+
|
|
| Explain String |
|
|
+------------------------------------------------------------------------------------+
|
|
...
|
|
<!--| Estimated Per-Host Requirements: Memory=1.01GB VCores=1 |
|
|
| WARNING: The following tables are missing relevant table and/or column statistics. |
|
|
| hbase.hbase_table |
|
|
| |
|
|
| 09:EXCHANGE [PARTITION=UNPARTITIONED] |
|
|
| | |
|
|
| |−−11:MERGE |
|
|
| | | |
|
|
| | 08:AGGREGATE [MERGE FINALIZE] |
|
|
| | | output: sum(count(*)) |
|
|
| | | |
|
|
| | 07:EXCHANGE [PARTITION=UNPARTITIONED] |
|
|
| | | |-->
|
|
| | 04:AGGREGATE |
|
|
| | | output: count(*) |
|
|
| | | |
|
|
<b>| | 03:SCAN HBASE [hbase.hbase_table] |</b>
|
|
<b>| | start key: other_user@example.com |</b>
|
|
<b>| | stop key: other_user@example.com\0 |</b>
|
|
| | |
|
|
| 10:MERGE |
|
|
...
|
|
<!--| | |
|
|
| 06:AGGREGATE [MERGE FINALIZE] |
|
|
| | output: sum(count(*)) |
|
|
| | |
|
|
| 05:EXCHANGE [PARTITION=UNPARTITIONED] |
|
|
| | |-->
|
|
| 02:AGGREGATE |
|
|
| | output: count(*) |
|
|
| | |
|
|
<b>| 01:SCAN HBASE [hbase.hbase_table] |</b>
|
|
<b>| start key: some_user@example.com |</b>
|
|
<b>| stop key: some_user@example.com\0 |</b>
|
|
+------------------------------------------------------------------------------------+
|
|
</codeblock>
|
|
|
|
</example>
|
|
|
|
<example>
|
|
|
|
<title>Configuration Options for Java HBase Applications</title>
|
|
|
|
<p> If you have an HBase Java application that calls the
|
|
<codeph>setCacheBlocks</codeph> or <codeph>setCaching</codeph>
|
|
methods of the class <xref
|
|
href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html"
|
|
scope="external" format="html"
|
|
>org.apache.hadoop.hbase.client.Scan</xref>, you can set these same
|
|
caching behaviors through Impala query options, to control the memory
|
|
pressure on the HBase RegionServer. For example, when doing queries in
|
|
HBase that result in full-table scans (which by default are
|
|
inefficient for HBase), you can reduce memory usage and speed up the
|
|
queries by turning off the <codeph>HBASE_CACHE_BLOCKS</codeph> setting
|
|
and specifying a large number for the <codeph>HBASE_CACHING</codeph>
|
|
setting.
|
|
</p>
|
|
|
|
<p>
|
|
To set these options, issue commands like the following in <cmdname>impala-shell</cmdname>:
|
|
</p>
|
|
|
|
<codeblock>-- Same as calling setCacheBlocks(true) or setCacheBlocks(false).
|
|
set hbase_cache_blocks=true;
|
|
set hbase_cache_blocks=false;
|
|
|
|
-- Same as calling setCaching(rows).
|
|
set hbase_caching=1000;
|
|
</codeblock>
|
|
|
|
<p>
|
|
Or update the <cmdname>impalad</cmdname> defaults file <filepath>/etc/default/impala</filepath> and
|
|
include settings for <codeph>HBASE_CACHE_BLOCKS</codeph> and/or <codeph>HBASE_CACHING</codeph> in the
|
|
<codeph>-default_query_options</codeph> setting for <codeph>IMPALA_SERVER_ARGS</codeph>. See
|
|
<xref href="impala_config_options.xml#config_options"/> for details.
|
|
</p>
|
|
|
|
<note>
|
|
In Impala 2.0 and later, these options are settable through the JDBC or ODBC interfaces using the
|
|
<codeph>SET</codeph> statement.
|
|
</note>
|
|
|
|
</example>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="hbase_scenarios">
|
|
|
|
<title>Use Cases for Querying HBase through Impala</title>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Use Cases"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
The following are representative use cases for using Impala to query HBase tables:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
Using HBase to store rapidly incrementing counters, such as how many times a web page has been viewed, or
|
|
on a social network, how many connections a user has or how many votes a post received. HBase is
|
|
efficient for capturing such changeable data: the append-only storage mechanism is efficient for writing
|
|
each change to disk, and a query always returns the latest value. An application could query specific
|
|
totals like these from HBase, and combine the results with a broader set of data queried from Impala.
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Storing very wide tables in HBase. Wide tables have many columns, possibly thousands, typically
|
|
recording many attributes for an important subject such as a user of an online service. These tables
|
|
are also often sparse, that is, most of the columns values are <codeph>NULL</codeph>, 0,
|
|
<codeph>false</codeph>, empty string, or other blank or placeholder value. (For example, any particular
|
|
web site user might have never used some site feature, filled in a certain field in their profile,
|
|
visited a particular part of the site, and so on.) A typical query against this kind of table is to
|
|
look up a single row to retrieve all the information about a specific subject, rather than summing,
|
|
averaging, or filtering millions of rows as in typical Impala-managed tables.
|
|
</p>
|
|
</li>
|
|
</ul>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept audience="hidden" id="hbase_create_new">
|
|
|
|
<title>Creating a New HBase Table for Impala to Use</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
You can create an HBase-backed table through a <codeph>CREATE TABLE</codeph> statement in the Hive shell,
|
|
without going into the HBase shell at all:
|
|
</p>
|
|
|
|
<!-- To do:
|
|
Add example. (Not critical because this subtopic is currently hidden.)
|
|
-->
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept audience="hidden" id="hbase_reuse_existing">
|
|
|
|
<title>Associate Impala with an Existing HBase Table</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
If you already have some HBase tables created through the HBase shell, you can make them accessible to
|
|
Impala through a <codeph>CREATE TABLE</codeph> statement in the Hive shell:
|
|
</p>
|
|
|
|
<!-- To do:
|
|
Add example. (Not critical because this subtopic is currently hidden.)
|
|
-->
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept audience="hidden" id="hbase_column_families">
|
|
|
|
<title>Map HBase Columns and Column Families to Impala Columns</title>
|
|
|
|
<conbody>
|
|
|
|
<p/>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="hbase_loading">
|
|
|
|
<title>Loading Data into an HBase Table</title>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="ETL"/>
|
|
<data name="Category" value="Ingest"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
The Impala <codeph>INSERT</codeph> statement works for HBase tables. The <codeph>INSERT ... VALUES</codeph>
|
|
syntax is ideally suited to HBase tables, because inserting a single row is an efficient operation for an
|
|
HBase table. (For regular Impala tables, with data files in HDFS, the tiny data files produced by
|
|
<codeph>INSERT ... VALUES</codeph> are extremely inefficient, so you would not use that technique with
|
|
tables containing any significant data volume.)
|
|
</p>
|
|
|
|
<!-- To do:
|
|
Add examples throughout this section.
|
|
-->
|
|
|
|
<p>
|
|
When you use the <codeph>INSERT ... SELECT</codeph> syntax, the result in the HBase table could be fewer
|
|
rows than you expect. HBase only stores the most recent version of each unique row key, so if an
|
|
<codeph>INSERT ... SELECT</codeph> statement copies over multiple rows containing the same value for the
|
|
key column, subsequent queries will only return one row with each key column value:
|
|
</p>
|
|
|
|
<p>
|
|
Although Impala does not have an <codeph>UPDATE</codeph> statement, you can achieve the same effect by
|
|
doing successive <codeph>INSERT</codeph> statements using the same value for the key column each time:
|
|
</p>
|
|
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="hbase_limitations">
|
|
|
|
<title>Limitations and Restrictions of the Impala and HBase Integration</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
The Impala integration with HBase has the following limitations and restrictions, some inherited from the
|
|
integration between HBase and Hive, and some unique to Impala:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p>
|
|
If you issue a <codeph>DROP TABLE</codeph> for an internal (Impala-managed) table that is mapped to an
|
|
HBase table, the underlying table is not removed in HBase. The Hive <codeph>DROP TABLE</codeph>
|
|
statement also removes the HBase table in this case.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
The <codeph>INSERT OVERWRITE</codeph> statement is not available for HBase tables. You can insert new
|
|
data, or modify an existing row by inserting a new row with the same key value, but not replace the
|
|
entire contents of the table. You can do an <codeph>INSERT OVERWRITE</codeph> in Hive if you need this
|
|
capability.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
If you issue a <codeph>CREATE TABLE LIKE</codeph> statement for a table mapped to an HBase table, the
|
|
new table is also an HBase table, but inherits the same underlying HBase table name as the original.
|
|
The new table is effectively an alias for the old one, not a new table with identical column structure.
|
|
Avoid using <codeph>CREATE TABLE LIKE</codeph> for HBase tables, to avoid any confusion.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Copying data into an HBase table using the Impala <codeph>INSERT ... SELECT</codeph> syntax might
|
|
produce fewer new rows than are in the query result set. If the result set contains multiple rows with
|
|
the same value for the key column, each row supercedes any previous rows with the same key value.
|
|
Because the order of the inserted rows is unpredictable, you cannot rely on this technique to preserve
|
|
the <q>latest</q> version of a particular key value.
|
|
</p>
|
|
</li>
|
|
<li rev="2.3.0">
|
|
<p>
|
|
Because the complex data types (<codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and <codeph>MAP</codeph>)
|
|
available in <keyword keyref="impala23_full"/> and higher are currently only supported in Parquet tables, you cannot
|
|
use these types in HBase tables that are queried through Impala.
|
|
</p>
|
|
</li>
|
|
<li>
|
|
<p conref="../shared/impala_common.xml#common/hbase_no_load_data"/>
|
|
</li>
|
|
<li>
|
|
<p conref="../shared/impala_common.xml#common/tablesample_caveat"/>
|
|
</li>
|
|
</ul>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="hbase_queries">
|
|
|
|
<title>Examples of Querying HBase Tables from Impala</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
The following examples create an HBase table with four column families,
|
|
create a corresponding table through Hive,
|
|
then insert and query the table through Impala.
|
|
</p>
|
|
<p>
|
|
In HBase shell, the table
|
|
name is quoted in <codeph>CREATE</codeph> and <codeph>DROP</codeph> statements. Tables created in HBase
|
|
begin in <q>enabled</q> state; before dropping them through the HBase shell, you must issue a
|
|
<codeph>disable '<varname>table_name</varname>'</codeph> statement.
|
|
</p>
|
|
|
|
<codeblock>$ hbase shell
|
|
15/02/10 16:07:45
|
|
HBase Shell; enter 'help<RETURN>' for list of supported commands.
|
|
Type "exit<RETURN>" to leave the HBase Shell
|
|
...
|
|
|
|
hbase(main):001:0> create 'hbasealltypessmall', 'boolsCF', 'intsCF', 'floatsCF', 'stringsCF'
|
|
0 row(s) in 4.6520 seconds
|
|
|
|
=> Hbase::Table - hbasealltypessmall
|
|
hbase(main):006:0> quit
|
|
</codeblock>
|
|
|
|
<p>
|
|
Issue the following <codeph>CREATE TABLE</codeph> statement in the Hive shell. (The Impala <codeph>CREATE
|
|
TABLE</codeph> statement currently does not support the <codeph>STORED BY</codeph> clause, so you switch into Hive to
|
|
create the table, then back to Impala and the <cmdname>impala-shell</cmdname> interpreter to issue the
|
|
queries.)
|
|
</p>
|
|
|
|
<p>
|
|
This example creates an external table mapped to the HBase table, usable by both Impala and Hive. It is
|
|
defined as an external table so that when dropped by Impala or Hive, the original HBase table is not touched at all.
|
|
</p>
|
|
|
|
<p>
|
|
The <codeph>WITH SERDEPROPERTIES</codeph> clause
|
|
specifies that the first column (<codeph>ID</codeph>) represents the row key, and maps the remaining
|
|
columns of the SQL table to HBase column families. The mapping relies on the ordinal order of the
|
|
columns in the table, not the column names in the <codeph>CREATE TABLE</codeph> statement.
|
|
The first column is defined to be the lookup key; the
|
|
<codeph>STRING</codeph> data type produces the fastest key-based lookups for HBase tables.
|
|
</p>
|
|
|
|
<note>
|
|
For Impala with HBase tables, the most important aspect to ensure good performance is to use a
|
|
<codeph>STRING</codeph> column as the row key, as shown in this example.
|
|
</note>
|
|
|
|
<codeblock>$ hive
|
|
...
|
|
hive> use hbase;
|
|
OK
|
|
Time taken: 4.095 seconds
|
|
hive> CREATE EXTERNAL TABLE hbasestringids (
|
|
> id string,
|
|
> bool_col boolean,
|
|
> tinyint_col tinyint,
|
|
> smallint_col smallint,
|
|
> int_col int,
|
|
> bigint_col bigint,
|
|
> float_col float,
|
|
> double_col double,
|
|
> date_string_col string,
|
|
> string_col string,
|
|
> timestamp_col timestamp)
|
|
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
|
|
> WITH SERDEPROPERTIES (
|
|
> "hbase.columns.mapping" =
|
|
> ":key,boolsCF:bool_col,intsCF:tinyint_col,intsCF:smallint_col,intsCF:int_col,intsCF:\
|
|
> bigint_col,floatsCF:float_col,floatsCF:double_col,stringsCF:date_string_col,\
|
|
> stringsCF:string_col,stringsCF:timestamp_col"
|
|
> )
|
|
> TBLPROPERTIES("hbase.table.name" = "hbasealltypessmall");
|
|
OK
|
|
Time taken: 2.879 seconds
|
|
hive> quit;
|
|
</codeblock>
|
|
|
|
<p>
|
|
Once you have established the mapping to an HBase table, you can issue DML statements and queries
|
|
from Impala. The following example shows a series of <codeph>INSERT</codeph>
|
|
statements followed by a query.
|
|
The ideal kind of query from a performance standpoint
|
|
retrieves a row from the table based on a row key
|
|
mapped to a string column.
|
|
An initial <codeph>INVALIDATE METADATA <varname>table_name</varname></codeph>
|
|
statement makes the table created through Hive visible to Impala.
|
|
</p>
|
|
|
|
<codeblock>$ impala-shell -i localhost -d hbase
|
|
Starting Impala Shell without Kerberos authentication
|
|
Connected to localhost:21000
|
|
...
|
|
Query: use `hbase`
|
|
[localhost:21000] > invalidate metadata hbasestringids;
|
|
Fetched 0 row(s) in 0.09s
|
|
[localhost:21000] > desc hbasestringids;
|
|
+-----------------+-----------+---------+
|
|
| name | type | comment |
|
|
+-----------------+-----------+---------+
|
|
| id | string | |
|
|
| bool_col | boolean | |
|
|
| double_col | double | |
|
|
| float_col | float | |
|
|
| bigint_col | bigint | |
|
|
| int_col | int | |
|
|
| smallint_col | smallint | |
|
|
| tinyint_col | tinyint | |
|
|
| date_string_col | string | |
|
|
| string_col | string | |
|
|
| timestamp_col | timestamp | |
|
|
+-----------------+-----------+---------+
|
|
Fetched 11 row(s) in 0.02s
|
|
[localhost:21000] > insert into hbasestringids values ('0001',true,3.141,9.94,1234567,32768,4000,76,'2014-12-31','Hello world',now());
|
|
Inserted 1 row(s) in 0.26s
|
|
[localhost:21000] > insert into hbasestringids values ('0002',false,2.004,6.196,1500,8000,129,127,'2014-01-01','Foo bar',now());
|
|
Inserted 1 row(s) in 0.12s
|
|
[localhost:21000] > select * from hbasestringids where id = '0001';
|
|
+------+----------+------------+-------------------+------------+---------+--------------+-------------+-----------------+-------------+-------------------------------+
|
|
| id | bool_col | double_col | float_col | bigint_col | int_col | smallint_col | tinyint_col | date_string_col | string_col | timestamp_col |
|
|
+------+----------+------------+-------------------+------------+---------+--------------+-------------+-----------------+-------------+-------------------------------+
|
|
| 0001 | true | 3.141 | 9.939999580383301 | 1234567 | 32768 | 4000 | 76 | 2014-12-31 | Hello world | 2015-02-10 16:36:59.764838000 |
|
|
+------+----------+------------+-------------------+------------+---------+--------------+-------------+-----------------+-------------+-------------------------------+
|
|
Fetched 1 row(s) in 0.54s
|
|
</codeblock>
|
|
|
|
<note conref="../shared/impala_common.xml#common/invalidate_metadata_hbase"/>
|
|
</conbody>
|
|
</concept>
|
|
</concept>
|