mirror of
https://github.com/apache/impala.git
synced 2026-01-26 03:01:30 -05:00
In the Choosing LB algorithm section, it says that Round-robin is not recommended for Impalad. Change-Id: I418931c028ddc6e8f5d894da6c92bc7994bbca56 Reviewed-on: http://gerrit.cloudera.org:8080/11496 Reviewed-by: Alex Rodoni <arodoni@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
532 lines
19 KiB
XML
532 lines
19 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
||
<!--
|
||
Licensed to the Apache Software Foundation (ASF) under one
|
||
or more contributor license agreements. See the NOTICE file
|
||
distributed with this work for additional information
|
||
regarding copyright ownership. The ASF licenses this file
|
||
to you under the Apache License, Version 2.0 (the
|
||
"License"); you may not use this file except in compliance
|
||
with the License. You may obtain a copy of the License at
|
||
|
||
http://www.apache.org/licenses/LICENSE-2.0
|
||
|
||
Unless required by applicable law or agreed to in writing,
|
||
software distributed under the License is distributed on an
|
||
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||
KIND, either express or implied. See the License for the
|
||
specific language governing permissions and limitations
|
||
under the License.
|
||
-->
|
||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||
<concept id="proxy">
|
||
|
||
<title>Using Impala through a Proxy for High Availability</title>
|
||
<titlealts audience="PDF"><navtitle>Load-Balancing Proxy for HA</navtitle></titlealts>
|
||
<prolog>
|
||
<metadata>
|
||
<data name="Category" value="High Availability"/>
|
||
<data name="Category" value="Impala"/>
|
||
<data name="Category" value="Network"/>
|
||
<data name="Category" value="Proxy"/>
|
||
<data name="Category" value="Administrators"/>
|
||
<data name="Category" value="Developers"/>
|
||
<data name="Category" value="Data Analysts"/>
|
||
</metadata>
|
||
</prolog>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
For most clusters that have multiple users and production availability requirements, you might set up a proxy
|
||
server to relay requests to and from Impala.
|
||
</p>
|
||
|
||
<p>
|
||
Currently, the Impala statestore mechanism does not include such proxying and load-balancing features. Set up
|
||
a software package of your choice to perform these functions.
|
||
</p>
|
||
|
||
<note>
|
||
<p conref="../shared/impala_common.xml#common/statestored_catalogd_ha_blurb"/>
|
||
</note>
|
||
|
||
<p outputclass="toc inpage"/>
|
||
|
||
</conbody>
|
||
|
||
<concept id="proxy_overview">
|
||
|
||
<title>Overview of Proxy Usage and Load Balancing for Impala</title>
|
||
<prolog>
|
||
<metadata>
|
||
<data name="Category" value="Concepts"/>
|
||
</metadata>
|
||
</prolog>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
Using a load-balancing proxy server for Impala has the following advantages:
|
||
</p>
|
||
|
||
<ul>
|
||
<li>
|
||
Applications connect to a single well-known host and port, rather than keeping track of the hosts where
|
||
the <cmdname>impalad</cmdname> daemon is running.
|
||
</li>
|
||
|
||
<li>
|
||
If any host running the <cmdname>impalad</cmdname> daemon becomes unavailable, application connection
|
||
requests still succeed because you always connect to the proxy server rather than a specific host running
|
||
the <cmdname>impalad</cmdname> daemon.
|
||
</li>
|
||
|
||
<li>
|
||
The coordinator node for each Impala query potentially requires
|
||
more memory and CPU cycles than the other nodes that process the
|
||
query. The proxy server can issue queries so that each connection uses
|
||
a different coordinator node. This load-balancing technique lets the
|
||
Impala nodes share this additional work, rather than concentrating it
|
||
on a single machine.
|
||
</li>
|
||
</ul>
|
||
|
||
<p>
|
||
The following setup steps are a general outline that apply to any load-balancing proxy software:
|
||
</p>
|
||
|
||
<ol>
|
||
<li>
|
||
Select and download the load-balancing proxy software or other
|
||
load-balancing hardware appliance. It should only need to be installed
|
||
and configured on a single host, typically on an edge node. Pick a
|
||
host other than the DataNodes where <cmdname>impalad</cmdname> is
|
||
running, because the intention is to protect against the possibility
|
||
of one or more of these DataNodes becoming unavailable.
|
||
</li>
|
||
|
||
<li>
|
||
Configure the load balancer (typically by editing a configuration file).
|
||
In particular:
|
||
<ul>
|
||
<li>
|
||
Set up a port that the load balancer will listen on to relay
|
||
Impala requests back and forth. </li>
|
||
<li>
|
||
See <xref href="#proxy_balancing" format="dita"/> for load
|
||
balancing algorithm options.
|
||
</li>
|
||
<li>
|
||
For Kerberized clusters, follow the instructions in <xref
|
||
href="impala_proxy.xml#proxy_kerberos"/>.
|
||
</li>
|
||
</ul>
|
||
</li>
|
||
|
||
<li>
|
||
If you are using Hue or JDBC-based applications, you typically set
|
||
up load balancing for both ports 21000 and 21050, because these client
|
||
applications connect through port 21050 while the
|
||
<cmdname>impala-shell</cmdname> command connects through port
|
||
21000. See <xref href="impala_ports.xml#ports"/> for when to use port
|
||
21000, 21050, or another value depending on what type of connections
|
||
you are load balancing.
|
||
</li>
|
||
|
||
<li>
|
||
Run the load-balancing proxy server, pointing it at the configuration file that you set up.
|
||
</li>
|
||
|
||
<li>
|
||
For any scripts, jobs, or configuration settings for applications
|
||
that formerly connected to a specific DataNode to run Impala SQL
|
||
statements, change the connection information (such as the
|
||
<codeph>-i</codeph> option in <cmdname>impala-shell</cmdname>) to
|
||
point to the load balancer instead.
|
||
</li>
|
||
</ol>
|
||
|
||
<note>
|
||
The following sections use the HAProxy software as a representative example of a load balancer
|
||
that you can use with Impala.
|
||
</note>
|
||
|
||
</conbody>
|
||
|
||
</concept>
|
||
|
||
<concept id="proxy_balancing" rev="">
|
||
<title>Choosing the Load-Balancing Algorithm</title>
|
||
<conbody>
|
||
<p>
|
||
Load-balancing software offers a number of algorithms to distribute requests.
|
||
Each algorithm has its own characteristics that make it suitable in some situations
|
||
but not others.
|
||
</p>
|
||
|
||
<dl>
|
||
<dlentry>
|
||
<dt>Leastconn</dt>
|
||
<dd>
|
||
Connects sessions to the coordinator with the fewest connections,
|
||
to balance the load evenly. Typically used for workloads consisting
|
||
of many independent, short-running queries. In configurations with
|
||
only a few client machines, this setting can avoid having all
|
||
requests go to only a small set of coordinators.
|
||
</dd>
|
||
<dd>
|
||
Recommended for Impala with F5.
|
||
</dd>
|
||
</dlentry>
|
||
<dlentry>
|
||
<dt>Source IP Persistence</dt>
|
||
<dd>
|
||
<p>
|
||
Sessions from the same IP address always go to the same
|
||
coordinator. A good choice for Impala workloads containing a mix
|
||
of queries and DDL statements, such as <codeph>CREATE TABLE</codeph>
|
||
and <codeph>ALTER TABLE</codeph>. Because the metadata changes from
|
||
a DDL statement take time to propagate across the cluster, prefer
|
||
to use the Source IP Persistence in this case. If you are unable
|
||
to choose Source IP Persistence, run the DDL and subsequent queries
|
||
that depend on the results of the DDL through the same session,
|
||
for example by running <codeph>impala-shell -f <varname>script_file</varname></codeph>
|
||
to submit several statements through a single session.
|
||
</p>
|
||
</dd>
|
||
<dd>
|
||
<p>
|
||
Required for setting up high availability with Hue.
|
||
</p>
|
||
</dd>
|
||
</dlentry>
|
||
<dlentry>
|
||
<dt>Round-robin</dt>
|
||
<dd>
|
||
<p>
|
||
Distributes connections to all coordinator nodes.
|
||
Typically not recommended for Impala.
|
||
</p>
|
||
</dd>
|
||
</dlentry>
|
||
</dl>
|
||
|
||
<p>
|
||
You might need to perform benchmarks and load testing to determine
|
||
which setting is optimal for your use case. Always set up using two
|
||
load-balancing algorithms: Source IP Persistence for Hue and Leastconn
|
||
for others.
|
||
</p>
|
||
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept id="proxy_kerberos">
|
||
|
||
<title>Special Proxy Considerations for Clusters Using Kerberos</title>
|
||
<prolog>
|
||
<metadata>
|
||
<data name="Category" value="Security"/>
|
||
<data name="Category" value="Kerberos"/>
|
||
<data name="Category" value="Authentication"/>
|
||
<data name="Category" value="Proxy"/>
|
||
</metadata>
|
||
</prolog>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
In a cluster using Kerberos, applications check host credentials to
|
||
verify that the host they are connecting to is the same one that is
|
||
actually processing the request, to prevent man-in-the-middle attacks.
|
||
</p>
|
||
<p>
|
||
In <keyword keyref="impala211_full">Impala 2.11</keyword> and lower
|
||
versions, once you enable a proxy server in a Kerberized cluster, users
|
||
will not be able to connect to individual impala daemons directly from
|
||
impala-shell.
|
||
</p>
|
||
|
||
<p>
|
||
In <keyword keyref="impala212_full">Impala 2.12</keyword> and higher,
|
||
if you enable a proxy server in a Kerberized cluster, users have an
|
||
option to connect to Impala daemons directly from
|
||
<cmdname>impala-shell</cmdname> using the <codeph>-b</codeph> /
|
||
<codeph>--kerberos_host_fqdn</codeph> option when you start
|
||
<cmdname>impala-shell</cmdname>. This option can be used for testing or
|
||
troubleshooting purposes, but not recommended for live production
|
||
environments as it defeats the purpose of a load balancer/proxy.
|
||
</p>
|
||
|
||
<p>
|
||
Example:
|
||
<codeblock>
|
||
impala-shell -i impalad-1.mydomain.com -k -b loadbalancer-1.mydomain.com
|
||
</codeblock>
|
||
</p>
|
||
|
||
<p>
|
||
Alternatively, with the fully qualified
|
||
configurations:
|
||
<codeblock>impala-shell --impalad=impalad-1.mydomain.com:21000 --kerberos --kerberos_host_fqdn=loadbalancer-1.mydomain.com</codeblock>
|
||
</p>
|
||
<p>
|
||
See <xref href="impala_shell_options.xml#shell_options"/> for
|
||
information about the option.
|
||
</p>
|
||
|
||
<p>
|
||
To clarify that the load-balancing proxy server is legitimate, perform
|
||
these extra Kerberos setup steps:
|
||
</p>
|
||
|
||
<ol>
|
||
<li>
|
||
This section assumes you are starting with a Kerberos-enabled cluster. See
|
||
<xref href="impala_kerberos.xml#kerberos"/> for instructions for setting up Impala with Kerberos. See
|
||
<xref keyref="cdh_sg_kerberos_prin_keytab_deploy"/> for general steps to set up Kerberos.
|
||
</li>
|
||
|
||
<li>
|
||
Choose the host you will use for the proxy server. Based on the Kerberos setup procedure, it should
|
||
already have an entry <codeph>impala/<varname>proxy_host</varname>@<varname>realm</varname></codeph> in
|
||
its keytab. If not, go back over the initial Kerberos configuration steps for the keytab on each host
|
||
running the <cmdname>impalad</cmdname> daemon.
|
||
</li>
|
||
|
||
<li>
|
||
Copy the keytab file from the proxy host to all other hosts in the cluster that run the
|
||
<cmdname>impalad</cmdname> daemon. (For optimal performance, <cmdname>impalad</cmdname> should be running
|
||
on all DataNodes in the cluster.) Put the keytab file in a secure location on each of these other hosts.
|
||
</li>
|
||
|
||
<li>
|
||
Add an entry <codeph>impala/<varname>actual_hostname</varname>@<varname>realm</varname></codeph> to the keytab on each
|
||
host running the <cmdname>impalad</cmdname> daemon.
|
||
</li>
|
||
|
||
<li>
|
||
|
||
For each impalad node, merge the existing keytab with the proxy’s keytab using
|
||
<cmdname>ktutil</cmdname>, producing a new keytab file. For example:
|
||
<codeblock>$ ktutil
|
||
ktutil: read_kt proxy.keytab
|
||
ktutil: read_kt impala.keytab
|
||
ktutil: write_kt proxy_impala.keytab
|
||
ktutil: quit</codeblock>
|
||
|
||
</li>
|
||
|
||
<li>
|
||
|
||
To verify that the keytabs are merged, run the command:
|
||
<codeblock>
|
||
klist -k <varname>keytabfile</varname>
|
||
</codeblock>
|
||
which lists the credentials for both <codeph>principal</codeph> and <codeph>be_principal</codeph> on
|
||
all nodes.
|
||
</li>
|
||
|
||
|
||
<li>
|
||
|
||
Make sure that the <codeph>impala</codeph> user has permission to read this merged keytab file.
|
||
|
||
</li>
|
||
|
||
<li>
|
||
Change the following configuration settings for each host in the cluster that participates
|
||
in the load balancing:
|
||
<ul>
|
||
<li>
|
||
In the <cmdname>impalad</cmdname> option definition, add:
|
||
<codeblock>
|
||
--principal=impala/<i>proxy_host@realm</i>
|
||
--be_principal=impala/<i>actual_host@realm</i>
|
||
--keytab_file=<i>path_to_merged_keytab</i>
|
||
</codeblock>
|
||
<note>
|
||
Every host has different <codeph>--be_principal</codeph> because the actual hostname
|
||
is different on each host.
|
||
|
||
Specify the fully qualified domain name (FQDN) for the proxy host, not the IP
|
||
address. Use the exact FQDN as returned by a reverse DNS lookup for the associated
|
||
IP address.
|
||
|
||
</note>
|
||
</li>
|
||
|
||
<li>
|
||
Modify the startup options. See <xref href="impala_config_options.xml#config_options"/> for the procedure to modify the startup
|
||
options.
|
||
</li>
|
||
</ul>
|
||
</li>
|
||
|
||
<li>
|
||
Restart Impala to make the changes take effect. Restart the <cmdname>impalad</cmdname> daemons on all
|
||
hosts in the cluster, as well as the <cmdname>statestored</cmdname> and <cmdname>catalogd</cmdname>
|
||
daemons.
|
||
</li>
|
||
|
||
</ol>
|
||
|
||
</conbody>
|
||
|
||
</concept>
|
||
|
||
<concept id="tut_proxy">
|
||
|
||
<title>Example of Configuring HAProxy Load Balancer for Impala</title>
|
||
<prolog>
|
||
<metadata>
|
||
<data name="Category" value="Configuring"/>
|
||
</metadata>
|
||
</prolog>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
If you are not already using a load-balancing proxy, you can experiment with
|
||
<xref href="http://haproxy.1wt.eu/" scope="external" format="html">HAProxy</xref> a free, open source load
|
||
balancer. This example shows how you might install and configure that load balancer on a Red Hat Enterprise
|
||
Linux system.
|
||
</p>
|
||
|
||
<ul>
|
||
<li>
|
||
<p>
|
||
Install the load balancer: <codeph>yum install haproxy</codeph>
|
||
</p>
|
||
</li>
|
||
|
||
<li>
|
||
<p>
|
||
Set up the configuration file: <filepath>/etc/haproxy/haproxy.cfg</filepath>. See the following section
|
||
for a sample configuration file.
|
||
</p>
|
||
</li>
|
||
|
||
<li>
|
||
<p>
|
||
Run the load balancer (on a single host, preferably one not running <cmdname>impalad</cmdname>):
|
||
</p>
|
||
<codeblock>/usr/sbin/haproxy –f /etc/haproxy/haproxy.cfg</codeblock>
|
||
</li>
|
||
|
||
<li>
|
||
<p>
|
||
In <cmdname>impala-shell</cmdname>, JDBC applications, or ODBC applications, connect to the listener
|
||
port of the proxy host, rather than port 21000 or 21050 on a host actually running <cmdname>impalad</cmdname>.
|
||
The sample configuration file sets haproxy to listen on port 25003, therefore you would send all
|
||
requests to <codeph><varname>haproxy_host</varname>:25003</codeph>.
|
||
</p>
|
||
</li>
|
||
</ul>
|
||
|
||
<p>
|
||
This is the sample <filepath>haproxy.cfg</filepath> used in this example:
|
||
</p>
|
||
|
||
<codeblock>global
|
||
# To have these messages end up in /var/log/haproxy.log you will
|
||
# need to:
|
||
#
|
||
# 1) configure syslog to accept network log events. This is done
|
||
# by adding the '-r' option to the SYSLOGD_OPTIONS in
|
||
# /etc/sysconfig/syslog
|
||
#
|
||
# 2) configure local2 events to go to the /var/log/haproxy.log
|
||
# file. A line like the following can be added to
|
||
# /etc/sysconfig/syslog
|
||
#
|
||
# local2.* /var/log/haproxy.log
|
||
#
|
||
log 127.0.0.1 local0
|
||
log 127.0.0.1 local1 notice
|
||
chroot /var/lib/haproxy
|
||
pidfile /var/run/haproxy.pid
|
||
maxconn 4000
|
||
user haproxy
|
||
group haproxy
|
||
daemon
|
||
|
||
# turn on stats unix socket
|
||
#stats socket /var/lib/haproxy/stats
|
||
|
||
#---------------------------------------------------------------------
|
||
# common defaults that all the 'listen' and 'backend' sections will
|
||
# use if not designated in their block
|
||
#
|
||
# You might need to adjust timing values to prevent timeouts.
|
||
#
|
||
# The timeout values should be dependant on how you use the cluster
|
||
# and how long your queries run.
|
||
#---------------------------------------------------------------------
|
||
defaults
|
||
mode http
|
||
log global
|
||
option httplog
|
||
option dontlognull
|
||
option http-server-close
|
||
option forwardfor except 127.0.0.0/8
|
||
option redispatch
|
||
retries 3
|
||
maxconn 3000
|
||
timeout connect 5000
|
||
timeout client 3600s
|
||
timeout server 3600s
|
||
|
||
#
|
||
# This sets up the admin page for HA Proxy at port 25002.
|
||
#
|
||
listen stats :25002
|
||
balance
|
||
mode http
|
||
stats enable
|
||
stats auth <varname>username</varname>:<varname>password</varname>
|
||
|
||
# This is the setup for Impala. Impala client connect to load_balancer_host:25003.
|
||
# HAProxy will balance connections among the list of servers listed below.
|
||
# The list of Impalad is listening at port 21000 for beeswax (impala-shell) or original ODBC driver.
|
||
# For JDBC or ODBC version 2.x driver, use port 21050 instead of 21000.
|
||
listen impala :25003
|
||
mode tcp
|
||
option tcplog
|
||
balance leastconn
|
||
|
||
server <varname>symbolic_name_1</varname> impala-host-1.example.com:21000 check
|
||
server <varname>symbolic_name_2</varname> impala-host-2.example.com:21000 check
|
||
server <varname>symbolic_name_3</varname> impala-host-3.example.com:21000 check
|
||
server <varname>symbolic_name_4</varname> impala-host-4.example.com:21000 check
|
||
|
||
# Setup for Hue or other JDBC-enabled applications.
|
||
# In particular, Hue requires sticky sessions.
|
||
# The application connects to load_balancer_host:21051, and HAProxy balances
|
||
# connections to the associated hosts, where Impala listens for JDBC
|
||
# requests on port 21050.
|
||
listen impalajdbc :21051
|
||
mode tcp
|
||
option tcplog
|
||
balance source
|
||
server <varname>symbolic_name_5</varname> impala-host-1.example.com:21050 check
|
||
server <varname>symbolic_name_6</varname> impala-host-2.example.com:21050 check
|
||
server <varname>symbolic_name_7</varname> impala-host-3.example.com:21050 check
|
||
server <varname>symbolic_name_8</varname> impala-host-4.example.com:21050 check
|
||
</codeblock>
|
||
<note type="important">
|
||
Hue requires the <codeph>check</codeph> option at end of each line in
|
||
the above file to ensure HAProxy can detect any unreachable Impalad
|
||
server, and failover can be successful. Without the TCP check, you may hit
|
||
an error when the <cmdname>impalad</cmdname> daemon to which Hue tries to
|
||
connect is down.
|
||
</note>
|
||
|
||
<note conref="../shared/impala_common.xml#common/proxy_jdbc_caveat"/>
|
||
|
||
</conbody>
|
||
|
||
</concept>
|
||
|
||
</concept>
|