This change cleans up our FE pom.xml file by removing unneeded
dependencies and system dependencies (system dependencies are now pulled in
from the Maven release repository).
The upside is that our pom is cleaner and it will also help reduce the likelihood of
broken dependencies since Maven will pull in the right versions. The downside
is that we now pull in quite a few more JARs.
Note: I was unable to find release artifacts for Sentry and Parquet so I leaving
those as "system" for now.
Change-Id: I0b917b09a02243d78d89747591ab6bccacf7cf38
Saving changes
Change-Id: I3697a7b44884c40e077b3e354fef76625e1b881d
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1011
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
The following changes have been made:
-- Update hbase
-- Update hive
-- Update hadoop
-- Update the parquet version to 1.2.5
Change-Id: Id6ceaef0e9eebab27ffd408160116fa84ed300fb
The audit logs currently have the "impersonator" field set to what we call the doAsUser
and the "user" field set as the connected user. They should be reversed.
Added basic tests to validate the correct event gets audited.
Change-Id: Idfa0aaa6c88debedc4993bd0489dbd3f696fcf17
Reviewed-on: http://gerrit.ent.cloudera.com:8080/958
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
This helps speed up the restart time becuase we don't need to restart
the catalog server and reload the table metadata. This is useful if you
want to restart the impalad with a different command line parameter
or if you are making changes to only the impalad binary.
Change-Id: I0b714afaf7e508c450a353a53d67d95165de3486
Reviewed-on: http://gerrit.ent.cloudera.com:8080/897
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Test suites that derive from common.CustomClusterTestSuite have a brand
new cluster for every tests case, which they can configure as they wish
with custom arguments using the @with_args() decorator.
A future improvement is to optionally only have one cluster per test
suite, to allow multiple tests to run more quickly if they share
configuration options.
Change-Id: I6abd5740e644996d7ca2800edf4ff11b839d1bc4
Reviewed-on: http://gerrit.ent.cloudera.com:8080/882
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: jenkins
Currently, we execute all the queries involved in data loading serially. This change
creates a separate .sql file for each file format, compression codec and compression
scheme combination, and executes all the files in parallel. Additionally, we now store all the
.sql files (independent of workload) in $IMPALA_HOME/data_load_files/<dataset_name>. Note
that only data loaded through Impala is parallelized, data loaded through hive and hbase
remains serial.
On our build machines, the time taken to load all the data from snapshot was on the order
of 15 minutes.
Change-Id: If8a862c43f0e75b506ca05d83eacdc05621cbbf8
Reviewed-on: http://gerrit.ent.cloudera.com:8080/804
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
This brings back online the process failure tests and adds a basic failure
test for the catalog service. The timeouts had to be adjusted to account for the
extra time it takes to load the the catalog and also there is an additional state
store subscriber. Note: the statestore 'live.backends' metric which is used in these
tests needs to be renamed, it really means 'live.subscribers'. However, it requires some
coordination with other teams to make the change.
Also updated start-impala-cluster to check the catalog.ready flag to ensure the impalad
catalog is ready to accept queries.
Change-Id: If22e25dba7dc83aa40bec937b5f82b815bed4645
Reviewed-on: http://gerrit.ent.cloudera.com:8080/730
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
This is currently broken (query options do not get set via run-workload). If any
query options are provided to run-workload, it exits with an error. This patch
re-enables setting query options through run-workload and also moves their validation to
impala_beeswax.
Change-Id: I1df010990f9e57ebd4cf59ada5d9646a883df380
Reviewed-on: http://gerrit.ent.cloudera.com:8080/820
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
This patch reworks our Kerberos authentication layer to support multiple
authentication protocols, particularly PLAIN/SASL to support external
LDAP authentication.
There is now a system-wide AuthManager object, initialised by InitAuth()
which occurs during the usual InitCommonRuntime() setup. The AuthManager
is responsible for supplying AuthProvider objects to ThriftServers and
ThriftClients. The AuthProvider in turn generates Thrift transport
objects which are usually SASL-enabled, and which either employ GSSAPI
or PLAIN mechanisms.
In miscellaneous changes:
* Cyrus SASL now builds both with LDAP and the dummy '--enable-true'
external authentication mechanisms enabled.
* To test PLAIN/SASL authentication, you must now include
$IMPALA_HOME/thirdparty/${IMPALA_CYRUS_SASL_VERSION}/build/lib/sasl2 in
FLAGS_sasl_path.
* The shell now has an option to authenticate using LDAP, and will
prompt for a password at startup before doing so.
* Since the authentication code is almost entirely Thrift-specific, it
has been moved to the rpc lib.
Change-Id: I771de50f05630efdf1606ab9f0f48146ad54595e
Reviewed-on: http://gerrit.ent.cloudera.com:8080/716
Tested-by: jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
We've had at least one case of Sasl failing to build during
Saslauthd. We don't use that component, so it's fine to disable it
rather than figure out the actual issue.
Change-Id: I1e16063970806823f7fe3b40a1b0e74a32c4b57f
Reviewed-on: http://gerrit.ent.cloudera.com:8080/736
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Henry Robinson <henry@cloudera.com>
Now you can write:
./build_thirdparty -sasl -gflags
or similar to build individual thirdparty libaries, which is handy if
you're upgrading a single library or changing its build flags.
The behaviour with no command-line flags is the same as before this
patch, except that the 'git clean' is called only from the individual
library directories, rather than /thirdparty as before; this avoids
blowing away unchecked in directories while still removing build
artefacts as intended.
Change-Id: Iaafb6f6e42b0173c11eec3b08c8dea895dcd9199
Reviewed-on: http://gerrit.ent.cloudera.com:8080/725
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
This change adds support for user impersonation for HS2 authorization
requests. It adds a new flag (--authorized_proxy_user_config) that if
set, allows users (ex. hue) to impersonate as another user. The user they
wish to impersonate as is passed using the HS2 configuration property,
'impala.doas.user'.
The configuration allows for specifying the list of users a proxy user
can impersonate as well, or '*' to allow the proxy user to impersonate
any user. For example: hue=user1,user2,admin=*
Change-Id: I2a13e31e5bde2e6df47134458c803168415d0437
Reviewed-on: http://gerrit.ent.cloudera.com:8080/574
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
Adds basic support for catalogd to our ImpalaCluster test library/object model.
This will allow us to write more programatic tests targeting the catalogd process
including process failure tests and metric check validators.
Change-Id: I8e5f7bc73f999f105437c6d3d52c6d436a354d2d
Reviewed-on: http://gerrit.ent.cloudera.com:8080/617
Tested-by: jenkins
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
There is a Hive Metastore concurrency bug (HIVE-5457) which causes concurrent
calls to getTable() to sometimes fail due with data nucleus exceptions. This
causes catalogd to fail to load ALL metadata for all tables. This fix is to
serialize our calls to getTable(). Additionally, tweaked the logging a bit and
improved start-impala-cluster to do a better job of reporting the status of catalog
initialization. It's too bad we have to serialize these calls, but we seem to be able
to run everything else in parallel with no problems (get col stats, block md, etc).
Also added a couple of changes in our hive-site to match the defaults for our cluster
metastore deployments.
Change-Id: Ic70e2a9b8190a56510e430d8da3942dca252eb4c
Reviewed-on: http://gerrit.ent.cloudera.com:8080/609
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
The Impala CatalogService manages the caching and dissemination of cluster-wide metadata.
The CatalogService combines the metadata from the Hive Metastore, the NameNode,
and potentially additional sources in the future. The CatalogService uses the
StateStore to broadcast metadata updates across the cluster.
The CatalogService also directly handles executing metadata updates request from
impalad servers (DDL requests). It exposes a Thrift interface to allow impalads to
directly connect execute their DDL operations.
The CatalogService has two main components - a C++ server that implements StateStore
integration, Thrift service implementiation, and exporting of the debug webpage/metrics.
The other main component is the Java Catalog that manages caching and updating of of all
the metadata. For each StateStore heartbeat, a delta of all metadata updates is broadcast
to the rest of the cluster.
Some Notes On the Changes
---
* The metadata is all sent as thrift structs. To do this all catalog objects (Tables/Views,
Databases, UDFs) have thrift struct to represent them. These are sent with each statestore
delta update.
* The existing Catalog class has been seperated into two seperate sub-classes. An
ImpladCatalog and a CatalogServiceCatalog. See the comments on those classes for more
details.
What is working:
* New CatalogService created
* Working with statestore delta updates and latest UDF changes
* DDL performed on Node 1 is now visible on all other nodes without a "refresh".
* Each DDL operation against the Catalog Service will return the catalog version that
contains the change. An impalad will wait for the statestore heartbeat that contains this
version before returning from the DDL comment.
* All table types (Hbase, Hdfs, Views) getting their metadata propagated properly
* Block location information included in CS updates and used by Impalads
* Column and table stats included in CS updates and used by Impalads
* Query tests are all passing
Still TODO:
* Directly return catalog object metadata from DDL requests
* Poll the Hive Metastore to detect new/dropped/modified tables
* Reorganize the FE code for the Catalog Service. I don't think we want everything in the
same JAR.
Change-Id: I8c61296dac28fb98bcfdc17361f4f141d3977eda
Reviewed-on: http://gerrit.ent.cloudera.com:8080/601
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
At the moment, a query is the default unit of execution and parallelism in the Impala
performance suite. With this change, we now have the ability to treat a workload as the
unit of execution. A workload is defined as a unique combination of the dataset, scale
factor, a subset (or all) of the queries in the dataset, and a table format (file format,
compression codec and compression scheme).
It introduces two new command line options in bin/run-workload.py:
* --execution_scope
The default scope is 'query', and it maintains previous semantics. The
new scope is 'workload', which toggles the unit of execution to a workload.
* --shuffle_query_exec_order.
Shuffles the order in which queries are executed (only applicable when the
execution_scope if workload), defaults to False.
Change-Id: I790d75f0896210cda8eb999015b0be04246e4c45
Reviewed-on: http://gerrit.ent.cloudera.com:8080/503
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
Python modules on Redhat systems might be in lib or in lib64, unlike Debian systems which
symlink one to the other
Change-Id: Ia1e2d362e3d7e13b87c70e7578644827a5234a91
Reviewed-on: http://gerrit.ent.cloudera.com:8080/544
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
This patch allows Impala to start either Beeswax or HS2 on an
SSL-secured port. SSL is a certificate-based authentication scheme,
where the server provides a certificate to the client as part of the
handshake process. The client verifies that certificate, either by
contacting a trusted third-party certificate authority (CA), or by
accepting a 'self-signed' certificate from the server that is also
provided to the client out-of-band; the client simply compares the two
certificate copies.
Once the certificate is verified, the client and server negotiate an
encryption key for the session, using a public key provided by the
server to encrypt that negotiation. Therefore the server has to have
access to a private key in order to decrypt the encryption key.
Both certificate and key are stored in industry standard .PEM
format. Impala uses the same certificate and key for both Beeswax and
HS2, and the files containing the certificate and key are provided via
--ssl_server_certificate and --ssl_private_key. If either are non-blank,
SSL is enabled for Beeswax and HS2.
The Python shell supports SSL as of this patch via new --ssl and
--ca_cert flags.
Finally, this patch also adds support for Impala's ThriftClients to use
SSL, paving the way for having the backend service use encryption on the
wire as well (although such a configuration is not used by this
patch). The client SSL support is only currently used for the new test
case.
This patch does not enable 'mutual' authentication, where clients
provide certificates to the server in order to authenticate
themselves. Impala has other authentication mechanisms for that purpose.
Change-Id: I3942aa0d21b34b7cda748292f04a9523f35ee6d4
Reviewed-on: http://gerrit.ent.cloudera.com:8080/514
Tested-by: jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
Previously, the user specified command line paramter --log_level was not being
taken into account while starting the mini impala cluster.
Change-Id: I433412b6a7057585136d2ad887010881217d9676
Reviewed-on: http://gerrit.ent.cloudera.com:8080/520
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
We now maintain our own internal version of the Mongoose webserver,
renamed to 'Squeasel' for differentiation. This patch imports the new
code, and swaps all mentions of mongoose or mg_ for squeasel / sq_.
In the future, we might consider making Squeasel a git subproject so
that we can pull in changes more easily.
Change-Id: I83b595dc336a32f2c8aba59eee420b71274b681b
Reviewed-on: http://gerrit.ent.cloudera.com:8080/485
Tested-by: jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
This change adds Impala DDL support for creation of AVRO tables.
Additionally, it add Impala support for CREATE and ALTER SERDEPROPERTIES
which are used when creating Avro backed tables. This syntax is not
exactly the same as the Hive support since it introduces a new
fileformat (AVROFILE) that implies the needed Serialization library,
input format, and output format.
Change-Id: I5047e419198a89599e9d014fdedfee1a20437a7d
Reviewed-on: http://gerrit.ent.cloudera.com:8080/464
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
This change cleans up start-impala-cluster to remove all the uneeded log4j setup code.
As part of this change, updated the start-impalad script to "exec" the impala binaries,
which removes the .sh wrapper script from the list of running processes.
Change-Id: I5dee49b72ff51012bf43ab9d2a3a21fd2b841ff5
Reviewed-on: http://gerrit.ent.cloudera.com:8080/270
Tested-by: jenkins
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
This change modifies run-all-tests to run the BE unit tests before all other tests.
This is beneficial for a number of reasons - it helps catch basic bugs earlier
(query tests will probably fail if there is a BE unit test failure) and also allows
us to keep the mini-impala-cluster running after a build to help with debugging.
Ideally, we could also run the FE unit tests before the query tests but there is
currently a dependency on the TPC-H temp tables generated by the query tests so this
cannot be done.
Change-Id: Id43dbac456236258cd9986e990779d27f5d41075
Reviewed-on: http://gerrit.ent.cloudera.com:8080/269
Tested-by: jenkins
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>