This removes Impala-lzo from the Impala development environment.
Impala-lzo is not built as part of the Impala build. The LZO plugin
is no longer loaded. LZO tables are not loaded during dataload,
and LZO is no longer tested.
This removes some obsolete scan APIs that were only used by Impala-lzo.
With this commit, Impala-lzo would require code changes to build
against Impala.
The plugin infrastructure is not removed, and this leaves some
LZO support code in place. If someone were to decide to revive
Impala-lzo, they would still be able to load it as a plugin
and get the same functionality as before. This plugin support
may be removed later.
Testing:
- Dryrun of GVO
- Modified TestPartitionMetadataUncompressedTextOnly's
test_unsupported_text_compression() to add LZO case
Change-Id: I3a4f12247d8872b7e14c9feb4b2c58cfd60d4c0e
Reviewed-on: http://gerrit.cloudera.org:8080/15814
Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
Impala 4 moved to using CDP versions for components, which involves
adopting Hive 3. This removes the old code supporting CDH components
and Hive 2. Specifically, it does the following:
1. Remove USE_CDP_HIVE and default to the values from USE_CDP_HIVE=true.
USE_CDP_HIVE now has no effect on the Impala environment. This also
means that bin/jenkins/build-all-flag-combinations.sh no longer
include USE_CDP_HIVE=false as a configuration.
2. Remove USE_CDH_KUDU and default to getting Impala from the
native toolchain.
3. Ban IMPALA_HIVE_MAJOR_VERSION<3 and remove related code, including
the IMPALA_HIVE_MAJOR_VERSION=2 maven profile in fe/pom.xml.
There is a fair amount of code that still references the Hive major
version. Upstream Hive is now working on Hive 4, so there is a high
likelihood that we'll need some code to deal with that transition.
This leaves some code (such as maven profiles) and test logic in
place.
Change-Id: Id85e849beaf4e19dda4092874185462abd2ec608
Reviewed-on: http://gerrit.cloudera.org:8080/15869
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
With HADOOP-16711, Hadoop added extra validation during the
initialization of S3AFileSystem that verified that the caller had
permissions on the S3 bucket specified. Some frontend tests use
non-existent S3 buckets in URIs to check analysis behavior. These
started to fail with the new validation.
This changes the core-site.xml configuration to disable the new
validation by setting fs.s3a.bucket.probe=1. This is equivalent
to the old behavior, and it can now run the frontend tests without
AWS credentials.
Testing:
- Hand tested the failing tests (AnalyzeDDLTest, ExplainTest, PlannerTest)
- Ran core job on USE_CDP_HIVE=true and USE_CDP_HIVE=false
Change-Id: Id61ffbf686f8b7827e7fbf13167cfc1dfc06a325
Reviewed-on: http://gerrit.cloudera.org:8080/15799
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Anurag Mantripragada <anurag@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Automatically assume IMPALA_HOME is the source directory
in a couple of places.
Delete the cache_tables.py script and MINI_DFS_BASE_DATA_DIR
config var which had both bit-rotted and were unused.
Allow setting IMPALA_CLUSTER_NODES_DIR to put the minicluster
nodes, most important the data, in a different location, e.g.
on a different filesystem.
Testing:
I set up a dev environment using this code and was able to
load data and run some tests.
Change-Id: Ibd8b42a6d045d73e3ea29015aa6ccbbde278eec7
Reviewed-on: http://gerrit.cloudera.org:8080/15687
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Recently Kudu made enhancements to time source configuration and
adjusted the time source for local clusters/tests to `system_unsync`.
This patch mirrors that behavior in Impala test clusters given there is no
need to require NTP-synchronized clock for a test where all the
participating Kudu masters and tablet servers are run at the same node
using the same local wallclock.
See the Kudu commit here for details:
eb2b70d4b9
While making this change, I removed all ntp related packages and special
handling as they should not be needed in a development environment
any more. I also added curl and gawk which were missing in my
Docker ubuntu environment and broke my testing.
Testing:
I tested with the steps below using Docker for Mac:
docker rm impala-dev
docker volume rm impala
docker run --privileged --interactive --tty --name impala-dev -v impala:/home -p 25000:25000 -p 25010:25010 -p 25020:25020 ubuntu:16.04 /bin/bash
apt-get update
apt-get install sudo
adduser --disabled-password --gecos '' impdev
echo 'impdev ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers
su - impdev
cd ~
sudo apt-get --yes install git
git clone https://git-wip-us.apache.org/repos/asf/impala.git ~/Impala
cd ~/Impala
export IMPALA_HOME=`pwd`
git remote add fork https://github.com/granthenke/impala.git
git fetch fork
git checkout kudu-system-time
$IMPALA_HOME/bin/bootstrap_development.sh
source $IMPALA_HOME/bin/impala-config.sh
(pushd fe && mvn -fae test -Dtest=AnalyzeDDLTest)
(pushd fe && mvn -fae test -Dtest=AnalyzeKuduDDLTest)
$IMPALA_HOME/bin/start-impala-cluster.py
./tests/run-tests.py query_test/test_kudu.py
Change-Id: Id99e5cb58ab988c3ad4f98484be8db193d5eaf99
Reviewed-on: http://gerrit.cloudera.org:8080/15568
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Alexey Serbin <aserbin@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch bumps up CDP_BUILD_NUMBER to 2244454 which contains a change
introduced by RANGER-2688. Due to this change, we added to
ranger-admin-site.xml.template a cookie-related configuration so that
the Ranger server could be properly started.
Testing:
Verified that the data loading passes and that all the Ranger-related FE
and E2E tests are successful
- when $USE_CDP_HIVE is false, and
- when $USE_CDP_HIVE is true.
Change-Id: I7750f73834368c7109965e78b147238fc6316f49
Reviewed-on: http://gerrit.cloudera.org:8080/15533
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The kerberized minicluster is enabled by setting
IMPALA_KERBERIZE=true in impala-config-*.sh.
After setting it you must run ./bin/create-test-configuration.sh
then restart minicluster.
This adds a script to partially automate setup of a local KDC,
in lieu of the unmaintained minikdc support (which has been ripped
out).
Testing:
I was able to run some queries against pre-created HDFS tables
with kerberos enabled.
Change-Id: Ib34101d132e9c9d59da14537edf7d096f25e9bee
Reviewed-on: http://gerrit.cloudera.org:8080/15159
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Hive 3 changed the typical storage model for tables to split them
between two directories:
- hive.metastore.warehouse.dir stores managed tables (which is now
defined to be only transactional tables)
- hive.metastore.warehouse.external.dir stores external tables
(everything that is not a transactional table)
In more recent commits of Hive, there is now validation that the
external tables cannot be stored in the managed directory. In order
to adopt these newer versions of Hive, we need to use separate
directories for external vs managed warehouses.
Most of our test tables are not transactional, so they would reside
in the external directory. To keep the test changes small, this uses
/test-warehouse for the external directory and /test-warehouse/managed
for the managed directory. Having the managed directory be a subdirectory
of /test-warehouse means that the data snapshot code should not need to
change.
The Hive 2 configuration doesn't change as it does not have this concept.
Since this changes the dataload layout, this also sets the CDH_MAJOR_VERSION
to 7 for USE_CDP_HIVE=true. This means that dataload will uses a separate
location for data as compared to USE_CDP_HIVE=false. That should reduce
conflicts between the two configurations.
Testing:
- Ran exhaustive tests with USE_CDP_HIVE=false
- Ran exhaustive tests with USE_CDP_HIVE=true (with current Hive version)
- Verified that dataload succeeds and tests are able to run with a newer
Hive version.
Change-Id: I3db69f1b8ca07ae98670429954f5f7a1a359eaec
Reviewed-on: http://gerrit.cloudera.org:8080/15026
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The minicluster init scripts currently keep track of pids
for HDFS, YARN, etc by writing the pid to files for each
service. It uses the pid in the file to see what is running and
needs to shutdown or start. Currently, it does not remove the pid
file after the minicluster shuts down. This means that it would
be reading a zombie pid from the pid file to see if the service
is already running. If the pid is reused by something else, it
can fail to start up a necessary service.
This removes the pid files when the minicluster components shut down
successfully.
Change-Id: I5b14d74df8061b6595b9897df9c9667e3f569e34
Reviewed-on: http://gerrit.cloudera.org:8080/14950
Reviewed-by: Andrew Sherman <asherman@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
In IMPALA-9047, we disabled some Ranger-related FE and BE tests due to
changes in Ranger's behavior after upgrading Ranger from 1.2 to 2.0.
This patch aims to re-enable those disabled FE tests in
AuthorizationStmtTest.java and RangerAuditLogTest.java to increase
Impala's test coverage of authorization via Ranger.
There are at least two major changes in Ranger's behavior in the newer
versions.
1. The first is that the owner of the requested resource no longer has
to be explicitly granted privileges in order to access the resource.
2. The second is that a user not explicitly granted the privilege of
creating a database is able to do so.
Due to these changes, some of previous Ranger authorization requests
that were expected to be rejected are now granted after the upgrade.
To re-enable the tests affected by the first change described above, we
modify AuthorizationTestBase.java to allow our FE Ranger authorization
tests to specify the requesting user in an authorization test. Those
tests failed after the upgrade because the default requesting user in
Impala's AuthorizationTestBase.java happens to be the owner of the
resources involved in our FE authorization tests. After this patch, a
requesting user can be either a non-owner user or an owner user in a
Ranger authorization test and the requesting user would correspond to a
non-owner user if it is not explicitly specified. Note that in a Sentry
authorization test, we do not use the non-owner user as the requesting
user by default as we do in the Ranger authorization tests. Instead, we
set the name of the requesting user to a name that is the same as the
owner user in Ranger authorization tests to avoid the need for providing
a customized group mapping service when instantiating a Sentry
ResourceAuthorizationProvider as we do in AuthorizationTest.java, our
FE tests specifically for testing authorization via Sentry.
On the other hand, to re-enable the tests affected by the second change,
we remove from the Ranger policy for all databases the allowed
condition that grants any user the privilege of creating a database,
which is not by default granted by Sentry. After the removal of the
allowed codition, those tests in AuthorizationStmtTest.java and
RangerAuditLogTest.java affected by the second change now result in the
same authorization errors before the upgrade of Ranger.
Testing:
- Passed AuthorizationStmtTest.java in a local dev environment
- Passed RangerAuditLogTest.java in a local dev environment
Change-Id: I228533aae34b9ac03bdbbcd51a380770ff17c7f2
Reviewed-on: http://gerrit.cloudera.org:8080/14798
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This commit adds support for the syntax of CREATE TABLE (and CTAS)
statements for managed Kudu tables with Kudu/HMS integration. A follow
up patch will address the actual handling of CREATE TABLE statement
with Kudu/HMS integration.
For a managed table the syntax remains the same. However, the detailed
changes includes:
1) Kudu table will always be created with the new Kudu storage handler
'org.apache.kudu.hive.KuduStorageHandler' even when Kudu/HMS integration
is disabled. The legacy storage handler will be eventually deprecated.
2) When Kudu/HMS integration is enabled, the Kudu table underneath the
managed HMS table will follow the naming convention 'db_name.table_name'
instead of 'impala::db_name.table_name'.
3) Add 'kudu.table_id' table property to be used with Kudu/HMS integration.
This commit also extracts Kudu-related DDL parsing and analyzing tests,
so that they can be run with or without Kudu/HMS integration enabled.
Change-Id: I465673d749221bd5f3772814b1c22c2673a53f5c
Reviewed-on: http://gerrit.cloudera.org:8080/13318
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Thomas Marshall <tmarshall@cloudera.com>
Some tests can fail on S3 due to some operations that are eventually
consistent. S3Guard stores extra metadata in a DynamoDB to solve
several consistency issues.
This adds support for running the minicluster on S3 with S3Guard.
S3Guard is configured by the following environment variables:
S3GUARD_ENABLED: defaults to false, set to true to enable S3Guard
S3GUARD_DYNAMODB_TABLE: name of the DynamoDB table to use. This must
be exclusively owned by this minicluster. The dataload scripts
initialize this table and will purge entries if the table already
exists. The table should be in the same region as the S3_BUCKET
for the minicluster.
S3GUARD_DYNAMODB_REGION - AWS region for S3GUARD_DYNAMODB_TABLE
These environment variables only impact S3 configurations.
The support comes from three pieces:
1. Configuration changes in core-site.xml to add the appropriate
parameters.
2. Updating dataload to initialize/purge the s3guard dynamodb table
and import data appropriately.
3. Update tests to manipulate files through the HDFS command line
rather than through s3 utilities. This takes the filesystem
utility code for ABFS (which actually calls HDFS command line),
makes it generic, and uses it for S3Guard.
Testing:
- Ran multiple rounds of s3 tests
- Aborted tests in the middle and restarted the s3 tests (to test
the s3guard reinitialization code)
Change-Id: I3c748529a494bb6e70fec96dc031523ff79bf61d
Reviewed-on: http://gerrit.cloudera.org:8080/13020
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Sahil Takiar <stakiar@cloudera.com>
Dataload on Hive 3 uses YARN, but YARN is not needed for Hive 2
dataload. This fixes the condition so that YARN does not start
for USE_CDP_HIVE=false (Hive 2). This fixes a similar condition
for manipulating the classpath in testdata/bin/run-hive-server.sh.
Change-Id: If2b0a529eadd5a436f0318229600180f72a27207
Reviewed-on: http://gerrit.cloudera.org:8080/13343
Reviewed-by: Todd Lipcon <todd@apache.org>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This switches away from Tez local mode to tez-on-YARN. After spending a
couple of days trying to debug issues with Tez local mode, it seemed
like it was just going to be too much of a lift.
This patch switches on the starting of a Yarn RM and NM when
USE_CDP_HIVE is enabled. It also switches to a new yarn-site.xml with a
minimized set of configurations, generated by the new python templating.
In order for everything to work properly I also had to update the Hadoop
dependency to come from CDP instead of CDH when using CDP Hive.
Otherwise, the classpath of the launched Tez containers had conflicting
versions of various Hadoop classes which caused tasks to fail.
I verified that this fixes concurrent query execution by running queries
in parallel in two beeline sessions. With local mode, these queries
would periodically fail due to various races (HIVE-21682). I'm also able
to get farther along in data loading.
Change-Id: If96064f271582b2790a3cfb3d135f3834d46c41d
Reviewed-on: http://gerrit.cloudera.org:8080/13224
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Todd Lipcon <todd@apache.org>
This patch bumps the CDP_BUILD_NUMBER to 1013201. This patch also
refactors the bootstrap_toolchain.py to be more generic for dealing with
CDP components, e.g. Ranger and Hive 3.
The patch also fixes some TODOs to replace the rangerPlugin.init() hack
with rangerPlugin.refreshPoliciesAndTags() API available in this Ranger
build.
Testing:
- Ran core tests
- Manually verified that no regression when starting Hive 3 with
USE_CDP_HIVE=true
Change-Id: I18c7274085be4f87ecdaf0cd29a601715f594ada
Reviewed-on: http://gerrit.cloudera.org:8080/13002
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds fupport for GRANT privilege statements to GROUP and
REVOKE privilege statements from GROUP. The grammar has been updated to
support FROM GROUP and TO GROUP for GRANT/REVOKE statements, i.e:
GRANT <privilege> ON <resource> TO GROUP <group>
REVOKE <privilege> ON <resource> FROM GROUP <group>
Currently, only Ranger's authorization implementation supports GROUP
based privileges. Sentry will throw an UnsupportedOperationException if
it is the enabled authorization provider and this new grammar is used.
Testing:
- AuthorizationStmtTest was updated to also test for GROUP
authorization.
- ToSqlTest was updated to test for GROUP changes to the grammar.
- A GROUP based E2E test was added to test_ranger.py
- ParserTest was updated to test combinations for GrantRevokePrivilege
- AnalyzeAuthStmtsTest was updated to test for USER and GROUP identities
- Ran all FE tests
- Ran authorization E2E tests
Change-Id: I28b7b3e4c776ad1bb5bdc184c7d733d0b5ef5e96
Reviewed-on: http://gerrit.cloudera.org:8080/12914
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds an initial support for Ranger that can be enabled via
the following flags in both impalad and catalogd to do enforcement.
- ranger_service_type=hive
- ranger_app_id=some_app_id
- authorization_factory_class=\
org.apache.impala.authorization.ranger.RangerAuthorizationFactory
The Ranger plugin for Impala uses Hive service definition to allow
sharing Ranger policies between Hive and Impala. Temporarily the REFRESH
privilege uses "read" access type and it will be updated in the later
patch once Ranger supports "refresh" access type.
There's a change in DESCRIBE <table> privilege requirement to use ANY
privilege instead of VIEW_METADATA privilege as the first-level check
to play nicely with Ranger. This is not a security risk since the
column-level filtering logic after the first-level check will use
VIEW_METADATA privilege to filter out unauthorized column access. In
other words, DESCRIBE <table> may return an empty result instead of
an authorization error as long as there exists any privilege in the
given table.
This patch updates AuthorizationStmtTest with a parameterized test that
runs the tests against Sentry and Ranger.
Testing:
- Updated AuthorizationStmtTest with Ranger
- Ran all FE tests
- Ran all E2E authorization tests
Change-Id: I8cad9e609d20aae1ff645c84fd58a02afee70276
Reviewed-on: http://gerrit.cloudera.org:8080/12632
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch updates the build scripts to suport Apache Ranger:
- Download Apache Ranger
- Setup Apache Ranger database
- Create Apache Ranger configuration files
- Start/stop Apache Ranger
Testing:
- Ran ./buildall.sh -format on a clean repository and was able to start
Ranger without any problem.
- Ran test-with-docker
Change-Id: I249cd64d74518946829e8588ed33d5ac454ffa7b
Reviewed-on: http://gerrit.cloudera.org:8080/12469
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
While ntp-wait is in the default $PATH on Ubuntu, it's not on CentOS.
Use the absolute path of the binary (same in both flavors).
Testing:
- Ran privately on CentOS both with and without ntp-wait installed; GVO
will cover Ubuntu.
Change-Id: I5d709d121a71e62b8ee4027db81b53108f389fdd
Reviewed-on: http://gerrit.cloudera.org:8080/12369
Reviewed-by: Philip Zeyliger <philip@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch does the work to load data and run some end-to-end
query tests on a dockerised cluster. Changes were required
in start-impala-cluster.py/ImpalaCluster and in some configuration
files.
ImpalaCluster is used for various things, including discovering
service ports and testing for cluster readiness. This patch adds
basic support and uses it from start-impala-cluster.py to check
for cluster readiness. Some logic is moved from
start-impala-cluster.py to ImpalaCluster.
Limitations:
* We're fairly inconsistent about whether services listen only on
a single interface (e.g. loopback, traditionally) or whether it
listens on all interfaces. This doesn't fix all of those issues.
E.g. HDFS datanodes listen on all interfaces to work around
some issues.
* Many tests don't pass yet, particularly those using
ImpalaCluster(), which isn't initialised with the appropriate
docker arguments.
Testing:
Did a full data load locally using a dockerised Impala cluster:
START_CLUSTER_ARGS="--docker_network=impala-cluster" \
TEST_START_CLUSTER_ARGS="--docker_network=impala-cluster" \
./buildall.sh -format -testdata -ninja -notests -skiptests -noclean
Ran a selection of end-to-end tests touching HDFS, Kudu and HBase
tables after I loaded data locally.
Ran exhaustive tests with non-dockerised impala cluster.
Change-Id: I98fb9c4f5a3a3bb15c7809eab28ec8e5f63ff517
Reviewed-on: http://gerrit.cloudera.org:8080/12189
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This refactors start-impala-cluster.py to allow multiple implementations
of the minicluster operations like start and stop. There are now
two classes implementing the same set of operations -
MiniClusterOperations and DockerMiniClusterOperations. The docker
versions start and stop the containers added in IMPALA-7948.
With some configuration (see instructions below), the containers can
connect back to services (HDFS, HMS, Kudu, Sentry, etc) running on the
host. Config generation was modified so that services optionally
communicate via the docker bridge network rather than loopback
(the host's loopback interface is not accessible to the containers).
Notes:
* I improved the container build to regenerate containers when cluster
configs are regenerated (previously the containers could have stale
configs).
* Switch from CMD to ENTRYPOINT to allow passing in arguments to "docker
run" without clobbering default args.
* Python 2.6 is not supported for this code path. This only affects
CentOS 6, which has limited support for docker anyway.
* I deferred implementing wait_for_cluster(), since the existing
code requires surgery to abstract out assumptions about locating
processes and web UI ports - see IMPALA-7988.
How to use:
==========
Create a docker network to use for internal cluster communication,
e.g.:
docker network create -d bridge --gateway=172.17.0.1 \
--subnet=172.17.0.1/16 impala-cluster
Add the gateway address of the docker network you created to
impala-config-local.sh, e.g.:
export INTERNAL_LISTEN_HOST=172.17.0.1
export DEFAULT_FS=hdfs://${INTERNAL_LISTEN_HOST}:20500
Regenerate configs and docker images:
. bin/impala-config.sh
./bin/create-test-configuration.sh
ninja -j $IMPALA_BUILD_THREADS docker_images
Restart the minicluster and Impala services to pick up the config:
./testdata/bin/run-all.sh
start-impala-cluster.py --docker_network impala-cluster
You can connect with impala-shell and run some queries. You will
likely run into issues, particularly if running against an existing
data load, since "localhost" or "127.0.0.1" get baked into HMS
table definitions.
Testing:
Ran exhaustive tests (not using Docker) to make sure I didn't break
anything.
Change-Id: I5975cced33fa93df43101dd47d19b8af12e93d11
Reviewed-on: http://gerrit.cloudera.org:8080/12095
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
A few unversioned artifacts crept in over time without corresponding
.gitignore entries. These are the updates based on the git status output
on my dev env.
Change-Id: I281ab3b5c98ac32e5d60663562628ffda6606a6a
Reviewed-on: http://gerrit.cloudera.org:8080/11787
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
HADOOP-15407 adds a new FileSystem implementation called "ABFS" for the
ADLS Gen2 service. It's in the hadoop-azure module as a replacement for
WASB. Filesystem semantics should be the same, so skipped tests and
other behavior changes have simply mirrored what is done for ADLS Gen1
by default. Tests skipped on ADLS Gen1 due to eventual consistency of
the Python client can be run against ADLS Gen2.
Change-Id: I5120b071760e7655e78902dce8483f8f54de445d
Reviewed-on: http://gerrit.cloudera.org:8080/11630
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch transitions from pulling in Kudu (libkudu_client.so and the
minicluster tarballs) from the toolchain to instead pull Kudu in with
the other CDH components.
For OSes where the CDH binaries are not provided but the toolchain
binaries are (only Ubuntu 14), we set USE_CDH_KUDU to false to
continue to download the toolchain binaries. We also continue
to use the toolchain binaries to build the client stub for OSes
where KUDU_IS_SUPPORTED is false.
This patch also fixes an issue in bootstrap_toolchain.py where we were
using the wrong g++ to compile the Kudu stub.
Testing:
- Verified building and running Impala works as expected for supported
combinations of KUDU_IS_SUPPORTED/USE_CDH_KUDU
Change-Id: If6e1048438b6d09a1b38c58371d6212bb6dcc06c
Reviewed-on: http://gerrit.cloudera.org:8080/11363
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch removes the use of IMPALA_MINICLUSTER_PROFILE. The code that
uses IMPALA_MINICLUSTER_PROFILE=2 is removed and it defaults to code from
IMPALA_MINICLUSTER_PROFILE=3. In order to reduce having too many code
changes in this patch, there is no code change for the shims. The shims
for IMPALA_MINICLUSTER_PROFILE=3 automatically become the default
implementation.
Testing:
- Ran core and exhaustive tests
Change-Id: Iba4a81165b3d2012dc04d4115454372c41e39f08
Reviewed-on: http://gerrit.cloudera.org:8080/10940
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
HDFS trash checkpointing renames files in the trash folder and breaks
impala tests. Impala set the trash checkpointing interval to 1440 to try
to postpone it for 24 hours. Unfortunately that told HDFS to do it when
the UNIX time is a multiple of 1440 * 60 and it broke trash-related
tests run around midnight in GMT. This patch sets the interval to
541728000 so that HDFS won't do the checkpointing until Jan 1st 3000,
and HDFS will checkpoint every 1030 years after that.
Change-Id: I9452f7e44c7679f86a947cd20115c078757223d8
Reviewed-on: http://gerrit.cloudera.org:8080/10742
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This avoids leftover keystore files from previous instances
of the cluster, e.g. testdata/cluster/cdh6/node-1/data/kms.keystore.
I checked on my local system and that is the only file outside of
the nn and dn subdirectories.
Change-Id: I9e6c1afa0df95b26446ccafa0c942e28ba848dae
Reviewed-on: http://gerrit.cloudera.org:8080/10613
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Philip Zeyliger <philip@cloudera.com>
Configures a Java property for KMS to account for JDK 8u171's security fixes. I
was seeing impala-py.test tests/metadata/test_hdfs_encryption.py fail with the
following error:
AssertionError: Error creating encryption zone: RemoteException: Can't recover key for testkey1 from keystore file:/home/impdev/Impala/testdata/cluster/cdh6/node-1/data/kms.keystore
The issue is described in HDFS-13494, and I imagine it'll be fixed in due time. In the
meanwhile, setting this property seems to do the trick.
Change-Id: I2d21c9cce3b91e8fd8b2b4f1cda75e3958c977d5
Reviewed-on: http://gerrit.cloudera.org:8080/10418
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Erasure coding data loading is flaky in two ways:
1. HBase sometimes doesn't work because of HBase-19369
2. Nested data loading sometimes fails because the HDFS namenode cannot
find enough good datanodes.
For problem 1, this patch enables erasure coding only on /test-warehouse
directory. For problem 2, this patch sets
dfs.namenode.redundancy.considerLoad to false, preventing namenode from
excluding heavily-loaded datanodes.
Change-Id: I219106cd3ec7ffab7a834700f2a722b165e5f66c
Reviewed-on: http://gerrit.cloudera.org:8080/10362
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
In this patch we add the "ERASURE_CODING" enviornment variable. If we
enable it, a cluster with 5 data nodes will be created during data
loading and HDFS will be started with erasure coding enabled.
Testing:
I ran the core build, and verified that erasure coding gets enabled in
HDFS. Many of our EE tests failed however.
Cherry-picks: not for 2.x
Change-Id: I397aed491354be21b0a8441ca671232dca25146c
Reviewed-on: http://gerrit.cloudera.org:8080/10275
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This commit tackles a few additions and improvements to
test-with-docker. In general, I'm adding workloads (e.g., exhaustive,
rat-check), tuning memory setting and parallelism, and trying to speed
things up.
Bug fixes:
* Embarassingly, I was still skipping thrift-server-test in the backend
tests. This was a mistake in handling feedback from my last review.
* I made the timeline a little bit taller to clip less.
Adding workloads:
* I added the RAT licensing check.
* I added exhaustive runs. This led me to model the suites a little
bit more in Python, with a class representing a suite with a
bunch of data about the suite. It's not perfect and still
coupled with the entrypoint.sh shell script, but it feels
workable. As part of adding exhaustive tests, I had
to re-work the timeout handling, since now different
suites meaningfully have different timeouts.
Speed ups:
* To speed up test runs, I added a mechanism to split py.test suites into
multiple shards with a py.test argument. This involved a little bit of work in
conftest.py, and exposing $RUN_CUSTOM_CLUSTER_TESTS_ARGS in run-all-tests.sh.
Furthermore, I moved a bit more logic about managing the
list of suites into Python.
* Doing the full build with "-notests" and only building
the backend tests in the relevant target that needs them. This speeds
up "docker commit" significantly by removing about 20GB from the
container. I had to indicates that expr-codegen-test depends on
expr-codegen-test-ir, which was missing.
* I sped up copying the Kudu data: previously I did
both a move and a copy; now I'm doing a move followed by a move. One
of the moves is cross-filesystem so is slow, but this does half the
amount of copying.
Memory usage:
* I tweaked the memlimit_gb settings to have a higher default. I've been
fighting empirically to have the tests run well on c4.8xlarge and
m4.10xlarge.
The more memory a minicluster and test suite run uses, the fewer parallel
suites we can run. By observing the peak processes at the tail of a run (with a
new "memory_usage" function that uses a ps/sort/awk trick) and by observing
peak container total_rss, I found that we had several JVMs that
didn't have Xmx settings set. I added Xms/Xmx settings in a few
places:
* The non-first Impalad does very little JVM work, so having
an Xmx keeps it small, even in the parallel tests.
* Datanodes do work, but they essentially were never garbage
collecting, because JVM defaults let them use up to 1/4th
the machine memory. (I observed this based on RSS at the
end of the run; nothing fancier.) Adding Xms/Xmx settings
helped.
* Similarly, I piped the settings through to HBase.
A few daemons still run without resource limitations, but they don't
seem to be a problem.
Change-Id: I43fe124f00340afa21ad1eeb6432d6d50151ca7c
Reviewed-on: http://gerrit.cloudera.org:8080/10123
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch integrates the orc library into Impala and implements
HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner
supplies input needed from the orc-reader, tracks memory consumption of
the reader and transfers the reader's output (orc::ColumnVectorBatch)
into impala::RowBatch. The ORC version we used is release-1.4.3.
A startup option --enable_orc_scanner is added for this feature. It's
set to true by default. Setting it to false will fail queries on ORC
tables.
Currently, we only support reading primitive types. Writing into ORC
table has not been supported neither.
Tests
- Most of the end-to-end tests can run on ORC format.
- Add tpcds, tpch tests for ORC.
- Add some ORC specific tests.
- Haven't enabled test_scanner_fuzz for ORC yet, since the ORC library
is not robust for corrupt files (ORC-315).
Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Reviewed-on: http://gerrit.cloudera.org:8080/9134
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Remove Yarn from minicluster by default.
Turns out that we start Yarn as part of the minicluster, but we never use it.
(HiveServer2 is configured to run MR jobs "locally" in process.) Likely, this
Yarn integration is a vestige of Yarn/Llama integration. We can save memory by
not starting it by default.
There are some less-common tooks like tests/comparison/cluster.py which use
Yarn (and Hadoop Streaming). In deference to those tools, I've left a mechanism
to start Yarn rather than excising it altogether. After running
buildall the regular way, add Yarn to the cluster by running:
testdata/cluster/admin -y start_cluster
I tested by running core tests. I did not test the kerberized minicluster.
[Due to a git mishap, a version of this was previously checked in and reverted.]
Change-Id: I97053a44bbe32048e6c35cc28680d1c7696af13f
Reviewed-on: http://gerrit.cloudera.org:8080/9970
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Turns out that we start Yarn as part of the minicluster, but we never use it.
(HiveServer2 is configured to run MR jobs "locally" in process.) Likely, this
Yarn integration is a vestige of Yarn/Llama integration. We can save memory by
not starting it by default.
There are some less-common tooks like tests/comparison/cluster.py which use
Yarn (and Hadoop Streaming). In deference to those tools, I've left a mechanism
to start Yarn rather than excising it altogether. After running
buildall the regular way, add Yarn to the cluster by running:
testdata/cluster/admin -y start_cluster
I tested by running core tests. I did not test the kerberized minicluster.
Change-Id: I5504cc40b89e3c6d53fac0b7aa4b395fa63e8d79
Adds support for building against two sets of Hadoop ecosystem
components. The control variable is IMPALA_MINICLUSTER_PROFILE_OVERRIDE,
which can either be set to 2 (for Hadoop 2, Hive 1, and so on) or 3 (for
Hadoop 3, Hive 2, and so on).
We intend (in a trivial follow-on change soon) to make 3 the new default
and to explicitly deprecate 2, but this change only does not switch the
default yet. We support both to facilitate a smoother transition, but
support will be removed soon in the Impala 3.x line.
The switch is done at build time, following the pattern from IMPALA-5184
(build fe against both Hive 1 & 2 APIs). Switching back and forth
requires running 'cmake' again. Doing this at build-time avoids
complicating the Java code with classloader configuration.
There are relatively few incompatible APIs. This implementation
encapsulates that by extracting some Java code into
fe/src/compat-minicluminicluster-profile-{2,3}. (This follows the
pattern established by IMPALA-5184, but, to avoid a proliferation
of directories, I've moved the Hive files into the same tree.)
pattern from IMPALA-5184 (build fe against both Hive 1 & 2 APIs). I
consolidated the Hive changes into the same directory structure.
For Maven, I introduced Maven "profiles" to handle the two cases where
the dependencies (and exclusions) differ. These are driven by the
$IMPALA_MINICLUSTER_PROFILE environment variable.
For Sentry, exception class names changed. We work around this by adding
"isSentry...(Exception)" methods with two different implementations.
Sentry is also doing some odd shading, whereby some exceptions are
"sentry.org.apache.sentry..."; we handle both. Similarly, the mechanism
to create a SentryAuthProvider is slightly different. The easiest way to
see the differences is to run:
diff -u fe/src/compat-minicluster-profile-{2,3}/java/org/apache/impala/util/SentryUtil.java
diff -u fe/src/compat-minicluster-profile-{2,3}/java/org/apache/impala/authorization/SentryAuthProvider.java
The Sentry work is based on a change by Zach Amsden.
In addition, we recently added an explicit "refresh" permission. In
Sentry 2, this required creating an ImpalaPrivilegeModel to capture
that. It's a slight customization of Hive's equivalent class.
For Parquet, the difference is even more mechanical. The package names
gone from "parquet" to "org.apache.parquet". The affected code
was extracted into ParquetHelper, but only one copy exists. The second
copy is generated at build-time using sed.
In the rare cases where we need to behave differently at runtime,
MiniclusterProfile.MINICLUSTER_PROFILE is a class which encapsulates
what version we were built aginst. One of the cases is the results
expected by various frontend tests. I avoided the issue by translating
one error string into another, which handled the diversion in one place,
rather than complicating the several locations which look for "No
FileSystem for scheme..." errors.
The HBase APIs we use for splitting regions at test time changed.
This patch includes a re-write of that code for the new APIs. This
piece was contributed by Zach Amsden.
To work with newer versions of dependencies, I updated the version of
httpcomponents.core we use to 4.4.9.
We (Thomas Tauber-Marshall and I) uploaded new Hadoop/Hive/Sentry/HBase
binaries to s3://native-toolchain, and amended the shell scripts to
launch the right things. There are minor mechanical differences. Some
of this was based on earlier work by Joe McDonnell and Zach Amsden.
Hive's logging is changed in Hive 2, necessitating creating a
log4j2.properties template and using it appropriately. Furthermore,
Hadoop3's new shell script re-writes do a certain amount of classpath
de-duplication, causing some issues with locating the relevant logging
configurations. Accomodations exist in the code to deal with that.
parquet-filtering.test was updated to turn off stats filtering. Older
Hive didn't write Parquet statistics, but newer Hive does. By turning
off stats filtering, we test what the test had intended to test.
For views-compatibility.test, it seems that Hive 2 has fixed certain
bugs that we were testing for in Hive. I've added a
HIVE=SUCCESS_PROFILE_3_ONLY mechanism to capture that.
For AuthorizationTest, different hive versions show slightly different
things for extended output.
To facilitate easier reviewing, the following files are 100% renames as identified by git; nothing
to see here.
rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetCatalogsReq.java (100%)
rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetColumnsReq.java (100%)
rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetFunctionsReq.java (100%)
rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetInfoReq.java (100%)
rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetSchemasReq.java (100%)
rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetTablesReq.java (100%)
rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/impala/compat/MetastoreShim.java (100%)
rename fe/src/{compat-hive-2 => compat-minicluster-profile-3}/java/org/apache/impala/compat/MetastoreShim.java (100%)
rename testdata/cluster/node_templates/{cdh5 => common}/etc/hadoop/conf/kms-acls.xml.tmpl (100%)
rename testdata/cluster/node_templates/{cdh5 => common}/etc/hadoop/conf/kms-site.xml.tmpl (100%)
rename testdata/cluster/node_templates/{cdh5 => common}/etc/hadoop/conf/yarn-site.xml.tmpl (100%)
rename testdata/cluster/node_templates/{cdh5 => common}/etc/init.d/kudu-common (100%)
rename testdata/cluster/node_templates/{cdh5 => common}/etc/init.d/kudu-master (100%)
rename testdata/cluster/node_templates/{cdh5 => common}/etc/init.d/kudu-tserver (100%)
rename testdata/cluster/node_templates/{cdh5 => common}/etc/kudu/master.conf.tmpl (100%)
rename testdata/cluster/node_templates/{cdh5 => common}/etc/kudu/tserver.conf.tmpl (100%)
CreateTableLikeFileStmt had a chunk of code moved to ParquetHelper.java. This
was done manually, but without changing anything except what Java required in
terms of accessibility and boilerplate.
rewrite fe/src/main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java (80%)
copy fe/src/{main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java => compat-minicluster-profile-3/java/org/apache/impala/analysis/ParquetHelper.java} (77%)
Testing: Ran core & exhaustive tests with both profiles.
Cherry-picks: not for 2.x.
Change-Id: I7a2ab50331986c7394c2bbfd6c865232bca975f7
Reviewed-on: http://gerrit.cloudera.org:8080/9716
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
Due to the way in which we instantiate fair scheduler allocation
loader, we donot read the config overrides from the HDFS config
files.
This is an unexpected behavior from users' POV since we typically
support overrides like custom user -> group mapping via HDFS
config (for ex: LDAPGroupsMapping) that eventually affects the
query -> pool assignment.
Fix: This patch loads the hadoop default configuration so that the
underlying QueuePlacementPolicy is based on user specified overrides.
Testing (manual): Changed the core-site.xml to use LDAPGroupsMapping
instead of the default ShellBasedUnixGroupsMapping and confirmed that
the correct group mapping plugin is loaded, by adding additional logging.
Also, modified TestRequestPoolService to assert that the core-site xml
overrides are loaded.
Change-Id: Ibb93870c0cc37e2432a643a274931f1d3d13fb96
Reviewed-on: http://gerrit.cloudera.org:8080/9000
Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com>
Tested-by: Impala Public Jenkins
For some time Impala in a production environment has been able
to access data stored in Amazon S3 buckets using credentials specified
in a number of ways:
- storing Amazon access keys in environment variables or
in core-site.xml.
- using proprietary management tools to store Amazon access keys
securely
- using Amazon IAM roles bound to VMs running in EC2.
The development minicluster environment used the first approach,
which risked leaking these keys.
This change enables Impala builds to use IAM
roles to access S3 buckets when running on an Amazon EC2 virtual
machine. The changes mainly ensure that environment variables carrying
the traditional AWS credentials do not conflict with credentials supplied
by the IAM role attached to the VM instance.
IAM role based credentials are accessible through the EC2
instance-property mechanism; for further details see Amazon's docs at
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html#instance-metadata-security-credentials
The change also removes the remaining references to the s3n: provider.
In the FE tests all URIs referring to s3n: are replaced with their
s3a: equivalents, except for a single negative test in
AnalyzeStmtsTest.java, which is removed.
In addition to the code changes, the s3n: and s3a: credential properties
are also removed from core-site.xml.tmpl. The s3a: provider can pick up
AWS S3 credentials from environment variables or IAM properties bound
to the VM instance, which is a more flexible approach.
As environment variables have precedence over IAM roles, care must be
taken when managing the canonical environment variables carrying
AWS credentials. There are two requirements to be reconciled:
1. The FE tests have code that examines s3a: URIs; this code needs
existing, but not necessarily valid AWS credentials.
2. When the Impala test suite is executed on an EC2 VM, AWS credentials
can be supplied via IAM roles. These credentials can be used only
if the AWS_* environment variables are unset (do not exist).
The tradeoff is managed following these rules:
1. When AWS_* environment variables are set before invoking the
Impala configuration scripts, their value is preserved and
the config scripts ensure that the variables are exported.
2. If the AWS_* variables are missing or empty, they will be unset
to ensure that credentials supplied by Amazon's IAM roles can be
accessed,
3. except if the scripts are running outside of EC2 (so there can be
no IAM roles) and TARGET_FILESYSTEM is not set "s3". This combination
is most often the case on a developer's local workstation.
In this case the AWS_* credential variables are forcibly set to
dummy values to allow the FE tests to succeed.
The removal of S3 credential parameters from core-site.xml[.tmpl]
also allows users to set up their own credentials there,
the config scripts will not change those settings.
Environment variables carrying AWS security credentials will be set
up according to the following table:
Instance: Running outside EC2 || Running in EC2 |
--------------------+--------+--------++--------+--------+
TARGET_FILESYSTEM | S3 | not S3 || S3 | not S3 |
--------------------+--------+--------++--------+--------+
| | || | |
empty | unset | dummy || unset | unset |
AWS_* | | || | |
env --------------+--------+--------++--------+--------+
var | | || | |
not empty | export | export || export | export |
| | || | |
--------------------+--------+--------++--------+--------+
Legend: unset: the variable is unset
export: the variable is exported with its current value
dummy: the variable is set to a preset dummy value and
exported
Running on an EC2 VM is indicated by setting RUNNING_IN_EC2 to "true" and
exporting it before impala_config.sh is invoked.
The change also moves the logic performing the S3 access checks into a separate
script file: bin/check-s3-access.sh. This file now contains all the S3-specific
logic and network access to check if the requested S3 bucket can be accessed.
Testing:
Performed local builds for HDFS as well as automated builds against
HDFS and S3, using both IAM roles and explicit AWS_* credentials for
authentication.
Verified that FE tests that parse s3a: URLs are still successful in
all these combinations (when they are run).
Change-Id: I14cd9d4453a91baad3c379aa7e4944993fca95ae
Reviewed-on: http://gerrit.cloudera.org:8080/8294
Reviewed-by: Philip Zeyliger <philip@cloudera.com>
Reviewed-by: Zach Amsden <zamsden@cloudera.com>
Tested-by: Impala Public Jenkins
This patch leverages the AdlFileSystem in Hadoop to allow
Impala to talk to the Azure Data Lake Store. This patch has
functional changes as well as adds test infrastructure for
testing Impala over ADLS.
We do not support ACLs on ADLS since the Hadoop ADLS
connector does not integrate ADLS ACLs with Hadoop users/groups.
For testing, we use the azure-data-lake-store-python client
from Microsoft. This client seems to have some consistency
issues. For example, a drop table through Impala will delete
the files in ADLS, however, listing that directory through
the python client immediately after the drop, will still show
the files. This behavior is unexpected since ADLS claims to be
strongly consistent. Some tests have been skipped due to this
limitation with the tag SkipIfADLS.slow_client. Tracked by
IMPALA-5335.
The azure-data-lake-store-python client also only works on CentOS 6.6
and over, so the python dependencies for Azure will not be downloaded
when the TARGET_FILESYSTEM is not "adls". While running ADLS tests,
the expectation will be that it runs on a machine that is at least
running CentOS 6.6.
Note: This is only a test limitation, not a functional one. Clusters
with older OSes like CentOS 6.4 will still work with ADLS.
Added another dependency to bootstrap_build.sh for the ADLS Python
client.
Testing: Ran core tests with and without TARGET_FILESYSTEM as
'adls' to make sure that all tests pass and that nothing breaks.
Change-Id: Ic56b9988b32a330443f24c44f9cb2c80842f7542
Reviewed-on: http://gerrit.cloudera.org:8080/6910
Tested-by: Impala Public Jenkins
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
By default, Kudu assumes it has 80% of system memory which
is far too high for the minicluster. This sets a mem limit
of 2gb and lowers the limit of the block cache. These values
were tested on a gerrit-verify-dryrun job as well as an
exhaustive run.
This patch also simplifies TestKuduMemLimits which was
unnecessarily creating a large table during test execution.
Change-Id: I7fd7e1cd9dc781aaa672a2c68c845cb57ec885d5
Reviewed-on: http://gerrit.cloudera.org:8080/6844
Reviewed-by: Todd Lipcon <todd@apache.org>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
The minicluster setup logic assigned fixed port numbers to several
but not all listening sockets of the data nodes. This change
assigns similar port ranges to all the listening ports that were
so far allowed to pick their own port numbers, interfering with
other components, e.g. HBase.
Change-Id: Iecf312873b7026c52b0ac0e71adbecab181925a0
Reviewed-on: http://gerrit.cloudera.org:8080/6531
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Impala Public Jenkins
We've seen repeated test failures because HBase tries to bind to ports
in the ephemeral port range, which sometimes would already be occupied
by outgoing connections of other proccesses.
This change changes the ports to the new default HBase ports
(HBASE-10123):
HBase Master Port: 60000 -> 16000
HBase Master Web UI Port: 60010 -> 16010
HBase ReqionServer Port: 60020 -> 16020
HBase ReqionServer Web UI Port: 60030 -> 16030
HBase Status Multicast Port: 60100 -> 16100
This made it necessary to change the default KMS port, too
(HADOOP-12811):
KMS HTTP port: 16000 -> 9600
Change-Id: I6f8af325e34b6e352afd75ce5ddd2446ce73d857
Reviewed-on: http://gerrit.cloudera.org:8080/6524
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Impala Public Jenkins
IMPALA-4553 made ntp-wait succeed before kudu would start, assuming
ntp-wait was installed, in order to prevent a litany of errors on ec2
about unsynchronized clocks. This patch disables that waiting if no
internet connection is detected in order to make it possible to start
the minicluster when offline.
Change-Id: Ifbb5babebb0ca6d2553be1b001e20e2270e052b6
Reviewed-on: http://gerrit.cloudera.org:8080/5412
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins
When ntpd is not synchronized, kudu initialization fails on the master
node:
F1129 16:37:28.969956 15230 master_main.cc:68] Check failed:
_s.ok() Bad status: Service unavailable: Cannot initialize clock:
Error reading clock. Clock considered unsynchronized
Change-Id: I371e01e21246a8c0ece98ca7d4bf6761615127b4
Reviewed-on: http://gerrit.cloudera.org:8080/5258
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins
In our IPMC vote to release 2.7.0 rc3, Justing Mclean pointed out a
number of issues of compliance with ASF policy. He asked:
1. "Please place build instruction and supported platforms in the
README. The wiki may change over time and that may make it difficult
to build older versions."
2. Remove binary file llvm-ir/test-loop.bc
3. Add be/src/gutil/valgrind.h,
shell/ext-py/sqlparse-0.1.14/sqlparse/pipeline.py and
cmake_modules/FindJNI.cmake, normalize.css (embedded in bootstrap.css)
to LICENSE.txt
4. Fix be/src/thirdparty/squeasel/squeasel* in LICENSE.txt
5. Remove outdated copyright lines from HBase (see
https://issues.apache.org/jira/browse/HBASE-3870)
6. Remove duplicate jquery notice from LICENSE.txt
Change-Id: I30ff77d7ac28ce67511c200764fba19ae69922e0
Reviewed-on: http://gerrit.cloudera.org:8080/4582
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Internal Jenkins