Downstream system vendors, users and customers have lately expressed
interest in consuming Impala in containerized forms, taking advantage of
various specialized, hardened container base image offerings, like
container offerings based on the Wolfi project by Chainguard;
see: https://github.com/wolfi-dev.
This patch enables Impala container images to be built on top of custom
base images, and adds an implementation example that uses the publicly
available Wolfi base image.
Building a customized Docker image follows a hybrid approach. Instead of
replicating the complete Impala build process inside a Wolfi container
for a fully native binary build, it relies on an existing build platform
that is compatible with the binary packages available inside the custom
container image. For Wolfi the Impala binaries are supplied by the
Red Hat 9 build of Impala. This is made possible by the fact that major
library dependencies of Impala have the same versions on Wolfi OS and
Red Hat 9, so binaries built on Red Hat 9 can be run on Wolfi
with no changes.
The binaries produced by the regular build process are then installed
into a Docker image built on top of an explicitly specified custom base
image. The selection of a custom base image is controlled by two
environment variables:
- USE_CUSTOM_IMPALA_BASE_IMAGE (boolean):
If set to 'true', triggers the use of the custom image.
When set to 'false' or left unspecified, the Docker base image is
selected by the existing logic of matching the build platform's
operating system.
- IMPALA_CUSTOM_DOCKER_BASE (string): specifies the URI of the base image
These environment variables can be overridden from the environment,
from impala-config-branch.sh, or impala-config-local.sh.
They are reported at the end of bin/impala-config.sh where important
environment variables are listed. They are also added to the list of
variables in bin/jenkins/dockerized-impala-preserve-vars.py to ensure
that they can be used in the context of Jenkins jobs as well.
The unified script that installs Impala's required dependencies into the
container image is extended for Wolfi to handle APK packages.
A new script is added to install Bash in the Docker image if it is
missing. Impala build scripts (including the scripts used during Docker
image builds) as well as container startup scripts require Bash,
but minimal container base images usually omit it, favoring a smaller
alternative.
To improve the debugging experience for a containerized Impala
minicluster, the minicluster starter script bin/start-impala-cluster.py
is extended with the following features:
- synchronizes every launched container's timezone to the host.
This is needed for Iceberg time-travel test, which create timestamped
Iceberg metadata items in the impalad context inside a container, but
check creation/modification times of the same items in the test scripts
running on the host, outside the containers. The tests scripts have
the implicit expectation that the same local time is shared across
all these contexts, but this is not necessarily true if the host,
where tests are running is set to a timezone other than UTC.
Time sycnhronization is achieved by injecting the TZ environment
variable into the container, holding the name of the timezone used
on the host. The timezone name is taken either from the host's TZ
variable (if set), or from the host's /etc/localtime symlink,
checking the name of the timezone file it points to.
In case /etc/localtime is not a symlink (and TZ is not set on the
host), the host's /etc/localtime file is mounted into the container.
- sets up a directory for each container to collect the Java VMs error
files (hs_err_pidNNNN.log) from the containers.
- adds the --mount_sources command line parameter, which mounts the
complete $IMPALA_HOME subtree into the container at
/opt/impala/sources to make source code available inside the container
for easier debugging.
Tested by running core-mode tests in the following environments:
- Regular run (impalad running natively on the platform) on Ubuntu 20.04
- Regular run on Rocky Linux 9.2
- Dockerised run (impalad instances running in their individual
containers) using Ubuntu 20.04 containers
- Dockerised run (impalad instances running in their individual
containers) using Rocky Linux 9.2 containers
- Dockerised run (impalad instances running in their individual
containers) using Wolfi's wolfi-base containers
Change-Id: Ia5e39f399664fe66f3774caa316ed5d4df24befc
Reviewed-on: http://gerrit.cloudera.org:8080/22583
Reviewed-by: Laszlo Gaal <laszlo.gaal@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Reviewed-by: Jason Fehr <jfehr@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When the Impala docker images are deployed in production environments,
it can be hard to add debugging tools at runtime. Two of the most
useful diagnostic tools are jstack and pstack, which can be used to
print Java and native stack traces. Install these tools into Redhat
images which are the most commonly used in production.
To install pstack we install gdb
To install jstack we install a development jdk on top of the headless
jdk.
Extend the install_os_packages.sh script to add an argument to
--install-debug-tools to set the level of diagnostic tools to install.
The possible arguments are:
none - install no extra tools
basic - install pstack and jstack
full - install more debugging tools.
In a Centos 8.5 build, the size of a impalad_coord_exec image increased
from 1.74GB to 1.85GB, as reported by ‘docker image list’.
What other tools might be added?
- Installing perf is tricky as in a container perf requires an
installation specific to the underlying linux kernel image, which is
hard to predict at build time.
- Installing pprof is hard as installation seems to require compiling
from sources. Clearly there are many options and we cannot install
everything.
TESTING
Built release and debug docker images, and used jstack and pstack in a
running container to print Impala's stacks.
Change-Id: I25e6827b86564a9c0fc25678e4a194ee8e0be0e9
Reviewed-on: http://gerrit.cloudera.org:8080/21433
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Currently, Impala supports building and testing Docker
images on Ubuntu. This extends that same support to
Redhat-based distributions:
1. This splits out the Docker build's OS package
installation into a separate install_os_packages.sh
script. This script detects the OS and calls apt
or yum as appropriate. The script takes the argument
--install-debug-tools, which installs extra tools
like iproute2 and ping. This defaults to true for debug
images and false for release images.
2. This modifies daemon_entrypoint.sh to detect the
OS and set LD_LIBRARY_PATH appropriate to account
for different locations of Java.
3. This modifies docker/setup_build_context.py to
handle different locations of libkudu_client.so
and add extra sanity checks on various libraries
found via globs.
4. This modifies bin/jenkins/dockerized-*.sh test
infrastructure to be able to install docker on
either Ubuntu or Redhat. It also changes the exit
logic to collect the container logs.
Developers can override the base image for Redhat 7
and Redhat 8 builds via the IMPALA_REDHAT7_DOCKER_BASE
and IMPALA_REDHAT8_DOCKER_BASE environment variables.
These default to open source Redhat equivalents
(Centos 7.9 and Rocky 8.5 respectively), but they are
also known to work with Redhat UBI images.
Testing:
- Ran dockerised testing on Rocky 8.5 via the
rocky-8.5-dockerised-tests job.
- Ran GVO
- Ran a Docker build on Centos7 with UBI7 as the base image
Change-Id: Ibaff2560ef971ac2c2231a8e43921164ea1d2f4d
Reviewed-on: http://gerrit.cloudera.org:8080/19006
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
What works:
* A single node cluster can be started up with docker-compose
* HMS data is stored in Derby database in a docker volume
* Filesystem data is stored in a shared docker volume, using the
localfs support in the Hadoop client.
* A Kudu cluster with a single master can be optionally added on
to the Impala cluster.
* TPC-DS data can be loaded automatically by a data loading container.
We need to set up a docker network called quickstart-network,
purely because docker-compose insists on generating network names
with underscores, which are part of the FQDN and end up causing
problems with Java's URL parsing, which rejects these technically
invalid domain names.
How to run:
Instructions for running the quickstart cluster are in
docker/README.md.
How to build containers:
./buildall.sh -release -noclean -notests -ninja
ninja quickstart_hms_image quickstart_client_image docker_images
How to upload containers to dockerhub:
IMPALA_QUICKSTART_IMAGE_PREFIX=timgarmstrong/
for i in impalad_coord_exec impalad_coordinator statestored \
impalad_executor catalogd impala_quickstart_client \
impala_quickstart_hms
do
docker tag $i ${IMPALA_QUICKSTART_IMAGE_PREFIX}$i
docker push ${IMPALA_QUICKSTART_IMAGE_PREFIX}$i
done
I pushed containers build from commit f260cce22, which
was branched from 6cb7cecacf on master.
Misc other stuff:
* Added more metadata to all images.
TODO:
* Test and instructions to run against Kudu quickstart
* Upload latest version of containers before merging.
Change-Id: Ifc0b862af40a368381ada7ec2a355fe4b0aa778c
Reviewed-on: http://gerrit.cloudera.org:8080/15966
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
A recent patch (IMPALA-9930) introduces a new admission control rpc
service, which can be configured to perform admission control for
coordinators. In that patch, the admission service runs in an impalad.
This patch separates the service out to run in a new daemon, called
the admissiond. It also integrates this new daemon with the build
infrastructure around Docker.
Some notable changes:
- Adds a new class, AdmissiondEnv, which performs the same function
for the admissiond as ExecEnv does for impalads.
- The '/admission' http endpoint is exposed on the admissiond's webui
if the admission control service is in use, otherwise it is exposed
on coordinator impalad's webuis.
- start-impala-cluster.py takes a new flag --enable_admission_service
which configures the minicluster to have an admissiond with all
coordinators using it for admission control.
- Coordinators are now configured to use the admission service by
specifying the startup flag --admission_service_host. This is
intended to mirror the configuration of the statestored/catalogd
location.
Testing:
- Existing tests for the admission control serivce are modified to run
with an admissiond.
- Manually ran start-impala-cluster.py with --enable_admission_service
and --docker_network to verify Docker integration.
Change-Id: Id677814b31e9193035e8cf0d08aba0ce388a0ad9
Reviewed-on: http://gerrit.cloudera.org:8080/16891
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The convention in in linux is to that anything below 1000 is reserved
for system accounts, services, and other special accounts, and
regular user UIDs and GIDs stay above 1000. This will ensure that the
'impala' user created that runs the impala executable inside the
docker container gets assigned 1000 uid and gid.
Testing:
Manually tested by running the docker container and checking the user.
Change-Id: I51b846ca5fb2c55ac1707b9581cee18447467b41
Reviewed-on: http://gerrit.cloudera.org:8080/16807
Reviewed-by: Andrew Sherman <asherman@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Maven Changes:
Splits out all executor specific jar files into a separate pom file
under mvn-deps/executor-deps. The new pom file lists out all executor
specific jar files. fe/pom.xml has a dependency on
mvn-deps/executor-deps/pom.xml so that all executor specific jars are
still built as part of the fe/ build. mvn-deps/executor-deps/pom.xml
writes out a build-classpath.txt file that contains all dependencies in
the pom.xml file (similar to what is already done in fe/pom.xml).
Docker Build Changes:
setup_build_context.py was changed to leverage the aformentioned Maven
changes. The script still symlinks all dependencies into the lib/ folder,
but also creates an exec-lib/ and statestore-lib/ folder. The exec-lib/
folder contains all dependencies necessary to run Impala Executors, but
excludes any dependencies that are Coordinator specific. The
statestore-lib/ folder excludes all jar files entirely since it does not
run an embedded JVM.
The docker/CMakeLists.txt was modified to support the new library layout
created by setup_build_context.py. Prior to this patch only the build
for the Impala base image has access to the dependencies created by
setup_build_context.py. This patch changes the build logic so all images
have access to the dependencies. This does increase build time because
the built context has to be copied and sent to the Docker daemon for
each image build.
Docker Image Changes:
The copy command for the lib/ folder was removed from the impala_base
Dockerfile and a corresponding copy command was added to each daemon
Docker image. This allows each daemon image to only copy in the
dependencies it actually requires to run.
Other:
* Deleted the hive-3 profile since Impala 4.0 only supports hive-3 builds
* Moved shaded-deps into the mvn-deps folder
Overall, this decreases the size of the impalad_executor image by 120 MB,
and the statestored image by 700 MB.
impalad_coordinator and impalad_coordinator images are now 771 MB, and
impalad_executor images are 651MB.
Further improvements might be possible by decreasing the number of
transitive dependencies in mvn-deps/executor-deps/pom.xml. Moreover,
any new Coordinator specific jar files will not be included in the
Executor image.
Testing:
* Ran core tests
Change-Id: I899859a38d8ccab890de889a49ef132a89289dfd
Reviewed-on: http://gerrit.cloudera.org:8080/16320
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Sahil Takiar <stakiar@cloudera.com>
This removes Impala-lzo from the Impala development environment.
Impala-lzo is not built as part of the Impala build. The LZO plugin
is no longer loaded. LZO tables are not loaded during dataload,
and LZO is no longer tested.
This removes some obsolete scan APIs that were only used by Impala-lzo.
With this commit, Impala-lzo would require code changes to build
against Impala.
The plugin infrastructure is not removed, and this leaves some
LZO support code in place. If someone were to decide to revive
Impala-lzo, they would still be able to load it as a plugin
and get the same functionality as before. This plugin support
may be removed later.
Testing:
- Dryrun of GVO
- Modified TestPartitionMetadataUncompressedTextOnly's
test_unsupported_text_compression() to add LZO case
Change-Id: I3a4f12247d8872b7e14c9feb4b2c58cfd60d4c0e
Reviewed-on: http://gerrit.cloudera.org:8080/15814
Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
Automatically detect if we're on Ubuntu 16.04
or 18.04 and use the appropriate base image.
Testing:
Built an image locally on my Ubuntu 18.04 system and
made sure I could start a minicluster and run a query.
Change-Id: I8dfdb349e78fd76b91138a70449d51b0ef0021df
Reviewed-on: http://gerrit.cloudera.org:8080/15765
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
I often find it tricky to debug network and Impala issues when using our
Docker images. This change adds a handful of tools that I frequently
miss having.
It adds about 6.5% to the image size, they grow from 984MB to 953MB. If
people feel that that is too much, I'm happy to cut back on the tools we
install.
Change-Id: I47c7aa7076cebfa3bfad2029fb1da9e64364f0e6
Reviewed-on: http://gerrit.cloudera.org:8080/13895
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Lars Volker <lv@cloudera.com>
Create a ranger cache directory used by ranger clients when ranger
is enabled. For simplicity, it is added to the base image. It is
used only on the coordinators/catalogd.
Change-Id: Iad134636e1566a44acf7b010e6b6067a972798c6
Reviewed-on: http://gerrit.cloudera.org:8080/14007
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Some components depend on these utils (kinit, kdestroy..) for ticket
cache lifecycle management. These are also useful for debugging
in general, for example, to test KDC connectivity etc.
Local docker image size increased from 820MB to 865MB for a release
build (=5.4%).
Change-Id: I9c9e9ab5b027ea9d223928280bc94f2ed9f701d3
Reviewed-on: http://gerrit.cloudera.org:8080/13997
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com>
Tested-by: Bharath Vissapragada <bharathv@cloudera.com>
This reduces the size of an image from 1.36GB to 705MB with
a release build on my system.
Thanks to Joe McDonnell for the suggestion.
Testing:
Precommit docker tests are sufficient to validate that
the containers are functional.
Change-Id: I5476a97a7a030499a60a6cef67f8c3cdffa7e756
Reviewed-on: http://gerrit.cloudera.org:8080/13699
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
* Symlink impalad/catalog/statestored inside container.
This doesn't seem to really save any space - there's
some kind of deduplication going on.
* Don't include libfesupport.so, which shouldn't be needed.
* strip debug symbols from the binary.
* Only include the libkuduclient.so libraries for Kudu
This shaves ~1.1GB from the image size- 250MB as a result
of the impalad binary changes and the remainder from the
Kudu changs.
Change-Id: I95ff479bedd3b93e6569e72f03f42acd9dba8b14
Reviewed-on: http://gerrit.cloudera.org:8080/13487
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Set a default USER in the Dockerfile per best practices so that
consumers of the container don't accidentally run as root.
The default user is "impala" if the container is run in docker
without specifying a user.
Various frameworks, including kubernetes, will run the container with
an arbitrary user and group ID set.
This causes issues with some Hadoop libraries, which depend on the
user having a name. This is generally not the case because inside
the container usernames are resolved with the container's /etc/passwd.
To work around this, the entrypoint script checks if the current
user has a name and if not, assigns it one (either dummyuser or
$HADOOP_USER_NAME).
Remove the umask setting that was required to make logs modifiable
by the host user - this is not needed for our tests since the host
host and container users now match up.
Also run apt-get clean in Dockerfile to reduce cruft in the
image.
Change-Id: I0bea9f44a8199851ed04fbef8caf4a2350ae2c0e
Reviewed-on: http://gerrit.cloudera.org:8080/13451
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This modifies containers to put logs in /opt/impala/logs,
then mounts that directory to
$IMPALA_HOME/logs/.../<container_name> so that logs will
be collected on the host and scooped up by jenkins jobs.
The layout of the log directory is a little different to
the non-dockerised containers because I wanted to avoid
sharing log directories between containers.
Change-Id: I24bcaa521882d450d43d1f2ca34767e7ce36bbd2
Reviewed-on: http://gerrit.cloudera.org:8080/13393
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The docker containers currently have minicluster configs baked into
them. This is not necessary any more since the /opt/impala/conf
directory is mounted to point at the up-to-date configs, so there's
no reason to include configs in the container.
Testing:
Confirmed that I could build containers, start up a minicluster and run
queries.
Change-Id: I6d77f79620514187a5c45483e9051bd8c40dfc9e
Reviewed-on: http://gerrit.cloudera.org:8080/13104
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This fixes all core e2e tests running on my local dockerised
minicluster build. I do not yet have a CI job or script running
but I wanted to get feedback on these changes sooner. The second
part of the change will include the CI script and any follow-on
fixes required for the exhaustive tests.
The following fixes were required:
* Detect docker_network from TEST_START_CLUSTER_ARGS
* get_webserver_port() does not depend on the caller passing in
the default webserver port. It failed previously because it
relied on start-impala-cluster.py setting -webserver_port
for *all* processes.
* Add SkipIf markers for tests that don't make sense or are
non-trivial to fix for containerised Impala.
* Support loading Impala-lzo plugin from host for tests that depend on
it.
* Fix some tests that had 'localhost' hardcoded - instead it should
be $INTERNAL_LISTEN_HOST, which defaults to localhost.
* Fix bug with sorting impala daemons by backend port, which is
the same for all dockerised impalads.
Testing:
I ran tests locally as follows after having set up a docker network and
starting other services:
./buildall.sh -noclean -notests -ninja
ninja -j $IMPALA_BUILD_THREADS docker_images
export TEST_START_CLUSTER_ARGS="--docker_network=impala-cluster"
export FE_TEST=false
export BE_TEST=false
export JDBC_TEST=false
export CLUSTER_TEST=false
./bin/run-all-tests.sh
Change-Id: Iee86cbd2c4631a014af1e8cef8e1cd523a812755
Reviewed-on: http://gerrit.cloudera.org:8080/12639
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This builds an impala_base container that has all of the build artifacts
required to run the impala processes, then builds impalad, catalogd and
statestore containers based on that with the right ports exposed.
The images are based on the Ubuntu 16.04 image to align with the
most common development environment.
The container build process is integrated with CMake and is designed
to integrate with the rest of the build so that the container build
depends on the artifacts that will go into the container. You can
build the images with the following command, which will create
images called "impala_base", "impalad", "catalogd" and
"statestored":
ninja -j $IMPALA_BUILD_THREADS docker_images
The images need some refinement to be truly useful. The following
will be done in future patches:
* IMPALA-7947 - integrate with start-impala-cluster.py to
automatically create docker network with containers running on it
* Mechanism to pass in command-line flags
* Mechanisms to update the various config files to point to the
docker host rather than "localhost", which doesn't point to
the right thing inside the container.
* Mechanisms to set mem_limit, JVM heap sizes, etc, automatically.
Testing:
Manually started up the containers connected to a user-defined bridge
network, tweaked the configurations to point to the HMS/HDFS/etc
running on my host. I then used "docker ps" to figure out the
port mappings for beeswax and debug webserver.
Confirmed that I could run a query and access debug pages:
$ impala-shell.sh -i localhost:32860 -q "select coordinator()"
Starting Impala Shell without Kerberos authentication
Opened TCP connection to localhost:32860
Connected to localhost:32860
Server version: impalad version 3.1.0-SNAPSHOT DEBUG (build
d7870fe03645490f95bd5ffd4a2177f90eb2f3c0)
Query: select coordinator()
Query submitted at: 2018-12-11 15:51:04 (Coordinator:
http://8063e77ce999:25000)
Query progress can be monitored at:
http://8063e77ce999:25000/query_plan?query_id=1b4d03f0f0f1fcfb:b0b37e5000000000
+---------------+
| coordinator() |
+---------------+
| 8063e77ce999 |
+---------------+
Fetched 1 row(s) in 0.11s
Change-Id: Ifea707aa3cc23e4facda8ac374160c6de23ffc4e
Reviewed-on: http://gerrit.cloudera.org:8080/12074
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>