This patch add graceful and force shutdown support for impala.sh.
This patch also keep the stdout and stderr log when startup.
This patch also fix some bugs in the impala.sh, including:
- empty service name check.
- restart command cannot work.
Testing:
- Manually deploy package on Ubuntu22.04 and verify it.
Change-Id: Ib7743234952ba6b12694ecc68a920d59fea0d4ba
Reviewed-on: http://gerrit.cloudera.org:8080/21297
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Independent linux packaging related content to package/CMakeLists.txt
to make it more clearly.
This patch also add LICENSE and NOTICE file in the final package.
Testing:
- Manually deploy package on Ubuntu22.04 and verify it.
Change-Id: If3914dcda69f81a735cdf70d76c59fa09454777b
Reviewed-on: http://gerrit.cloudera.org:8080/20263
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Updates gitignore for files generated during bootstrap_development.
Fixes deleting tracked files in be/src/thirdparty. Includes ignore rules
for past versions of shell dependencies and updates ignores for current
versions.
Change-Id: I03deba5e7fb151ef8e34039becdcc3fb47684084
Reviewed-on: http://gerrit.cloudera.org:8080/18499
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Adds a plain-text space-separated image list in
docker/docker-images.txt. This is generated based on the images built by
CMake, so is kept in sync with images added to or removed from the
CMake file.
Duplicated logic per image is removed - instead there is a helper
function that is called for each daemon image to be built.
Rips out the timestamp mechanism that was intended to avoid unnecessary
container rebuilds, but has turned out to be brittle. Instead the
containers are rebuilt each time the rule is invoked.
This moves some subdirectories so that the image tag matches the
subdirectory, to simplify the build scripts.
Change-Id: I4d8e215e9b07c6491faa4751969a30f0ed373fe3
Reviewed-on: http://gerrit.cloudera.org:8080/13899
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Lars Volker <lv@cloudera.com>
Avoided rewrite if the resulting string literal exceeds a defined limit.
Testing:
Added three statements in testFoldConstantsRule() to verify that the
expression rewrite is accepted only when the size of the rewritten
expression is below a specified threshold.
Change-Id: I8b078113ccc1aa49b0cea0c86dff2e02e1dd0e23
Reviewed-on: http://gerrit.cloudera.org:8080/12814
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
This patch bumps the CDP_BUILD_NUMBER to 1013201. This patch also
refactors the bootstrap_toolchain.py to be more generic for dealing with
CDP components, e.g. Ranger and Hive 3.
The patch also fixes some TODOs to replace the rangerPlugin.init() hack
with rangerPlugin.refreshPoliciesAndTags() API available in this Ranger
build.
Testing:
- Ran core tests
- Manually verified that no regression when starting Hive 3 with
USE_CDP_HIVE=true
Change-Id: I18c7274085be4f87ecdaf0cd29a601715f594ada
Reviewed-on: http://gerrit.cloudera.org:8080/13002
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This builds an impala_base container that has all of the build artifacts
required to run the impala processes, then builds impalad, catalogd and
statestore containers based on that with the right ports exposed.
The images are based on the Ubuntu 16.04 image to align with the
most common development environment.
The container build process is integrated with CMake and is designed
to integrate with the rest of the build so that the container build
depends on the artifacts that will go into the container. You can
build the images with the following command, which will create
images called "impala_base", "impalad", "catalogd" and
"statestored":
ninja -j $IMPALA_BUILD_THREADS docker_images
The images need some refinement to be truly useful. The following
will be done in future patches:
* IMPALA-7947 - integrate with start-impala-cluster.py to
automatically create docker network with containers running on it
* Mechanism to pass in command-line flags
* Mechanisms to update the various config files to point to the
docker host rather than "localhost", which doesn't point to
the right thing inside the container.
* Mechanisms to set mem_limit, JVM heap sizes, etc, automatically.
Testing:
Manually started up the containers connected to a user-defined bridge
network, tweaked the configurations to point to the HMS/HDFS/etc
running on my host. I then used "docker ps" to figure out the
port mappings for beeswax and debug webserver.
Confirmed that I could run a query and access debug pages:
$ impala-shell.sh -i localhost:32860 -q "select coordinator()"
Starting Impala Shell without Kerberos authentication
Opened TCP connection to localhost:32860
Connected to localhost:32860
Server version: impalad version 3.1.0-SNAPSHOT DEBUG (build
d7870fe03645490f95bd5ffd4a2177f90eb2f3c0)
Query: select coordinator()
Query submitted at: 2018-12-11 15:51:04 (Coordinator:
http://8063e77ce999:25000)
Query progress can be monitored at:
http://8063e77ce999:25000/query_plan?query_id=1b4d03f0f0f1fcfb:b0b37e5000000000
+---------------+
| coordinator() |
+---------------+
| 8063e77ce999 |
+---------------+
Fetched 1 row(s) in 0.11s
Change-Id: Ifea707aa3cc23e4facda8ac374160c6de23ffc4e
Reviewed-on: http://gerrit.cloudera.org:8080/12074
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
A few unversioned artifacts crept in over time without corresponding
.gitignore entries. These are the updates based on the git status output
on my dev env.
Change-Id: I281ab3b5c98ac32e5d60663562628ffda6606a6a
Reviewed-on: http://gerrit.cloudera.org:8080/11787
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Switching to a new CDH_BUILD_NUMBER requires downloading new CDH
components as well as forcing Maven to update its local repository.
This patch updates the CDH_COMPONENTS_HOME to include the
CDH_BUILD_NUMBER which will automatically download the new CDH
components after switching to a new CDH_BUILD_NUMBER. When running
a build if it detects that a new CDH_BUILD_NUMBER has changed, the
build will force an update to the local Maven repository. This helps
to prevent build failure even on a fresh Git clone due to stale local
Maven repository.
Testing:
- Manually tested by running buildall.sh with different CDH_BUILD_NUMBER
Change-Id: Ib0ad9c2258663d3bd7470e6df921041d1ca0c0be
Reviewed-on: http://gerrit.cloudera.org:8080/11099
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The fix for IMPALA-6241 is to increase the timeout for all slow builds.
While testing that fix, I discovered that the ASAN build detection logic
was failing silently, resulting in it assuming that it was testing a
DEBUG build. The error was:
Unexpected DW_AT_name in first CU:
/data/jenkins/workspace/verify-impala-toolchain-package-build/label/ec2-package-ubuntu-16-04/toolchain/source/llvm/llvm-3.9.1.src/projects/compiler-rt/lib/asan/asan_preinit.cc;
choosing DEBUG
The fix for that issue is to remove the build type detection heuristic
and instead just write a file with the build type as part of the build process.
Testing:
Before this change I was able to reproduce locally every 5-10 test
iterations. After this change I haven't seen it reproduce.
Change-Id: Ia4ed949cac99b9925f72e19e4adaa2ead370b536
Reviewed-on: http://gerrit.cloudera.org:8080/8652
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
Ran into these when compiling (generated files from kudu),
eclipse setup, and testing.
Change-Id: Ife446e40756864f2a19ae4393ac503d17d91996b
Reviewed-on: http://gerrit.cloudera.org:8080/7902
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Impala Public Jenkins
Previously we could get a developer's shell into a bad state where a
value of a config variable from a previous impala-config.sh version
would override the value from the new impala-config.sh version.
This change adds a new mechanism to override settings locally by adding
settings to impala-config-local.sh. This alternative approach is more
robust, because the config variables will be reset to the intended
values when impala-config.sh is re-sourced.
impala-config-branch.sh can also be used to override settings in a
version-controlled way, e.g. to support having different settings for
different branches.
I did not convert all variables to use this approach, since many people
and Jenkins jobs depend on setting these variables from the environment.
The remaining "sticky" variables are ones where default values should
not change frequently, e.g. source directory locations and build
settings.
Change-Id: I930e2ca825142428d17a6981c77534ab0c8e3489
Reviewed-on: http://gerrit.cloudera.org:8080/5545
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Impala Public Jenkins
Files in be/ get wiped out by clean.sh if they're listed in .gitignore.
It is easier to just configure this via a project-specific .vimrc file.
Change-Id: I262f7a1ec8daace84a29518ba826c7c3b20fb9e9
Reviewed-on: http://gerrit.cloudera.org:8080/4854
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
This has been working for several months, and it it was written mainly
by Casey Ching while he was at Cloudera working on Impala.
Change-Id: Ia4bc78ad46dda13e4533183195af632f46377cae
Reviewed-on: http://gerrit.cloudera.org:8080/4820
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Internal Jenkins
This change removes some of the occurrences of the strings 'CDH'/'cdh'
from the Impala repository. References to Cloudera-internal Jiras have
been replaced with upstream Jira issues on issues.cloudera.org.
For several categories of occurrences (e.g. pom.xml files,
DOWNLOAD_CDH_COMPONENTS) I also created a list of follow-up Jiras to
remove the occurrences left after this change.
Change-Id: Icb37e2ef0cd9fa0e581d359c5dd3db7812b7b2c8
Reviewed-on: http://gerrit.cloudera.org:8080/4187
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
By ASF rules, we can't have JARs in releases. The releases are just
tarballs of the repo.
This patch removes from the repo the single JAR there, which was a
version of a JAR that is built during data load, with one string
changed. The JAR is used only for testing.
Instead of building that jar with the different string and saving the
result in git, daa loading will now build the jar twice, with one Java
source file slightly changed.
Change-Id: Icee7b8c32b08e064dea4a14624acff6021ef5ce1
Reviewed-on: http://gerrit.cloudera.org:8080/4499
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
All logs, test results and SQL files generated during data
loading and testing are now consolidated under a single new
directory $IMPALA_HOME/logs. The goal is to simplify archiving
in Jenkins runs and debugging.
The new structure is as follows:
$IMPALA_HOME/logs/cluster
- logs of Hadoop components and Impala
$IMPALA_HOME/logs/data_loading
- logs and SQL files produced in data loading
$IMPALA_HOME/logs/fe_tests
- logs and test output of Frontend unit tests
$IMPALA_HOME/logs/be_tests
- logs and test output of Backend unit tests
$IMPALA_HOME/logs/ee_tests
- logs and test output of end-to-end tests
$IMPALA_HOME/logs/custom_cluster_tests
- logs and test output of custom cluster tests
I tested this change with a full data load which
was successful.
Change-Id: Ief1f58f3320ec39d31b3c6bc6ef87f58ff7dfdfa
Reviewed-on: http://gerrit.cloudera.org:8080/2456
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
This patch adds logic to automatically download the pre-built toolchain
packages to the local developer machine using the bootstrap_toolchain.py
script in case there are not present. There is no manual user
intervention necessary to initiate the download process.
If desired the script can always be called to re-download the
dependencies from a correctly sourced Impala environment.
Change-Id: I636160efeadfac4b5c1feb478da5ae5da0c9fd00
Reviewed-on: http://gerrit.cloudera.org:8080/1429
Reviewed-by: Martin Grund <mgrund@cloudera.com>
Tested-by: Internal Jenkins
Adds a static definition of the metric metadata used by Impala. The
metric names, descriptions, and other properties are defined in
common/thrift/metrics.json file, and the generate_metrics.py script
creates a thrift representation. The metric definitions are then
available in a constant map which is used at runtime to instantiate
metrics, looking them up in the map by the metric key.
New metrics should be defined by adding an entry to the list of metrics
in metrics.json with the following properties:
key: The unique string identifying the metric. If the metric can
be templated, e.g. rpc call duration, it may be a format
string (in the format used by strings::Substitute()).
description: A text description of the metric. May also be a format
string.
label: A brief title for the metric, not currently used by
Impala but provided for external tools.
units: The unit of the metric. Must be a valid value of TUnit.
kind: The kind of metric, e.g. GAUGE or COUNTER. Must be a valid
value of TMetricKind.
contexts: The context in which this metric may be instantiated.
Usually "IMPALAD", "STATESTORED", "CATALOGD", but may be
a different kind of 'entity'. Not currently used by
Impala but provided for modeling purposes for external
tools.
For example, adding the counter for the total number of queries run over
the lifetime of the impalad process might look like:
{
"key": "impala-server.num-queries",
"description": "The total number of queries processed.",
"label": "Queries",
"units": "UNIT",
"kind": "COUNTER",
"contexts": [
"IMPALAD"
]
}
TODO: Incorporate 'label' into the metrics debug page.
TODO: Verify the context at runtime, e.g. verify 'contexts' contains,
e.g. a DCHECK.
After the metric definition is added, the generate_metrics.py script
will generate the TMetricDefs.thrift that contains a TMetricDef for
the metric definition. At runtime, the metric can be instantiated
using the key defined in metrics.json. Gauges, Counters, and
Properties are instantiated using static methods on MetricGroup. Other
metric types are instantiated using static CreateAndRegister methods
on their associated classes.
TODO: Generate a thrift enum used to lookup metric defs.
TODO: Consolidate the instantiation of metrics that are created
outside of metrics.h (i.e. collection metrics, memory metrics).
TODO: Need a better way to verify if metric definitions are missing.
Change-Id: Iba7f94144d0c34f273c502ce6b9a2130ea8fedaa
Reviewed-on: http://gerrit.cloudera.org:8080/330
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
This patch introduces the concept of error codes for errors that are
recorded in Impala and are going to be presented to the client. These
error codes are used to aggregate and group incoming error / warning
messages to reduce the spill on the shell and increase the usefulness of
the messages. By splitting the message string from the implementation,
it becomes possible to edit the string independently of the code and
pave the way for internationalization.
Error messages are defined as a combination of an enum value and a
string. Both are defined in the Error.thrift file that is automatically
generated using the script in common/thrift/generate_error_codes.py. The
goal of the script is to have a central understandable repository of
error messages. Adding new messages to this file will require rebuilding
the thrift part. The proxy class ErrorMessage is responsible to
represent an error and capture the parameters that are used to format
the error message string.
When error messages are recorded they are recorded based on the
following algorithm:
- If an error message is of type GENERAL, do not aggregate this message
and simply add it to the total number of messages
- If an error messages is of specific type, record the first error
message as a sample and for all other occurrences increment the count.
- The coordinator will merge all error messages except the ones of type
GENERAL and display a count.
For example, in the case of the parquet file spanning multiple blocks
the output will look like:
Parquet files should not be split into multiple hdfs-blocks.
file=hdfs://localhost:20500/fid.parq (1 of 321 similar)
All messages are always logged to VLOG. In the coordinator error
messages are merged across all backends to retain readability in the
case of large clusters.
The current version of this patch adds these new error codes to some of
the most important error messages as a reference implementation.
Change-Id: I1f1811631836d2dd6048035ad33f7194fb71d6b8
Reviewed-on: http://gerrit.cloudera.org:8080/39
Reviewed-by: Martin Grund <mgrund@cloudera.com>
Tested-by: Internal Jenkins
This change adds support for auxiliary worksloads, tests, and datasets. This is useful
to augment the regular test runs with some additional tests that do not belong in the
main Impala repo.
This adds initial changes for the Impala failure testing library. It also refactors
run workload into its own module to it can be used in other tests.
The failure testing has two main components - the first is an object model on top on top
of Impala services in a cluster. This allows for enumerating the serivces in the cluster
and executing commands on remote machines. This initial cut is built on top of the
CM service to help with starting/stopping services. The long term goal is to let this run
on both a CM cluster and non-CM cluster as well as locally.
The other part of the failure injection change is failure_inctor module that uses the
Impala service abstraction to select and inject failures into random impala services.
This failure testing framework hasn't been completely validated because the product code
is not yet ready, but it is important to get this checked in so all new changes to
run-workload are based off this refactor.
Change-Id: I73bf44f0ac881ec17bea7cb05d850b45e2ea5be5
Queries now return rows on both our small (query test) data set as well as the 10TB
data set. This change also fixes a problem with python not being set properly and
adds support for reporting query results using the geometric mean
Change-Id: Ia432148d96645ecda3f63900b3bfbd29c706d886
This change updates the run-benchmark script to enable it to target one or more
workloads. Now benchmarks can be run like:
./run-benchmark --workloads=hive-benchmark,tpch
We lookup the workload in the workloads directory, then read the associated
query .test files and start executing them.
To ensure the queries are not duplicated between benchmark and query tests, I
moved all existing queries (under fe/src/test/resources/* to the workloads
directory. You do NOT need to look through all the .test files, I've just moved
them. The one new file is the 'hive-benchmark.test' which contains the hive
benchmark queries.
Also added support for generating schema for different scale factors as well as
executing against these scale factors. For example, let's say we have a dataset
with a scale factor called "SF1". We would first generate the schema using:
./generate_schema_statements --workload=<workload> --scale_factor="SF3"
This will create tables with a unique names from the other scale factors.
Run the generated .sql file to load the data. Alternatively, the data can loaded
by running a new python script:
./bin/load-data.py -w <workload1>,<workload2> -e <exploration strategy> -s [scale factor]
For example: load-data.sh -w tpch -e core -s SF3
Then run against this:
./run-benchmark --workloads=<workload> --scale_factor=SF3
This changeset also includes a few other minor tweaks to some of the test
scripts.
Change-Id: Ife8a8d91567d75c9612be37bec96c1e7780f50d6
This adds most of the Hive TPCH queries into the functional Impala tests. This
code review doesn't actually include the TPCH data. The data set is relatively
large. Instead I updated scripts to copy the data from a data host.
This change has a few parts:
1) Update the benchmark schema generation/test vector generation to be more
generic. This way we can use the same schema creation/data loading steps for
TPCH as we do for benchmark tests.
2) Add in schema template for the TPCH workload along with test vectors and
dimensions which are used for schema generation.
3) Add in a new test file for each TPC-H query. The Hive TPCH work broke down
the queries to generate some "temp" tables, then execute using joins/selects
from these temp tables. Since creating the temp tables does some real work
it is good to execute these via Impala. Each test a) Runs all the Insert
statements to generate the temp tables b) runs the additional TPCH queries
4) Updated all the TPCH insert statements and queries to be parameterized on
$TABLE name. This way we can run the tests across all combinations of file
format/compression/etc.
5) Updated data loading
Change-Id: I6891acc4c7464eaf1dc7dbbb532ddbeb6c259bab
This change updates the Impala performance schema and test vector generation
techniques. It also migrates the existing benchmark scripts that were Ruby over
to use Python. The changes has a few parts:
1) Conversion of test vector generation and benchmark statement generation from
Ruby to Python. A result of this was also to update the benchmark test vector
and dimension files to be written in CSV format (python doesn't have built-in
YAML support)
2) Standardize on the naming for benchmark tables to (somewhat match Query
tests). In general the form is:
* If file_format=text and compression=none, do not use a table suffix
* Abbreviate sequence file as (seq) rc file as (rc) etc
* If using BLOCK compression don't append anything to table name, if using
'record' append 'record'
3) Created a new way to adding new schemas. this is the
benchmark_schema_template.sql file. The generate_benchmark_statements.py script
reads this in and breaks up the sections. The section format is:
====
Data Set Name
---
BASE table name
---
CREATE STATEMENT Template
---
INSERT ... SELECT * format
---
LOAD Base statement
---
LOAD STATEMENT Format
Where BASE Table is a table the other file formats/compression types can be
generated from. This would generally be a local file.
The thinking is that if the files already exist in HDFS then we can just load
the file directly rather than issue an INSERT ... SELECT * statement. The
generate_benchmark_statements.py script has been updated to use this new
template as well as query HDFS for each table to determine how it should be
created. It then outputs an ideal file call load-benchmark-*-generated.sql.
Since this file is geneated dynamically we can remove the old benchmark
statement files.
4) This has been hooked into load-benchmark-data.sh and run_query has been
updated to use the new format as well