impala

mirror of https://github.com/apache/impala.git synced 2026-02-02 06:00:36 -05:00

Author	SHA1	Message	Date
Joe McDonnell	d0cfdd139f	IMPALA-10199: Add Ubuntu 20 toolchain configuration Ubuntu 20 has been using the toolchain from Ubuntu 18. Since Ubuntu 20 has been added to the toolchain, this switches Impala to use a toolchain with Ubuntu 20 support and uses the Ubuntu 20 bits. This is expected to help with IMPALA-10962. Testing: - Ran a core build on Ubuntu 20 Change-Id: If2394b668ef3c56b1a4c0773fd5e4ff92be4a846 Reviewed-on: http://gerrit.cloudera.org:8080/18559 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-05-24 20:42:04 +00:00
wzhou-code	79f4c0576c	IMPALA-10951 (part 2): Upgrade protobuf library to 3.14.0 for Impala There are some API changes in the new version of protobuf library. This patch makes necessary changes for Impala code to pass compiling. Two tarballs of protobuf 3.14.0 are built in toolchain. Use protobuf-3.14.0 for normal builds, use protobuf-3.14.0-clangcompat-p2 for Clang builds. Bump up Kudu to the latest version 67ba3cae45. Testing: - Passed core DEBUG build and exhaustive release build. - Passed core ASAN build. Change-Id: Ia1df4faceff9fda169c9d15fe8b1e69cfabe0d43 Reviewed-on: http://gerrit.cloudera.org:8080/17948 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-01-12 08:39:17 +00:00
Joe McDonnell	7b490eed5b	IMPALA-10951 (preparation): Update Kudu to a more recent version As part of moving to a newer protobuf, this updates the Kudu version to get the fix for KUDU-3334. With this newer Kudu version, Clang builds hit an error while linking: lib/libLLVMCodeGen.a(TargetPassConfig.cpp.o):TargetPassConfig.cpp: function llvm::TargetPassConfig::createRegAllocPass(bool): error: relocation refers to global symbol "std::call_once<void (&)()>(std::once_flag&, void (&)())::{lambda()#2}::_FUN()", which is defined in a discarded section section group signature: "_ZZSt9call_onceIRFvvEJEEvRSt9once_flagOT_DpOT0_ENKUlvE0_clEv" prevailing definition is from ../../build/debug/security/libsecurity.a(openssl_util.cc.o) (This is from a newer binutils that will be pursued separately.) As a hack to get around this error, this adds the calloncehack shared library. The shared library publicly defines the symbol that was coming from kudu_client. By linking it ahead of kudu_client, the linker uses that rather than the one from kudu_client. This fixes the Clang builds. The new Kudu also requires a minor change to the flags for tserver startup. Testing: - Ran debug tests and verified calloncehack is not used - Ran ASAN tests Change-Id: Ieccbe284f11445e1de792352ebc7c9e1fa2ca0c3 Reviewed-on: http://gerrit.cloudera.org:8080/18129 Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-01-07 01:44:58 +00:00
wzhou-code	03a7a59f5d	IMPALA-10876: Support to download JWKS from given URL This patch added functionality to download JWKS from a given URL and support key rotation by periodically checking the JWKS URL for updates. We use Kudu's EasyCurl wrapper to download file from the given URL. curl was added to native-toolchain. This patch modified makefiles and bootstrap_toolchain.py to integrate libcurl and libkudu_curl_util. Added end-end JWT authentication test cases with JWKS specified as HTTP/HTTPS URL. Testing: - Passed core run, including new test cases. Change-Id: Ic6ac8cf0010c13db30219776d1d275709bf211df Reviewed-on: http://gerrit.cloudera.org:8080/17802 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-09-28 04:45:23 +00:00
Fucun Chu	1306c58f29	IMPALA-10870: Add Apache Hive 3.1.2 to the minicluster This patch modifies the minicluster script to optionally use Apache Hive 3.1.2 instead of CDP Hive 3.1.3. In order to make sure that existing setups don't break this is enabled via a environment variable override to bin/impala-config.sh. When the environment variable USE_APACHE_HIVE is set to true the bootstrap_toolchain script downloads Apache Hive 3.1.2 tarballs and extracts it in the toolchain directory. These binaries are used to start the Hive services (Hiveserver2 and metastore). The default is CDP Hive 3.1.3 Since CDP Hive 3 uses some features of Apache Hive 4, this patch uses a different database name so that it is easy to switch from working from one environment which uses CDP Hive 3.1.3 metastore to another which usese Apache Hive 3.1.2 metastore. In order to start a minicluster which uses Apache Hive 3.1.2 users should follow the steps below: 1. Make sure that minicluster, if running, is stopped before you run the following commands. 2. Open a new terminal and run following commands. > export USE_APACHE_HIVE=true > source bin/impala-config.sh > bin/bootstrap_toolchain.py The above command downloads the Apache Hive 3.1.2 tarballs and extracts them in toolchain/apache_components directory. > rm $HIVE_HOME/lib/guava-jar > cp $HADOOP_HOME/share/hadoop/hdfs/lib/guava-.jar $HIVE_HOME/lib/ The above command is to fix HIVE-22915 > bin/create-test-configuration.sh -create_metastore The above step should provide "-create-metastore" only the first time so that a new metastore db is created and the Apache Hive 3.1.2 schema is initialized. > testdata/bin/run-all.sh Follow-up: - Add MetastoreShim to support Apache Hive 3.x in IMPALA-10871 Tests: - Made sure that the cluster comes up with Apache Hive 3.1.2 when the steps above are performed. - Made sure that existing scripts work as they do currently when argument is not provided. Change-Id: I1978909589ecacb15d32d874e97f050a85adf1f6 Reviewed-on: http://gerrit.cloudera.org:8080/17793 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-09-27 12:07:40 +00:00
wzhou-code	025500ccb5	IMPALA-10489: Implement JWT support This patch added JWT support with following functionality: * Load and parse JWKS from pre-installed JSON file. * Read the JWT token from the HTTP Header. * Verify the JWT's signature with public key in JWKS. * Get the username out of the payload of JWT token. * Support following JSON Web Algorithms (JWA): HS256, HS384, HS512, RS256, RS384, RS512. We use third party library jwt-cpp to verify JWT token. jwt-cpp is a headers only C++ library. It was added to native-toolchain. This patch modified bootstrap_toolchain.py to download jwt-cpp from toolchain s3 bucket, and modified makefiles to add jwt-cpp/include in the include path. Added BE unit-tests for loading JWKS file and verifying JWT token. Also added FE custom cluster test for JWT authentication. Testing: - Passed core run. Change-Id: I6b71fa854c9ddc8ca882878853395e1eb866143c Reviewed-on: http://gerrit.cloudera.org:8080/17435 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-07-08 23:10:32 +00:00
Csaba Ringhofer	94f67a3432	IMPALA-7825: Upgrade Thrift version to 0.11.0 Before this patch Impala mainly used Thrift 0.9.3, but it was possible to compile Impala shell with Thrift 0.11.0, so the 0.11.0 Thrift lib was already included in the toolchain. Most of the changes are related to replacing boost:: with std:: shared_ptr-s in cpp code (this is a continuation of patch by Sahil). The Thrift upgrade also needs an Impyla release with Thrift 0.11.0, as Impala's test framework relies on Impyla. A thrift_sasl release is also needed, because it currently pins Thrift version to 0.9.3 for Python 2. The current patch uses alpha releases from Impyla and thrift_sasl that use thrift 0.11.0. Notable side effects: - old logic to compile thrift for impala-shell with 0.11.0 was removed - impala_shell's utf8 handling had to be updated as the new 0.11.0 compilation happens with no_utf8strings. This also made things a bit faster, e.g the following is ~0.22s instead of ~0.25 shell/impala_shell.py \ -B -q "select * from functional_parquet.alltypes;" > /dev/null - THRIFT-3921 changed the stream operators to print an enum's name instead of its number, leading to slightly different messages in some cases. - "templates" was added to the thift generator's parameters to avoid a compilation issue (related to IMPALA-10600). I didn't notice any change in compilation time. This option generated .tcc files with templetized readers/writers for Thrift types. Currently we don't use these, but they could potentially speed up (de)serialization. Testing: - ran Impyla's test suite with Python 2 and 3 - ran core tests Change-Id: Idd13f177b4f7acc07872ea6399035aa180ef6ab6 Reviewed-on: http://gerrit.cloudera.org:8080/17170 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-04-27 13:36:54 +00:00
Laszlo Gaal	bf4c2dfb38	IMPALA-10646: Add non-server RHEL 8 signature to toolchain bootstrap Recent Red Hat Enterprise Linux 8.x version return a shorter release name string from `lsb_release -si` than earlier versions. This shorter string was not recognized in the OS mapper logic in bin/bootstrap_toolchain.py, makig it -- and the build process -- break on Red Hat 8.2 The patch adds the shorter signature as a point fix. Rename a local variable to fix an unrelated name conflict (shadowing) found by flake8. Tests: run bin/bootstrap_toolchain.py manually on Red Hat 8.2, then run a complete build on the same OS. Regression-tested (build and dataload only) on the following versions: - Centos 8.2 (as opposed to Red Hat 8.2) - Centos 7.9 - Ubuntu 18.04 Change-Id: Icb1a6c215b1b5a65691042bb7d94fb034392d135 Reviewed-on: http://gerrit.cloudera.org:8080/17292 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-04-09 23:23:47 +00:00
John Sherman	a29d06db53	IMPALA-9218: Add support for locally compiled Hive - Add HIVE_VERSION_OVERRIDE, HIVE_STORAGE_API_VERSION_OVERRIDE, HIVE_METASTORE_THRIFT_DIR_OVERRIDE, HIVE_HOME_OVERRIDE environment variable support to impala-config.sh - When used together with HIVE_SRC_DIR_OVERRIDE allows a user to specify a locally compiled version of Hive for development and the minicluster - Hive jars are expected to have been installed into the local maven repository - Currently only version 3 of Hive is supported due to the absence of API shims for Hive 4.0 Example: ~/hive $ mvn package install -Pdist -DskipTests Example configuration: export HIVE_VERSION_OVERRIDE=3.1.0-SNAPSHOT export HIVE_STORAGE_API_VERSION_OVERRIDE=2.6.0 export HIVE_HOME_OVERRIDE=\ ~/hive/packaging/target/apache-hive-3.1.0-SNAPSHOT-bin/apache-hive-3.1.0-SNAPSHOT-bin export HIVE_SRC_DIR_OVERRIDE=~/hive export HIVE_METASTORE_THRIFT_DIR_OVERRIDE=~/hive/standalone-metastore/src/main/thrift/ Change-Id: I21892c153c445e3a5d93f2bc8f5e0b799929dd34 Reviewed-on: http://gerrit.cloudera.org:8080/17094 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-12 03:15:44 +00:00
Qifan Chen	61a020d0f8	IMPALA-10007: Impala development environment does not support Ubuntu 20.04 This is a minor amendment to a previously merged change with ChangeId I4f592f60881fd8f34e2bf393a76f5a921505010a, to address additional review comments. In particular, the original commit referred to Ubuntu 20.4 whereas it should have used Ubuntu 20.04. Change-Id: I7db302b4f1d57ec9aa2100d7589d5e814db75947 Reviewed-on: http://gerrit.cloudera.org:8080/16241 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-10-22 05:03:32 +00:00
zhaorenhai	0098113d95	IMPALA-10090: Create aarch64 development environment on ubuntu 18.04 Including following changes: 1 build native-toolchain local by script on aarch64 platform 2 change some native-toolchain's lib version number 3 split SKIP_TOOLCHAIN_BOOTSTRAP and DOWNLOAD_CDH_COMPONETS to two things, because on aarch64, just need to download cdp components , but not need to download toolchain. 4 download hadoop aarch64 nativelibs , impala building needs these libs. With this commit, on ubuntu 18.04 aarch64 version, just need to run bin/bootstrap_development.sh, just like x86. Change-Id: I769668c834ab0dd504a822ed9153186778275d59 Reviewed-on: http://gerrit.cloudera.org:8080/16065 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-02 06:47:30 +00:00
Joe McDonnell	bbec0443fc	IMPALA-10044: Fix cleanup for bootstrap_toolchain.py failure case If DownloadUnpackTarball::download()'s wget_and_unpack_package call hits an exception, the exception handler cleans up any created directories. Currently, it erroneously cleans up the directory where the tarballs are downloaded even when it is not a temporary directory. This would delete the entire toolchain. This fixes the cleanup to only delete that directory if it is a temporary directory. Testing: - Simulated exception from wget_and_unpack_package and verified behavior. Change-Id: Ia57f56b6717635af94247fce50b955c07a57d113 Reviewed-on: http://gerrit.cloudera.org:8080/16294 Reviewed-by: Laszlo Gaal <laszlo.gaal@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-07 03:53:13 +00:00
Qifan Chen	ebb5599aaa	IMPALA-10007: Impala development environment does not support Ubuntu 20.4 This work addresses the current limitation in Impala development environment in that Ubuntu 20.4 is not supportd. The fix modifies bootstrap_system.sh and bootstrap_toolchain.py to specifically allow the bootstrapping of the development environment on a maching running Ubuntu 20.4. Limited use shows that the environment is useful and stable, similar to the one running on Ubuntu 18.4. Testing on a box running Ubuntu 20.4: 1. Successfully bootstrapped the entire Impala development environment 2. Interacted with the enviroment through the following tools: gdb jdb clang-format impalad GUI vim 3. Ran all tests Limitations found with Ubuntu 20.4 environment. 1. gdb in Impala toolchain is not compatible with Impala C++ test code ${IMPALA_HOME}/be/build/latest/service\ /unifiedbetests (invoked by ${IMPALA_HOME}/be/build/latest/\ scheduling/admission-controller-test) and reports the following error, after attaching to the test process. BFD (GNU Binutils) 2.25.51 internal error, aborting at elf64-x86-64.c ine 5583 in elf_x86_64_get_plt_sym_val Change-Id: I4f592f60881fd8f34e2bf393a76f5a921505010a Reviewed-on: http://gerrit.cloudera.org:8080/16238 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-07-25 04:28:20 +00:00
Laszlo Gaal	4d4c364258	IMPALA-9871: Simplify OS version matching for SLES in bootstrap_toolchain.py bin/bootstrap_toolchain.py failed to recognize SLES12 sp5, which broke builds on that platform. This patch simplifies OS version parsing and matching for SLES, omitting the check for the OS minor version, which shows the SP level for SLES. This is similar to how Red Hat variants are handled. This gets rid of the constant update need whenever a new SP level is released for SLES12. This is enabled by the native toolchain sharing a single set of artifacts between all the SLES12 SP levels. Test: ran a successful build on a SLES12sp5 box. Change-Id: Id9ada210b915050fbceebb7364e130116e9244d0 Reviewed-on: http://gerrit.cloudera.org:8080/16102 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-23 23:53:39 +00:00
Tim Armstrong	6ec6aaae8e	IMPALA-3695: Remove KUDU_IS_SUPPORTED Testing: Ran exhaustive tests. Change-Id: I059d7a42798c38b570f25283663c284f2fcee517 Reviewed-on: http://gerrit.cloudera.org:8080/16085 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-18 01:11:18 +00:00
Joe McDonnell	56ee90c598	IMPALA-9760: Add IMPALA_TOOLCHAIN_PACKAGES_HOME to prepare for GCC7 The locations for native-toolchain packages in IMPALA_TOOLCHAIN currently do not include the compiler version. This means that the toolchain can't distinguish between native-toolchain packages built with gcc 4.9.2 versus gcc 7.5.0. The collisions can cause issues when switching back and forth between branches. This introduces the IMPALA_TOOLCHAIN_PACKAGES_HOME environment variable, which is a location inside IMPALA_TOOLCHAIN that would hold native-toolchain packages. Currently, it is set to the same as IMPALA_TOOLCHAIN, so there is no difference in behavior. This lays the groundwork to add the compiler version to this path when switching to GCC7. Testing: - The only impediment to building with IMPALA_TOOLCHAIN_PACKAGES_HOME=$IMPALA_TOOLCHAIN/test is Impala-lzo. With a custom Impala-lzo, compilation succeeds. Either Impala-lzo will be fixed or it will be removed. - Core tests Change-Id: I1ff641e503b2161baf415355452f86b6c8bfb15b Reviewed-on: http://gerrit.cloudera.org:8080/15991 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-05-30 16:25:37 +00:00
Joe McDonnell	3e76da9f51	IMPALA-9708: Remove Sentry support Impala 4 decided to drop Sentry support in favor of Ranger. This removes Sentry support and related tests. It retires startup flags related to Sentry and does the first round of removing obsolete code. This does not adjust documentation to remove references to Sentry, and other dead code will be removed separately. Some issues came up when implementing this. Here is a summary of how this patch resolves them: 1. authorization_provider currently defaults to "sentry", but "ranger" requires extra parameters to be set. This changes the default value of authorization_provider to "", which translates internally to the noop policy that does no authorization. 2. These flags are Sentry specific and are now retired: - authorization_policy_provider_class - sentry_catalog_polling_frequency_s - sentry_config 3. The authorization_factory_class may be obsolete now that there is only one authorization policy, but this leaves it in place. 4. Sentry is the last component using CDH_COMPONENTS_HOME, so that is removed. There are still Maven dependencies coming from the CDH_BUILD_NUMBER repository, so that is not removed. 5. To make the transition easier, testdata/bin/kill-sentry-service.sh is not removed and it is still called from testdata/bin/kill-all.sh. Testing: - Core job passes Change-Id: I8e99c15936d6d250cf258e3a1dcba11d3eb4661e Reviewed-on: http://gerrit.cloudera.org:8080/15833 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-05-20 17:43:40 +00:00
Joe McDonnell	f241fd08ac	IMPALA-9731: Remove USE_CDP_HIVE=false and Hive 2 support Impala 4 moved to using CDP versions for components, which involves adopting Hive 3. This removes the old code supporting CDH components and Hive 2. Specifically, it does the following: 1. Remove USE_CDP_HIVE and default to the values from USE_CDP_HIVE=true. USE_CDP_HIVE now has no effect on the Impala environment. This also means that bin/jenkins/build-all-flag-combinations.sh no longer include USE_CDP_HIVE=false as a configuration. 2. Remove USE_CDH_KUDU and default to getting Impala from the native toolchain. 3. Ban IMPALA_HIVE_MAJOR_VERSION<3 and remove related code, including the IMPALA_HIVE_MAJOR_VERSION=2 maven profile in fe/pom.xml. There is a fair amount of code that still references the Hive major version. Upstream Hive is now working on Hive 4, so there is a high likelihood that we'll need some code to deal with that transition. This leaves some code (such as maven profiles) and test logic in place. Change-Id: Id85e849beaf4e19dda4092874185462abd2ec608 Reviewed-on: http://gerrit.cloudera.org:8080/15869 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-05-07 22:14:39 +00:00
Laszlo Gaal	b921d982b5	IMPALA-9668: Obey SKIP_TOOLCHAIN_BOOTSTRAP during virtualenv bootstrap IMPALA-9626 broke the use case where the toolchain binaries are not downloaded from the native-toolchain S3 bucket, because SKIP_TOOLCHAIN_BOOTSTRAP is set to true. Fix this use case by checking SKIP_TOOLCHAIN_BOOTSTRAP in bin/bootstrap_environment.py: - if true: just check if the specified version of the Python binary is present at the expected toolchain location. If it is there, use it, otherwise throw an exception and abort the bootstrap process. - in any other case: proceed to download the Python binary as in bootstrap_toolchain.py. Test: - simulate the custom toolchain setup by downloading the toolchain binaries from the S3 bucket, copying them to a separate directory, symlinking them into Impala/toolchain, then executing buildall.sh with SKIP_BOOTSTRAP_TOOLCHAIN set to "true". Change-Id: Ic51b3c327b3cebc08edff90de931d07e35e0c319 Reviewed-on: http://gerrit.cloudera.org:8080/15759 Reviewed-by: Laszlo Gaal <laszlo.gaal@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-04-22 21:56:01 +00:00
Laszlo Gaal	c97191b6a5	IMPALA-9626: Use Python from the toolchain for Impala Historically Impala used the Python2 version that was available on the hosting platform, as long as that version was at least v2.6. This caused constant headache as all Python syntax had to be kept compatible with Python 2.6 (for Centos 6). It also caused a recent problem on Centos 8: here the system Python version was compiled with the system's GCC version (v8.3), which was much more recent than the Impala standard compiler version (GCC 4.9.2). When the Impala virtualenv was built, the system Python version supplied C compiler switches for models containing native code that were unknown for the Impala version of GCC, thus breaking virtualenv installation. This patch changes the Impala virtualenv to always use the Python2 version from the toolchain, which is built with the toolchain compiler. This ensures that - Impala always has a known Python 2.7 version for all its scripts, - virtualenv modules based on native code will always be installable, as the Python environment and the modules are built with the same compiler version. Additional changes: - Add an auto-use fixture to conftest.py to check that the tests are being run with Python 2.7.x - Make bootstrap_toolchain.py independent from the Impala virtualenv: remove the dependency on the "sh" library Tests: - Passed core-mode tests on CentOS 7.4 - Passed core-mode tests in Docker-based mode for centos:7 and ubuntu:16.04 Most content in this patch was developed but not published earlier by Tim Armstrong. Change-Id: Ic7b40cef89cfb3b467b61b2d54a94e708642882b Reviewed-on: http://gerrit.cloudera.org:8080/15624 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-04-16 01:08:00 +00:00
Laszlo Gaal	34018f6275	IMPALA-9629: Add CentOS 8.1 support to bootstrap_system.sh CentOS 8.1 is a new major version of the CentOS family. It is now stable and popular enough to start supporting it for Impala development. Prepare a raw CentOS 8.1 system to support Impala development and testing. This should work on a standalone computer, on a virtual machine, or inside a Docker container. Details: - snappy-devel moved to the PowerTools repo, so it needs to be installed from there - CentOS 8 has no default Python version. The bootstrap script installs (or configures) Python2 with pip2, then makes them the default via the "alternatives" mechanism. The installer is adaptive, it performs only the necessary steps, so it works in various environments. The installer logic is also shared between bin/bootstrap_system.sh and docker/entrypoint.sh - The toolchain package tag "ec2-centos-8" is added to bootstrap_toolchain.py - For some unknown reason, when the downloaded Maven tarball is extracted in a Docker-based test, the "bin" and "boot" directories are created with owner-only permissions. The 'impdev' users has no access to the maven executable, which then breaks the build. This patch forcibly restores the correct permissions on these directories; this is a no-op when the extraction happens correctly. - TOOLCHAIN_ID is bumped to a build that already has CentOS 8 binaries. - Centos8-specific bootstrap code was added to the Docker-based tests. Tested: - ran the Docker-based tests with --base-image=centos:8 to verify the following build phases are successful: * system prep * build * dataload and that test can start. Passing all tests is was not a requirement for this step, although plausible test results (i.e. not all of the tests fail) were. - ran the Docker-based tests to verify nonregression with --base-image set to the following: centos:7, ubuntu:16.04, ubuntu:18.04. On centos:7 and ubuntu:16.04 the only failure was IMPALA-9097 (BE tests fail without the minicluster running); ubuntu:18.04 showed the same failures as the current upstream code. - passed a core-mode test run on private infrastructure on Centos 7.4 - ran buildall.sh in core mode manually inside a Docker container, simulating a developer workflow (prep-build-dataload-test). There were several observed test failures, but the workflow itself was run to completion with no problems. Change-Id: I3df5d48eca7a10219264e3604a4f05f072188e6e Reviewed-on: http://gerrit.cloudera.org:8080/15623 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-04-15 17:23:43 +00:00
Attila Jeges	14ae6eae1e	IMPALA-9279: Update the Kudu version to include VARCHAR support Before this change the preferred way of getting Kudu was to pull it in from the specified CDH build (even if USE_CDP_HIVE was set to true). Optionally by setting USE_CDH_KUDU to false, one could force Impala to use the native toolchain Kudu. But even then, the Kudu Java artifacts would be downloaded from CDH. Since Kudu VARCHAR support won't be backported to CDH, this behavior blocks the Impala side of the Kudu/Impala VARCHAR integration. With this change: 1. Using the native toolchain Kudu (including the Java artifacts) is the default behavior. From now on USE_CDH_KUDU will be set to false by default. Impala can be forced to fall back on using the CDH Kudu by explicitly setting USE_CDH_KUDU to true. 2. Kudu version is updated to include the VARCHAR support. Testing: Ran exhaustive tests with USE_CDH_KUDU=true and USE_CDH_KUDU=false. Change-Id: Iafe56342d43cb63e35c0bbb1b4a99327dda0a44a Reviewed-on: http://gerrit.cloudera.org:8080/15134 Reviewed-by: Attila Jeges <attilaj@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-02-12 13:27:18 +00:00
Tim Armstrong	6f150d383c	IMPALA-9361: manually configured kerberized minicluster The kerberized minicluster is enabled by setting IMPALA_KERBERIZE=true in impala-config-*.sh. After setting it you must run ./bin/create-test-configuration.sh then restart minicluster. This adds a script to partially automate setup of a local KDC, in lieu of the unmaintained minikdc support (which has been ripped out). Testing: I was able to run some queries against pre-created HDFS tables with kerberos enabled. Change-Id: Ib34101d132e9c9d59da14537edf7d096f25e9bee Reviewed-on: http://gerrit.cloudera.org:8080/15159 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-02-08 05:16:12 +00:00
stiga-huang	2c429d6d53	IMPALA-6772: Bump ORC version to 1.6.2-p6 Bump our ORC version to include fixes for ORC-414, ORC-580, ORC-581, ORC-586, ORC-589, ORC-590, and ORC-591. The new ORC version also unblocks IMPALA-9226 which requires EncodedStringVectorBatch introduced in ORC-1.6. Due to other changes in native-toolchain, this patch also bumps versions of LLVM and crcutil. Tests: - Run scanners test for orc/def/block. Change-Id: I7eec92238b12179502d6a9001ee2ba24bfa96b77 Reviewed-on: http://gerrit.cloudera.org:8080/15089 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-01-22 09:56:53 +00:00
Joe McDonnell	da0ab1d41a	IMPALA-8586: Support download URLs for CDP bin/bootstrap_toolchain.py has accumulated complexity over time. CDH, CDP, and the native toolchain all use different download machinery and naming. One feature that is needed on the CDP side is the ability to specify the download URL in an IMPALA_*_URL environment variable. This adds that support and refactors CDH and native toolchain downloads to use the new system. This is essentially a rewrite of bin/bootstrap_toolchain.py. Currently, there are multiple phases of downloads, each with their own download functions and peculiarities to account for package names and destinations for downloads. This changes the logic so that a package will generate a DownloadUnpackTarball that is completely resolved. It contains everything about what to download and where to put it as well as a needs_download() function and a download() function. Once there is a list of DownloadUnpackTarball objects, they can all be downloaded and unpacked in a single phase. This implements different types of packages as subclasses of DownloadUnpackTarball. Since most subclasses want to be able to construct URLs and archive names using templates, the TemplatedDownloadUnpackTarball takes the same arguments as DownloadUnpackTarball along with a map of template substitutions, which are applied to all string arguments. Kudu requires special handling and gets its own set of subclasses to handle various subtleties like toolchain vs CDH Kudu, the Kudu stub, and making sure that the "kudu" package and the "kudu-java" package don't confuse each other. As part of this change, USE_CDP_HIVE=true now uses the CDP version of HBase rather than always using the CDH version. Change-Id: I67824fd82b820e68e9f5c87939ec94ca6abadb8c Reviewed-on: http://gerrit.cloudera.org:8080/13432 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-09-13 01:40:11 +00:00
Thomas Tauber-Marshall	90d8442529	Fix integration of kudu-hive.jar IMPALA-8503 added downloading kudu-hive.jar and adding it to HADOOP_CLASSPATH in run-hive-server.sh to allow the Hive Metastore to start with Kudu's HMS plugin. There are two problems with this that are fixed by this patch: - Previously, we fully specify the expected jar filename based on the value of IMPALA_KUDU_JAVA_VERSION when adding it to HADOOP_CLASSPATH but this is overly restrictive for users who may wish to override this value in impala-config-branch.sh to build their own branch with a different version of the kudu-hive.jar This patch relaxes this restriction by adding any jar containing the string kudu-hive in IMPALA_KUDU_JAVA_HOME to HADOOP_CLASSPATH - In bootstrap_toolchain, we don't download a package if its directory already exists. Since the 'kudu' and 'kudu-java' packages download to the same directory, this led to a race condition where 'kudu-java' might not be downloaded if 'kudu' had already been unpacked when it started. This patch fixes this by inspecting the contents of the Kudu package directory to look for specific files expected for each Kudu package. Change-Id: I4ac79c3e9b8625ba54145dba23c69fd5117f35c7 Reviewed-on: http://gerrit.cloudera.org:8080/13542 Reviewed-by: Thomas Marshall <tmarshall@cloudera.com> Reviewed-by: Hao Hao <hao.hao@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-06-07 02:18:38 +00:00
Abhishek	51e8175c62	IMPALA-8450: Add support for zstd in parquet Makefile was updated to include zstd in the ${IMPALA_HOME}/toolchain directory. Other changes were made to make zstd headers and libs accessible. Class ZstandardCompressor/ZstandardDecompressor was added to provide interfaces for calling ZSTD_compress/ZSTD_decompress functions. Zstd supports different compression levels (clevel) from 1 to ZSTD_maxCLevel(). Zstd also supports -ive clevels, but since the -ive values represents uncompressed data they won't be supported. The default clevel is ZSTD_CLEVEL_DEFAULT. HdfsParquetTableWriter was updated to support ZSTD codec. The new codecs can be set using existing query option as follows: set COMPRESSION_CODEC=ZSTD:<clevel>; set COMPRESSION_CODEC=ZSTD; // uses ZSTD_CLEVEL_DEFAULT Testing: - Added unit test in DecompressorTest class with ZSTD_CLEVEL_DEFAULT clevel and a random clevel. The test unit decompresses an input compressed data and validates the result. It also tests for expected behavior when passing an over/under sized buffer for decompressing. - Added unit tests for valid/invalid values for COMPRESSION_CODEC. - Added e2e test in test_insert_parquet.py which tests writing/read- ing (null/non-null) data into/from a table (w different data type columns) using multiple codecs. Other existing e2e tests were updated to also use parquet/zstd table format. - Manual interoperability tests were run between Impala and Hive. Change-Id: Id2c0e26e6f7fb2dc4024309d733983ba5197beb7 Reviewed-on: http://gerrit.cloudera.org:8080/13507 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-06-05 11:15:04 +00:00
Hao Hao	ab3bc22534	IMPALA-8503: allow the Hive Metastore to start with kudu-hive plugin This patch allows to start the Hive Metasotre with Kudu plugin which is required for enabling Kudu's integration with the HMS. The Kudu plugin is downloaded and extracted from native-toolchain S3 bucket. Change-Id: I4bd1488ced51840ec986d29ed371e26168abcc76 Reviewed-on: http://gerrit.cloudera.org:8080/13319 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Thomas Marshall <tmarshall@cloudera.com>	2019-06-03 17:17:37 +00:00
Todd Lipcon	17daa6efb9	IMPALA-8369 (part 2): Hive 3: switch to Tez-on-YARN execution This switches away from Tez local mode to tez-on-YARN. After spending a couple of days trying to debug issues with Tez local mode, it seemed like it was just going to be too much of a lift. This patch switches on the starting of a Yarn RM and NM when USE_CDP_HIVE is enabled. It also switches to a new yarn-site.xml with a minimized set of configurations, generated by the new python templating. In order for everything to work properly I also had to update the Hadoop dependency to come from CDP instead of CDH when using CDP Hive. Otherwise, the classpath of the launched Tez containers had conflicting versions of various Hadoop classes which caused tasks to fail. I verified that this fixes concurrent query execution by running queries in parallel in two beeline sessions. With local mode, these queries would periodically fail due to various races (HIVE-21682). I'm also able to get farther along in data loading. Change-Id: If96064f271582b2790a3cfb3d135f3834d46c41d Reviewed-on: http://gerrit.cloudera.org:8080/13224 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Todd Lipcon <todd@apache.org>	2019-05-10 13:42:55 +00:00
Tim Armstrong	9a216f1de9	IMPALA-8517: print backtrace to debug bootstrap_toolchain This should help track down the source of the exception if the flakiness reoccurs. Change-Id: Ia6205d024c67c6c70ec49e4e65967d5c91b48428 Reviewed-on: http://gerrit.cloudera.org:8080/13270 Tested-by: Tim Armstrong <tarmstrong@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2019-05-08 18:18:14 +00:00
Vihang Karajgaonkar	748a3d57e5	Fix redundant downloads of hive source tarball Since CDP_BUILD_NUMBER was bumped to 1056671 the name of the hive source tarball changed. Not only the tar ball name was changed, the file it gets extracted to is also different from the tar file itself. Due to this the bootstrap_toolchain.py fails to check if the downloaded hive source component already exists and it downloads again unnecessarily. This patch improves bootstrap_toolchain.py to take non-standard tarfiles which extracts to a different directory name compared to the tar file. Testing done: 1. Removed the local toolchain and ran the script couple of times to make sure that it downloads the hive tar ball only once. Change-Id: Ifd04a1a367a0cc4aa0a2b490a45fbc93a862c83a Reviewed-on: http://gerrit.cloudera.org:8080/13219 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-05-04 20:02:48 +00:00
Vihang Karajgaonkar	99e1a39b90	Bump CDP_BUILD_NUMBER to 1056671 This change bumps the CDP_BUILD_NUMBER to 1056671 which includes all the Hive and Tez patches required for building against Hive 3. With this change we get rid of the custom builds for Hive and Tez introduced in IMPALA-8369 and switch to more official sources of builds for the minicluster. Notes: 1. The tarball names and the directory to which they extract to changed from the previous CDP_BUILD_NUMBER. Due to this we need to change the bootstrap_toolchain and impala-config.sh so that the Hive environment variables are set correctly. Testing Done: 1. Built against Hive-3 and Hive-2 using the flag USE_CDP_HIVE 2. Did basic testing from Impala and Beeline for the testing the tez patch 3. Currently running the full-suite of tests to make sure there are no regressions Change-Id: Ic758a15b33e89b6804c12356aac8e3f230e07ae0 Reviewed-on: http://gerrit.cloudera.org:8080/13213 Reviewed-by: Fredy Wijaya <fwijaya@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-05-03 04:35:39 +00:00
Vihang Karajgaonkar	a89762bc01	IMPALA-8369 : Impala should be able to interoperate with Hive 3.1.0 This change adds a compatibility shim in fe so that Impala can interoperate with Hive 3.1.0. It moves the existing Metastoreshim class to a compat-hive-2 directory and adds a new Metastoreshim class under compat-hive-3 directory. These shim classes implement method which are different in hive-2 v/s hive-3 and are used by front end code. At the build time, based on the environment variable IMPALA_HIVE_MAJOR_VERSION one of the two shims is added to as source using the fe/pom.xml build plugin. Additionally, in order to reduce the dependencies footprint of Hive in the front end code, this patch also introduces a new module called shaded-deps. This module using shade plugin to include only the source files from hive-exec which are need by the fe code. For hive-2 build path, no changes are done with respect to hive dependencies to minimize the risk of destabilizing the master branch on the default build option of using Hive-2. The different set of dependencies are activated using maven profiles. The activation of each profile is automatic based on the IMPALA_HIVE_MAJOR_VERSION. Testing: 1. Code compiles and runs against both HMS-3 and HMS-2 2. Ran full-suite of tests using the private jenkins job against HMS-2 3. Running full-tests against HMS-3 will need more work like supporting Tez in the mini-cluster (for dataloading) and HMS transaction support since HMS3 create transactional tables by default. THis will be on-going effort and test failures on Hive-3 will be fixed in additional sub-tasks. Notes: 1. Patch uses a custom build of Hive to be deployed in mini-cluster. This build has the fixes for HIVE-21596. This hack will be removed when the patches are available in official CDP Hive builds. 2. Some of the existing tests rely on the fact the UDFs implement the UDF interface in Hive (UDFLength, UDFHour, UDFYear). These built-in hive functions have been moved to use GenericUDF interface in Hive 3. Impala currently only supports UDFExecutor. In order to have a full compatibility with all the functions in Hive 2.x we should support GenericUDFs too. That would be taken up as a separate patch. 3. Sentry dependencies bring a lot of transitive hive dependencies. The patch excludes such dependencies since they create problems while building against Hive-3. Since these hive-2 dependencies are already included when building against hive-2 this should not be a problem. Change-Id: I45a4dadbdfe30a02f722dbd917a49bc182fc6436 Reviewed-on: http://gerrit.cloudera.org:8080/13005 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-05-01 03:27:43 +00:00
Todd Lipcon	8e97a3b5f6	Configure Hive 3's HS2 to execute queries using Tez local mode Hive 3 no longer supports MR execution, so this sets up the appropriate configuration and classpath so that HS2 can run queries using Tez. The bulk of this patch is toolchain changes to download Tez itself. The Tez tarball is slightly odd in that it has no top-level directory, so the patch changes around bootstrap_toolchain a bit to support creating its own top-level directory for a component. The remainder of the patch is some classpath setup and hive-site changes when Hive 3 is enabled. So far I tested this manually by setting up a metastore and impala-config with USE_CDP_HIVE=true, and then connecting to HS2 using hive beeline -u 'jdbc:hive2://localhost:11050' I was able to insert and query data, and was able to verify that queries like 'select count(*)' were executing via Tez local mode. NOTE: this patch relies on a custom build of Tez, based on a private branch. I've submitted a PR to Tez upstream, referenced in the commits here. Will remove this hack once the PR is accepted and makes its way into an official build. Change-Id: I76e47fbd1d6ff5103d81a8de430d5465dba284cd Reviewed-on: http://gerrit.cloudera.org:8080/12931 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>	2019-04-26 23:45:14 +00:00
Fredy Wijaya	5fa076e95c	IMPALA-8329: Bump CDP_BUILD_NUMBER to 1013201 This patch bumps the CDP_BUILD_NUMBER to 1013201. This patch also refactors the bootstrap_toolchain.py to be more generic for dealing with CDP components, e.g. Ranger and Hive 3. The patch also fixes some TODOs to replace the rangerPlugin.init() hack with rangerPlugin.refreshPoliciesAndTags() API available in this Ranger build. Testing: - Ran core tests - Manually verified that no regression when starting Hive 3 with USE_CDP_HIVE=true Change-Id: I18c7274085be4f87ecdaf0cd29a601715f594ada Reviewed-on: http://gerrit.cloudera.org:8080/13002 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-04-17 05:30:33 +00:00
Hector Acosta	da153104f2	IMPALA-8382 Add support for SLES 12 SP3 Testing: Ran a build, reployed a cluster on sles 12 sp3. Change-Id: Ia3cb1311b15226f1130be7e1d79110d16e3287ef Reviewed-on: http://gerrit.cloudera.org:8080/12922 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2019-04-07 19:44:48 +00:00
Vihang Karajgaonkar	6b77c61d94	IMPALA-8345 : Add option to set up minicluster to use Hive 3 As a first step to integrate Impala with Hive 3.1.0 this patch modifies the minicluster scripts to optionally use Hive 3.1.0 instead of CDH Hive 2.1.1. In order to make sure that existing setups don't break this is enabled via a environment variable override to bin/impala-config.sh. When the environment variable USE_CDP_HIVE is set to true the bootstrap_toolchain script downloads Hive 3.1.0 tarballs and extracts it in the toolchain directory. These binaries are used to start the Hive services (Hiveserver2 and metastore). The default is still CDH Hive 2.1.1 Also, since Hive 3.1.0 uses a upgraded metastore schema, this patch makes use of a different database name so that it is easy to switch from working from one environment which uses Hive 2.1.1 metastore to another which usese Hive 3.1.0 metastore. In order to start a minicluster which uses Hive 3.1.0 users should follow the steps below: 1. Make sure that minicluster, if running, is stopped before you run the following commands. 2. Open a new terminal and run following commands. > export USE_CDP_HIVE=true > source bin/impala-config.sh > bin/bootstrap_toolchain.py The above command downloads the Hive 3.1.0 tarballs and extracts them in toolchain/cdp_components-${CDP_BUILD_NUMBER} directory. This is a no-op if the CDP_BUILD_NUMBER has not changed and if the cdp_components are already downloaded by a previous invocation of the script. > source bin/create-test-configuration.sh -create-metastore The above step should provide "-create-metastore" only the first time so that a new metastore db is created and the Hive 3.1.0 schema is initialized. For all subsequent invocations, the "-create-metastore" argument can be skipped. We should still source this script since the hive-site.xml of Hive 3.1.0 is different than Hive 2.1.0 and needs to be regenerated. > testdata/bin/run-all.sh Note that the testing was performed locally by downloading the Hive 3.1 binaries into toolchain/cdp_components-976603/apache-hive-3.1.0.6.0.99.0-9-bin. Once the binaries are available in S3 bucket, the bootstrap_toolchain script should automatically do this for you. Testing Done: 1. Made sure that the cluster comes up with Hive 3.1 when the steps above are performed. 2. Made sure that existing scripts work as they do currently when argument is not provided. 3. Impala cluster comes and connects to HMS 3.1.0 (Note that Impala still uses Hive 2.1.1 client. Upgrading client libraries in Impala will be done as a separate change) Change-Id: Icfed856c1f5429ed45fd3d9cb08a5d1bb96a9605 Reviewed-on: http://gerrit.cloudera.org:8080/12846 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-03-28 01:52:45 +00:00
Fredy Wijaya	01592b5fa3	IMPALA-8233: Do not re-download Ranger if it is already downloaded This patch updates the bootstrap_toolchain.py to not re-download Ranger if it is already downloaded. Testing: Manually tested it by running the boolstrap_toolchain.py. Change-Id: Iec3b200bda11d00bba6a250461b37c599d8d1adf Reviewed-on: http://gerrit.cloudera.org:8080/12541 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-02-21 21:25:31 +00:00
fwijaya	0cb7187841	IMPALA-8099: Update the build scripts to support Apache Ranger This patch updates the build scripts to suport Apache Ranger: - Download Apache Ranger - Setup Apache Ranger database - Create Apache Ranger configuration files - Start/stop Apache Ranger Testing: - Ran ./buildall.sh -format on a clean repository and was able to start Ranger without any problem. - Ran test-with-docker Change-Id: I249cd64d74518946829e8588ed33d5ac454ffa7b Reviewed-on: http://gerrit.cloudera.org:8080/12469 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-02-15 21:28:05 +00:00
Hector Acosta	f8c9ef4841	Update toolchain to support ubuntu 18.04 Openldap was bumped because it gained openssl 1.1 support, which is what ubuntu 18 uses. Change-Id: Ie25c8cb129c6817a2e116f31853ae64c5a8acfe9 Reviewed-on: http://gerrit.cloudera.org:8080/12421 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-02-14 22:15:54 +00:00
Sahil Takiar	fa78c594de	IMPALA-7924: Generate Thrift 11 Python Code Upgrades the version of the toolchain in order to pull in Thrift 0.11.0. Updates the CMake build to write generated Python code using Thrift 0.11 to shell/build/thrift-11-gen/gen-py/. The Thrift 0.11 Python deserialization code has some big performance improvements that allow faster parsing of runtime profiles. By adding the ability to generate the Thrift Python code using Thrift 0.11 we can take advantage of the Python performance improvements without going through a full Thrift upgrade from 0.9 to 0.11. Set USE_THRIFT11_GEN_PY=true and then run bin/set-pythonpath.sh to add the Thrift 0.11 Python generated code to the PYTHONPATH rather than the 0.9 generated code. Testing: - Ran core tests Change-Id: I3432c3e29d28ec3ef6a0a22156a18910f511fed0 Reviewed-on: http://gerrit.cloudera.org:8080/12036 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-01-09 05:54:59 +00:00
Thomas Tauber-Marshall	4c656677ac	Make IMPALA_KUDU_* variables override-able Allows the IMPALA_KUDU_VERSION and IMPALA_KUDU_URL environment variables to be override by impala-config-branch.sh Also adds a feature to bootstrap-toolchain.py that optionally substitutes the CDH platform label into override values for IMPALA_(CDH_COMPONENT)_URL, which makes it easier to override the value of IMPALA_KUDU_URL Testing: - Went through various combinations of a clean shell or overridding these variables then building and running the minicluster. Change-Id: I36414b8772d615809463127a989e843b9d15d4a3 Reviewed-on: http://gerrit.cloudera.org:8080/11499 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-09-25 00:14:56 +00:00
Thomas Tauber-Marshall	85f3bb0178	IMPALA-7499: build against CDH Kudu This patch transitions from pulling in Kudu (libkudu_client.so and the minicluster tarballs) from the toolchain to instead pull Kudu in with the other CDH components. For OSes where the CDH binaries are not provided but the toolchain binaries are (only Ubuntu 14), we set USE_CDH_KUDU to false to continue to download the toolchain binaries. We also continue to use the toolchain binaries to build the client stub for OSes where KUDU_IS_SUPPORTED is false. This patch also fixes an issue in bootstrap_toolchain.py where we were using the wrong g++ to compile the Kudu stub. Testing: - Verified building and running Impala works as expected for supported combinations of KUDU_IS_SUPPORTED/USE_CDH_KUDU Change-Id: If6e1048438b6d09a1b38c58371d6212bb6dcc06c Reviewed-on: http://gerrit.cloudera.org:8080/11363 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-09-11 01:01:06 +00:00
Laszlo Gaal	1455548c8c	Download gdb from the toolchain and add it to the path This patch extends the toolchain bootstrap code with the toolchain version of GDB (v7.9.1, built in the toolchain since its inception), and adds it to the path. The goal is to provide a stable gdb version for core dump analysis. Change-Id: If4e094db93da4f5dab1e1b2da7f88a1dd06bc9e6 Reviewed-on: http://gerrit.cloudera.org:8080/11215 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2018-08-15 17:10:48 +00:00
Joe McDonnell	27c788f826	IMPALA-7132: Filter out useless output from run_clang_tidy.sh Clang's run-clang-tidy.py script produces a lot of output even when there are no warnings or errors. None of this output is useful. This patch has two parts: 1. Bump LLVM to 5.0.1-p1, which has patched run-clang-tidy.py to make it reduce its own output when passed -quiet (along with other enhancements). 2. Pass -quiet to run-clang-tidy.py and pipe the stderr output to a temporary file. Display this output only if run-clang-tidy.py hits an error, as this output is not useful otherwise. Testing with a known clang tidy issue shows that warnings and errors are still in the output, and the output is clean on a clean Impala checkout. Change-Id: I63c46a7d57295eba38fac8ab49c7a15d2802df1d Reviewed-on: http://gerrit.cloudera.org:8080/10615 Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-07-17 04:58:28 +00:00
Lars Volker	837d386886	Bump toolchain version, include libunwind Change-Id: I0b26f6a342dd7ba282c3f6c4de93745aff2dd095 Reviewed-on: http://gerrit.cloudera.org:8080/10755 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-07-06 22:06:03 +00:00
Fredy Wijaya	92292e79f0	IMPALA-7180: Pin Impala CDH dependencies For IMPALA_MINICLUSTER_PROFILE=3 (Hadoop 3.x components), pin the CDH dependencies by storing the CDH tarballs and Maven repository in S3. This solves the issue of build coherency between the the CDH tarballs and Maven dependencies. For IMPALA_MINICLUSTER_PROFILE=2 (Hadoop 2.x components), pin the CDH dependencies by storing only the CDH tarballs in S3. The Maven repository will still use https://repository.cloudera.com, so there is still a possibility of a build coherency issue. For each CDH dependency, there is a unique build number in each repository URL to indicate the build number that created those CDH dependencies. This informaton can be useful for debugging issues related to CDH dependencies. This patch introduces CDH_DOWNLOAD_HOST and CDH_BUILD_NUMBER environment variables that can be overriden, which can be useful for running an integration job. This patch also fixes dependency issues in Hadoop that transitively depend on snapshot versions of dependencies that no longer exist, i.e. - net.minidev:json-smart:2.3-SNAPSHOT (HADOOP-14903) - org.glassfish:javax.el:3.0.1-b06-SNAPSHOT The fix is to force the dependencies by using the released versions of those dependencies. Testing: - Ran all core tests on IMPALA_MINICLUSTER_PROFILE=2 and IMPALA_MINICLUSTER_PROFILE=3 Cherry-picks: not for 2.x Change-Id: I66c0dcb8abdd0d187490a761f129cda3b3500990 Reviewed-on: http://gerrit.cloudera.org:8080/10748 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-23 01:46:40 +00:00
Attila Jeges	17749dbcfc	IMPALA-3307: Add support for IANA time-zone db Impala currently uses two different libraries for timestamp manipulations: boost and glibc. Issues with boost: - Time-zone database is currently hard coded in timezone_db.cc. Impala admins cannot update it without upgrading Impala. - Time-zone database is flat, therefore can’t track year-to-year changes. - Time-zone database is not updated on a regular basis. Issues with glibc: - Uses /usr/share/zoneinfo/ database which could be out of sync on some of the nodes in the Impala cluster. - Uses the host system’s local time-zone. Different nodes in the Impala cluster might use a different local time-zone. - Conversion functions take a global lock, which causes severe performance degradation. In addition to the issues above, the fact that /usr/share/zoneinfo/ and the hard-coded boost time-zone database are both in use is a source of inconsistency in itself. This patch makes the following changes: - Instead of boost and glibc, impalad uses Google's CCTZ to implement time-zone conversions. - Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to specify an HDFS/S3/ADLS path to a zip archive that contains the shared compiled IANA time-zone database. If the startup flag is set, impalad will use the specified time-zone database. Otherwise, impalad will use the default /usr/share/zoneinfo time-zone database. - Introduces a new startup flag (--hdfs_zone_alias_conf) to impalad to specify an HDFS/S3/ADLS path to a shared config file that contains definitions for non-standard time-zone aliases. - impalad reads the entire time-zone database into an in-memory map on startup for fast lookups. - The name of the coordinator node’s local time-zone is saved to the query context when preparing query execution. This time-zone is used whenever the current time-zone is referred afterwards in an execution node. - Adds a new ZipUtil class to extract files from a zip archive. The implementation is not vulnerable to Zip Slip. Cherry-picks: not for 2.x. Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Reviewed-on: http://gerrit.cloudera.org:8080/9986 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Attila Jeges <attilaj@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-22 13:18:58 +00:00
Philip Zeyliger	202807e2ff	Speed up Python dependencies. This parallelizes downloading some Python libraries, giving a speedup of $IMPALA_HOME/infra/python/deps/download_requirements. I've seen this take from 7-15 seconds before and from 2-5 seconds after. I also checked that we always have at least Python 2.6 when building Impala, so I was able to remove the try/except handling in bootstrap_toolchain. Change-Id: I7cbf622adb7d037f1a53c519402dcd8ae3c0fe30 Reviewed-on: http://gerrit.cloudera.org:8080/10234 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-01 22:12:39 +00:00
stiga-huang	818cd8fa27	IMPALA-5717: Support for reading ORC data files This patch integrates the orc library into Impala and implements HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner supplies input needed from the orc-reader, tracks memory consumption of the reader and transfers the reader's output (orc::ColumnVectorBatch) into impala::RowBatch. The ORC version we used is release-1.4.3. A startup option --enable_orc_scanner is added for this feature. It's set to true by default. Setting it to false will fail queries on ORC tables. Currently, we only support reading primitive types. Writing into ORC table has not been supported neither. Tests - Most of the end-to-end tests can run on ORC format. - Add tpcds, tpch tests for ORC. - Add some ORC specific tests. - Haven't enabled test_scanner_fuzz for ORC yet, since the ORC library is not robust for corrupt files (ORC-315). Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4 Reviewed-on: http://gerrit.cloudera.org:8080/9134 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-11 05:13:02 +00:00

1 2

86 Commits