impala

mirror of https://github.com/apache/impala.git synced 2026-01-28 09:03:52 -05:00

Author	SHA1	Message	Date
Adam Holley	21f521a7c2	IMPALA-7554: Update custom cluster tests to have new logs for sentry This patch adds the ability to create a new log for each spawn of the sentry service. This will enable better trouble shooting for the custom cluster tests that restart the sentry service. Testing: - Ran all custom cluster tests. Change-Id: I6e538af7fd6e6ea21dc3f4442bdebf3b31558516 Reviewed-on: http://gerrit.cloudera.org:8080/11624 Reviewed-by: Fredy Wijaya <fwijaya@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-12 01:00:56 +00:00
Adam Holley	34e666f520	IMPALA-7503: SHOW GRANT USER not showing all privileges. This patch fixes the SHOW GRANT USER statement to show all privileges granted to a user, either directly via object ownership, or granted through a role via a group the user belongs to. The output for SHOW GRANT USER will have two additional columns for privilege name and privilege type so the user can know where the privilege comes from. Truncated sample showing two columns that are different from role: +----------------+----------------+--------+----------+-... \| principal_type \| principal_name \| scope \| database \| ... +----------------+----------------+--------+----------+-... \| USER \| foo \| table \| foo_db \| ... \| ROLE \| foo_role \| server \| \| ... +----------------+----------------+--------+----------+-... Testing: - Create new custom cluster test with custom group mapping. - Ran FE and custom cluster tests. Change-Id: Ie9f6c88f5569e1c414ceb8a86e7b013eaa3ecde1 Reviewed-on: http://gerrit.cloudera.org:8080/11531 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-09-28 17:05:59 +00:00
Adam Holley	23f5338bf6	Revert "Revert "IMPALA-7074: Update OWNER privilege on CREATE, DROP, and SET OWNER"" The problem was caused by update in Hive with changed notifications. HIVE-15180 was added but was incomplete and resulted in the break. HIVE-17747 fixed the issue by properly creating the messages. Change-Id: I4b9276c36bf96afccd7b8ff48803a30b47062c3d Reviewed-on: http://gerrit.cloudera.org:8080/11466 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-09-20 00:51:28 +00:00
Thomas Tauber-Marshall	23da624113	Revert "IMPALA-7074: Update OWNER privilege on CREATE, DROP, and SET OWNER" This patch has been causing a large number of build failures. Revert it until we figure out why. Change-Id: I7f4fc028962d4c6a630456a12a65884a62f01442 Reviewed-on: http://gerrit.cloudera.org:8080/11456 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-09-18 02:11:48 +00:00
Adam Holley	e5b424ba4e	IMPALA-7074: Update OWNER privilege on CREATE, DROP, and SET OWNER This patch adds calls to automatically create or remove owner privileges in the catalog based on the statement. This is similar to the existing pattern where after privileges are granted in Sentry, they are created in the catalog directly instead of pulled from Sentry. When object ownership is enabled: CREATE DATABASE will grant the user OWNER privileges to that database. ALTER DATABASE SET OWNER will transfer the OWNER privileges to the new owner. DROP DATABASE will revoke the OWNER privileges from the owner. This will apply to DATABASE, TABLE, and VIEW. Example: If ownership is enabled, when a table is created, the creator is the owner, and Sentry will create owner privileges for the created table so the user can continue working with it without waiting for Sentry refresh. Inserts will be available immediately. Testing: - Created new custom cluster tests for object ownership Change-Id: I1e09332e007ed5aa6a0840683c879a8295c3d2b0 Reviewed-on: http://gerrit.cloudera.org:8080/11314 Reviewed-by: Vuk Ercegovac <vercegovac@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-09-14 06:03:44 +00:00
David Knupp	6e5ec22b12	IMPALA-7399: Emit a junit xml report when trapping errors This patch will cause a junitxml file to be emitted in the case of errors in build scripts. Instead of simply echoing a message to the console, we set up a trap function that also writes out to a junit xml report that can be consumed by jenkins.impala.io. Main things to pay attention to: - New file that gets sourced by all bash scripts when trapping within bash scripts: https://gerrit.cloudera.org/c/11257/1/bin/report_build_error.sh - Installation of the python lib into impala-python venv for use from within python files: https://gerrit.cloudera.org/c/11257/1/bin/impala-python-common.sh - Change to the generate_junitxml.py file itself, for ease of https://gerrit.cloudera.org/c/11257/1/lib/python/impala_py_lib/jenkins/generate_junitxml.py Most of the other changes are to source the new report_build_error.sh script to set up the trap function. Change-Id: Idd62045bb43357abc2b89a78afff499149d3c3fc Reviewed-on: http://gerrit.cloudera.org:8080/11257 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-08-23 18:33:58 +00:00
Joe McDonnell	7f3f63424f	IMPALA-7199: Add scripts to create code coverage reports gcovr is a python library that uses gcov to generate code coverage reports. This adds gcovr to the python dependencies and adds bin/impala-gcovr to provide easy access to gcovr's command line. gcovr 3.4 supports python 2.6+. This also adds bin/coverage_helper.sh to provide a simplified interface to generate reports and zero coverage counters. Code coverage data is written out when a program exits, so it is important to avoid hard kills to shut down the impalads when generating coverage. This modifies testdata/bin/kill-all.sh to call start-impala-cluster.py --kill when shutting down the minicluster to try to avoid doing a hard kill. It will still do a hard kill if impala is still running after the softer kill. Change-Id: I5b2e0b794c64f9343ec976de7a3f235e54d2badd Reviewed-on: http://gerrit.cloudera.org:8080/10791 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-07-17 16:45:44 +00:00
Fredy Wijaya	a203733fac	IMPALA-7295: Remove IMPALA_MINICLUSTER_PROFILE=2 This patch removes the use of IMPALA_MINICLUSTER_PROFILE. The code that uses IMPALA_MINICLUSTER_PROFILE=2 is removed and it defaults to code from IMPALA_MINICLUSTER_PROFILE=3. In order to reduce having too many code changes in this patch, there is no code change for the shims. The shims for IMPALA_MINICLUSTER_PROFILE=3 automatically become the default implementation. Testing: - Ran core and exhaustive tests Change-Id: Iba4a81165b3d2012dc04d4115454372c41e39f08 Reviewed-on: http://gerrit.cloudera.org:8080/10940 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-07-14 01:03:18 +00:00
poojanilangekar	c6f9b61ec2	IMPALA-6625: Skip computing parquet conjuncts for non-Parquet scans This change ensures that the planner computes parquet conjuncts only when for scans containing parquet files. Additionally, it also handles PARQUET_DICTIONARY_FILTERING and PARQUET_READ_STATISTICS query options in the planner. Testing was carried out independently on parquet and non-parquet scans: 1. Parquet scans were tested via the existing parquet-filtering planner test. Additionally, a new test [parquet-filtering-disabled] was added to ensure that the explain plan generated skips parquet predicates based on the query options. 2. Non-parquet scans were tested manually to ensure that the functions to compute parquet conjucts were not invoked. Additional test cases were added to the parquet-filtering planner test to scan non parquet tables and ensure that the plans do not contain conjuncts based on parquet statistics. 3. A parquet partition was added to the alltypesmixedformat table in the functional database. Planner tests were added to ensure that Parquet conjuncts are constructed only when the Parquet partition is included in the query. Change-Id: I9d6c26d42db090c8a15c602f6419ad6399c329e7 Reviewed-on: http://gerrit.cloudera.org:8080/10704 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-07-06 02:06:50 +00:00
Taras Bobrovytsky	8060f4d50e	IMPALA-7102 (Part 1): Disable reading of erasure coding by default In this patch we add a query option ALLOW_ERASURE_CODED_FILES, that allows us to enable or disable the support of erasure coded files. Even though Impala should be able to handle HDFS erasure coded files already, this feature hasn't been tested thoroughly yet. Also, Impala lacks metrics, observability and DDL commands related to erasure coding. This is a query option instead of a startup flag because we want to make it possible for advanced users to enable the feature. We may also need a follow on patch to also disable the write path with this flag. Cherry-picks: not for 2.x Change-Id: Icd3b1754541262467a6e67068b0b447882a40fb3 Reviewed-on: http://gerrit.cloudera.org:8080/10646 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-29 23:26:35 +00:00
Joe McDonnell	c8bfcbd6e8	IMPALA-7200: Fix missing FILESYSTEM_PREFIX hitting local dataload As part of IMPALA-3307, we copy a time-zone database into HDFS. This command is failing on local filesystem due to a missing FILESYSTEM_PREFIX. This adds FILESYSTEM_PREFIX for this command. Change-Id: I972192f22943baef6043a4c9db54d5d48089ea9d Reviewed-on: http://gerrit.cloudera.org:8080/10803 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-23 05:04:11 +00:00
Attila Jeges	17749dbcfc	IMPALA-3307: Add support for IANA time-zone db Impala currently uses two different libraries for timestamp manipulations: boost and glibc. Issues with boost: - Time-zone database is currently hard coded in timezone_db.cc. Impala admins cannot update it without upgrading Impala. - Time-zone database is flat, therefore can’t track year-to-year changes. - Time-zone database is not updated on a regular basis. Issues with glibc: - Uses /usr/share/zoneinfo/ database which could be out of sync on some of the nodes in the Impala cluster. - Uses the host system’s local time-zone. Different nodes in the Impala cluster might use a different local time-zone. - Conversion functions take a global lock, which causes severe performance degradation. In addition to the issues above, the fact that /usr/share/zoneinfo/ and the hard-coded boost time-zone database are both in use is a source of inconsistency in itself. This patch makes the following changes: - Instead of boost and glibc, impalad uses Google's CCTZ to implement time-zone conversions. - Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to specify an HDFS/S3/ADLS path to a zip archive that contains the shared compiled IANA time-zone database. If the startup flag is set, impalad will use the specified time-zone database. Otherwise, impalad will use the default /usr/share/zoneinfo time-zone database. - Introduces a new startup flag (--hdfs_zone_alias_conf) to impalad to specify an HDFS/S3/ADLS path to a shared config file that contains definitions for non-standard time-zone aliases. - impalad reads the entire time-zone database into an in-memory map on startup for fast lookups. - The name of the coordinator node’s local time-zone is saved to the query context when preparing query execution. This time-zone is used whenever the current time-zone is referred afterwards in an execution node. - Adds a new ZipUtil class to extract files from a zip archive. The implementation is not vulnerable to Zip Slip. Cherry-picks: not for 2.x. Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Reviewed-on: http://gerrit.cloudera.org:8080/9986 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Attila Jeges <attilaj@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-22 13:18:58 +00:00
Joe McDonnell	a541ac6039	IMPALA-7193: Fix cluster startup args in create-load-data.sh The minicluster args for dataload changed to a bash array in IMPALA-7119, and this requires a special syntax to derefence and get the whole array. This fixes the invocation to use the right syntax ($BASH_VAR[@] rather than $BASH_VAR). Change-Id: Ie9a24c0e9fa34e43697b16b48cf219f47f30c0cc Reviewed-on: http://gerrit.cloudera.org:8080/10782 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-22 06:23:23 +00:00
Joe McDonnell	147e962f2d	IMPALA-7119: Restart whole minicluster when HDFS replication stalls After loading data, we wait for HDFS to replicate all of the blocks appropriately. If this takes too long, we restart HDFS. However, HBase can fail if HDFS is restarted and HBase is unable to write its logs. In general, there is no real reason to keep HBase and the other minicluster components running while restarting HDFS. This changes the HDFS health check to restart the whole minicluster and Impala rather than just HDFS. Testing: - Tested with a modified version that always does the restart in the HDFS health check and verified that the tests pass Change-Id: I58ffe301708c78c26ee61aa754a06f46c224c6e2 Reviewed-on: http://gerrit.cloudera.org:8080/10665 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-18 21:46:11 +00:00
Joe McDonnell	9a5410570e	IMPALA-7061: Rework HBase splitting and assignment Some frontend PlannerTests rely on HBase tables being arranged in a deterministic way. Specifically, the HBase tables need to be split with specific region boundaries and those regions need to be assigned to specific HBase region servers. Currently, the tables are created without splits and testdata/bin/split-hbase.sh runs Java code in HBaseTestDataRegionAssignment to split and assign the tables. This runs during dataload via testdata/bin/create-load-data.sh and during tests with bin/run-all-tests.sh. There are problems with both parts of this process. The table splitting is flaky. Since significant time can pass between the assignments and the tests, rebalancing means the assignments are not always stable. This changes the process so that the HBase tables are created with the splits already specified via the HBase shell. The splits remain stable over time. PlannerTestBase runs the assignment code in HBaseTestDataRegionAssignment at the start of the PlannerTests. This makes the assignments deterministic. No other tests depends on the exact assignments, so this does not regress anything. Testing: - Local testing - Ran gerrit-verify-dryrun-external - Verified minicluster profile 2 compiles Change-Id: I3d639128a856254a6ccb93d6750f531974b5f897 Reviewed-on: http://gerrit.cloudera.org:8080/10447 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-25 00:28:18 +00:00
Zoltan Borok-Nagy	ccf19f9f8f	IMPALA-5842: Write page index in Parquet files This commit builds on the previous work of Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/ The commit implements the write path of PARQUET-922: "Add column indexes to parquet.thrift". As specified in the parquet-format, Impala writes the page indexes just before the footer. This allows much more efficient page filtering than using the same information from the 'statistics' field of DataPageHeader. I updated Pooja's python tests as well. Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Reviewed-on: http://gerrit.cloudera.org:8080/9693 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-17 20:22:02 +00:00
Joe McDonnell	2e9f5c90eb	IMPALA-7043: HBase split failure should not fail dataload HBase splitting can fail due to changes in HBase code. It is useful to still do tests even if HBase splitting failed. As it is today, buildall.sh will abort if create-load-data.sh's invocation of split-hbase.sh fails. No tests run, even though the HBase splitting affects only a small portion of our tests. This changes create-load-data.sh to keep going with dataload if HBase splitting fails. It outputs the same errors to the log as it would before this change. It adds a message to explain that it is ignoring the failure and there may be related test failures. Change-Id: I7497fe8c9f1655a34b2743462d8b7248eb94554e Reviewed-on: http://gerrit.cloudera.org:8080/10437 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-17 09:13:01 +00:00
Tianyi Wang	13a1acd7e4	IMPALA-7003: Deflake erasure coding data loading Erasure coding data loading is flaky in two ways: 1. HBase sometimes doesn't work because of HBase-19369 2. Nested data loading sometimes fails because the HDFS namenode cannot find enough good datanodes. For problem 1, this patch enables erasure coding only on /test-warehouse directory. For problem 2, this patch sets dfs.namenode.redundancy.considerLoad to false, preventing namenode from excluding heavily-loaded datanodes. Change-Id: I219106cd3ec7ffab7a834700f2a722b165e5f66c Reviewed-on: http://gerrit.cloudera.org:8080/10362 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-15 23:59:58 +00:00
Joe McDonnell	b126b2d105	IMPALA-6972: Disable parallel dataload on MINICLUSTER_PROFILE=2 There is a Hive bug in Hive 1.1.0 that can result in a NullPointerException when doing parallel Hive operations (see IMPALA-6532). Since dataload goes parallel on Hive loads starting with IMPALA-6372, dataload can hit this error on Hive 1.1.0 (i.e. IMPALA_MINICLUSTER_PROFILE=2). This is impacting builds on the 2.x branch. This disables parallel dataload for IMPALA_MINICLUSTER_PROFILE=2. IMPALA_MINICLUSTER_PROFILE=3 uses a newer version of Hive that has a fix for this, so this continues to use parallel dataload for that case. Parallelism can be reenabled when Hive 1.1.0 gets the fix from Hive 2.1.1. Change-Id: I90a0f2b3756d7192fa7db2958031b8c88eb606e6 Reviewed-on: http://gerrit.cloudera.org:8080/10306 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-10 01:30:13 +00:00
Taras Bobrovytsky	c05696dd6a	IMPALA-6949: Add the option to start the minicluster with EC enabled In this patch we add the "ERASURE_CODING" enviornment variable. If we enable it, a cluster with 5 data nodes will be created during data loading and HDFS will be started with erasure coding enabled. Testing: I ran the core build, and verified that erasure coding gets enabled in HDFS. Many of our EE tests failed however. Cherry-picks: not for 2.x Change-Id: I397aed491354be21b0a8441ca671232dca25146c Reviewed-on: http://gerrit.cloudera.org:8080/10275 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-05 01:20:59 +00:00
Lars Volker	52a2d90f98	Warn about Hadoop / Java version incompatibility Running Hadoop 3 with Java 7 can result in some obscure error messages. This change adds a warning to impala-config.sh when using Hadoop 3 with Java 7. Your development environment is configured for Hadoop 3 and Java 7. Hadoop 3 requires at least Java 8. Your JAVA binary currently points to /usr/lib/jvm/java-7-oracle-amd64/bin/java and reports the following version: java version "1.7.0_75" Java(TM) SE Runtime Environment (build 1.7.0_75-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.75-b04, mixed mode) It also catches failure of the minicluster start and prints an additional warning when running with Hadoop 3 and Java 7. Cherry-picks: not for 2.x Change-Id: I4d8b505cf045eeb562d16ce4ce09da0712dc03eb Reviewed-on: http://gerrit.cloudera.org:8080/10244 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-01 20:40:28 +00:00
Tim Armstrong	418c705787	IMPALA-6679,IMPALA-6678: reduce scan reservation This has two related changes. IMPALA-6679: defer scanner reservation increases ------------------------------------------------ When starting each scan range, check to see how big the initial scan range is (the full thing for row-based formats, the footer for Parquet) and determine whether more reservation would be useful. For Parquet, base the ideal reservation on the actual column layout of each file. This avoids reserving memory that we won't use for the actual files that we're scanning. This also avoid the need to estimate ideal reservation in the planner. We also release scanner thread reservations above the minimum as soon as threads complete, so that resources can be released slightly earlier. IMPALA-6678: estimate Parquet column size for reservation --------------------------------------------------------- This change also reduces reservation computed by the planner in certain cases by estimating the on-disk size of column data based on stats. It also reduces the default per-column reservation to 4MB since it appears that < 8MB columns are generally common in practice and the method for estimating column size is biased towards over-estimating. There are two main cases to consider for the performance implications: * Memory is available to improve query perf - if we underestimate, we can increase the reservation so we can do "efficient" 8MB I/Os for large columns. * The ideal reservation is not available - query performance is affected because we can't overlap I/O and compute as much and may do smaller (probably 4MB I/Os). However, we should avoid pathological behaviour like tiny I/Os. When stats are not available, we just default to reserving 4MB per column, which typically is more memory than required. When stats are available, the memory required can be reduced below when some heuristic tell us with high confidence that the column data for most or all files is smaller than 4MB. The stats-based heuristic could reduce scan performance if both the conservative heuristics significantly underestimate the column size and memory is constrained such that we can't increase the scan reservation at runtime (in which case the memory might be used by a different operator or scanner thread). Observability: Added counters to track when threads were not spawned due to reservation and to track when reservation increases are requested and denied. These allow determining if performance may have been affected by memory availability. Testing: Updated test_mem_usage_scaling.py memory requirements and added steps to regenerate the requirements. Loops test for a while to flush out flakiness. Added targeted planner and query tests for reservation calculations and increases. Change-Id: Ifc80e05118a9eef72cac8e2308418122e3ee0842 Reviewed-on: http://gerrit.cloudera.org:8080/9757 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-28 23:41:39 +00:00
Philip Zeyliger	2e6a63e31e	IMPALA-6070: Further improvements to test-with-docker. This commit tackles a few additions and improvements to test-with-docker. In general, I'm adding workloads (e.g., exhaustive, rat-check), tuning memory setting and parallelism, and trying to speed things up. Bug fixes: * Embarassingly, I was still skipping thrift-server-test in the backend tests. This was a mistake in handling feedback from my last review. * I made the timeline a little bit taller to clip less. Adding workloads: * I added the RAT licensing check. * I added exhaustive runs. This led me to model the suites a little bit more in Python, with a class representing a suite with a bunch of data about the suite. It's not perfect and still coupled with the entrypoint.sh shell script, but it feels workable. As part of adding exhaustive tests, I had to re-work the timeout handling, since now different suites meaningfully have different timeouts. Speed ups: * To speed up test runs, I added a mechanism to split py.test suites into multiple shards with a py.test argument. This involved a little bit of work in conftest.py, and exposing $RUN_CUSTOM_CLUSTER_TESTS_ARGS in run-all-tests.sh. Furthermore, I moved a bit more logic about managing the list of suites into Python. * Doing the full build with "-notests" and only building the backend tests in the relevant target that needs them. This speeds up "docker commit" significantly by removing about 20GB from the container. I had to indicates that expr-codegen-test depends on expr-codegen-test-ir, which was missing. * I sped up copying the Kudu data: previously I did both a move and a copy; now I'm doing a move followed by a move. One of the moves is cross-filesystem so is slow, but this does half the amount of copying. Memory usage: * I tweaked the memlimit_gb settings to have a higher default. I've been fighting empirically to have the tests run well on c4.8xlarge and m4.10xlarge. The more memory a minicluster and test suite run uses, the fewer parallel suites we can run. By observing the peak processes at the tail of a run (with a new "memory_usage" function that uses a ps/sort/awk trick) and by observing peak container total_rss, I found that we had several JVMs that didn't have Xmx settings set. I added Xms/Xmx settings in a few places: * The non-first Impalad does very little JVM work, so having an Xmx keeps it small, even in the parallel tests. * Datanodes do work, but they essentially were never garbage collecting, because JVM defaults let them use up to 1/4th the machine memory. (I observed this based on RSS at the end of the run; nothing fancier.) Adding Xms/Xmx settings helped. * Similarly, I piped the settings through to HBase. A few daemons still run without resource limitations, but they don't seem to be a problem. Change-Id: I43fe124f00340afa21ad1eeb6432d6d50151ca7c Reviewed-on: http://gerrit.cloudera.org:8080/10123 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-26 20:47:29 +00:00
Joe McDonnell	da363a99a4	IMPALA-6899: Optimize the HDFS commands used in dataload HDFS commandline calls can be expensive due to JVM startup and other costs. Since most HDFS commandline calls can take multiple paths, one way to reduce execution time is to consolidate multiple HDFS commands into a single HDFS call. Since HDFS put commands will follow symbolic links and can copy recursively, this can allow for further consolidation by creating the full directory structure and copying it in a single HDFS call. This does several of these optimizations throughout the dataload codepath. It saves a few seconds here and there: Loading Hive Builtins: 1:10 -> 0:30 Loading custom schemas: 0:35 -> 0:20 Loading Hive UDFs: 0:45 -> 0:25 Change-Id: I0934353329dc7312394fc4457ab8db2a272c6282 Reviewed-on: http://gerrit.cloudera.org:8080/10120 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-23 21:29:59 +00:00
Joe McDonnell	5bc5279b07	IMPALA-6898: Avoid duplicate Kudu load during full dataload testdata/bin/create-load-data.sh does bin/load-data.py for functional/exhaustive, tpch/core, and tpcds/core in a first phase, then it loads functional and tpch for Kudu in a second phase. For a full dataload, this second phase is not necessary. functional/exhaustive and tpch/core already include Kudu. This avoids the second phase when doing a full dataload. The second phase is still necessary when loading from a snapshot, and this does not change that behavior. This saves a couple minutes off of full dataload. Change-Id: Ic023d230f99126ed37795106c38faae5f0cb608e Reviewed-on: http://gerrit.cloudera.org:8080/10128 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-21 01:08:50 +00:00
Fredy Wijaya	51cf5b27fc	IMPALA-6850: Print actual error message on Sentry error The patch puts the output of Sentry to $IMPALA_CLUSTER_LOGS_DIR/sentry/sentry.out to follow the same convention as other service output logs. Testing: - Injected some failure in run-sentry-service.sh script to see if the error message was captured Change-Id: I76627bb5b986a548ec6e4f12b555bd6fc8c4dab8 Reviewed-on: http://gerrit.cloudera.org:8080/10064 Reviewed-by: Vuk Ercegovac <vercegovac@cloudera.com> Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-14 01:41:38 +00:00
Joe McDonnell	d481cd4842	IMPALA-6372: Go parallel for Hive dataload This changes generate-schema-statements.py to produce separate SQL files for different file formats for Hive. This changes load-data.py to go parallel on these separate Hive SQL files. For correctness, the text version of all tables must be loaded before any of the other file formats. load-data.py runs DDLs to create the tables in Impala and goes parallel. Currently, there are some minor dependencies so that text tables must be created prior to creating the other table formats. This changes the definitions of some tables in testdata/datasets/functional/functional_schema_template.sql to remove these dependencies. Now, the DDLs for the text tables can run in parallel to the other file formats. To unify the parallelism for Impala and Hive, load-data.py now uses a single fixed-size pool of processes to run all SQL files rather than spawning a thread per SQL file. This also modifies the locations that do invalidate to use refresh where possible and eliminate global invalidates. For debuggability, different SQL executions output to different log files rather than to standard out. If an error occurs, this will point out the relevant log file. This saves about 10-15 minutes on dataload (including for GVO). Change-Id: I34b71e6df3c8f23a5a31451280e35f4dc015a2fd Reviewed-on: http://gerrit.cloudera.org:8080/8894 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-14 00:16:26 +00:00
stiga-huang	818cd8fa27	IMPALA-5717: Support for reading ORC data files This patch integrates the orc library into Impala and implements HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner supplies input needed from the orc-reader, tracks memory consumption of the reader and transfers the reader's output (orc::ColumnVectorBatch) into impala::RowBatch. The ORC version we used is release-1.4.3. A startup option --enable_orc_scanner is added for this feature. It's set to true by default. Setting it to false will fail queries on ORC tables. Currently, we only support reading primitive types. Writing into ORC table has not been supported neither. Tests - Most of the end-to-end tests can run on ORC format. - Add tpcds, tpch tests for ORC. - Add some ORC specific tests. - Haven't enabled test_scanner_fuzz for ORC yet, since the ORC library is not robust for corrupt files (ORC-315). Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4 Reviewed-on: http://gerrit.cloudera.org:8080/9134 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-11 05:13:02 +00:00
Philip Zeyliger	2896b8d127	IMPALA-6070: Expose using Docker to run tests faster. Allows running the tests that make up the "core" suite in about 2 hours. By comparison, https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/buildTimeTrend tends to run in about 3.5 hours. This commit: * Adds "echo" statements in a few places, to facilitate timing. * Adds --skip-parallel/--skip-serial flags to run-tests.py, and exposes them in run-all-tests.sh. * Marks TestRuntimeFilters as a serial test. This test runs queries that need > 1GB of memory, and, combined with other tests running in parallel, can kill the parallel test suite. * Adds "test-with-docker.py", which runs a full build, data load, and executes tests inside of Docker containers, generating a timeline at the end. In short, one container is used to do the build and data load, and then this container is re-used to run various tests in parallel. All logs are left on the host system. Besides the obvious win of getting test results more quickly, this commit serves as an example of how to get various bits of Impala development working inside of Docker containers. For example, Kudu relies on atomic rename of directories, which isn't available in most Docker filesystems, and entrypoint.sh works around it. In addition, the timeline generated by the build suggests where further optimizations can be made. Most obviously, dataload eats up a precious ~30-50 minutes, on a largely idle machine. This work is significantly CPU and memory hungry. It was developed on a 32-core, 120GB RAM Google Compute Engine machine. I've worked out parallelism configurations such that it runs nicely on 60GB of RAM (c4.8xlarge) and over 100GB (eg., m4.10xlarge, which has 160GB). There is some simple logic to guess at some knobs, and there are knobs. By and large, EC2 and GCE price machines linearly, so, if CPU usage can be kept up, it's not wasteful to run on bigger machines. Change-Id: I82052ef31979564968effef13a3c6af0d5c62767 Reviewed-on: http://gerrit.cloudera.org:8080/9085 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-06 06:40:07 +00:00
Philip Zeyliger	8e5f923158	Loosen hive-exec.jar glob pattern in copy-udfs-udas.sh. This commit slightly loosens the coupling between IMPALA_HIVE_VERSION and "hive.version" in the Maven sense. Cherry-picks: not for 2.x Change-Id: Ifbe6f5208b4ad0ffc9cbfe4e93d712ce698beb23 Reviewed-on: http://gerrit.cloudera.org:8080/9925 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-04-05 06:54:53 +00:00
Vuk Ercegovac	2894884deb	IMPALA-6670: refresh lib-cache entries from plan When an impalad is in executor-only mode, it receives no catalog updates. As a result, lib-cache entries are never refreshed. A consequence is that udf queries can return incorrect results or may not run due to resolution issues. Both cases are caused by the executor using a stale copy of the lib file. For incorrect results, an old version of the method may be used. Resolution issues can come up if a method is added to a lib file. The solution in this change is to capture the coordinator's view of the lib file's last modified time when planning. This last modified time is then shipped with the plan to executors. Executors must then use both the lib file path and the last modified time as a key for the lib-cache. If the coordinator's last modified time is more recent than the executor's lib-cache entry, then the entry is refreshed. Brief discussion of alternatives: - lib-cache always checks last modified time + easy/local change to lib-cache - adds an fs lookup always. rejected for this reason - keep the last modified time in the catalog - bound on staleness is too loose. consider the case where fn's f1, f2, f3 are created with last modified times of t1, t2, t3. treat the fn's last modified time as a low-watermark; if the cache entry has a more recent time, use it. Such a scheme would allow the version at t2 to persist. An old fn may keep the state from converging to the latest. This could end up with strange cases where different versions of the lib are used across executors for a single query. In contrast, the change in this path relies on the statestore to push versions forward at all coordinators, so will push all versions at all caches forward as well. Testing: - added an e2e custom cluster test Change-Id: Icf740ea8c6a47e671427d30b4d139cb8507b7ff6 Reviewed-on: http://gerrit.cloudera.org:8080/9697 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-24 04:38:53 +00:00
Philip Zeyliger	783de170c9	IMPALA-4277: Support multiple versions of Hadoop ecosystem Adds support for building against two sets of Hadoop ecosystem components. The control variable is IMPALA_MINICLUSTER_PROFILE_OVERRIDE, which can either be set to 2 (for Hadoop 2, Hive 1, and so on) or 3 (for Hadoop 3, Hive 2, and so on). We intend (in a trivial follow-on change soon) to make 3 the new default and to explicitly deprecate 2, but this change only does not switch the default yet. We support both to facilitate a smoother transition, but support will be removed soon in the Impala 3.x line. The switch is done at build time, following the pattern from IMPALA-5184 (build fe against both Hive 1 & 2 APIs). Switching back and forth requires running 'cmake' again. Doing this at build-time avoids complicating the Java code with classloader configuration. There are relatively few incompatible APIs. This implementation encapsulates that by extracting some Java code into fe/src/compat-minicluminicluster-profile-{2,3}. (This follows the pattern established by IMPALA-5184, but, to avoid a proliferation of directories, I've moved the Hive files into the same tree.) pattern from IMPALA-5184 (build fe against both Hive 1 & 2 APIs). I consolidated the Hive changes into the same directory structure. For Maven, I introduced Maven "profiles" to handle the two cases where the dependencies (and exclusions) differ. These are driven by the $IMPALA_MINICLUSTER_PROFILE environment variable. For Sentry, exception class names changed. We work around this by adding "isSentry...(Exception)" methods with two different implementations. Sentry is also doing some odd shading, whereby some exceptions are "sentry.org.apache.sentry..."; we handle both. Similarly, the mechanism to create a SentryAuthProvider is slightly different. The easiest way to see the differences is to run: diff -u fe/src/compat-minicluster-profile-{2,3}/java/org/apache/impala/util/SentryUtil.java diff -u fe/src/compat-minicluster-profile-{2,3}/java/org/apache/impala/authorization/SentryAuthProvider.java The Sentry work is based on a change by Zach Amsden. In addition, we recently added an explicit "refresh" permission. In Sentry 2, this required creating an ImpalaPrivilegeModel to capture that. It's a slight customization of Hive's equivalent class. For Parquet, the difference is even more mechanical. The package names gone from "parquet" to "org.apache.parquet". The affected code was extracted into ParquetHelper, but only one copy exists. The second copy is generated at build-time using sed. In the rare cases where we need to behave differently at runtime, MiniclusterProfile.MINICLUSTER_PROFILE is a class which encapsulates what version we were built aginst. One of the cases is the results expected by various frontend tests. I avoided the issue by translating one error string into another, which handled the diversion in one place, rather than complicating the several locations which look for "No FileSystem for scheme..." errors. The HBase APIs we use for splitting regions at test time changed. This patch includes a re-write of that code for the new APIs. This piece was contributed by Zach Amsden. To work with newer versions of dependencies, I updated the version of httpcomponents.core we use to 4.4.9. We (Thomas Tauber-Marshall and I) uploaded new Hadoop/Hive/Sentry/HBase binaries to s3://native-toolchain, and amended the shell scripts to launch the right things. There are minor mechanical differences. Some of this was based on earlier work by Joe McDonnell and Zach Amsden. Hive's logging is changed in Hive 2, necessitating creating a log4j2.properties template and using it appropriately. Furthermore, Hadoop3's new shell script re-writes do a certain amount of classpath de-duplication, causing some issues with locating the relevant logging configurations. Accomodations exist in the code to deal with that. parquet-filtering.test was updated to turn off stats filtering. Older Hive didn't write Parquet statistics, but newer Hive does. By turning off stats filtering, we test what the test had intended to test. For views-compatibility.test, it seems that Hive 2 has fixed certain bugs that we were testing for in Hive. I've added a HIVE=SUCCESS_PROFILE_3_ONLY mechanism to capture that. For AuthorizationTest, different hive versions show slightly different things for extended output. To facilitate easier reviewing, the following files are 100% renames as identified by git; nothing to see here. rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetCatalogsReq.java (100%) rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetColumnsReq.java (100%) rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetFunctionsReq.java (100%) rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetInfoReq.java (100%) rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetSchemasReq.java (100%) rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetTablesReq.java (100%) rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/impala/compat/MetastoreShim.java (100%) rename fe/src/{compat-hive-2 => compat-minicluster-profile-3}/java/org/apache/impala/compat/MetastoreShim.java (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/hadoop/conf/kms-acls.xml.tmpl (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/hadoop/conf/kms-site.xml.tmpl (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/hadoop/conf/yarn-site.xml.tmpl (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/init.d/kudu-common (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/init.d/kudu-master (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/init.d/kudu-tserver (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/kudu/master.conf.tmpl (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/kudu/tserver.conf.tmpl (100%) CreateTableLikeFileStmt had a chunk of code moved to ParquetHelper.java. This was done manually, but without changing anything except what Java required in terms of accessibility and boilerplate. rewrite fe/src/main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java (80%) copy fe/src/{main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java => compat-minicluster-profile-3/java/org/apache/impala/analysis/ParquetHelper.java} (77%) Testing: Ran core & exhaustive tests with both profiles. Cherry-picks: not for 2.x. Change-Id: I7a2ab50331986c7394c2bbfd6c865232bca975f7 Reviewed-on: http://gerrit.cloudera.org:8080/9716 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-23 20:56:00 +00:00
Tianyi Wang	d03b66ca35	IMPALA-6394: Restart HDFS only when no replication progress is made In wait-hdfs-replication, the frequent and eager restart might slow the HDFS replication down. HDFS should be restarted only if no progress is made in a certain amount of time, and we should wait longer before failing the data loading. Testing: It's tested with a fake HDFS fsck script. Change-Id: Ib059480254643dc032731b4b3c55204a93b61e77 Reviewed-on: http://gerrit.cloudera.org:8080/9698 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-22 00:41:16 +00:00
Philip Zeyliger	45aee121eb	Removing (broken) retries from split-hbase.sh. The retries in split-hbase.sh don't work in the common case, because $MINIKDC_PRINC_HIVE is not set in non-kerberized (common) environments. The regular data load scripts (create-load-data.sh) have code to manage that, but split-hbase.sh blindly forges ahead, leading to errors like: /home/impdev/Impala/testdata/bin/split-hbase.sh: line 49: MINIKDC_PRINC_HIVE: unbound variable Error in /home/impdev/Impala/testdata/bin/create-load-data.sh at line 48: LOAD_DATA_ARGS="" Since this hasn't been working, I opted to remove it entirely, as a failure on the line where HBase splitting actually failed would be significantly more useful than the error here. A search of mailing lists suggested that I was at least the second person to have run into this. (In my case, I did break HBase splitting, but it took me a second to identify the error, since the log was spammed with unrelated information relating to the cluster restart.) Testing: core tests. Change-Id: I715891c9e744f21002330c3ae3ebc14095d94ffd Reviewed-on: http://gerrit.cloudera.org:8080/9588 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-15 01:32:14 +00:00
Tianyi Wang	c7a58b8a73	IMPALA-6394: Restart HDFS when blocks are under replicated HDFS sometimes fails to fully replicate all the blocks in 30 seconds and no progress is made. This patch tries to restart HDFS several times before aborting the data loading. Change-Id: Iefd4c2fc6c287f054e385de52bdc42b0bdbd7915 Reviewed-on: http://gerrit.cloudera.org:8080/9469 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-09 22:54:47 +00:00
Joe McDonnell	fd66890bf1	IMPALA-6579: Always force reload Kudu tables for dataload When loading from an up-to-date snapshot, dataload will load all of the metadata and load data into HDFS. Then, it will skip load-data.py for functional/exhaustive, tpch/core, and tpcds/core. It will invoke a special round of load-data.py calls to populate Kudu tables, and it always runs these with a force reload. However, when loading from an old snapshot, dataload will still load all of the metadata and load the data into HDFS, but then it will still invoke load-data.py for functional/exhaustive, tpch/core, and tpcds/core. These invocations mostly do DDLs with very few load statements. However, these invocations are a problem for Kudu. The metadata of Impala tables referencing Kudu entities have been imported along with all the other metadata, but the Kudu entities have not been created, as they are separate from HDFS. This means that Kudu tables are not really valid in this circumstance. Since Kudu has been added to the list of data formats for tpch/core (see IMPALA-6475), load-data.py with tpch/core will attempt to insert into these invalid Kudu tables. To avoid this, always force reload any Kudu tables. generate-schema-statements.py will always generate a drop table statement before any create of a Kudu table. This guarantees that the create will also create the corresponding Kudu entity. Change-Id: I2d07f3513c543e2590f2f62b96b37472316868ee Reviewed-on: http://gerrit.cloudera.org:8080/9445 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-25 03:04:58 +00:00
Joe McDonnell	d9b6fd0730	IMPALA-6386: Invalidate metadata at table level for dataload Dataload currently executes bin/load-data.py for TPC-H, TPC-DS, and functional-query concurrently. One of the final steps for bin/load-data.py is to run a global "invalidate metadata". Global "invalidate metadata" commands are known to cause problem on concurrent systems. See IMPALA-5087. For dataload, if TPC-H executes "invalidate metadata" while TPC-DS is still creating tables and adding partitions, the TPC-DS executor might erroneously believe that a table does not exist. This changes dataload to invalidate metadata at an individual table level rather than globally. This prevents the concurrency issue. This also changes the names of some of the intermediate SQL files generated by generate-schema-statements.py and consumed by load-data.py to make them less confusing. Change-Id: Ibc3a6d8a674a0bf6b02069bfe8a5e12034335b1f Reviewed-on: http://gerrit.cloudera.org:8080/9009 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-17 22:52:58 +00:00
Tianyi Wang	c4d950b9e9	IMPALA-3887: Wait for HDFS replication in data loading When the data loading finishes, it is possible for some HDFS blocks to be under replicated. If impala gets the metadata before the replication is done, some tests may fail. This patch adds a replication waiting step in the data loading script. Resubmitted with filesystem type check. Change-Id: I64d9a8ea1d0a32b40047321b50a7139a8f48eac8 Reviewed-on: http://gerrit.cloudera.org:8080/8916 Reviewed-by: Vuk Ercegovac <vercegovac@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-09 03:24:36 +00:00
David Knupp	2fb11fb732	Revert "IMPALA-3887: Wait for HDFS replication in data loading" Using fsck breaks non-HDFS builds: local, S3, and Isilon. This reverts commit `5a7c10ec3d`. Change-Id: I0b12a42049543ca0b267b5146a0bbcdd2316abfc Reviewed-on: http://gerrit.cloudera.org:8080/8880 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-19 23:26:29 +00:00
Tianyi Wang	5a7c10ec3d	IMPALA-3887: Wait for HDFS replication in data loading When the data loading finishes, it is possible for some HDFS blocks to be under replicated. If impala gets the metadata before the replication is done, some tests may fail. This patch adds a replication waiting step in the data loading script. Change-Id: I88dfb7165b7515b3e96111436be490f2068ec322 Reviewed-on: http://gerrit.cloudera.org:8080/8846 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-16 04:53:56 +00:00
Philip Zeyliger	11dbb3952a	IMPALA-6070: Parallelize another bit of data load. The two Kudu loads and Hive UDFs can all run in parallel. This should shave about 4 minutes off of the data load. (Current timings are 3.5, 4, and 0.6 minutes, see below.) I've run dataload with this change many times. Loading Kudu functional (logging to /home/ubuntu/Impala/logs/data_loading/load-kudu.log)... Loading workload 'functional-query' using exploration strategy 'core' in table formats 'kudu/none/none' OK (Took: 3 min 29 sec) Loading Kudu TPCH (logging to /home/ubuntu/Impala/logs/data_loading/load-kudu-tpch.log)... Loading workload 'tpch' using exploration strategy 'core' in table formats 'kudu/none/none' OK (Took: 4 min 0 sec) Loading Hive UDFs (logging to /home/ubuntu/Impala/logs/data_loading/build-and-copy-hive-udfs.log)... Loading Hive UDFs OK (Took: 0 min 41 sec) Change-Id: I7e93ee5a77ec9271b980b88bef7ad512ecbe0407 Reviewed-on: http://gerrit.cloudera.org:8080/8822 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-14 02:28:40 +00:00
Zachary Amsden	66704f915e	IMPALA-6068: Scale back fixing functional-types I re-created the original patch for IMPALA-6068, but only performed what I believe to be the limited legal transformation of data load: DEPENDENT_LOAD -> DEPENDENT_LOAD_HIVE. Any place that directly uploads via hadoop or hdfs commands was left alone as changing it can't be proven to be correct. Change-Id: I6c242cca209a7138b10ad517076707709b5cd204 Testing: Doing a full data load. I mistakenly changed a variable name causing the first two dry-runs to fail. Reviewed-on: http://gerrit.cloudera.org:8080/8690 Reviewed-by: Zach Amsden <zamsden@cloudera.com> Tested-by: Zach Amsden <zamsden@cloudera.com>	2017-12-04 23:46:44 +00:00
Philip Zeyliger	45ccf4ae3b	Removing testdata/bin/run-hive.sh. I can't find any uses of it. Change-Id: Ibdb65f8390efec8559cea59da0a48584a383cb24 Reviewed-on: http://gerrit.cloudera.org:8080/8503 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-28 03:02:53 +00:00
David Knupp	d1c9510001	Revert "IMPALA-6068: Fix dataload for complextypes_fileformat" This reverts commit `e4f585240a`. Among other things, that commit replaced hdfs command line calls with "LOAD DATA LOCAL INPATH" using Hive. However, doing so presumes that the minicluster is the only test environment. Sometimes though, the data load script is against a remote cluster, and those cases, the data load process is now broken. Change-Id: I6dc419934d2953eb950b14d090d7895ec57aa9f2 Reviewed-on: http://gerrit.cloudera.org:8080/8653 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Reviewed-by: Zach Amsden <zamsden@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-28 02:57:04 +00:00
Philip Zeyliger	a0be00ad6d	Expose $IMPALA_MAVEN_OPTIONS for configuring Maven. With this commit, $IMPALA_MAVEN_OPTIONS is used by bin/mvn-quiet.sh to configure Maven slightly. The default is no extra options. This is handy for giving Maven a settings file with the "-s" flag, to control, for example, repositories and their mirrors. In fact, I considered exposing IMPALA_MAVEN_SETTINGS_FILE explicitly, but decided that the generic option would be as good. It's useful to customize how Maven works, especially to provide a settings file with repository mirrors. Change-Id: I2c62185476fd2388c7cda8884276b79a77370127 Reviewed-on: http://gerrit.cloudera.org:8080/8496 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-14 01:29:56 +00:00
Philip Zeyliger	76111ce168	IMPALA-6108, IMPALA-6070: Parallel data load (re-instated). This is a revert of a revert, re-enabling parallel data load. It avoid the race condition by explicitly configuring the temporary directory in question in load-data.py. When the parallel data load change went in, we discovered a race with a signature of: java.io.FileNotFoundException: File /tmp/hadoop-jenkins/mapred/local/1508958341829_tmp does not exist The number in this path is milliseconds since the epoch, and the race occurs when two queries submitted to HiveServer2, running with the local runner, hit the same millisecond time stamp. The upstream bug is https://issues.apache.org/jira/browse/MAPREDUCE-6441, and I described the symptoms in https://issues.apache.org/jira/browse/MAPREDUCE-6992 (which is now marked as a dupe). I've tested this by running data load 5 times on the same machines where it failed before. I also ran data load manually and inspected the system to make sure that the temporary directories are getting created as expected in /tmp/impala-data-load-*. Change-Id: I60d65794da08de4bb3eb439a2414c095f5be0c10 Reviewed-on: http://gerrit.cloudera.org:8080/8405 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-02 00:40:19 +00:00
Philip Zeyliger	e301ca6418	IMPALA-6108: Revert "IMPALA-6070: Parallel data load." We may be seeing a race with errors like "java.io.FileNotFoundException: File /tmp/hadoop-jenkins/mapred/local/1508958341829_tmp does not exist". This reverts commit `e020c37106`. Change-Id: I46da93f4315a5a4bdaa96fa464cb51922bd6c419 Reviewed-on: http://gerrit.cloudera.org:8080/8386 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-10-26 02:07:50 +00:00
Joe McDonnell	e4f585240a	IMPALA-6068: Fix dataload for complextypes_fileformat Dataload typically follows a pattern of loading data into a text version of a table, and then using an insert overwrite from the text table to populate the table for other file formats. This insert is always done in Impala for Parquet and Kudu. Otherwise it runs in Hive. Since Impala doesn't support writing nested data, the population of complextypes_fileformat tries to hack the insert to run in Hive by including it in the ALTER part of the table definition. ALTER runs immediately after CREATE and always runs in Hive. The problem is that ALTER also runs before the base table (functional.complextypes_fileformat) is populated. The insert succeeds, but it is inserting zero rows. This code change introduces a way to force the Parquet load to run using Hive. This lets complextypes_fileformat specify that the insert should happen in Hive and fixes the ordering so that the table is populated correctly. This is also useful for loading custom Parquet files into Parquet tables. Hive supports the DATA LOAD LOCAL syntax, which can read a file from the local filesystem. This means that several locations that currently use the hdfs commandline can be modified to use this SQL. This change speeds up dataload by a few minutes, as it avoids the overhead of the hdfs commandline. Any other location that could use DATA LOAD LOCAL is also switched over to use it. This includes the testescape* tables which now print the appropriate DATA LOAD commands as a result of text_delims_table.py. Any location that already uses DATA LOAD LOCAL is also switched to indicate that it must run in Hive. Any location that was doing an HDFS command in the LOAD section is moved to the LOAD_DEPENDENT_HIVE section. Testing: Ran dataload and core tests. Also verified that functional_parquet.complextypes_fileformat has rows. Change-Id: I7152306b2907198204a6d8d282a0bad561129b82 Reviewed-on: http://gerrit.cloudera.org:8080/8350 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins	2017-10-25 03:43:26 +00:00
Philip Zeyliger	e020c37106	IMPALA-6070: Parallel data load. This commit loads functional-query, TPC-H data, and TPC-DS data in parallel. In parallel, these take about 37 minutes, dominated by functional-query. Serially, these take about 30 minutes more, namely the 13 minutes of tpcds and 16 minutes of tpcds. This works out nicely because CPU usage during data load is very low in aggregate. (We don't sustain more than 1 CPU of load, whereas build machines are likely to have many CPUs.) To do this, I added support to run-step.sh to have a notion of a backgroundable task, and support waiting for all tasks. I also increased the heapsize of our HiveServer2 server. When datasets were being loaded in parallel, we ran out of memory at 256MB of heap. The resulting log output is currently like so (but without the timestamps): 15:58:04 Started Loading functional-query data in background; pid 8105. 15:58:04 Started Loading TPC-H data in background; pid 8106. 15:58:04 Loading functional-query data (logging to /home/impdev/Impala/logs/data_loading/load-functional-query.log)... 15:58:04 Started Loading TPC-DS data in background; pid 8107. 15:58:04 Loading TPC-H data (logging to /home/impdev/Impala/logs/data_loading/load-tpch.log)... 15:58:04 Loading TPC-DS data (logging to /home/impdev/Impala/logs/data_loading/load-tpcds.log)... 16:11:31 Loading workload 'tpch' using exploration strategy 'core' OK (Took: 13 min 27 sec) 16:14:33 Loading workload 'tpcds' using exploration strategy 'core' OK (Took: 16 min 29 sec) 16:35:08 Loading workload 'functional-query' using exploration strategy 'exhaustive' OK (Took: 37 min 4 sec) I tested dataloading with the following command on an 8-core, 32GB machine. I saw 19GB of available memory during my run: ./buildall.sh -testdata -build_shared_libs -start_minicluster -start_impala_cluster -format Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac Reviewed-on: http://gerrit.cloudera.org:8080/8320 Reviewed-by: Jim Apple <jbapple-impala@apache.org> Reviewed-by: Michael Brown <mikeb@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-10-25 00:00:25 +00:00
Philip Zeyliger	77e010ae4e	IMPALA-6070: Parallel compute_table_stats.py Uses a thread pool to issue many compute stats commands in parallel to Impala, rather than doing it serially. Where it was obvious, I combined multiple stats commands into fewer, to reduce the number of "show databses" and serialized "show tables" commands. This speeds up the compute stats step in data loading significantly. My measurements for testdata/bin/compute-table-stats.sh running before and after this change, with the Impala daemons restarted (cold) or not restarted (warm) on an 8-core, 32GB RAM machine were: old, cold: 7m44s new, cold: 1m42s old, warm: 1m23s new, warm: 48s The data load in the full test build behaves in a cold fashion. It's typical for https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/ to run this compute stats step for 9 or 10 minutes. With this change, this will come down to about 2 minutes. Change-Id: Ifb080f2552b9dbe304ecadd6e52429214094237d Reviewed-on: http://gerrit.cloudera.org:8080/8354 Reviewed-by: David Knupp <dknupp@cloudera.com> Tested-by: Impala Public Jenkins	2017-10-24 23:54:15 +00:00

1 2 3 4 5 ...

372 Commits