impala

mirror of https://github.com/apache/impala.git synced 2026-01-10 00:00:16 -05:00

Author	SHA1	Message	Date
Skye Wanderman-Milne	7767d300a3	IMPALA-3311: fix string data coming out of aggs in subplans The problem: varlen data (e.g. strings) produced by aggregations is freed by FreeLocalAllocations() after passing up the output batch. This works for streaming operators or blocking operators that copy their input, but results in memory corruption when the output reaches non-copying blocking operators, e.g. SubplanNode and NestedLoopJoinNode. The fix: this patch makes the PartitionedAggregationNode copy out produced string data if the node is in a subplan. Otherwise it calls MarkNeedsToReturn() on the output batch. Marking the batch would work in the subplan case as well, but would likely be less efficient since it would result in many small batches coming out of the subplan. The patch includes a test case. However, this test only exposes the problem with an ASAN build and the --disable_mem_pools flag, which we don't currently have automated testing for. Change-Id: Iada891504c261ba54f4eb8c9d7e4e5223668d7b9 Reviewed-on: http://gerrit.cloudera.org:8080/2929 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 23:06:36 -07:00
Henry Robinson	df1412c962	IMPALA-3480: Add query options for min/max filter sizes This patch adds two query options for runtime filters: RUNTIME_FILTER_MAX_SIZE RUNTIME_FILTER_MIN_SIZE These options define the minimum and maximum filter sizes for a filter, no matter what the estimates produced by the planner are. Filter sizes are rounded up to the nearest power of two. Change-Id: I5c13c200a0f1855f38a5da50ca34a737e741868b Reviewed-on: http://gerrit.cloudera.org:8080/2966 Tested-by: Internal Jenkins Reviewed-by: Henry Robinson <henry@cloudera.com>	2016-05-12 23:06:35 -07:00
Bharath Vissapragada	3092c96619	IMPALA-2660: Respect auth_to_local configs from hdfs configs This patch implements a new feature to read the auth_to_local configs from hdfs configuration files, using the parameter hadoop.security.auth_to_local. This is done by modifying the User#getShortName() method to use its hdfs equivalent. This patch includes an end to end authorization test using sentry where we add specific auth_to_local setting for a certain user and test if the sentry authorization passes for this user after applying these rules. Given we don't have tests that run on a kerberized min-cluster, this patch adds a hack to load this configuration during even on non-kerberized 'test runs'. However this feature is disabled by default to preserve the existing behavior. To enable it, 1. Use kerberos as authentication mechanism (by setting --principal) and 2. Add "--load_auth_to_local_rules=true" to the cluster startup args Change-Id: I76485b83c14ba26f6fce66e5f83e8014667829e0 Reviewed-on: http://gerrit.cloudera.org:8080/2800 Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:18:01 -07:00
Sailesh Mukil	27815818b9	IMPALA-3452: S3: Disable Impala staging for INSERTs via flag for speedup INSERTs on S3 are slower because of double buffering where we buffer once locally and once in a staging directoy in S3 before moving the file(s) to the final location. Also, moving the file from the staging directory to the final location in HDFS is a quick rename which is only a metadata operation. However, on S3, renames are not supported, thus becoming a full file copy instead of just a metadata rename operation. This patch instroduces a boolean query option "s3_skip_insert_staging" which avoids the staging step on S3 and allows the sinks to write to the final location directly. This trades in consistency for the sake of performance. If a node(s) fails during the query, then we will end up with inconsistent results in the final location. P.S: This option is disabled for INSERT OVERWRITE queries as that would require cleaning the destination directory before moving the final files there. However, the coordinator is responsible for the cleaning which takes place only after the table sinks have moved the files to the final location. Thus, INSERT OVERWRITE queries must still have their files moved to a staging location by the table sinks. Performance gains: - For non-partitioned tables, the INSERT queries run 4-4.5x faster on S3. (Tested on a 63GB INSERT to a table) - For heavily partitioned tables, there is considerable improvement in the order of 4-5 minutes on queries that take ~27 minutes but queries are still slow because of IMPALA-3482 where the catalog takes too long to update all the metadata. (Tested with a query that creates 2.4K partitions in a table totalling ~19GB). Change-Id: Iff9620d41ba0d5fb1aa0c9f4abb48866fc2b0698 Reviewed-on: http://gerrit.cloudera.org:8080/2905 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:18:00 -07:00
Alex Behm	12097a0707	IMPALA-3491: Use unique_database fixture in test_hidden_files.py. Testing: Tested the changes locally by running them in a loop 10 times. Also did a private core/hdfs run. Change-Id: I37e1528c02e598f3fb2d673b6559d55a34bf79b4 Reviewed-on: http://gerrit.cloudera.org:8080/3002 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:59 -07:00
Dimitris Tsirogiannis	5cae398a48	IMPALA-3133: Wrong privileges after a REVOKE ALL ON SERVER statement This commit fixes an issue where a GRANT ALL ON SERVER to role_name statement followed by a REVOKE ALL ON SERVER from role_name statement would not revoke all privileges from role_name. The problem was triggered by a specific combination of Sentry client API calls used in Impala during grant/revoke statements at server scope. In particular, during GRANT, Impala was using an API call that didn't explicitly specify the privilege action (Sentry uses '*' if no action is specified). In contrast, the corresponding REVOKE call was explicitly specifying the privilege action to be 'ALL'. Sentry doesn't seem to handle this case correctly, thereby failing to remove all the privileges after a REVOKE ALL ON SERVER call. The fix from the Impala side, that results in the correct behavior, is to always specify the privilege action by using the appropriate API calls. Change-Id: I6b3a0d10f5e88c6a0a10bd20f620562d2de7ab25 Reviewed-on: http://gerrit.cloudera.org:8080/2979 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:57 -07:00
Marcel Kornacker	3b7d5b7c17	MT: Planner for multi-threaded execution New classes: - ParallelPlanner: creates build plans, assigns plans to cohorts - JoinBuildSink: DataSink for plan fragments that materialize build sides - ids for plans, hash tables, plan fragments Tests: this adds a new test file section PARALLELPLANS and augments the tpc-h/-ds tests with those sections. In the interest of keeping this patch small I didn't augment other test files with that section yet (which will happen at a later date, to cover more corner cases). Change-Id: Ic3c34dd3f9190a131e6f03d901b4bfcd164a5174 Reviewed-on: http://gerrit.cloudera.org:8080/2846 Tested-by: Internal Jenkins Reviewed-by: Marcel Kornacker <marcel@cloudera.com>	2016-05-12 14:17:56 -07:00
Tim Armstrong	8e64273fee	Fix Kudu hole punch check to work if /tmp is on different fs /tmp isn't necessarily on the same filesystem as the Kudu data directory. Fix the check so that it checks the actual Kudu directory. Change-Id: Ic6aa27569a0650db7dcf5759952cd50c8e47f8c9 Reviewed-on: http://gerrit.cloudera.org:8080/2967 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:56 -07:00
Tim Armstrong	34c95c9590	IMPALA-2345,2991: test coverage for spilling and sorts Add missing coverage for sorting by CHAR and VARCHAR. Add more coverage for spilling sorts. Fix spilling tests: ensure that they actually reliably spill (many of them had memory limits high enough that they could run entirely in memory). I ran this in a loop for a while to flush out flaky tests. The tests should be fairly predictable given that they're not run concurrently with other tests and we allocate enough block manager memory so that each operator can obtain its reservation. Change-Id: Ia2d2627a2c327dcdf269ea3216385b1af9dfa305 Reviewed-on: http://gerrit.cloudera.org:8080/2877 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:55 -07:00
Sailesh Mukil	3093054e95	IMPALA-3460: test_grant_revoke: remove S3-specific workload Now that we functionally support writes to S3 via Impala, test_grant_revoke should not have a special case for S3, which till this patch did the test without INSERTs. Change-Id: Id981e7f83bf86b32d1a5b267ad3781db02337e86 Reviewed-on: http://gerrit.cloudera.org:8080/2949 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:54 -07:00
Thomas Tauber-Marshall	8c2bf9769a	IMPALA-2805: Order conjuncts based on selectivity and cost Added costs to all Exprs, which estimate the relative cost of evaluating an expression and all of its children. Costs are calculated during analysis. For now, these costs are intended as a simple way to order expressions from cheap to expensive, not necessarily to be a precise reflection of running times. In general, expressions that deal with variable length types like strings will have higher cost than those dealing with fixed length types like numbers and booleans. Additionally, expressions with complicated subexpressions will have higher cost than simpler expressions. Also added PlanNode.orderConjunctsByCost, which takes a list of Exprs and returns a new list sorted according to an estimate of the cheapest order to evaulate the conjuncts in, based on their cost and selectivity. The conjuncts are sorted by repeatedly iterating over them and choosing the conjunct that would result in the least total estimated work were it to be applied before the remaining conjuncts. Selectivities are exponentially backed off, and Exprs without selectivity estimates are given a reasonable default. Change-Id: I02279a26fbc6308ac5eb819d78345fc010469034 Reviewed-on: http://gerrit.cloudera.org:8080/2598 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:53 -07:00
Alex Behm	a41710a0c8	Use unique_database fixture in test_compute_stats.py. This patch makes it a little easier to use the unique_database fixture with .test files. The RESULTS section can now contain $DATABASE which is replaced with the current database by the test framework. Testing: - ran the test locally on exhaustive - ran the test on hdfs and the local filesystem on Jenkins Change-Id: I8655eb769003f88c0e1ec1b254118e4ec3353b48 Reviewed-on: http://gerrit.cloudera.org:8080/2947 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:50 -07:00
Sailesh Mukil	ed7f5ebf53	IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. S3 does not support the rename operation, so the staged files in S3 are copied instead of renamed, which contributes to the slow performance on S3. The FinalizeSuccessfulInsert() function now does not make any underlying assumptions of the filesystem it is on and works across all supported filesystems. This is done by adding a full URI field to the base directory for a partition in the TInsertPartitionStatus. Also, the HdfsOp class now does not assume a single filesystem and gets connections to the filesystems based on the URI of the file it is operating on. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 4x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. This is mainly because of an extra copy that happens between staging and the final location of a file. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Reviewed-on: http://gerrit.cloudera.org:8080/2574 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:49 -07:00
Alex Behm	bce6b2b422	IMPALA-2736: Basic column-wise slot materialization in Parquet scanner. This change is a first step towards a more efficient Parquet scanner. The focus is on presenting the new code flow that materializes the table-level slots in a column-wise fashion, without going deep into actually improving scan efficieny. After these changes there are several obvious places that should be optimized to realize efficiency gains. Summary of changes - the table-level tuples are materialized in a column-wise fashion with new ColumnReader::ReadValueBatch() functions - this is done by materializing a 'scratch' batch, and transferring scratch tuples that survive filters/conjuncts to the output batch - the tuples of nested collections are still materialized in a row-wise fashion using the ColumnReader::ReadValue() function, just as before Mini benchmark I ran the following queries on a single impalad before and after my change using a synthetic 'huge_lineitem' table. I modified hdfs-scan-node.cc to set the number of rows of any row batch to 0 to focus the measurement on the scan time. Query options: set num_scanner_threads=1; set disable_codegen=true; set num_nodes=1; select * from huge_lineitem; Before: 22.39s Afer: 18.50s select * from huge_lineitem where l_linenumber < 0; Before: 25.11s After: 20.56s select * from huge_lineitem where l_linenumber % 2 = 0; Before: 26.32s After: 21.82s Change-Id: I72a613fa805c542e39df20588fb25c57b5f139aa Reviewed-on: http://gerrit.cloudera.org:8080/2779 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:48 -07:00
Lars Volker	a09c80a33e	IMPALA-3458: Fix table creation to test insert with header lines For IMPALA-1740 we added a test to insert.test, which creates a table and inserts data. The table was created on HDFS by default and thus inserts with compression enabled did not work. This change adds the required table to the functional schema in the same way we do it for the other insert tests. Change-Id: Ie68e7067b7a16218d27935820d5d1ce7035d2e6c Reviewed-on: http://gerrit.cloudera.org:8080/2919 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:47 -07:00
Henry Robinson	6fd8faa718	IMPALA-3007: Adjust Bloom Filter size according to NDV estimate Instead of having a default Bloom Filter size for all runtime filters, adjust filter size according to desired FP-rate and expected NDV from join's build-side. Size of filter is still clipped to 4k < N < 16MB range. If NDV estimate from planner is -1 (i.e. no stats) the default filter size is used. The NDV of all filters produced by the same join is currently the same because the NDV is estimated from the cardinality of the input. In the future, the NDV should be estimated for each filter source expr. The BE changes anticipate this and can enable or disable individual filters if they have differing FP rates. Change-Id: I1fe37b8d4cfb3c52bb8e8cf0ca55e92665b87803 Reviewed-on: http://gerrit.cloudera.org:8080/2812 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:46 -07:00
Lars Volker	b5570da405	IMPALA-1740: Add support for skip.header.line.count. HIVE-5795 introduced a parameter skip.header.line.count to skip header lines from input files. This change introduces the capability to skip an arbitrary number of header lines from csv input files on hdfs. The size of the total file header must be smaller than max_scan_range_length, otherwise an error will be reported. This is necessary because scan ranges are not read in disk order, so there is no way of identifying header lines except by counting from the start of the first scan range. [localhost:21000] > alter table t1 set tblproperties('skip.header.line.count'='1'); Query: alter table t1 set tblproperties('skip.header.line.count'='1') [localhost:21000] > select * from t1; Query: select * from t1 +----+----+ \| c1 \| c2 \| +----+----+ \| 1 \| 1 \| \| 2 \| 2 \| \| 3 \| 3 \| +----+----+ Fetched 3 row(s) in 0.32s [localhost:21000] > alter table t1 set tblproperties('skip.header.line.count'='0'); Query: alter table t1 set tblproperties('skip.header.line.count'='0') [localhost:21000] > select * from t1; Query: select * from t1 +------+------+ \| c1 \| c2 \| +------+------+ \| NULL \| NULL \| \| 1 \| 1 \| \| 2 \| 2 \| \| 3 \| 3 \| +------+------+ WARNINGS: Error converting column: 0 TO INT (Data is: num1) Error converting column: 1 TO DOUBLE (Data is: num2) file: hdfs://localhost:20500/test-warehouse/t1/test.txt record: num1,num2 Fetched 4 row(s) in 0.41s Change-Id: I595f01a165d41499ca1956fe748ba3840a6eb543 Reviewed-on: http://gerrit.cloudera.org:8080/2110 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:46 -07:00
Tim Armstrong	6f9434217e	IMPALA-3412: fix CHAR codegen crash in tuple comparator Attempting to codegen a sort where the sort expr has a CHAR type as an intermediate result fails completely. The problem is that ScalarFnCall checked whether its input arguments were CHAR to disable codegen, but didn't check its output. This patch also replaces some incorrect codegen CHAR logic that should not be executed with DCHECKs. Testing: The test is a minimal reproduction of the issue. The test is executed both by the sorter and top-n nodes so covers both cases. Change-Id: I189073d46a10988803d572928a38f4a718690fa3 Reviewed-on: http://gerrit.cloudera.org:8080/2876 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:45 -07:00
Henry Robinson	a0d71e8192	IMPALA-3245 / IMPALA-3305: Fix crash with global filters when NUM_NODES=1 Filter locality was not correctly set when NUM_NODES=1 (and therefore no distributed plan was created). The default locality should be 'local'. This patch also fixes a bug where the initial filter routing table wasn't printed when NUM_NODES=1. Change-Id: I7b9a6bcc64ca6ec5fd51d63815cea25de866ef93 Reviewed-on: http://gerrit.cloudera.org:8080/2721 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:44 -07:00
Skye Wanderman-Milne	9f366645ab	IMPALA-3378/IMPALA-3379: fix various JNI issues This patch: 1) Removes JniUtil::Cleanup() and JniUtil::global_refs_. We never called Cleanup(), and all the jobjects in global_refs_ are meant to have the lifetime of the impalad process. This makes JniUtil::GetGlobalClassRef() and JniUtil::LocalToGlobalRef() thread-safe (which fixes IMPALA-3379). 2) Introduces a new JniUtil::FreeGlobalRef() method, which is a wrapper around the JNI DeleteGlobalRef() method. 3) Change JNI users to use the JniUtil methods instead of the JNI methods directly where appropriate. This makes error checking more consistent, and makes it easier to find all JNI uses. This is possible since GetGlobalClassRef() and LocalToGlobalRef() are now thread-safe and don't leak jobjects. 4) Removes HiveUdfCall::JniContext::cl, as well as other JNI constants, and replaces them with process-wide static singletons. It then moves the initialization to a new HiveUdfCall::Init() method which is once called in the main thread at the beginning of the process. This fixes IMPALA-3378. 5) Deletes the JniContext created for each HiveUdfCall Unfortunately I am not able to repro IMPALA-3378 so there is no test case (I didn't attempt IMPALA-3379 but it's similar). Change-Id: I8cd089e355d2ee2d5ace81f05b214272c05cf941 Reviewed-on: http://gerrit.cloudera.org:8080/2820 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:41 -07:00
Anuj Phadke	a915293109	IMPALA-1850: Allow fs.defaultFS to be set to a non-HDFS filesystem This change whitelists the supported filesystems which can be set as Default FS for Impala to run on. This patch configures Impala to use S3 as the default filesystem, rather than a secondary filesystem as before. Change-Id: I2f45bef6c94ece634045acb906d12591587ccfed Reviewed-on: http://gerrit.cloudera.org:8080/1121 Reviewed-by: anujphadke <aphadke@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:40 -07:00
Michael Ho	37a73ae9f7	IMPALA-3350: Add some missing StringVal.is_null checks Impala has a hardcoded limit of 1GB in size for StringVal. If the length of the string exceeds 1GB, Impala will simply mark the StringVal as NULL (i.e. is_null = true). It's important that string functions or built-in UDFs check this field before accessing the pointer or Impala may end up doing null pointer access, leading to crashes. Change-Id: I55777487fff15a521818e39b4f93a8a242770ec2 Reviewed-on: http://gerrit.cloudera.org:8080/2786 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:39 -07:00
Tim Armstrong	5b75601920	Query options not correctly reset after each test. The regex didn't match cases where the 'set' statement had whitespace between the preceding semicolon and the 'set'. E.g. if it is not the first statement in the block and is preceded by a newline. The resolution by name test implicitly relied on the bug, so it needed to be updated. Change-Id: Ic810b31c1ad7b2bcfd29413181bb81d1a0dbcb90 Reviewed-on: http://gerrit.cloudera.org:8080/2823 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:38 -07:00
casey	a27946e696	Improve mini-cluster usability (testdata/cluster/admin) Changes: 1) Previously when a service would fail, the user would have to find the the log file and open it. Now the end of the log is dumped to stdout. 2) Add start, stop, and restart commands to the "admin" script. For example now you can run testdata/cluster/admin restart kudu 3) Wait up to 120 seconds for services to shutdown. The timeout is the same as for the Impala processes. If the services fail to stop an error will be raised. Change-Id: I537ea5656df2081d4f1f27a9f3fcef4547fdc2fe Reviewed-on: http://gerrit.cloudera.org:8080/2751 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:37 -07:00
Harrison Sheinblatt	1058163f70	IMPALA-2276: Isilon and s3 builds must fail with stale snapshot If a stale snapshot is detected, the full data load proceeds even if the option to skip data load was set. A check is added to fail immediately if this happens for isilon or s3 because the full data load will not work on these filesystems currently. Change-Id: I98faaa4a66e5715bd86289a56d199599b9011f52 Reviewed-on: http://gerrit.cloudera.org:8080/2811 Reviewed-by: Harrison Sheinblatt <hs7@hotmail.com> Tested-by: Internal Jenkins	2016-05-12 14:17:37 -07:00
Henry Robinson	6629e79f32	IMPALA-3077: Don't run spilling / nested tests without PHJ Change-Id: Ide5e20f05b14aa19a0f570398712ac9297b525eb Reviewed-on: http://gerrit.cloudera.org:8080/2822 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:36 -07:00
Tim Armstrong	31d4103416	IMPALA-3317: fix crash in sorter when spilling zero-length strings The sorter converts string pointers to block offsets when spilling. There was a subtle bug in the logic that assumed if the offset was past the end of the current block, the data must necessarily be in the next block. This is not true for zero-length strings, because there is no backing storage so the pointer can point to the byte after the end of the block. This patch fixes the bug by using a simpler offset encoding scheme that packs the block number into the upper 32 bits and the offset within the block into the lower 32 bits. It also slightly refactors the functions so that the method signatures and types are more consistent with the rest of the impala codebase. Also fix a bug with handling of multiple query options in tests. Change-Id: I5f64593e94d367d6b6efb61a8b86e35516f18839 Reviewed-on: http://gerrit.cloudera.org:8080/2780 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:36 -07:00
casey	6f4a5e6bb0	Kudu: Use -block_manager=file if "hole punching" isn't supported By default Kudu requires the underlying file system to support hole punching. If support isn't there Kudu will fail to start. People using such a file system can instead start Kudu with -block_manager=file. Before starting Kudu in the local mini-cluster, the "fallocate" command will be used to automatically determine if the special flag is needed. Note, users who need this must run bin/create-test-configuration.sh after pulling in this commit. This also fixes a bug in the delete_kudu_data() in the cluster admin script. A directory name was incorrect. Change-Id: I1ca7fedb367444c41e462b72b0b76091ee94e27c Reviewed-on: http://gerrit.cloudera.org:8080/2750 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:36 -07:00
Alex Behm	50314b5b2c	Set PYTHONUNBUFFERED in wait-for-* scripts. Before this fix our "Waiting for something to happen" print output would be buffered and dumped all at once when the event we were waiting for succeeded or we hit a timeout. After this fix the output of "print" is displayed on the console imemdiately, as was originally intended. Change-Id: Icf341e81d0d459504918ae7c9e88918fe5e16c59 Reviewed-on: http://gerrit.cloudera.org:8080/2810 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:35 -07:00
casey	cef87e39dc	Updates for new Kudu toolchain layout and upgrade Kudu The directory structure of the newer Kudu toolchain artifacts has changed. Now the root directory is split into /release and /debug. A few little updates are needed to the build and service scripts. Since the toolchain no longer provides stubs for platforms that Kudu doesn't support the stubs need to be generated. This will be done as part of the toolchain bootstrapping. Also this upgrades Kudu to 0.8 RC1. Developers will need to run bin/create-test-configuration.sh after pulling in this change. Otherwise the Kudu service will fail to start. Change-Id: I625903bd92afece0ad819a96fc275d5812b5eb2a Reviewed-on: http://gerrit.cloudera.org:8080/2720 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:35 -07:00
Henry Robinson	c14a6f11df	IMPALA-3077: Enable runtime filters when PHJ spills This patch changes when runtime filters are produced in the partitioned hash-join node to allow filters to be produced even when the PHJ spills. Filters are now produced during the level0 processing of the PHJ's build-side input in ProcessBuildBatch(). Since this function is codegen'ed, so now is filter production. We use constant-propagation via constant argument injection to disable filter production at no cost when it is not needed (including in level1+ repartitioning). I inspected the IR to confirm that the constant propagation works as expected. This change also allows us to send filters earlier during build-side processing. A tradeoff is that filters are still built even if the expected FP rate is too high, although any too-permissive filters are still not sent to the scan (see 'Performance impact' below). The restriction that prevented filters from being computed inside a sub-plan is removed as part of this cleanup (since the FE handles assigning filters correctly in subplans), and a test is added to confirm that one of the correct cases for filters in subplans works. This patch also fixes a bug where re-partitioning beyond level0 would not use the codegen'ed implementation of ProcessBuildBatch(). A new test is added to test_runtime_row_filters, for Parquet only, which spills and confirms that filtering still occurs. Finally, the legacy --enable_phj_probe_side_filtering / --enable_probe_side_filtering flags have been deprecated, as runtime filtering can be permanently disabled via setting RUNTIME_FILTER_MODE=OFF. The implementation that the old flags referred to has been removed. Performance impact ------------------ We benchmark the performance loss due to always computing runtime filters even when the FP-rate will turn out to be too high as follows: select STRAIGHT_JOIN count() from (select id from functional.alltypes LIMIT 1) a JOIN [BROADCAST] (select FROM p LIMIT 100000000) b on a.id = -b.id and b.part_col > 0 ('p' is a two-column Parquet table with 1B rows). This builds a 100M row build table (benchmarks run on one node). When filtering is enabled, the filter is built but selects all rows from the probe side (so that there's no benefit to having the filter, to emphasise the cost of building the filter in the first place). RUNTIME_FILTER_MODE Avg. time (s) over 5 runs OFF 18.95 GLOBAL 19.55 ------------------------------- Change +3% Change-Id: I59a2d9ee03ccea6b674392584e4c7f272233571e Reviewed-on: http://gerrit.cloudera.org:8080/2783 Tested-by: Internal Jenkins Reviewed-by: Henry Robinson <henry@cloudera.com>	2016-05-12 14:17:34 -07:00
Casey Ching	5387636140	IMPALA-3373: Computing stats on Kudu table duplicates the columns Computing stats caused the Kudu table to be reloaded in the catalog and the column definitions ended up getting appended to the existing ones. There was already a method to clear the column state, so now that is called during load(). Change-Id: I9ad42338750e9d8873a3584bc22a7cd7bd465c5d Reviewed-on: http://gerrit.cloudera.org:8080/2813 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:34 -07:00
Michael Brown	6d8c075d9c	IMPALA-1996: Start HBase per directions in documentation; Implement HBase startup retry I. Start HBase per directions 1. https://hbase.apache.org/book.html#_configuration_files mentions a 'regionservers' file that points to a list of hosts on which to start HBase RegionServers. When HBase starts in our mini-cluster there are messages printed like this: cat: /home/mikeb/Impala/fe/src/test/resources/regionservers: No such file or directory The presence of this file now starts a single RegionServer and takes the place of RegionServer 1 in the "additional region servers" startup, a separate call. 2. The additional RegionServers are started but now we only start 2 from index 2. See https://hbase.apache.org/book.html#quickstart_pseudo There are still 3 total RegionServers using the same ports as before. We are simply configuring our settings as directed in the documentation. There were mentions in testdata/bin/run-hbase.sh of a "hbase race". One possible such bug is https://issues.apache.org/jira/browse/HBASE-5780 which has been fixed for a while. I've removed the check to wait for that Master, though I have not removed the Python script that does the waiting. We could remove that later after we let this patch bake. Also, https://issues.apache.org/jira/browse/HBASE-4467 has been marked "not a problem", so I've removed references to that. II. Implement HBase start retry If starting either HBase Master or additional RegionServers fails, kill all of HBase and try again. Do this for some number of attempts. In order to keep errexit ("set -e") happy, we expect the possibility of some of the startup attempts failing. We use control flow in those cases. In the last case, errexit can fail on our behalf. There is some code duplication here, but because Bash can't give us a stack trace on failure, and only a line number, I chose not to use functions to handle reuse. We don't really have functions anywhere else at the moment, either. Testing: It's pretty difficult to try to trigger a real "HBase fails to start" situation. I tested my changes by faking HBase failures, both when starting up the Master and first RegionServer, and also starting subsequent RegionServers. Multiple private builds have passed. Change-Id: Ib1d055a8a9098ce24e2f31b969501b6e090eab19 Reviewed-on: http://gerrit.cloudera.org:8080/2804 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:33 -07:00
Jim Apple	1c16dd0cf8	IMPALA-2107: Add Base64 encoder/decoder Change-Id: I911451c5d68e8ae9d352abfcf4d5ff36484f0bf3 Reviewed-on: http://gerrit.cloudera.org:8080/2633 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:32 -07:00
Michael Ho	cbcda93dfb	IMPALA-3334: Fix some bugs in query options' parsing. This change fixes two problems: 1. The query options OPTIMIZE_PARTITION_KEY_SCANS and DISABLE_STREAMING_PREAGGREGATIONS are both boolean so they should accept 'true' and '1' as input values. Previously, these two options are treated as int and value such as 'true' doesn't work with them. 2. The break statement in the case statement of the option SCAN_NODE_CODEGEN_THRESHOLD was 'stolen' by the option DISABLE_STREAMING_PREAGGREGATIONS when it was added. This change adds the missing break statement back for SCAN_NODE_CODEGEN_THRESHOLD. Change-Id: I5c74a1e5c49e3bda15a91b40740fc7310303207b Reviewed-on: http://gerrit.cloudera.org:8080/2776 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:31 -07:00
Jim Apple	83a1434bc9	IMPALA-2840: Don't store table location in partition location For a table with location "ABC", most partitions will have locations like "ABC/DEF=2". The "ABC" part of the location does not need to be stored in Catalog for each partition; we can compress it down to one int in the common case. This is done by stripping from each partition location the last N directories (where N is the number of clustering columns) and storing the resulting string in a cache of partition location prefixes. In the cache, this location prefix string is mapped to an int. Partition locations are then stored as a tuple consisting of that int and a suffix string; the partition location can be reconstructed as the concatenation of the prefix string (from the cache) and the suffix. Though this scheme was designed in the expectation that most partitions will be stored in directories like "/part_col_1=1.23/part_col_2=234/", it works even when that is not the case. TODO: Since each partition stores the literal values for the partitioning columns, we could also elide the column names and values when partitions are placed in directories like "/part_col_1=1.23/part_col_2=234/" Change-Id: I8c67b6ce0f83de2f5277a528a9ce67e47d638adb Reviewed-on: http://gerrit.cloudera.org:8080/2355 Reviewed-by: Jim Apple <jbapple@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:29 -07:00
Sailesh Mukil	c083c79888	IMPALA-3256: TestUdfs.test_libs_with_same_filenames failure There was an observed race between TestUdfs.test_java_udfs and TestUdfs.test_libs_with_same_filenames, because they both used the same database name. This patch just changes the name of the database used by test_libs_with_same_filenames. Change-Id: Icc38cbe720a3b9d864935416eb10612171132e17 Reviewed-on: http://gerrit.cloudera.org:8080/2767 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:29 -07:00
Lars Volker	ee8c309187	Fix typo in load-test-warehouse-snapshot.sh Change-Id: I2ef9b32cbc56819f80db864a6590a9a7b2732c9c Reviewed-on: http://gerrit.cloudera.org:8080/2310 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Internal Jenkins	2016-04-12 14:03:44 -07:00
Todd Lipcon	4bdd0b976d	IMPALA-3148. Fix selectivity computation for pushed Kudu predicates This follows up on a TODO from the Kudu merge and also fixes a bug: IMPALA-976 changed the computation of selectivities for a combined list of conjuncts to better handle expressions with no selectivity estimate. The Kudu implementation was forked from before this change and thus did not have an equivalent change. This refactors the algorithm to a new static method and calls it from both PlanNode and KuduScanNode so that the selectivity estimate behavior is the same regardless of whether Kudu can evaluate the predicate server-side. Todd tested this on TPCH 3TB and verified that the plans are reasonable now where they used to be nonsense. Change-Id: Id507077b577ed5804fc80517f33ea185f2bff41a Reviewed-on: http://gerrit.cloudera.org:8080/2628 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-04-12 14:03:44 -07:00
Sailesh Mukil	86fd262dc9	IMPALA-3324: Hive server does not start for S3 builds. The hive server does not start for S3 builds because HDFS is marked as an unsupported service in testdata/cluster/admin; and so HDFS is not started at all, and so the Hive server is unable to start as well. Due to this, all our S3 builds fail. Currently our S3 builds need HDFS to run correctly. (This has to be reverted once IMPALA-1850 goes in, because then S3 can run as a default FS without HDFS) Change-Id: Ibda9dc3ef895c2aa4d39eb5694ac5f2dbd83bee4 Reviewed-on: http://gerrit.cloudera.org:8080/2741 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-04-12 14:03:43 -07:00
Casey Ching	9bb1b8a366	Kudu: Disable fsnyc in the mini-cluster The Kudu team recommended disabling this for testing purposes. This should help with timeouts in cloud machines (ec2/gce). Disabling fsyncs could lead to data loss if the system crashed before the OS had a chance to write the data to disk. Our test setups don't need that level of reliability. Change-Id: I72fd85ce5c4bc71f071b854ea6a9ebe60fc1305f Reviewed-on: http://gerrit.cloudera.org:8080/2734 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-04-12 14:03:43 -07:00
Casey Ching	9d43aac6ce	IMPALA-3274: Always start Kudu for testing Previously Kudu would only be started when the test configuration was the standard mini-cluster. That led to failures during data loading when testing without the mini-cluster (ex: local file system). Kudu doesn't require any other services so now it'll be started for all test environments. Change-Id: I92643ca6ef1acdbf4d4cd2fa5faf9ac97a3f0865 Reviewed-on: http://gerrit.cloudera.org:8080/2690 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-04-12 14:02:35 -07:00
Skye Wanderman-Milne	9b51b2b6e6	IMPALA-2835: introduce PARQUET_FALLBACK_SCHEMA_RESOLUTION query option This patch introduces a new query option, PARQUET_FALLBACK_SCHEMA_RESOLUTION which allows Parquet files' schemas to be resolved by either name or position. It's "fallback" because eventually field IDs will be the primary schema resolution scheme, and we don't want to create an option that we will have to change the name of later. The default is still by position. I chose to do a query option because it will make testing easier and also be easier to diagnose resolution problems quickly in the field. If users want to switch the default behavior to be by name (like Hive), they can use the --default_query_options flag. This patch also introduces a new test section, SHELL, which can be used to execute shell commands in a .test file. This is useful for copying files into test tables. Change-Id: Id0c715ea23792b2a6872610839a40532aabbb5a6 Reviewed-on: http://gerrit.cloudera.org:8080/2384 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Internal Jenkins	2016-04-02 04:04:25 +00:00
Skye Wanderman-Milne	2cbd327d41	Regenerate complextypestbl files to include nested_struct.g field This field was included in the schema and data files, but the checked-in generated parquet files didn't include it. It's not referenced in any tests so we didn't catch it. Change-Id: I5d394f074e7082fa12fafb7e57a144a83b3099a6 Reviewed-on: http://gerrit.cloudera.org:8080/2562 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-04-01 05:06:38 +00:00
Bharath Vissapragada	5cd7ada727	IMPALA-3194: Allow queries materializing scalar type columns in RC/sequence files This commit unblocks queries materializing only scalar typed columns on tables backed by RC/sequence files containing complex typed columns. This worked prior to 2.3.0 release. Change-Id: I3a89b211bdc01f7e07497e293fafd75ccf0500fe Reviewed-on: http://gerrit.cloudera.org:8080/2580 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-03-31 12:06:57 +00:00
Sailesh Mukil	49a73cd598	IMPALA-3249: Failed to mkdirs on core-local-filesystem build. This failure happens on filesystems other than HDFS because as a part of IMPALA-2466, the $FILESYSTEM_PREFIX was not added to the new directories that the patch tries to create in create-load-data. Change-Id: I8de74db93893c5273ccc9c687f608959628f5004 Reviewed-on: http://gerrit.cloudera.org:8080/2644 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-03-30 00:03:45 +00:00
Casey Ching	39a28185e8	Re-enable Kudu in build using client stubs when needed The stubs in Impala broke during the merge commit. This commit removes the stubs in hopes of improving robustness of the build. The original problem (Kudu clients are only available for some OSs) is now addressed by moving the stubbing into a dummy Kudu client. The dummy client only allows linking to succeed, if any client method is called, Impala will crash. Before calling any such method, Kudu availability must be checked. Change-Id: I4bf1c964faf21722137adc4f7ba7f78654f0f712 Reviewed-on: http://gerrit.cloudera.org:8080/2585 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-03-29 23:57:54 +00:00
Alex Behm	b2ccb17c21	Print last 50 lines of log if data loading fails. The 20 lines we dump currently are often not enough to diagnose a failure quickly. Increasing to 50 lines. Printing 50 lines is also consistent with our run-step script which also prints 50 lines. Change-Id: I353a2030be6fad1cd63879b4717e237344f85c73 Reviewed-on: http://gerrit.cloudera.org:8080/2632 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-03-28 20:22:18 +00:00
Alex Behm	7e76e92bef	Consolidate test and cluster logs under a single directory. All logs, test results and SQL files generated during data loading and testing are now consolidated under a single new directory $IMPALA_HOME/logs. The goal is to simplify archiving in Jenkins runs and debugging. The new structure is as follows: $IMPALA_HOME/logs/cluster - logs of Hadoop components and Impala $IMPALA_HOME/logs/data_loading - logs and SQL files produced in data loading $IMPALA_HOME/logs/fe_tests - logs and test output of Frontend unit tests $IMPALA_HOME/logs/be_tests - logs and test output of Backend unit tests $IMPALA_HOME/logs/ee_tests - logs and test output of end-to-end tests $IMPALA_HOME/logs/custom_cluster_tests - logs and test output of custom cluster tests I tested this change with a full data load which was successful. Change-Id: Ief1f58f3320ec39d31b3c6bc6ef87f58ff7dfdfa Reviewed-on: http://gerrit.cloudera.org:8080/2456 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-03-28 19:23:22 +00:00
Sailesh Mukil	76b674850f	IMPALA-2466: Add more tests for the HDFS parquet scanner. These tests functionally test whether the following type of files are able to be scanned properly: 1) Add a parquet file with multiple blocks such that each node has to scan multiple blocks. 2) Add a parquet file with multiple blocks but only one row group that spans the entire file. Only one scan range should do any work in this case. Change-Id: I4faccd9ce3fad42402652c8f17d4e7aa3d593368 Reviewed-on: http://gerrit.cloudera.org:8080/1500 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-03-25 13:10:15 +00:00

1 2 3 4 5 ...

1289 Commits