impala

mirror of https://github.com/apache/impala.git synced 2025-12-19 18:12:08 -05:00

Author	SHA1	Message	Date
Joe McDonnell	c5a0ec8bdf	IMPALA-11980 (part 1): Put all thrift-generated python code into the impala_thrift_gen package This puts all of the thrift-generated python code into the impala_thrift_gen package. This is similar to what Impyla does for its thrift-generated python code, except that it uses the impala_thrift_gen package rather than impala._thrift_gen. This is a preparatory patch for fixing the absolute import issues. This patches all of the thrift files to add the python namespace. This has code to apply the patching to the thirdparty thrift files (hive_metastore.thrift, fb303.thrift) to do the same. Putting all the generated python into a package makes it easier to understand where the imports are getting code. When the subsequent change rearranges the shell code, the thrift generated code can stay in a separate directory. This uses isort to sort the imports for the affected Python files with the provided .isort.cfg file. This also adds an impala-isort shell script to make it easy to run. Testing: - Ran a core job Change-Id: Ie2927f22c7257aa38a78084efe5bd76d566493c0 Reviewed-on: http://gerrit.cloudera.org:8080/20169 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>	2025-04-15 17:03:02 +00:00
Yida Wu	47b638e667	IMPALA-13188: Add test that compute stats does not result in a different tuple cache key The patch introduces a new test, TestTupleCacheComputeStats, to verify that compute stats does not change the tuple cache key. The test creates a simple table with one row, runs an explain on a basic query, then inserts more rows, computes the stats, and reruns the same explain query. It compares the two results to ensure that the cache keys are identical in the planning phase. Tests: Passed the test. Change-Id: I918232f0af3a6ab8c32823da4dba8f8cd31369d0 Reviewed-on: http://gerrit.cloudera.org:8080/21917 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-10-22 02:20:34 +00:00
zhangyifan27	637c750a3e	IMPALA-12631: Improve count star performance for parquet scans Before this patch frontend generates multiple scan ranges for a parquet file when count star optimization is enabled. Backend function HdfsParquetScanner::GetNextInternal() also call NextRowGroup() multiple times to find row groups and sum up RowGroup.num_rows. This could be inefficient because we only need to read file metadata to compute count star. This patch optimizes it by creating only one scan range that contains the file footer for each parquet file. The following table shows a performance comparison before and after the patch. primitive_count_star_multiblock query is a modified primitive_count_star query that targets a multi-block tpch10_parquet.lineitem table. The files of the table are generated by the command `hdfs dfs -Ddfs.block.size=1048576 -cp -f -d`. +-------------------+---------------------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+--------+ \| Workload \| Query \| File Format \| Avg(s) \| Base Avg(s) \| Delta(Avg) \| StdDev(%) \| Base StdDev(%) \| Iters \| Median Diff(%) \| MW Zval \| Tval \| +-------------------+---------------------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+--------+ \| TPCDS(10) \| TPCDS-Q_COUNT_OPTIMIZED \| parquet / none / none \| 0.17 \| 0.16 \| +2.58% \| * 29.53% * \| * 27.16% * \| 30 \| +1.20% \| 0.58 \| 0.35 \| \| TPCDS(10) \| TPCDS-Q_COUNT_UNOPTIMIZED \| parquet / none / none \| 0.27 \| 0.26 \| +2.96% \| 8.97% \| 9.94% \| 30 \| +0.16% \| 0.44 \| 1.19 \| \| TPCDS(10) \| TPCDS-Q_COUNT_ZERO_SLOT \| parquet / none / none \| 0.18 \| 0.18 \| -0.69% \| 1.65% \| 1.99% \| 30 \| -0.34% \| -1.55 \| -1.47 \| \| TARGETED-PERF(10) \| primitive_count_star_multiblock \| parquet / none / none \| 0.06 \| 0.12 \| I -49.88% \| 4.11% \| 3.53% \| 30 \| I -99.97% \| -6.54 \| -66.81 \| +-------------------+---------------------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+--------+ Testing: - Ran PlannerTest#testParquetStatsAgg - Added new test cases to query_test/test_aggregation.py Change-Id: Ib9cd2448fe51a420d4559d0cc861c4d30822f4fd Reviewed-on: http://gerrit.cloudera.org:8080/20804 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-02-06 19:34:38 +00:00
Joe McDonnell	21f5a7a6e5	IMPALA-10180: Add summary stats for client fetch wait time This adds ClientFetchWaitTimeStats to the runtime profile to track the min/max/# of samples for ClientFetchWaitTimer. Here is some sample output: - ClientFetchWaitTimeStats: (Avg: 161.554ms ; Min: 101.411ms ; Max: 461.728ms ; Number of samples: 6) - ClientFetchWaitTimer: 969.326ms This also fixes the definition of ClientFetchWaitTimer to avoid including time after end of fetch. When the client is closing the query, Finalize() gets called. The Finalize() call should only add extra client wait time if fetch has not completed. Testing: - Added test cases in query_test/test_fetch.py with specific numbers of fetches and verification of the statistics. - The test cases make use of a new function for parsing summary stats for timers, and this also gets its own test case. Change-Id: I9ca525285e03c7b51b04ac292f7b3531e6178218 Reviewed-on: http://gerrit.cloudera.org:8080/19897 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>	2023-05-18 16:09:30 +00:00
stiga-huang	32aeeecc07	IMPALA-12008: Fix incorrect end time in DML profiles The end time in DML profiles is incorrect that it's actually the time when admission control resources are released. This is correct for normal queries. But for DMLs, coordinator still needs to invoke the updateCatalog RPC of catalogd to finalize the HMS update. The end time should be set after the request finished. This patch fixes the DML end time by not setting it after the admission control resources are released. Instead, it's set after ClientRequestState::WaitInternal() finishes, which makes sure the updateCatalog RPC has finished. Also adds a duration field in profile by the way. For testing, this patch also adds a new debug action in catalogd (catalogd_insert_finish_delay) to inject delays in updateCatalog. Tests - Added e2e test to verify the end time of a DML profile Change-Id: I9c5dc92c2f8576ceed374d447c0ac05022a2dee6 Reviewed-on: http://gerrit.cloudera.org:8080/19644 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-04-13 05:32:12 +00:00
Joe McDonnell	82bd087fb1	IMPALA-11973: Add absolute_import, division to all eligible Python files This takes steps to make Python 2 behave like Python 3 as a way to flush out issues with running on Python 3. Specifically, it handles two main differences: 1. Python 3 requires absolute imports within packages. This can be emulated via "from __future__ import absolute_import" 2. Python 3 changed division to "true" division that doesn't round to an integer. This can be emulated via "from __future__ import division" This changes all Python files to add imports for absolute_import and division. For completeness, this also includes print_function in the import. I scrutinized each old-division location and converted some locations to use the integer division '//' operator if it needed an integer result (e.g. for indices, counts of records, etc). Some code was also using relative imports and needed to be adjusted to handle absolute_import. This fixes all Pylint warnings about no-absolute-import and old-division, and these warnings are now banned. Testing: - Ran core tests Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b Reviewed-on: http://gerrit.cloudera.org:8080/19588 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-03-09 17:17:57 +00:00
Shant Hovsepian	3f1b1476af	IMPALA-10034: Add remaining TPC-DS queries to workload. Include remaining TPC-DS queries to the testdata workload definition. Q8 and Q38 were using non standard variants, those have been replaced by the official query versions. Q35 is using an official variant. Had to escape a table alias in Q90 as we treat 'AT' as a reserved keyword. Change-Id: Id5436689390f149694f14e6da1df624de4f5f7ad Reviewed-on: http://gerrit.cloudera.org:8080/16280 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-08-24 16:02:45 +00:00
Shant Hovsepian	ea3f073881	IMPALA-9943,IMPALA-4974: INTERSECT/EXCEPT [DISTINCT] INTERSECT and EXCEPT set operations are implemented as rewrites to joins. Currently only the DISTINCT qualified operators are implemented, not ALL qualified. The operator MINUS is supported as an alias for EXCEPT. We mimic Oracle and Hive's non-standard implementation which treats all operators with the same precedence, as opposed to the SQL Standard of giving INTERSECT higher precedence. A new class SetOperationStmt was created to encompass the previous UnionStmt behavior. UnionStmt is preserved as a special case of union only operands to ensure compatibility with previous union planning behavior. Tests: * Added parser and analyzer tests. * Ensured no test failures or plan changes for union tests. * Added TPC-DS queries 14,38,87 to functional and planner tests. * Added functional tests test_intersect test_except * New planner testSetOperationStmt Change-Id: I5be46f824217218146ad48b30767af0fc7edbc0f Reviewed-on: http://gerrit.cloudera.org:8080/16123 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Aman Sinha <amsinha@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-07-31 17:23:45 +00:00
Tim Armstrong	574fef2a76	IMPALA-9917: grouping() and grouping_id() support Implements grouping() and grouping_id() builtins. grouping_id() has both a no-arg version, which returns a bit vector of all grouping exprs and a varargs version, which returns a bit vector of the provided arguments. Grouping is a keyword, so needs special handling in the parser to be accepted as a function name. These functions are implemented in the transpose agg with a CASE expression similar to other aggregate functions, but returning the grouping() or grouping_id() value for that aggregation class instead of an aggregated value. Testing: * Added parser test for grouping keyword. * Added analysis tests for the functions. * Added basic planner test to show expressions generated * Added some TPC-DS queries that use grouping() - queries 80, 70 and 86 using reference .test files from Fang-Yu Rao. 27 and 36 were added with reference results from https://github.com/cwida/tpcds-result-reproduction * Add targeted end-to-end tests. * Added view compatibility test with Hive. Change-Id: If0b1640d606256c0fe9204d2a21a8f6d06abcdb6 Reviewed-on: http://gerrit.cloudera.org:8080/16140 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-07-14 03:13:18 +00:00
Tim Armstrong	3e1e7da229	IMPALA-9898: generate grouping set plans Integrates the parsing and analysis with plan generation. Testing: * Add analysis test to make sure we reject unsupported queries. * Added targeted planner tests to ensure we generate the correct aggregation classes for a variety of cases. * Add targeted end-to-end functional tests. Added five TPC-DS queries that use ROLLUP, building on some work done by Fang-Yu Rao. Some tweaks were required for these tests. * Add an extra ORDER BY clause to q77 to make fully deterministic. * Add backticks around `returns` to avoid reserved word. * Add INTERVAL keyword to date/timestamp arithmetic. We can run q80, too, but I haven't added or verified results yet - that can be done in a follow-up. Change-Id: Ie454c5bf7aee266321dee615548d7f2b71380197 Reviewed-on: http://gerrit.cloudera.org:8080/16128 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-07-14 03:13:18 +00:00
Tim Armstrong	fea5dffec5	IMPALA-9924: handle single subquery in or predicate This patch supports a subset of cases of subqueries inside OR inside WHERE and HAVING clauses. The approach used is to rewrite the subquery into a many-to-one LEFT OUTER JOIN with the subquery and then replace the subquery in the expression with a reference to the single select list expressions of the subquery. This works because: * A many-to-one LEFT OUTER JOIN returns one output row for each left input row, meaning that for every row in the original query before the rewrite, we get the same row plus a single matched row from the subquery * Expressions can be rewritten to refer to a slotref from the right side of the LEFT OUTER JOIN without affecting semantics. E.g. an IN subquery becomes <slot> IS NOT NULL or <operator> (<subquery>) becomes <operator> <slot>. This does not affect SELECT list subqueries, which are rewritten using a different mechanism that can already support some subqueries in disjuncts. Correlated and uncorrelated subqueries are both supported, but various limitations are present. Limitations: * Only one subquery per predicate is supported. The rewriting approach should generalize to multiple subqueries but other code needs refactoring to handle this case. * EXISTS and NOT EXISTS subqueries are not supported. The rewriting approach can generalise to that, but we need to add or pick a select list item from the subquery to check for NULL/IS NOT NULL and a little more work is required to do that correctly. * NOT IN is not supported because of the special NULL semantics. * Subqueries with aggregates + grouping by are not supported because we rely on adding distinct to select list and we don't support distinct + aggregations because of IMPALA-5098. Tests: * Positive analysis tests for IN and binary predicate operators. * Negative analysis tests for unsupported subquery operators. * Negative analysis tests for multiple subqueries. * Negative analysis tests for runtime scalar subqueries. * Positive and negative analysis tests for aggregations in subquery. * TPC-DS Query 45 planner and query tests * Targeted planner tests for various supported queries. * Targeted functional tests to confirm plans are executable and return correct result. These exercise a mix of the supported features - correlated/correlated, aggregate functions, EXISTS/comparator, etc. * Tests for BETWEEN predicate, which is supported as a side-effect of being rewritten during analysis. Change-Id: I64588992901afd7cd885419a0b7f949b0b174976 Reviewed-on: http://gerrit.cloudera.org:8080/16152 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>	2020-07-13 16:02:27 +00:00
Tim Armstrong	24f24b131f	IMPALA-9902: add rewrite of TPC-DS q38 I generated the query with dsqgen and then rewrote it to avoid intersect. Testing: Compared results to hive running the original version of the query. Change-Id: I81807683aa265a946729e15156bd2e33724103e1 Reviewed-on: http://gerrit.cloudera.org:8080/16118 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-07-06 23:47:45 +00:00
Shant Hovsepian	388ad555d7	IMPALA-8954: Uncorrelated scalar subqueries in the select list Extend StmtRewriter with the ability to rewrite scalar subqueries in the select list into cross joins. Currently the subquery must pass plan-time checks to determine that it returns a single row which may miss cases that may be valid at runtime or with more complex evaluation of the predicate expressions in the planner. Support for correlated subqueries will be a follow on change. Testing: * Added new analyzer tests, updated previous subquery tests * test_queries.py::TestQueries::test_subquery * Added test_tpcds_q9 to e2e and planner tests Change-Id: Ibcf55d26889aa01d69bb85f18c9241dda095fb66 Reviewed-on: http://gerrit.cloudera.org:8080/16007 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-07-05 22:03:42 +00:00
Fang-Yu Rao	2a48f7dd98	IMPALA-9890 (Part 1): Add more TPCDS queries to Impala's test suite This patch adds the following 12 TPCDS queries to the class of TestTpcdsDecimalV2Query: Q26, Q30, Q31, Q47, Q48, Q57, Q58, Q59, Q63, Q83, Q85, and Q89. All the queries except for Q31 are added to the class of TestTpcdsQuery as well because Impala returns one fewer row than expected for TestTpcdsQuery::test_tpcds_q31(), which requires further investigation. To verify whether or not the returned result set from Impala for a given query is correct, we compare the result set with that produced by the HiveServer2 (HS2) in Impala's mini-cluster. We could execute SQL statements in HS2 via Beeline, HS2's command line shell, which could be launched by the following command. beeline -u "jdbc:hive2://localhost:11050/default" We note that among these 12 queries, the execution of Q31, Q58, and Q83 result in the error of "Counters limit exceeded" by TEZ. To work around this problem, for these 3 queries we have to execute the following statement before running them to increase the default number of counters, which is set to 120. set tez.counters.max=1200 On the other hand, the table of 'reason' is referenced by Q85. This table was not referenced by any TPCDS query before this patch and thus was not created. In this regard, in this patch we also modify tpcds_schema_template.sql to create this additional table along with its data. Testing: - Verified that this patch passes the exhaustive tests in the DEBUG build. Change-Id: Ib5f260e75a3803aabe9ccef271ba94036f96e5cf Reviewed-on: http://gerrit.cloudera.org:8080/16119 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-30 13:06:33 +00:00
Thomas Tauber-Marshall	b8a8edddcb	IMPALA-8207: Fix query loading for perf and stress tests Problems with perf queries (run-workload.py): - TPCH picks up stress test specific queries (TPCH-AGG1/2/3) - TPCDS picks up queries that were intended just to validate that data was loaded properly but that aren't interesting from a perf perspective (TPCDS-COUNT-<table>) - TPCDS picks up both decimal_v1 and decimal_v2 queries. This is mostly harmless as for queries with matching names only one gets run but it causes some queries with mismatched names to be run twice (TPCDS-Q39-1/2 vs. TPCDS-Q39.1/2) Problems with stress queries (concurrent_select.py): - TPCDS fails to pick up Q22A as it does not use the decimal_v2 queries, even though decimal_v2 is the default now. This problem is exacerbated by the fact that the two scripts have different code paths for selecting the queries, so in the past changes that were made to one path were not always made to the other. This patch merges the two paths to reduce code duplication and prevent these sorts of issues in the future, and fixes the above issues. One complication is that historically the stress test has used query names in the form 'q1' whereas the perf test has used query names in the form 'TPCH-Q1'. This patch standardizes on using 'TPCH-Q1'. Testing: - Added a test that checks that the perf tests pick up the expected number of queries. - Manually ran the scripts and verified that the correct queries are selected. Change-Id: Id1966d6ca8babdda07d47e089b75ba06d0318c0d Reviewed-on: http://gerrit.cloudera.org:8080/12503 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-02-19 22:31:17 +00:00
Thomas Tauber-Marshall	f0a47ab2ca	IMPALA-8199: Fix stress test: 'No module named RuntimeProfile.ttypes' A recent commit (IMPALA-6964) broke the stress test because it added an import of a generated thrift value to a python file that is included by the stress test. The stress test is intended to be able to be run without doing a full build of Impala, but in this case the generated thrift isn't available, leading to an import error. The solution is to only import the thrift value in the function where it is used, which is not called by the stress test. Testing: - Ran the stress test manually without doing a full build and confirmed that it works now. Change-Id: I7a3bd26d743ef6603fabf92f904feb4677001da5 Reviewed-on: http://gerrit.cloudera.org:8080/12472 Reviewed-by: Thomas Marshall <thomasmarshall@cmu.edu> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-02-15 02:15:40 +00:00
stakiar	8da44ce16b	IMPALA-6964: Track stats about column and page sizes in Parquet reader Adds the following new stats: * ParquetCompressedPageSize - a summary (average, min, max) counter that tracks the size of compressed pages read, if no compressed pages are read then this counter is empty * ParquetUncompressedPageSize - a summary counter that tracks the size of uncompressed pages read, it is updated in two places: (1) when a compressed page is de-compressed, and (2) when a page that is not compressed is read * ParquetCompressedDataReadPerColumn - a summary counter that tracks the amount of compressed data read per column for a scan node * ParquetUncompressedDataReadPerColumn - a summary counter that tracks the amount of uncompressed data read per column for a scan node The PerColumn counters are calculated by aggregating the number of bytes read for each column across all scan ranges processed by a scan node. Each sample in the counter is the size of a single column. Here is an example of what the updated HDFS scan profile looks like: - ParquetCompressedDataReadPerColumn: (Avg: 227.56 KB (233018) ; Min: 225.14 KB (230540) ; Max: 229.98 KB (235496) ; Number of samples: 2) - ParquetUncompressedDataReadPerColumn: (Avg: 227.96 KB (233426) ; Min: 224.91 KB (230306) ; Max: 231.00 KB (236547) ; Number of samples: 2) - ParquetCompressedPageSize: (Avg: 4.46 KB (4568) ; Min: 3.86 KB (3955) ; Max: 5.19 KB (5315) ; Number of samples: 102) - ParquetDecompressedPageSize: (Avg: 4.47 KB (4576) ; Min: 3.86 KB (3950) ; Max: 5.22 KB (5349) ; Number of samples: 102) Testing: * Added new tests to test_scanners.py that do some basic validation of the new counters above Change-Id: I322f9b324b6828df28e5caf79529085c43d7c817 Reviewed-on: http://gerrit.cloudera.org:8080/11575 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-01-17 03:06:51 +00:00
Thomas Tauber-Marshall	cd26e807f1	IMPALA-7761: Add multiple DISTINCT to targeted perf and stress test IMPALA-110 added support for queries with multiple DISTINCT aggregates in a single select list. This patch adds queries to test this functionality to our targeted-perf workloads and fixes some incorrect return types in another targeted-perf aggregation query. It also adds some targeted queries to the stress test by extending the regex for stress test files to accept files of the form 'tpch-stress-*' and to allow for multiple tests per file. Testing: - Added an e2e test that runs the stress test file. Change-Id: I400aaf6b6620b4001895eafff785956bffb312c9 Reviewed-on: http://gerrit.cloudera.org:8080/11805 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-11-13 23:25:02 +00:00
Michael Brown	971cf179f6	IMPALA-7460 part 1: require user to install Paramiko and Fabric - Remove Fabric and Paramiko as requirements. They aren't needed by anything in buildall.sh. - Add a means to install into the impala-python virtual environment by hand. impala-pip is fine for this. - Add another requirements file for extended testing. The dependency situation is messy and untangling that out of impala-python and into lib/python should be out of the scope of IMPALA-7460. - Update core tests, which cover real regressions that have happened in the past, to run against locations that don't require a Paramiko import. This moves some logic out of concurrent_select.py into a thinner module. - Insulate ssh_util from globally-scoped import so that it only imports when needed. Testing: - This works in my development environment. - This works in my downstream stress and query gen environments. - This works when doing a full data load. - Impala still builds on a variety of OSs. Todo: - A subsequent review will update the versions. Change-Id: Ibf9010a0387b52c95b7bda5d1d4606eba1008b65 Reviewed-on: http://gerrit.cloudera.org:8080/11264 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-08-23 00:20:15 +00:00
Bikramjeet Vig	e8a669bf91	IMPALA-7279: Fix flakiness in test_rows_availability This patch fixes a flaky time string parsing method in test_rows_availability that fails on strings with microsecond precision. Change-Id: If7634869823d8cc4059048dd5d3c3a984744f3be Reviewed-on: http://gerrit.cloudera.org:8080/10922 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-07-12 02:42:00 +00:00
Michael Brown	4028e9c5ec	IMPALA-6759: align stress test memory estimation parse pattern The stress test never expected to see memory estimates on the order of PB. Apparently it can happen with TPC DS 10000, so update the pattern. It's not clear how to quickly write a test to catch this, because it involves crossing language boundaries and possibly having a massively-scaled dataset. I think leaving a comment in both places is good enough for now. Change-Id: I317c271888584ed2a817ee52ad70267eae64d341 Reviewed-on: http://gerrit.cloudera.org:8080/9846 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-29 03:27:25 +00:00
Michael Brown	2c0926e2de	Revert "IMPALA-6759: align stress test memory estimation parse pattern" This reverts commit `2521848753`.	2018-03-28 15:28:48 -07:00
Michael Brown	2521848753	IMPALA-6759: align stress test memory estimation parse pattern The stress test never expected to see memory estimates on the order of PB. Apparently it can happen with TPC DS 10000, so update the pattern. It's not clear how to quickly write a test to catch this, because it involves crossing language boundaries and possibly having a massively-scaled dataset. I think leaving a comment in both places is good enough for now. Change-Id: I08976f261582b379696fd0e81bc060577e552309	2018-03-28 15:27:10 -07:00
Taras Bobrovytsky	2159beee89	IMPALA-4467: Add support for DML statements in stress test - Add support for insert, upsert, update and and delete statements. - Add support for compute stats with mt_dop query options. - Update impyla version in order to be able to have access to query error text for DML queries. - Made flake8 fixes. flake8 on this file is clean. For every Kudu table in the databases, we make a copy and add a '_original' suffix to the table name. The DML queries will only make modifications to the non original table, the original table will never be modified. The orignal tables could be used to bring the non-original table to the inital state. Two flags were added for doing this: --reset-databases-before-binary-search and --reset-databases-after-binary-search. The DML queries are generated based on the mod values passed in with the following flag: --dml-mod-values 11 13 17. For each mod value 4 DML queries are generated. The DML operations will touch table rows where primary_key % mod_value = 0. So, the larger the mod value, the more rows would be affected. The DML queries are generated in such a way that the data for the insert, upsert, and update queries is taken from the table with the _original suffix. The stress test generates DML queries for only kudu databases. For example, --tpch-kudu-db=tpch_100_kudu --tpch-db=tpch_100 --generate-dml-queries would only generate queries for the tpch_100_kudu database. Here's an example of a full call with the new options that runs the stress test on the local mini cluster: ./concurrent_select.py \ --tpch-kudu-db=tpch_kudu \ --generate-dml-queries \ --dml-mod-values 11 13 17 \ --generate-compute-stats-queries \ --select-probability=0.5 \ --mem-limit-padding-pct=25 \ --mem-limit-padding-abs=50 \ --reset-databases-before-binary-search \ --reset-databases-after-binary-search Change-Id: Ia2aafdc6851cc0e1677a3c668d3350e47c4bfe40 Reviewed-on: http://gerrit.cloudera.org:8080/5093 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Impala Public Jenkins	2016-12-20 01:33:01 +00:00
Dan Hecht	ffa7829b70	IMPALA-3918: Remove Cloudera copyrights and add ASF license header For files that have a Cloudera copyright (and no other copyright notice), make changes to follow the ASF source file header policy here: http://www.apache.org/legal/src-headers.html#headers Specifically: 1) Remove the Cloudera copyright. 2) Modify NOTICE.txt according to http://www.apache.org/legal/src-headers.html#notice to follow that format and add a line for Cloudera. 3) Replace or add the existing ASF license text with the one given on the website. Much of this change was automatically generated via: git grep -li 'Copyright.Cloudera' > modified_files.txt cat modified_files.txt \| xargs perl -n -i -e 'print unless m#Copyright.Cloudera#i;' cat modified_files_txt \| xargs fix_apache_license.py [1] Some manual fixups were performed following those steps, especially when license text was completely missing from the file. [1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor modification to ORIG_LICENSE to match Impala's license text. Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86 Reviewed-on: http://gerrit.cloudera.org:8080/3779 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-08-09 08:19:41 +00:00
Casey Ching	f288867833	Stress test: Various changes The major changes are: 1) Collect backtrace and fatal log on crash. 2) Poll memory usage. The data is only displayed at this time. 3) Support kerberos. 4) Add random queries. 5) Generate random and TPC-H nested data on a remote cluster. The random data generator was converted to use MR for scaling. 6) Add a cluster abstraction to run data loading for #5 on a remote or local cluster. This also moves and consolidates some Cloudera Manager utilities that were in the stress test. 7) Cleanup the wrappers around impyla. That stuff was getting messy. Change-Id: I4e4b72dbee1c867626a0b22291dd6462819e35d7 Reviewed-on: http://gerrit.cloudera.org:8080/1298 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-01-20 23:00:25 +00:00

26 Commits