Commit Graph

11 Commits

Author SHA1 Message Date
Thomas Tauber-Marshall
f0a47ab2ca IMPALA-8199: Fix stress test: 'No module named RuntimeProfile.ttypes'
A recent commit (IMPALA-6964) broke the stress test because it added
an import of a generated thrift value to a python file that is
included by the stress test. The stress test is intended to be able to
be run without doing a full build of Impala, but in this case the
generated thrift isn't available, leading to an import error.

The solution is to only import the thrift value in the function where
it is used, which is not called by the stress test.

Testing:
- Ran the stress test manually without doing a full build and
  confirmed that it works now.

Change-Id: I7a3bd26d743ef6603fabf92f904feb4677001da5
Reviewed-on: http://gerrit.cloudera.org:8080/12472
Reviewed-by: Thomas Marshall <thomasmarshall@cmu.edu>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-02-15 02:15:40 +00:00
stakiar
8da44ce16b IMPALA-6964: Track stats about column and page sizes in Parquet reader
Adds the following new stats:

* ParquetCompressedPageSize - a summary (average, min, max) counter that
tracks the size of compressed pages read, if no compressed pages are
read then this counter is empty
* ParquetUncompressedPageSize - a summary counter that tracks the size
of uncompressed pages read, it is updated in two places: (1) when a
compressed page is de-compressed, and (2) when a page that is not
compressed is read
* ParquetCompressedDataReadPerColumn - a summary counter that tracks the
amount of compressed data read per column for a scan node
* ParquetUncompressedDataReadPerColumn - a summary counter that tracks
the amount of uncompressed data read per column for a scan node

The PerColumn counters are calculated by aggregating the number of bytes
read for each column across all scan ranges processed by a scan node.
Each sample in the counter is the size of a single column.

Here is an example of what the updated HDFS scan profile looks like:

- ParquetCompressedDataReadPerColumn: (Avg: 227.56 KB (233018) ;
Min: 225.14 KB (230540) ; Max: 229.98 KB (235496) ; Number of samples: 2)
- ParquetUncompressedDataReadPerColumn: (Avg: 227.96 KB (233426) ;
Min: 224.91 KB (230306) ; Max: 231.00 KB (236547) ; Number of samples: 2)
- ParquetCompressedPageSize: (Avg: 4.46 KB (4568) ; Min: 3.86 KB (3955) ;
Max: 5.19 KB (5315) ; Number of samples: 102)
- ParquetDecompressedPageSize: (Avg: 4.47 KB (4576) ; Min: 3.86 KB (3950)
 ; Max: 5.22 KB (5349) ; Number of samples: 102)

Testing:
* Added new tests to test_scanners.py that do some basic validation of
the new counters above

Change-Id: I322f9b324b6828df28e5caf79529085c43d7c817
Reviewed-on: http://gerrit.cloudera.org:8080/11575
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-01-17 03:06:51 +00:00
Thomas Tauber-Marshall
cd26e807f1 IMPALA-7761: Add multiple DISTINCT to targeted perf and stress test
IMPALA-110 added support for queries with multiple DISTINCT aggregates
in a single select list. This patch adds queries to test this
functionality to our targeted-perf workloads and fixes some incorrect
return types in another targeted-perf aggregation query.

It also adds some targeted queries to the stress test by extending the
regex for stress test files to accept files of the form
'tpch-stress-*' and to allow for multiple tests per file.

Testing:
- Added an e2e test that runs the stress test file.

Change-Id: I400aaf6b6620b4001895eafff785956bffb312c9
Reviewed-on: http://gerrit.cloudera.org:8080/11805
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-11-13 23:25:02 +00:00
Michael Brown
971cf179f6 IMPALA-7460 part 1: require user to install Paramiko and Fabric
- Remove Fabric and Paramiko as requirements. They aren't needed by
  anything in buildall.sh.
- Add a means to install into the impala-python virtual environment by hand.
  impala-pip is fine for this.
- Add another requirements file for extended testing. The dependency
  situation is messy and untangling that out of impala-python and into
  lib/python should be out of the scope of IMPALA-7460.
- Update core tests, which cover real regressions that have happened in
  the past, to run against locations that don't require a Paramiko
  import. This moves some logic out of concurrent_select.py into a
  thinner module.
- Insulate ssh_util from globally-scoped import so that it only imports
  when needed.

Testing:
- This works in my development environment.
- This works in my downstream stress and query gen environments.
- This works when doing a full data load.
- Impala still builds on a variety of OSs.

Todo:
- A subsequent review will update the versions.

Change-Id: Ibf9010a0387b52c95b7bda5d1d4606eba1008b65
Reviewed-on: http://gerrit.cloudera.org:8080/11264
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-08-23 00:20:15 +00:00
Bikramjeet Vig
e8a669bf91 IMPALA-7279: Fix flakiness in test_rows_availability
This patch fixes a flaky time string parsing method in
test_rows_availability that fails on strings with microsecond precision.

Change-Id: If7634869823d8cc4059048dd5d3c3a984744f3be
Reviewed-on: http://gerrit.cloudera.org:8080/10922
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-07-12 02:42:00 +00:00
Michael Brown
4028e9c5ec IMPALA-6759: align stress test memory estimation parse pattern
The stress test never expected to see memory estimates on the order of
PB. Apparently it can happen with TPC DS 10000, so update the pattern.

It's not clear how to quickly write a test to catch this, because it
involves crossing language boundaries and possibly having a
massively-scaled dataset. I think leaving a comment in both places is
good enough for now.

Change-Id: I317c271888584ed2a817ee52ad70267eae64d341
Reviewed-on: http://gerrit.cloudera.org:8080/9846
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Impala Public Jenkins
2018-03-29 03:27:25 +00:00
Michael Brown
2c0926e2de Revert "IMPALA-6759: align stress test memory estimation parse pattern"
This reverts commit 2521848753.
2018-03-28 15:28:48 -07:00
Michael Brown
2521848753 IMPALA-6759: align stress test memory estimation parse pattern
The stress test never expected to see memory estimates on the order of
PB. Apparently it can happen with TPC DS 10000, so update the pattern.

It's not clear how to quickly write a test to catch this, because it
involves crossing language boundaries and possibly having a
massively-scaled dataset. I think leaving a comment in both places is
good enough for now.

Change-Id: I08976f261582b379696fd0e81bc060577e552309
2018-03-28 15:27:10 -07:00
Taras Bobrovytsky
2159beee89 IMPALA-4467: Add support for DML statements in stress test
- Add support for insert, upsert, update and and delete statements.
- Add support for compute stats with mt_dop query options.
- Update impyla version in order to be able to have access to query
  error text for DML queries.
- Made flake8 fixes. flake8 on this file is clean.

For every Kudu table in the databases, we make a copy and add a
'_original' suffix to the table name. The DML queries will only make
modifications to the non original table, the original table will never
be modified. The orignal tables could be used to bring the non-original
table to the inital state. Two flags were added for doing this:
--reset-databases-before-binary-search and
--reset-databases-after-binary-search.

The DML queries are generated based on the mod values passed in with the
following flag: --dml-mod-values 11 13 17. For each mod value 4 DML
queries are generated. The DML operations will touch table rows where
primary_key % mod_value = 0. So, the larger the mod value, the more rows
would be affected. The DML queries are generated in such a way that the
data for the insert, upsert, and update queries is taken from the table
with the _original suffix. The stress test generates DML queries for
only kudu databases. For example, --tpch-kudu-db=tpch_100_kudu
--tpch-db=tpch_100 --generate-dml-queries would only generate queries
for the tpch_100_kudu database.

Here's an example of a full call with the new options that runs the
stress test on the local mini cluster:
./concurrent_select.py \
    --tpch-kudu-db=tpch_kudu \
    --generate-dml-queries \
    --dml-mod-values 11 13 17 \
    --generate-compute-stats-queries \
    --select-probability=0.5 \
    --mem-limit-padding-pct=25 \
    --mem-limit-padding-abs=50 \
    --reset-databases-before-binary-search \
    --reset-databases-after-binary-search

Change-Id: Ia2aafdc6851cc0e1677a3c668d3350e47c4bfe40
Reviewed-on: http://gerrit.cloudera.org:8080/5093
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Impala Public Jenkins
2016-12-20 01:33:01 +00:00
Dan Hecht
ffa7829b70 IMPALA-3918: Remove Cloudera copyrights and add ASF license header
For files that have a Cloudera copyright (and no other copyright
notice), make changes to follow the ASF source file header policy here:

http://www.apache.org/legal/src-headers.html#headers

Specifically:
1) Remove the Cloudera copyright.
2) Modify NOTICE.txt according to
   http://www.apache.org/legal/src-headers.html#notice
   to follow that format and add a line for Cloudera.
3) Replace or add the existing ASF license text with the one given
   on the website.

Much of this change was automatically generated via:

git grep -li 'Copyright.*Cloudera' > modified_files.txt
cat modified_files.txt | xargs perl -n -i -e 'print unless m#Copyright.*Cloudera#i;'
cat modified_files_txt | xargs fix_apache_license.py [1]

Some manual fixups were performed following those steps, especially when
license text was completely missing from the file.

[1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor
    modification to ORIG_LICENSE to match Impala's license text.

Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86
Reviewed-on: http://gerrit.cloudera.org:8080/3779
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-08-09 08:19:41 +00:00
Casey Ching
f288867833 Stress test: Various changes
The major changes are:

1) Collect backtrace and fatal log on crash.
2) Poll memory usage. The data is only displayed at this time.
3) Support kerberos.
4) Add random queries.
5) Generate random and TPC-H nested data on a remote cluster. The
   random data generator was converted to use MR for scaling.
6) Add a cluster abstraction to run data loading for #5 on a
   remote or local cluster. This also moves and consolidates some
   Cloudera Manager utilities that were in the stress test.
7) Cleanup the wrappers around impyla. That stuff was getting
   messy.

Change-Id: I4e4b72dbee1c867626a0b22291dd6462819e35d7
Reviewed-on: http://gerrit.cloudera.org:8080/1298
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
2016-01-20 23:00:25 +00:00