impala

jprdonnelly/impala

Fork 0

mirror of https://github.com/apache/impala.git synced 2025-12-19 18:12:08 -05:00

Commit Graph

Author	SHA1	Message	Date
Riza Suminto	f1cca6c767	IMPALA-12135: Deflake test_krpc_datastream_sender_shuffle test_krpc_datastream_sender_shuffle has been failing with OOM error in HDFS EC environement. This is due to fix introduced in IMPALA-12106 reduce num instances of a union fragment (F06) from 3 to 2. Running the test with BATCH_SIZE=8 help pass the test while still holding the assertion (KrpcDataStreamSender claiming megabytes of memory for RowBatchSerialization). Testing: - test_krpc_datastream_sender_shuffle pass both in regular minicluster and HDFS EC setup. Change-Id: I8c7961ad8dd489a4d62e738d364d4da1fa44d0cc Reviewed-on: http://gerrit.cloudera.org:8080/20011 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-06-09 20:50:36 +00:00
Joe McDonnell	82bd087fb1	IMPALA-11973: Add absolute_import, division to all eligible Python files This takes steps to make Python 2 behave like Python 3 as a way to flush out issues with running on Python 3. Specifically, it handles two main differences: 1. Python 3 requires absolute imports within packages. This can be emulated via "from __future__ import absolute_import" 2. Python 3 changed division to "true" division that doesn't round to an integer. This can be emulated via "from __future__ import division" This changes all Python files to add imports for absolute_import and division. For completeness, this also includes print_function in the import. I scrutinized each old-division location and converted some locations to use the integer division '//' operator if it needed an integer result (e.g. for indices, counts of records, etc). Some code was also using relative imports and needed to be adjusted to handle absolute_import. This fixes all Pylint warnings about no-absolute-import and old-division, and these warnings are now banned. Testing: - Ran core tests Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b Reviewed-on: http://gerrit.cloudera.org:8080/19588 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-03-09 17:17:57 +00:00
TheOmid	e327a28757	IMPALA-6684: Fix untracked memory in KRPC During serialization of a row batch header, a tuple_data_ is created which will hold the compressed tuple data for an outbound row batch. We would like this tuple data to be trackable as it is responsible for a significant portion of untrackable memory from the krpc data stream sender. By using MemTrackerAllocator, we can allocate tuple data and compression scratch and account for it in the memory tracker of the KrpcDataStreamSender. This solution replaces the type for tuple data and compression scratch from std::string to TrackedString, an std:basic_string with MemTrackerAllocator as the custom allocator. This patch adds memory estimation in DataStreamSink.java to account for OutboundRowBatch memory allocation. This patch also removes the thrift-based serialization because the thrift RPC has been removed in the prior commit. Testing: - Passed core tests. - Ran a single node benchmark which shows no regression. - Updated row-batch-serialize-test and row-batch-serialize-benchmark to test the row-batch serialization used by KRPC. - Manually collected query-profile, heap growth, and memory usage log showing untracked memory decreased by 1/2. - Add test_datastream_sender.py to verify the peak memory of EXCHANGE SENDER node. - Raise mem_limit in two of test_spilling_large_rows test case. - Print test line number in PlannerTestBase.java New row-batch serialization benchmark: Machine Info: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz serialize: 10% 50% 90% 10% 50% 90% (rel) (rel) (rel) ------------------------------------------------------------- ser_no_dups_base 18.6 18.8 18.9 1X 1X 1X ser_no_dups 18.5 18.5 18.8 0.998X 0.988X 0.991X ser_no_dups_full 14.7 14.8 14.8 0.793X 0.79X 0.783X ser_adj_dups_base 28.2 28.4 28.8 1X 1X 1X ser_adj_dups 68.9 69.1 69.8 2.44X 2.43X 2.43X ser_adj_dups_full 56.2 56.7 57.1 1.99X 2X 1.99X ser_dups_base 20.7 20.9 20.9 1X 1X 1X ser_dups 20.6 20.8 20.9 0.994X 0.995X 1X ser_dups_full 39.8 40 40.5 1.93X 1.92X 1.94X deserialize: 10% 50% 90% 10% 50% 90% (rel) (rel) (rel) ------------------------------------------------------------- deser_no_dups_base 75.9 76.6 77 1X 1X 1X deser_no_dups 74.9 75.6 76 0.987X 0.987X 0.987X deser_adj_dups_base 127 128 129 1X 1X 1X deser_adj_dups 179 193 195 1.41X 1.51X 1.51X deser_dups_base 128 128 129 1X 1X 1X deser_dups 165 190 193 1.29X 1.48X 1.49X Change-Id: I2ba2b907ce4f275a7a1fb8cf75453c7003eb4b82 Reviewed-on: http://gerrit.cloudera.org:8080/18798 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-10-17 21:59:57 +00:00

Author

SHA1

Message

Date

Riza Suminto

f1cca6c767

IMPALA-12135: Deflake test_krpc_datastream_sender_shuffle

test_krpc_datastream_sender_shuffle has been failing with OOM error in
HDFS EC environement. This is due to fix introduced in IMPALA-12106
reduce num instances of a union fragment (F06) from 3 to 2. Running the
test with BATCH_SIZE=8 help pass the test while still holding the
assertion (KrpcDataStreamSender claiming megabytes of memory for
RowBatchSerialization).

Testing:
- test_krpc_datastream_sender_shuffle pass both in regular minicluster
  and HDFS EC setup.

Change-Id: I8c7961ad8dd489a4d62e738d364d4da1fa44d0cc
Reviewed-on: http://gerrit.cloudera.org:8080/20011
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>

2023-06-09 20:50:36 +00:00

Joe McDonnell

82bd087fb1

IMPALA-11973: Add absolute_import, division to all eligible Python files

This takes steps to make Python 2 behave like Python 3 as
a way to flush out issues with running on Python 3. Specifically,
it handles two main differences:
 1. Python 3 requires absolute imports within packages. This
    can be emulated via "from __future__ import absolute_import"
 2. Python 3 changed division to "true" division that doesn't
    round to an integer. This can be emulated via
    "from __future__ import division"

This changes all Python files to add imports for absolute_import
and division. For completeness, this also includes print_function in the
import.

I scrutinized each old-division location and converted some locations
to use the integer division '//' operator if it needed an integer
result (e.g. for indices, counts of records, etc). Some code was also using
relative imports and needed to be adjusted to handle absolute_import.
This fixes all Pylint warnings about no-absolute-import and old-division,
and these warnings are now banned.

Testing:
 - Ran core tests

Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b
Reviewed-on: http://gerrit.cloudera.org:8080/19588
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>

2023-03-09 17:17:57 +00:00

TheOmid

e327a28757

IMPALA-6684: Fix untracked memory in KRPC

During serialization of a row batch header, a tuple_data_ is created
which will hold the compressed tuple data for an outbound row batch.
We would like this tuple data to be trackable as it is responsible for
a significant portion of untrackable memory from the krpc data stream
sender. By using MemTrackerAllocator, we can allocate tuple data and
compression scratch and account for it in the memory tracker of the
KrpcDataStreamSender. This solution replaces the type for tuple data
and compression scratch from std::string to TrackedString, an
std:basic_string with MemTrackerAllocator as the custom allocator.

This patch adds memory estimation in DataStreamSink.java to account
for OutboundRowBatch memory allocation. This patch also removes the
thrift-based serialization because the thrift RPC has been removed
in the prior commit.

Testing:
 - Passed core tests.
 - Ran a single node benchmark which shows no regression.
 - Updated row-batch-serialize-test and row-batch-serialize-benchmark
   to test the row-batch serialization used by KRPC.
 - Manually collected query-profile, heap growth, and memory usage log
   showing untracked memory decreased by 1/2.
 - Add test_datastream_sender.py to verify the peak memory of EXCHANGE
   SENDER node.
 - Raise mem_limit in two of test_spilling_large_rows test case.
 - Print test line number in PlannerTestBase.java

New row-batch serialization benchmark:

Machine Info: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
serialize:            10%   50%   90%     10%     50%     90%
                                        (rel)   (rel)   (rel)
-------------------------------------------------------------
   ser_no_dups_base  18.6  18.8  18.9      1X      1X      1X
        ser_no_dups  18.5  18.5  18.8  0.998X  0.988X  0.991X
   ser_no_dups_full  14.7  14.8  14.8  0.793X   0.79X  0.783X

  ser_adj_dups_base  28.2  28.4  28.8      1X      1X      1X
       ser_adj_dups  68.9  69.1  69.8   2.44X   2.43X   2.43X
  ser_adj_dups_full  56.2  56.7  57.1   1.99X      2X   1.99X

      ser_dups_base  20.7  20.9  20.9      1X      1X      1X
           ser_dups  20.6  20.8  20.9  0.994X  0.995X      1X
      ser_dups_full  39.8    40  40.5   1.93X   1.92X   1.94X

deserialize:          10%   50%   90%     10%     50%     90%
                                        (rel)   (rel)   (rel)
-------------------------------------------------------------
 deser_no_dups_base  75.9  76.6    77      1X      1X      1X
      deser_no_dups  74.9  75.6    76  0.987X  0.987X  0.987X

deser_adj_dups_base   127   128   129      1X      1X      1X
     deser_adj_dups   179   193   195   1.41X   1.51X   1.51X

    deser_dups_base   128   128   129      1X      1X      1X
         deser_dups   165   190   193   1.29X   1.48X   1.49X

Change-Id: I2ba2b907ce4f275a7a1fb8cf75453c7003eb4b82
Reviewed-on: http://gerrit.cloudera.org:8080/18798
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>

2022-10-17 21:59:57 +00:00

3 Commits