impala/testdata at 643c800d628188b2fc28002439f496fcdde5aa7d - impala - Gitea: Git with a cup of tea

jprdonnelly/impala

mirror of https://github.com/apache/impala.git synced 2026-01-04 18:00:57 -05:00

Files

History

Tim Armstrong 643c800d62 IMPALA-2473: reduce scanner memory usage

This patch reduces memory usage of scanners by adjusting how batch
capacity is checked and handled and by freeing unneeded memory.

Change RowBatch::AtCapacity(MemPool) so that batches with no rows
cannot hold onto an unbounded amount of memory - instead they
will pass these batches up operator tree so that the resources
can be freed.

The Parquet scanner also only checked capacity every 1024 rows.
With large rows (e.g. nested collections), it can overrun the
intended 8mb limit. It also didn't include the MemPool usage
in its checks. After the change the scanner will produce smaller
batches if rows contain large nested collections or strings.
I benchmarked this with a scan of the nested TPC-H customers
tables. The row batch sized decrease from ~16MB to ~8MB. If the
nested collections were larger this would be more drastic.

Also pass at capacity up the tree if no rows passed the conjuncts in
the DataSourceScanNode and Parquet scanner so that resources can be
freed.

HdfsTableSink is modified to avoid the incorrect assumption that a batch
only has 0 rows at eos. It is also refactored to pass a related flag as
an argument to make the semantics clearer.

Two simple benchmarks (one column and many columns) shows no change
in scanner performance:
 > set num_scanner_threads=1;
 > select count(l_orderkey) from biglineitem;
 > select count(l_orderkey), count(l_partkey), count(l_suppkey),
   count(l_returnflag), count(l_quantity), count(l_linenumber),
   count(l_extendedprice), count(l_linestatus), count(l_shipdate),
   count(l_commitdate) from biglineitem;

Change-Id: I3b79671ffd3af50a2dc20c643b06cc353ba13503
Reviewed-on: http://gerrit.cloudera.org:8080/1239
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins

2015-11-19 22:57:05 +00:00

..

Fix IMP-297

2014-01-08 10:46:44 -08:00

AllTypesErrorNoNulls

Timestamp data type implimentation.

2012-03-22 21:38:18 -07:00

IMPALA-1136, IMPALA-2161: Skip \u0000 characters when dealing Avro schemas

2015-09-02 00:37:28 +00:00

avro_schema_resolution

Enable loading metadata from the hive metastore snapshot and cleanup build scripts.

2014-12-19 13:41:00 -08:00

Add HdfsLzoTextScanner

2014-01-08 10:46:35 -08:00

Add support for streaming decompression of gzip text

2014-11-23 01:55:55 -08:00

Add HdfsLzoTextScanner

2014-01-08 10:46:35 -08:00

Random nested schema and data generation

2015-11-14 05:19:32 +00:00

IMPALA-2310: Add PURGE option to DROP TABLE/ALTER TBL DROP PART

2015-10-14 17:51:37 -07:00

Use "impala-python" (virtualenv) instead of system python

2015-08-06 02:09:09 +00:00

ComplexTypesTbl

Nested types: read and materialize nested types in Parquet scanner

2015-09-02 19:23:54 +00:00

IMPALA-2558: DCHECK in parquet scanner after block read error

2015-10-30 22:35:57 +00:00

IMPALA-2558: DCHECK in parquet scanner after block read error

2015-10-30 22:35:57 +00:00

adding outer joins plus new tests

2011-09-28 09:02:07 -07:00

Perf Work:

2011-12-30 00:26:27 -08:00

ImpalaDemoDataset

Test data loading framework improvements

2014-01-08 10:46:49 -08:00

adding outer joins plus new tests

2011-09-28 09:02:07 -07:00

Fix null string parsing.

2014-01-08 10:44:40 -08:00

LineItemMultiBlock

IMPALA-1881: Maximize data locality when scanning Parquet files with multiple row groups.

2015-10-05 11:30:39 -07:00

Move IR cross compile output to a better folder for packaging.

2012-06-01 13:14:18 -07:00

max_nesting_depth

Nested Types: Enforce and test maximum nesting depth of 100.

2015-10-05 11:30:54 -07:00

IMPALA-13: Use SSE string functions that take an explicit length

2014-04-11 11:16:24 -07:00

parquet_nested_types_encodings

IMPALA-2443: add support for more Parquet array encodings

2015-10-01 13:58:37 -07:00

src/main/java/com/cloudera/impala/datagenerator

Random nested schema and data generation

2015-11-14 05:19:32 +00:00

TblWithRaggedColumns

IMP-380 handle '\r' at end of row.

2014-01-08 10:46:14 -08:00

IMP-232: Parallel INSERT OVERWRITE

2014-01-08 10:45:04 -08:00

When a scan range begins at the starting point fo the tuple, we'll missed that tuple. This patch fixes

2014-01-08 10:44:24 -08:00

tinytable_seq_snap

IMPALA-362: impalad hangs when read sequence file without contents

2014-01-08 10:50:49 -08:00

IMPALA-786: Drop function should clear library cache.

2014-02-10 18:51:39 -08:00

UnsupportedTypes

IMPALA-138: Error messages for unknown column types are particularly bad.

2014-01-08 10:48:53 -08:00

IMPALA-2473: reduce scanner memory usage

2015-11-19 22:57:05 +00:00

__init__.py

CDH-18416: Don't inline ReadWriteUtil::ReadZLong()

2014-04-28 15:58:15 -07:00

.gitignore

Change all "Status Close()" to "void Close()"

2014-01-08 10:52:38 -08:00

pom.xml

Add Parquet nested schemas to testdata

2015-08-13 10:25:39 +00:00