impala

mirror of https://github.com/apache/impala.git synced 2026-01-01 18:00:30 -05:00

Files

Skye Wanderman-Milne 9147cd7518 IMPALA-525: Adjust IO buffer size based on read length and other memory fixes

We were previously wasting memory by always reading into 8MB IO
buffers, even when the data read was much less than 8MB. With this
patch, the IO manager picks a buffer size closer to the actual amount
being read (we don't use the exact size so we can continue to recycle
buffers). The minimum IO buffer size is determined via the
--min_buffer_size flag, and the max IO buffer size via the --read_size
flag.

This technique also helps with IMPALA-652, since short columns will
not use as much memory as before (we will not use considerably more
memory than the size of the table).

This patch also changes StringBuffer to use a doubling strategy so it
doesn't end up allocating many large unused buffers, and has the
scanner context use the requested length as the sync read size if it's
larger than the size produced by read_past_size_cb(). These changes
help prevent the boundary buffer in the scanner context from
allocating excess memory.

Change-Id: I0efb3b023ddfddb08bca22d5cb5f9511fb4d6c50
Reviewed-on: http://gerrit.ent.cloudera.com:8080/938
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins

2014-01-08 10:54:01 -08:00

functional

IMPALA-525: Adjust IO buffer size based on read length and other memory fixes

2014-01-08 10:54:01 -08:00

hive-benchmark

Parquet writer.

2014-01-08 10:48:44 -08:00

tpcds

Don't load tpcds.store_sales_unpartitioned into any file format except text.

2014-01-08 10:52:43 -08:00

tpch

Refactor testing framework to generate Avro tables.

2014-01-08 10:48:45 -08:00

README

Move functional data loading to new framework + initial changes for workload directory structure

2014-01-08 10:44:18 -08:00

README

This directory contains Impala test data sets. The directory layout is structured as follows:

datasets/
   <data set>/<data set>_schema_template.sql
   <data set>/<data files SF1>/data files
   <data set>/<data files SF2>/data files

Where SF is the scale factor controlling data size. This allows for scaling the same schema to
different sizes based on the target test environment.