mirror of
https://github.com/apache/impala.git
synced 2025-12-30 03:01:44 -05:00
Currently the only way to refresh metadata for a partition was to refresh
the whole table. This is a relatively time consuming process especially if
there are many partitions and only one is to be refreshed.
This patch allows the client to REFRESH on a single partition by using the
following syntax:
REFRESH [database_name.]table_name PARTITION (partition_spec)
Testing:
Added parsing and authorization tests in ParserTest.java and
AuthorizationTest.java respectively. A new test file
"test_refresh_partition.py" was added for testing functionality.
Performance:
For a table with 10000 partitions and 1 file per partition
execResetMetadata() Total Execution Time
Refresh Table 3795 ms 4630 ms
Refersh Partition 42 ms 680 ms
We see that the time to refresh improves by a factor of 90x but due to
significant overhead of about 640ms in this case the effective improvement
is about 7x. As the size of the table and number of partitions increase,
this improvement would be more significant.
Change-Id: Ia9aa25d190ada367fbebaca47ae8b2cafbea16fb
Reviewed-on: http://gerrit.cloudera.org:8080/3813
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
658 B
658 B
| 1 | # Manually created file. |
|---|---|
| 2 | file_format:text, dataset:tpch, compression_codec:none, compression_type:none |
| 3 | file_format:text, dataset:tpch, compression_codec:gzip, compression_type:block |
| 4 | file_format:seq, dataset:tpch, compression_codec:gzip, compression_type:block |
| 5 | file_format:seq, dataset:tpch, compression_codec:snap, compression_type:block |
| 6 | file_format:rc, dataset:tpch, compression_codec:none, compression_type:none |
| 7 | file_format:avro, dataset:tpch, compression_codec: none, compression_type: none |
| 8 | file_format:avro, dataset:tpch, compression_codec: snap, compression_type: block |
| 9 | file_format:parquet, dataset:tpch, compression_codec: none, compression_type: none |