Files
Daniel Becker 6b47c40e0d IMPALA-12159: Support ORDER BY for collections of variable length types in select list
IMPALA-12019 implemented support for collections of fixed length types
in the sorting tuple. This change implements it for collections of
variable length types.

Note that the limitation that structs that contain any type of
collection are not allowed in the sorting tuple is still in place (see
IMPALA-12160).

Note that it was not and still is not allowed to sort by complex types,
this change only allows them to be present in the select list when
sortin by some other expression.

This change also allows collections of variable length types to be
non-passthrough children of UNION ALL nodes.

Testing:
 - Renamed the 'simple_arrays_big' table to 'arrays_big' and extended it
   with collections containing variable length types. This table is
   mainly used to test that spilling works during sorting.
 - Renamed
   test_sort.py::TestArraySort::{test_simple_arrays,
   test_simple_arrays_with_limit}
   to {test_array_sort,test_array_sort_with_limit}
 - Extended the tests run in test_queries.py::TestQueries::{test_sort,
   test_top_n,test_partitioned_top_n} with collections containing
   var-len types.
 - Added tests in sort-complex.test that assert that it is not allowed
   to sort by collections. For structs we already have such tests in
   struct-in-select-list.test.

Change-Id: Ic15b29393f260b572e11a8dbb9deeb8c02981852
Reviewed-on: http://gerrit.cloudera.org:8080/20108
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-12-06 22:09:05 +00:00
..

The two Parquet files (nullable.parq and nonnullable_orc.parq) were generated
as testdata/data/schemas/nested/README stated.

The two ORC files (nullable.orc and nonnullable.orc) were generated by the orc-tools
which can convert JSON files into ORC format. However, we need to modify nullable.json
and nonnullable.json to meet the format it requires. The whole file should not be a array.
It should be JSON objects of each row joined by '\n'. Assume the JSON files are
nullable_orc.json and nonnullable_orc.json.

The ORC files can be regenerated by running the following commands in current directory:

wget https://search.maven.org/remotecontent?filepath=org/apache/orc/orc-tools/1.5.4/orc-tools-1.5.4-uber.jar \
  -O orc-tools-1.5.4-uber.jar

java -jar orc-tools-1.5.4-uber.jar convert \
  -s "struct<id:bigint,int_array:array<int>,int_array_Array:array<array<int>>,int_map:map<string,int>,int_Map_Array:array<map<string,int>>,nested_struct:struct<A:int,b:array<int>,C:struct<d:array<array<struct<E:int,F:string>>>>,g:map<string,struct<H:struct<i:array<double>>>>>>" \
  -o nullable.orc \
  nullable_orc.json

java -jar orc-tools-1.5.4-uber.jar convert \
  -s "struct<ID:bigint,Int_Array:array<int>,int_array_array:array<array<int>>,Int_Map:map<string,int>,int_map_array:array<map<string,int>>,nested_Struct:struct<a:int,B:array<int>,c:struct<D:array<array<struct<e:int,f:string>>>>,G:map<string,struct<h:struct<i:array<double>>>>>>" \
  -o nonnullable.orc \
  nonnullable_orc.json