impala

mirror of https://github.com/apache/impala.git synced 2026-02-03 09:00:39 -05:00

Author	SHA1	Message	Date
Daniel Becker	6b47c40e0d	IMPALA-12159: Support ORDER BY for collections of variable length types in select list IMPALA-12019 implemented support for collections of fixed length types in the sorting tuple. This change implements it for collections of variable length types. Note that the limitation that structs that contain any type of collection are not allowed in the sorting tuple is still in place (see IMPALA-12160). Note that it was not and still is not allowed to sort by complex types, this change only allows them to be present in the select list when sortin by some other expression. This change also allows collections of variable length types to be non-passthrough children of UNION ALL nodes. Testing: - Renamed the 'simple_arrays_big' table to 'arrays_big' and extended it with collections containing variable length types. This table is mainly used to test that spilling works during sorting. - Renamed test_sort.py::TestArraySort::{test_simple_arrays, test_simple_arrays_with_limit} to {test_array_sort,test_array_sort_with_limit} - Extended the tests run in test_queries.py::TestQueries::{test_sort, test_top_n,test_partitioned_top_n} with collections containing var-len types. - Added tests in sort-complex.test that assert that it is not allowed to sort by collections. For structs we already have such tests in struct-in-select-list.test. Change-Id: Ic15b29393f260b572e11a8dbb9deeb8c02981852 Reviewed-on: http://gerrit.cloudera.org:8080/20108 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-12-06 22:09:05 +00:00
Daniel Becker	ff3d0c7984	IMPALA-12019: Support ORDER BY for arrays of fixed length types in select list As a first stage of IMPALA-10939, this change implements support for including in the sorting tuple top-level collections that only contain fixed length types (including fixed length structs). For these types the implementation is almost the same as the existing handling of strings. Another limitation is that structs that contain any type of collection are not yet allowed in the sorting tuple. Also refactored the RawValue::Write*() functions to have a clearer interface. Testing: - Added a new test table that contains many rows with arrays. This is queried in a new test added in test_sort.py, to ensure that we handle spilling correctly. - Added tests that have arrays and/or maps in the sorting tuple in test_queries.py::TestQueries::{test_sort, test_top_n,test_partitioned_top_n}. Change-Id: Ic7974ef392c1412e8c60231e3420367bd189677a Reviewed-on: http://gerrit.cloudera.org:8080/19660 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-05-18 09:56:55 +00:00
Gabor Kaszab	df528fe2b1	IMPALA-10920: Zipping unnest for arrays This patch provides an unnest implementation for arrays where unnesting multiple arrays in one query results the items of the arrays being zipped together instead of joining. There are two different syntaxes introduced for this purpose: 1: ISO:SQL 2016 compliant syntax: SELECT a1.item, a2.item FROM complextypes_arrays t, UNNEST(t.arr1, t.arr2) AS (a1, a2); 2: Postgres compatible syntax: SELECT UNNEST(arr1), UNNEST(arr2) FROM complextypes_arrays; Let me show the expected behaviour through the following example: Inputs: arr1: {1,2,3}, arr2: {11, 12} After running any of the above queries we expect the following output: =============== \| arr1 \| arr2 \| =============== \| 1 \| 11 \| \| 2 \| 12 \| \| 3 \| NULL \| =============== Expected behaviour: - When unnesting multiple arrays with zipping unnest then the 'i'th item of one array will be put next to the 'i'th item of the other arrays in the results. - In case the size of the arrays is not the same then the shorter arrays will be filled with NULL values up to the size of the longest array. On a sidenote, UNNEST is added to Impala's SQL language as a new keyword. This might interfere with use cases where a resource (db, table, column, etc.) is named "UNNEST". Restrictions: - It is not allowed to have WHERE filters on an unnested item of an array in the same SELECT query. E.g. this is not allowed: SELECT arr1.item FROM complextypes_arrays t, UNNEST(t.arr1) WHERE arr1.item < 5; Note, that it is allowed to have an outer SELECT around the one doing unnests and have a filter there on the unnested items. - If there is an outer SELECT filtering on the unnested array's items from the inner SELECT then these predicates won't be pushed down to the SCAN node. They are rather evaluated in the UNNEST node to guarantee result correctness after unnesting. Note, this restriction is only active when there are multiple arrays being unnested, or in other words when zipping unnest logic is required to produce results. - It's not allowed to do a zipping and a (traditional) joining unnest together in one SELECT query. - It's not allowed to perform zipping unnests on arrays from different tables. Testing: - Added a bunch of E2E tests to the test suite to cover both syntaxes. - Did a manual test run on a table with 1000 rows, 3 array columns with size of around 5000 items in each array. I did an unnest on all three arrays in one query to see if there are any crashes or suspicious slowness when running on this scale. Change-Id: Ic58ff6579ecff03962e7a8698edfbe0684ce6cf7 Reviewed-on: http://gerrit.cloudera.org:8080/17983 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-11-23 07:03:10 +00:00
Gabor Kaszab	1e21aa6b96	IMPALA-9495: Support struct in select list for ORC tables This patch implements the functionality to allow structs in the select list of inline views, topmost blocks. When displaying the value of a struct it is formatted into a JSON value and returned as a string. An example of such a value: SELECT struct_col FROM some_table; '{"int_struct_member":12,"string_struct_member":"string value"}' Another example where we query a nested struct: SELECT outer_struct_col FROM some_table; '{"inner_struct":{"string_member":"string value","int_member":12}}' Note, the conversion from struct to JSON happens on the server side before sending out the value in HS2 to the client. However, HS2 is capable of handling struct values as well so in a later change we might want to add a functionality to send the struct in thrift to the client so that the client can use the struct directly. -- Internal representation of a struct: When scanning a struct the rowbatch will hold the values of the struct's children as if they were queried one by one directly in the select list. E.g. Taking the following table: CREATE TABLE tbl (id int, s struct<a:int,b:string>) STORED AS ORC And running the following query: SELECT id, s FROM tbl; After scanning a row in a row batch will hold the following values: (note the biggest size comes first) 1: The pointer for the string in s.b 2: The length for the string in s.b 3: The int value for s.a 4: The int value of id 5: A single null byte for all the slots: id, s, s.a, s.b The size of a struct has an effect on the order of the memory layout of a row batch. The struct size is calculated by summing the size of its fields and then the struct gets a place in the row batch to precede all smaller slots by size. Note, all the fields of a struct are consecutive to each other in the row batch. Inside a struct the order of the fields is also based on their size as it does in a regular case for primitives. When evaluating a struct as a SlotRef a newly introduced StructVal will be used to refer to the actual values of a struct in the row batch. This StructVal holds a vector of pointers where each pointer represents a member of the struct. Following the above example the StructVal would keep two pointers, one to point to an IntVal and one to point to a StringVal. -- Changes related to tuple and slot descriptors: When providing a struct in the select list there is going to be a SlotDescriptor for the struct slot in the topmost TupleDescriptor. Additionally, another TupleDesriptor is created to hold SlotDescriptors for each of the struct's children. The struct SlotDescriptor points to the newly introduced TupleDescriptor using 'itemTupleId'. The offsets for the children of the struct is calculated from the beginning of the topmost TupleDescriptor and not from the TupleDescriptor that directly holds the struct's children. The null indicator bytes as well are stored on the level of the topmost TupleDescriptor. -- Changes related to scalar expressions: A struct in the select list is translated into an expression tree where the top of this tree is a SlotRef for the struct itself and its children in the tree are SlotRefs for the members of the struct. When evaluating a struct SlotRef after the null checks the evaluation is delegated to the children SlotRefs. -- Restrictions: - Codegen support is not included in this patch. - Only ORC file format is supported by this patch. - Only HS2 client supports returning structs. Beeswax support is not implemented as it is going to be deprecated anyway. Currently we receive an error when trying to query a struct through Beeswax. -- Tests added: - The ORC and Parquet functional databases are extended with 3 new tables: 1: A small table with one level structs, holding different kind of primitive types as members. 2: A small table with 2 and 3 level nested structs. 3: A bigger, partitioned table constructed from alltypes where all the columns except the 'id' column are put into a struct. - struct-in-select-list.test and nested-struct-in-select-list.test uses these new tables to query structs directly or through an inline view. Change-Id: I0fbe56bdcd372b72e99c0195d87a818e7fa4bc3a Reviewed-on: http://gerrit.cloudera.org:8080/17638 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-09-14 21:21:47 +00:00
stiga-huang	9686545bfd	IMPALA-6503: Support reading complex types from ORC We've supported reading primitive types from ORC files (IMPALA-5717). In this patch we add support for complex types (struct/array/map). In IMPALA-5717, we leverage the ORC lib to parse ORC binaries (data in io buffer read from DiskIoMgr). The ORC lib can materialize ORC column binaries into its representation (orc::ColumnVectorBatch). Then we transform values in orc::ColumnVectorBatch into impala::Tuples in hdfs-orc-scanner. We don't need to do anything about decoding/decompression since they are handled by the ORC lib. Fortunately, the ORC lib already supports complex types, we can still leverage it to support complex types. What we need to add in IMPALA-6503 are two things: 1. Specify which nested columns we need in the form required by the ORC lib (Get list of ORC type ids from tuple descriptors) 2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into Impala's representation (Slots/Tuples/RowBatches) To format the materialization, we implement several ORC column readers in hdfs-orc-scanner. Each kind of reader treats a column type and transforms outputs of the ORC lib into tuple/slot values. Tests: * Enable existing tests for complex types (test_nested_types.py, test_tpch_nested_queries.py) for ORC. * Run exhaustive tests in DEBUG and RELEASE builds. Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Reviewed-on: http://gerrit.cloudera.org:8080/12168 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-03-08 04:39:08 +00:00
Nathan Salmon	34353218ce	IMPALA-4675: Case-insensitive matching of Parquet fields. The query option PARQUET_FALLBACK_SCHEMA_RESOLUTION allows matching of Parquet fields by name instead of by index (the default). Parquet column names are case sensitive, but Impala treats db/table/column/field names as case-insensitive. Today, there is no way today to select Parquet columns with mixed casing via SQL using the name-based field resolution policy. This patch changes the matching of Parquet fields to be case-insensitive. Testing: - Modified the data files backing complextypestbl to contain fields with mixed casing. - Several existing tests run against this table, including the test for name-based resolution. - I confirmed that without this fix, the existing name-based resolution tests fail on the modified data files. - I locally ran test_scanners.py and test_nested_types.py on exhaustive with this fix. Change-Id: I87395f84ba29b4c3d8e41be1ea4e89e500b8a9f4 Reviewed-on: http://gerrit.cloudera.org:8080/5891 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-03-03 10:20:07 +00:00
Thomas Tauber-Marshall	b2c2fe7813	IMPALA-3786: Replace "cloudera" with "apache" (part 2) As part of the ASF transition, we need to replace references to Cloudera in Impala with references to Apache. This primarily means changing Java package names from com.cloudera.impala.* to org.apache.impala.* A prior patch renamed all the files as necessary, and this patch performs the actual code changes. Most of the changes in this patch were generated with some commands of the form: find . \| grep "\.java\\|\.py\\|\.h\\|\.cc" \| \ xargs sed -i s/'com\(.\)cloudera\(\.\)impala/org\1apache\2impala/g along with some manual fixes. After this patch, the remaining references to Cloudera in the repo mostly fall into the categories: - External components that have cloudera in their own package names, eg. com.cloudera.kudu/llama - URLs, eg. https://repository.cloudera.com/ Change-Id: I0d35fa6602a7fc0c212b2ef5e2b3322b77dde7e2 Reviewed-on: http://gerrit.cloudera.org:8080/3937 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Reviewed-by: Jim Apple <jbapple@cloudera.com> Tested-by: Internal Jenkins	2016-09-29 21:14:13 +00:00
Skye Wanderman-Milne	2cbd327d41	Regenerate complextypestbl files to include nested_struct.g field This field was included in the schema and data files, but the checked-in generated parquet files didn't include it. It's not referenced in any tests so we didn't catch it. Change-Id: I5d394f074e7082fa12fafb7e57a144a83b3099a6 Reviewed-on: http://gerrit.cloudera.org:8080/2562 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-04-01 05:06:38 +00:00
Skye Wanderman-Milne	bcc73a36da	Nested types: read and materialize nested types in Parquet scanner This patch modifies the Parquet scanner to resolve nested schemas, and read and materialize collection types. The high-level modification is to create a CollectionColumnReader that recursively materializes map- and array-type slots. This patch also adds many tests, most of which query a new table called complextypestbl. This table contains hand-generated data that is meant to expose edge cases in the scanner. The tests mostly test the scanner, with a few tests of other functionality (e.g. array serialization). I ran a local benchmark comparing this scanner code to the original scanner code on an expanded version of tpch_parquet.lineitem with 48009720 rows. My benchmark involved selecting different numbers of columns with a single scanner thread, and I looked at the HDFS scan node time in the query profiles. This code introduces a 10%-20% regression in single-threaded scan time. Change-Id: Id27fb728934e8346444f61752c9278d8010e5f3a Reviewed-on: http://gerrit.cloudera.org:8080/576 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-09-02 19:23:54 +00:00

9 Commits