This patch adds support for BINARY columns for all table formats with
the exception of Kudu.
In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.
As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.
Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).
All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.
The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.
Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
to test scanning.
- Removed functional.unsupported_types and related tests, as now
Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type
Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Reviewed-on: http://gerrit.cloudera.org:8080/16066
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Hive has 2 types of UDFs. This commit contains limited
support for the second generation UDFs called GenericUDFs.
The main limitations are as follows:
Decimal types are not supported. The Impala framework determines
the precision and scale of the decimal return type. However, the
Hive GenericUDFs allow the capability to choose its own return
type based on the parameters. Until this can be resolved, it is
safer to forbid decimals from being used. Note that this
limitation currently exists in the first generation of Hive Java
UDFs.
Complex types are not supported.
Functions are not extracted from the jar file. The first generation
of Hive UDFs allowed this because the method prototypes are
explicitly defined and can be determined at function creation time. For
GenericUDFs, the return types are determined based on the parameters
passed in when running a query.
For the same reason as above, GenericUDFs cannot be made permanent.
They will need to be recreated everytime the server is restarted.
This is a severe limitation and will be resolved in the near future.
Change-Id: Ie6fd09120db413fade94410c83ebe8ff104013cd
Reviewed-on: http://gerrit.cloudera.org:8080/18295
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>