Files
impala/common
Csaba Ringhofer 85a2211bfb IMPALA-10349: Support constant folding for non ascii strings
Before this patch constant folding only converted the result of an
expression to StringLiteral if all characters were ASCII. The
change allows both UTF8 strings with non ascii characters and
byte arrays that are not valid UTF8 strings - the latter can
occur when constant folding is applied to BINARY columns,
for example in geospatial functions like st_polygon().

The main goal is being able to push down more predicates, e.g.
before that patch a filter like col="á" couldn't be pushed down
to Iceberg/Kudu/Parquet stat filtering, as all these expect literals.

Main changes:
- TStringLiteral uses a binary instead of a string member.
  This doesn't affect BE as in c++ both types are compiled
  to std::string. In Jave a java.nio.ByteBuffer is used instead of
  String.
- StringLiteral uses a byte[] member to store the value of
  the literal in case it is not valid UTF8 and cannot be
  represented as Java String. In other cases still a String
  is used to keep the change minimal, though it may be more
  optimal to use UTF8 byte[] due to the smaller size. Always
  converting from byte[] to String may be costy in the catalog
  as partition values are stored as *Literals and rest of the
  catalog operates on String.
- StringLiteral#compareTo() is switched to byte wise compare on byte[]
  to be consistent with BE. This was not needed for ASCII strings
  as Java String behaves the same way in that case, but non-ASCII
  can have different order (note that Impala does not support
  collations).
- When an invalid UTF8 StringLiteral is printed, for example in
  case of EXPLAIN output, then it is printed as
  unhex("<byte array in hexadecimal>"). This is a non-lossy way to
  represent it, but it may be too verbose in some cases, e.g. for
  large polygons. A follow up commit may refine this, e.g. by
  limiting the max size printed.

An issue found while implementing this is that INSERT does not
handle invalid UTF8 partition values correctly, see IMPALA-14096.
This behavior is not changed in the patch.

Testing:
- Added a few tests that push down non-ascii const expressions in
  predicates (both with utf8_mode=true and false).

Change-Id: I70663457a0b0a3443e586350f0a5996bb75ba64a
Reviewed-on: http://gerrit.cloudera.org:8080/22603
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-06-25 18:22:31 +00:00
..