mirror of
https://github.com/apache/impala.git
synced 2025-12-19 18:12:08 -05:00
IMPALA-13627: Handle legacy Hive timezone conversion
After HIVE-12191, Hive has 2 different methods of calculating timestamp
conversion from UTC to local timezone. When Impala has
convert_legacy_hive_parquet_utc_timestamps=true, it assumes times
written by Hive are in UTC and converts them to local time using tzdata,
which matches the newer method introduced by HIVE-12191.
Some dates convert differently between the two methods, such as
Asia/Kuala_Lumpur or Singapore prior to 1982 (also seen in HIVE-24074).
After HIVE-25104, Hive writes 'writer.zone.conversion.legacy' to
distinguish which method is being used. As a result there are three
different cases we have to handle:
1. Hive prior to 3.1 used what’s now called “legacy conversion” using
SimpleDateFormat.
2. Hive 3.1.2 (with HIVE-21290) used a new Java API that’s based on
tzdata and added metadata to identify the timezone.
3. Hive 4 support both, and added a new file metadata to identify it.
Adds handling for Hive files (identified by created_by=parquet-mr) where
we can infer the correct handling from Parquet file metadata:
1. if writer.zone.conversion.legacy is present (Hive 4), use it to
determine whether to use a legacy conversion method compatible with
Hive's legacy behavior, or convert using tzdata.
2. if writer.zone.conversion.legacy is not present but writer.time.zone
is, we can infer it was written by Hive 3.1.2+ using new APIs.
3. otherwise it was likely written by an earlier Hive version.
Adds a new CLI and query option - use_legacy_hive_timestamp_conversion -
to select what conversion method to use in the 3rd case above, when
Impala determines that the file was written by Hive older than 3.1.2.
Defaults to false to minimize changes in Impala's behavior and because
going through JNI is ~50x slower even when the results would not differ;
Hive defaults to true for its equivalent setting:
hive.parquet.timestamp.legacy.conversion.enabled.
Hive legacy-compatible conversion uses a Java method that would be
complicated to mimic in C++, doing
DateFormat formatter = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
formatter.setTimeZone(TimeZone.getTimeZone(timezone_string));
java.util.Date date = formatter.parse(date_time_string);
formatter.setTimeZone(TimeZone.getTimeZone("UTC"));
return out.println(formatter.format(date);
IMPALA-9385 added a check against a Timezone pointer in
FromUnixTimestamp. That dominates the time in FromUnixTimeNanos,
overriding any benchmark gains from IMPALA-7417. Moves FromUnixTime to
allow inlining, and switches to using UTCPTR in the benchmark - as
IMPALA-9385 did in most other code - to restore benchmark results.
Testing:
- Adds JVM conversion method to convert-timestamp-benchmark.
- Adds tests for several cases from Hive conversion tests.
Change-Id: I1271ed1da0b74366ab8315e7ec2d4ee47111e067
Reviewed-on: http://gerrit.cloudera.org:8080/22293
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
This commit is contained in:
@@ -263,6 +263,43 @@ DATE_ADD (<varname>timestamp</varname>, INTERVAL <varname>interval</varname> <va
|
||||
Parquet files written by Hive.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Hive versions prior to 3.1 wrote Parquet files in local time using Java's
|
||||
SimpleDateFormat. That method has some cases that differ from both Impala's
|
||||
method and the default method used in Hive 3.1.2+ that are based on the
|
||||
<xref href="https://www.iana.org/time-zones" format="html" scope="external">
|
||||
IANA Time Zone Database</xref>. Hive 4 added the
|
||||
<codeph>writer.zone.conversion.legacy</codeph> Parquet file metadata property
|
||||
to identify which method was used to write the file (controlled by
|
||||
<codeph>hive.parquet.timestamp.write.legacy.conversion.enabled</codeph>). When
|
||||
the Parquet file was written by Parquet Java (<codeph>parquet-mr</codeph>), Hive -
|
||||
and Impala's behavior when
|
||||
<codeph>convert_legacy_hive_parquet_utc_timestamps</codeph> is
|
||||
<codeph>true</codeph> - are:
|
||||
<ul>
|
||||
<li>
|
||||
If <codeph>writer.zone.conversion.legacy</codeph> is present, use the legacy
|
||||
conversion method if true, use the newer method if false.
|
||||
</li>
|
||||
<li>
|
||||
If <codeph>writer.zone.conversion.legacy</codeph> is not present but
|
||||
<codeph>writer.time.zone</codeph> is, we can infer the file was written by
|
||||
Hive 3.1.2+ using new APIs and use the newer method.
|
||||
</li>
|
||||
<li>
|
||||
Otherwise assume it was written by an earlier Hive release. In that case
|
||||
Hive will select conversion method based on
|
||||
<codeph>hive.parquet.timestamp.legacy.conversion.enabled</codeph> (defaults
|
||||
to <codeph>true</codeph>). <keyword keyref="impala45"/> adds the query
|
||||
option <codeph>use_legacy_hive_timestamp_conversion</codeph> to select this
|
||||
behavior. It defaults to <codeph>false</codeph> because conversion is ~50x
|
||||
slower than Impala's default conversion method and they produce the same
|
||||
results for modern time periods (post 1970, and in most instances before
|
||||
that).
|
||||
</li>
|
||||
</ul>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Hive currently cannot write <codeph>INT64</codeph> <codeph>TIMESTAMP</codeph> values.
|
||||
</p>
|
||||
|
||||
Reference in New Issue
Block a user