Hive allows creating Avro tables without an explicit Avro schema since 0.14.0.
For such tables, the Avro schema is inferred from the column definitions,
and not stored in the metadata at all (no Avro schema literal or Avro schema file).
This patch adds support for loading the metadata of such tables, although Impala
currently cannot create such tables (expect a follow-on patch).
Change-Id: I9e66921ffbeff7ce6db9619bcfb30278b571cd95
Reviewed-on: http://gerrit.cloudera.org:8080/538
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
This patch corrects a mistake in the Parquet magic file number verification
and adds a test about it. Note that with this patch Impala may fail to read
Parquet files with wrong magic number that it used to read before.
Change-Id: Iff31accda1e1d541946ef1f750e38886ce4cb8d5
Reviewed-on: http://gerrit.cloudera.org:8080/515
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: Internal Jenkins
The hardcoded timezone information is from Java version 1.7.0_76.
Change-Id: I32c40d0036473079e5bfd4d0252a648cbb0e7c23
Reviewed-on: http://gerrit.cloudera.org:8080/393
Reviewed-by: Juan Yu <jyu@cloudera.com>
Tested-by: Internal Jenkins
When running with a release build, NULL would be returned when
reading values from required fields in parquet files (with a debug
build a DCHECK would be hit).
Previously when the max definition level for a field was 0 (which
happens if a field is required), the definition level for value was
incorrectly set to 1. The max definition level is related to nested
data and is defined to be the number of nullable fields that will be
encountered when traversing a path to reach the desired end field.
For example, if a nested schema has a path a.b.c.d where b and d are
nullable then the max def level is 2. A def level is attached to each
value to indicate the number of optional values that are present (in
the previous example an def level of 2 means both b and d are not
null). So having a def level for a value that is greater than the max
def level for a field should never happen.
Change-Id: Ia91a97cf79e672c420d10416c6817f0930dcc920
(cherry picked from commit cdd67e4c7fd62d5b08adfaa303d7bb2382e6932c)
Reviewed-on: http://gerrit.cloudera.org:8080/386
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
This patch fixes an issue when an uninitialized, empty row is falsely
added to the rowbatch. The uninitialized data inside this row leads
later on to a crash when the null byte is checked together with the
offsets (that contains garbage).
The fix is to not only check for the number of materialized columns, but
as well for the number of materialized partition key columns. Only if both are
empty and the parser has an unfinished tuple, add the empty row.
To accommodate for the last row, check in FinishScanRange() if there is an
unfinished tuple with materialized slots or materialized partition key. Write
the fields if necessary.
Change-Id: I2808cc228e62d048d917d3a6352d869d117597ab
(cherry picked from commit c1795a8b40d10fbb32d9051a0e7de5ebffc8a6bd)
Reviewed-on: http://gerrit.cloudera.org:8080/364
Reviewed-by: Juan Yu <jyu@cloudera.com>
Tested-by: Internal Jenkins
This patch adds analysis support for the following new constructs.
Struct-field references:
Dot-separated paths such as "tblalias.col.field1.field2..." can now be used
to access fields of a struct. Such struct-field SlotRefs show up as new slot
in the tuple descriptor of the parent table.
Collection table references:
A collection-typed column or field can be referenced as a table in the FROM
clause using a dot-separated path, e.g., "FROM db.tbl.array_col". This exposes
the collection's fields as a table whose columns can be referenced as usual.
Collections referenced this way are flattened during execution. The following
subsections explain the analysis behavior in more detail:
Aliases:
Collection table references can either have an explicit alias a single
implicit alias. Examples:
SELECT ... FROM db.tbl.array_col AS a
-> explicit alias 'a'
SELECT ... FROM db.tbl.array_col
-> implicit alias 'array_col'
Flattening of collections:
Collection table refs can refer to table aliases defined to their left
in the same select block. References linked in this way indicate an implicit
parent/child join. Example:
SELECT parent.id, child.item FROM db.tbl parent, parent.array_col child
The aliases 'parent' and 'child' can be used to access the columns of 'tbl' and
the fields of 'array_col', respectively. In the flattened result set, collection
items implicitly join with parent rows iff the collection row is nested within
the parent.
The following stmt is not an implicit join, but a scan of a nested collection.
SELECT db.tbl.map_col.value FROM db.tbl.map_col
The following stmt is not a parent/child join but a cartesian product
because the collection does not reference an existing table alias.
SELECT a.id, map_col.value FROM db.tbl a, db.tbl.map_col
Collection 'table' field names:
Structs have explicit field names, but collections generally do not. This patch
uses fixed implicit field names such as "item" for arrays and "key"/"value"
for maps. If an array or map has a struct element/value then the struct's fields
are also accessible without the "item"/"value" indirection.
Analysis prevents complex-typed exprs from appearing in the select list of any
query statement because it's not yet clear how to present them to a consumer
of the query results.
To keep things simple, I've deliberately left these work items for
follow-on patches:
1. analysis of correlated inline views to handle queries like:
SELECT cnt FROM customers c (SELECT COUNT(1) FROM c.orders) v
2. generalized expression substitution to handle cases like:
SELECT x.c.d FROM (SELECT a.b x FROM tbl)
Change-Id: I5f52f083bcf7056d5bfd21cc784133edb7e82ef2
Reviewed-on: http://gerrit.cloudera.org:8080/43
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
While cherry-picking commit c289974 from cdh5-2.2.0_5.4.x, a line that causes a dataset
change was somehow missed by git's cherry-pick. This patch adds that line back in.
Change-Id: Id163643c55d44fe6a9503b54397f1f370b198554
I did a local benchmark and there's minimal performance impact(<1%)
Change-Id: I8d84a145acad886c52587258b27d33cff96ea399
(cherry picked from commit 7e750ad5d90007cc85ebe493af4dce7a537ad7c0)
Reviewed-on: http://gerrit.cloudera.org:8080/189
Reviewed-by: Juan Yu <jyu@cloudera.com>
Tested-by: Internal Jenkins
We used mkdir without a -p to load a hive13 parquet table. This failed when trying to load
the metadata, which is necessary for changes that mutate the warehouse snapshot. With this
change, it now passes.
Change-Id: Id163643c55d44fe6a9503b54397f1f370b198554
No changes to writing were made. No changes to reading Impala written
files were made.
Hive writes TIMESTAMP values to parquet files differently than Impala
does. Hive converts the value from local time to UTC before writing;
Impala does not. This change adds a startup flag that will convert UTC
to local when reading files written by Hive.
The Hive-file detection actually checks for "parquet-mr" (which is the
library Hive uses) in the file metadata. A slight possibility exists
that TIMESTAMP values written by something other than Hive but also
using parquet-mr may become incorrect. The possibility should be very
small because TIMESTAMP values are stored and encoded in a non-standard
way other applications are unlikely to be aware of.
Flags from be/src/exec/hdfs-parquet-scanner.cc:
-convert_legacy_hive_parquet_utc_timestamps (When true, TIMESTAMPs
read from files written by Parquet-MR (used by Hive) will be
converted from UTC to local time. Writes are unaffected.) type: bool
default: false
Change-Id: I79a499fe24049b7025ee2dd76c9c3e07010d346a
Reviewed-on: http://gerrit.cloudera.org:8080/35
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
DistributedFileSystem is lenient about seeking past the end of the file.
Other FileSystem implementations, such as NativeS3FileSystem, return an
error on this condition. That leads to a scary looking message in the
query warnings.
So, when creating scan ranges, let's require that the ranges fall within
the file bounds (at least according to what the HdfsFileDesc indicates
is the length). There were a couple of kinds of AllocateScanRange()
callsites that needed to be fixed up:
1) When a stream wants to read past a scan range, be careful not to read
past the end of the file.
2) When Impala needs to "guess" at the length of a range, use the
file_length as an upper bound on the guess. We were already doing this
someplaces but not everywhere.
3) When the scan range is derived from parquet metadata, validate the
metadata against file_length and issue appropriate errors. This will
give better diagnostics for corrupt files.
Note that we can't rely on this for safety (HdfsFileDesc file_length may
be stale), but it does mean that when metadata is up-to-date Impala will
no longer try to access beyond the end of files (and so we'll no longer
get false positive errors from the filesystem).
Additionally, this change revealed a pre-existing problem with files
that have multiple row-groups. The first time through InitColumns(),
stream_ was set to NULL. But, stream_->filename could potentially be
accessed when constructing error statuses for subsequent row-groups.
Change-Id: Ia668fa8c261547f85a18a96422846edcea57043e
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5424
Reviewed-by: Daniel Hecht <dhecht@cloudera.com>
Tested-by: jenkins
Compressed text formats currently require entire compressed
files be read into memory to be decompressed in a single call
to the decompression codec. This changes the HdfsTextScanner
to drive gzip in a streaming mode, i.e. produce partial output
as input is consumed.
Change-Id: Id5c0805e18cf6b606bcf27a5df4b5f58895809fd
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5233
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 05c3cc55e7a601d97adc4eebe03f878c68a33e56)
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5385
Fixed a bug when setting the length in reading/write text files for CHAR(N).
Also added chars_tiny table for testing CHAR(N) and VARCHAR(N).
Change-Id: If5d5db30afa4b00cf03c68c6a845f182970329f4
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4415
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
Changes include:
* Fix compile errors due to new column stats API and other stats related
fixes.
* Temporarily disable JDBC tests due to new serialization format in Hive .13
* Disable view compatibility tests until we can get them to work in Hive .13
* Test fixes due to Hive's type checking for partition column values
Change-Id: I05cc6a95976e0e037be79d91bc330a06d2fdc46c
The DDL statements for adding the partitions of alltypesaggmultifiles
did not include an ALTER TABLE stmt for one of the partitions, thereby
causing the planner tests to fail when test data were loaded from a
snapshot.
Change-Id: Id4b078cd334d816d6eb8eb15e5856189701a4bca
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3305
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3310
The following changes are included in this commit:
1. Modified the alltypesagg table to include an additional partition key
that has nulls.
2. Added a number of tests in hdfs.test that exercise the partition
pruning logic (see IMPALA-887).
3. Modified all the tests that are affected by the change in alltypesagg.
Change-Id: I1a769375aaa71273341522eb94490ba5e4c6f00d
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2874
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3236
This commit fixes IMPALA-1040 in which when an invalid value is inserted
to a decimal partitioned column through hive it results in a non
informative error message and in some cases in the associated table to
disappear from Impala's catalog. The fix results in a more informative
error message to always be thrown by Impala to indicate the insertion of
an invalid partition key value.
Change-Id: I2855ea69944e269fb7e02b3825f44e64352151e7
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3062
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3200
Float/Doubles are lossy so using those as the default literal type
is problematic.
Change-Id: I5a619dd931d576e2e6cd7774139e9bafb9452db9
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2758
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
This commit fixes issue CDH-18969 where Impala returns wrong results
when querying an HBase table. This issue is triggered when a column family
sorts lexicographically before ":key", which is the column family of the
row key, thereby causing the wrong column to be used as a row key by the
backend.
The following changes are included:
1. Modified the load function in HBaseTable.java to make sure the
catalog object of an HBase table always stores the row key column first.
Change-Id: Icd7ebc973d81672c04d5c7c8bbabd813338d5eac
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2513
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2602
Allows reading decimal columns with or without codegen. Includes tests
based on a data file posted on HIVE-5823.
Change-Id: Ie541c6b98bd24543691850cb45a434af60b5a5a6
(cherry picked from commit 6983dcefdf70cce14724e17d03bc061ffb8f671c)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2596
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
For wide Avro tables, ReadZLong() would get inlined many times into a
single function body, causing LLVM to crash. Not inlining doesn't seem
to have a performance impact on narrow tables, and helps with wide
tables.
This change also adds tests over wide (i.e. many-column) tables. The
test tables are produced by specifying shell commands to generate test
tables in functional_schema_template.sql, which are executed in
generate-schema-statements.py. In the SQL templates, sections starting
with a ` are treated as shell commands. The output of the shell
command is then used as the section text. This is only a starting
point; it isn't currently implemented for all sections, and may have
to be tweaked if we use this mechanism for all tables.
Change-Id: Ife0d857d19b21534167a34c8bc06bc70bef34910
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2206
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Skye Wanderman-Milne <skye@cloudera.com>
(cherry picked from commit 1c5951e3cce25a048208ab9bb3a3aed95e41cf67)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2353
Tested-by: jenkins
This patch modifies DelimitedTextParser and StringValue to work with
data containing null characters by using SSE instructions that take a
length, rather than expecting null-terminated strings. It also adds
some other minor changes to correctly handle data with nulls and to
faciliate testing. I checked the execution time of a count(*) and a
select(*) limit 1 query locally, and saw no difference for either text
or sequence files.
Change-Id: Ia920b35bea7048aa286f39ec83e313c2a39251d1
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2110
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Skye Wanderman-Milne <skye@cloudera.com>
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2181
This fixes how we validate delimiters to be in line with Hive. A delimiter must
fit in a single byte and can be specified in the following formats, as far as I can
tell (there isn't documentation):
- A single ASCII or unicode character (ex. '|')
- An escape character in octal format (ex. \001. Stored in the metastore as a
unicode character: \u0001).
- A signed decimal integer in the range [-128:127]. Used to support delimiters
for ASCII character values between 128-255 (-2 maps to ASCII 254).
Previously, we were not handling the "signed integer" case so there was no way
to specify a delimiter in the "extended" ASCII range of 128-255.
To support result validation, the test infrastructure had to be updated to support
reading/writing different character encodings.
Change-Id: Ie3c4d444dc9c6e60192093ed0c0f6f151eab16bc
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1848
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1888
With Hive .12, the default RC file format can be configured to be ColumnarSerDe or
LazyBinaryColumnarSerDe. Impala does not yet support the LazyBinaryColumnarSerDe. This change
verifies it is properly disabled.
Change-Id: Ia84495868237ce2c89a9706ad75e0f7eb8499057
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1416
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1423
parquet-mr had a bug where it didn't include the dictionary page's
header in the total column size. We now compensate for this by
detecting these files and padding the scan range length. This required
changing how the scanner detects when it's finished: it now counts the
number of rows rather than checking eosr (since the scan range may be
longer than the column).
Change-Id: Id9933808b965003c0c3b3aa78c32fe29a0c4bcbe
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1097
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
We were previously wasting memory by always reading into 8MB IO
buffers, even when the data read was much less than 8MB. With this
patch, the IO manager picks a buffer size closer to the actual amount
being read (we don't use the exact size so we can continue to recycle
buffers). The minimum IO buffer size is determined via the
--min_buffer_size flag, and the max IO buffer size via the --read_size
flag.
This technique also helps with IMPALA-652, since short columns will
not use as much memory as before (we will not use considerably more
memory than the size of the table).
This patch also changes StringBuffer to use a doubling strategy so it
doesn't end up allocating many large unused buffers, and has the
scanner context use the requested length as the sync read size if it's
larger than the size produced by read_past_size_cb(). These changes
help prevent the boundary buffer in the scanner context from
allocating excess memory.
Change-Id: I0efb3b023ddfddb08bca22d5cb5f9511fb4d6c50
Reviewed-on: http://gerrit.ent.cloudera.com:8080/938
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
The Hive HBase spec specifies that the key column mapping can either be
defined explicitly (using the :key syntax) or left out completely in
which case a mapping to the first table column is implied. This change
updates Impala to support implicit key mappings and also adds some
checks in our ALTER TABLE DDL to unsure we cannot get into this state by
dropping a column from an Hbase table (a similar restriction that Hive
puts in place)
Change-Id: I920d642261659ee3e881da2553ffe83300923af8
Reviewed-on: http://gerrit.ent.cloudera.com:8080/554
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
Our data generation doesn't appear to fully support comments in the
base table name section. This fixes the data generation, but we should
follow on by improving our comment support in the framework.
Change-Id: I68274bd98d5b0d54868d6c80b7137a59e7329229
Reviewed-on: http://gerrit.ent.cloudera.com:8080/465
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>