I re-created the original patch for IMPALA-6068, but only
performed what I believe to be the limited legal transformation
of data load: DEPENDENT_LOAD -> DEPENDENT_LOAD_HIVE.
Any place that directly uploads via hadoop or hdfs commands
was left alone as changing it can't be proven to be correct.
Change-Id: I6c242cca209a7138b10ad517076707709b5cd204
Testing: Doing a full data load. I mistakenly changed a variable
name causing the first two dry-runs to fail.
Reviewed-on: http://gerrit.cloudera.org:8080/8690
Reviewed-by: Zach Amsden <zamsden@cloudera.com>
Tested-by: Zach Amsden <zamsden@cloudera.com>
This reverts commit e4f585240a.
Among other things, that commit replaced hdfs command line calls
with "LOAD DATA LOCAL INPATH" using Hive. However, doing so
presumes that the minicluster is the only test environment.
Sometimes though, the data load script is against a remote cluster,
and those cases, the data load process is now broken.
Change-Id: I6dc419934d2953eb950b14d090d7895ec57aa9f2
Reviewed-on: http://gerrit.cloudera.org:8080/8653
Reviewed-by: Philip Zeyliger <philip@cloudera.com>
Reviewed-by: Zach Amsden <zamsden@cloudera.com>
Tested-by: Impala Public Jenkins
Dataload typically follows a pattern of loading data into
a text version of a table, and then using an insert
overwrite from the text table to populate the table for
other file formats. This insert is always done in Impala
for Parquet and Kudu. Otherwise it runs in Hive.
Since Impala doesn't support writing nested data, the
population of complextypes_fileformat tries to hack
the insert to run in Hive by including it in the ALTER
part of the table definition. ALTER runs immediately
after CREATE and always runs in Hive. The problem is
that ALTER also runs before the base table
(functional.complextypes_fileformat) is populated.
The insert succeeds, but it is inserting zero rows.
This code change introduces a way to force the Parquet
load to run using Hive. This lets complextypes_fileformat
specify that the insert should happen in Hive and fixes
the ordering so that the table is populated correctly.
This is also useful for loading custom Parquet files
into Parquet tables. Hive supports the DATA LOAD LOCAL
syntax, which can read a file from the local filesystem.
This means that several locations that currently use
the hdfs commandline can be modified to use this SQL.
This change speeds up dataload by a few minutes, as it
avoids the overhead of the hdfs commandline.
Any other location that could use DATA LOAD LOCAL is
also switched over to use it. This includes the
testescape* tables which now print the appropriate
DATA LOAD commands as a result of text_delims_table.py.
Any location that already uses DATA LOAD LOCAL is also
switched to indicate that it must run in Hive. Any
location that was doing an HDFS command in the LOAD
section is moved to the LOAD_DEPENDENT_HIVE section.
Testing: Ran dataload and core tests. Also verified that
functional_parquet.complextypes_fileformat has rows.
Change-Id: I7152306b2907198204a6d8d282a0bad561129b82
Reviewed-on: http://gerrit.cloudera.org:8080/8350
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins
For files that have a Cloudera copyright (and no other copyright
notice), make changes to follow the ASF source file header policy here:
http://www.apache.org/legal/src-headers.html#headers
Specifically:
1) Remove the Cloudera copyright.
2) Modify NOTICE.txt according to
http://www.apache.org/legal/src-headers.html#notice
to follow that format and add a line for Cloudera.
3) Replace or add the existing ASF license text with the one given
on the website.
Much of this change was automatically generated via:
git grep -li 'Copyright.*Cloudera' > modified_files.txt
cat modified_files.txt | xargs perl -n -i -e 'print unless m#Copyright.*Cloudera#i;'
cat modified_files_txt | xargs fix_apache_license.py [1]
Some manual fixups were performed following those steps, especially when
license text was completely missing from the file.
[1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor
modification to ORIG_LICENSE to match Impala's license text.
Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86
Reviewed-on: http://gerrit.cloudera.org:8080/3779
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
Python tests and infra scripts will now use "python" from the virtualenv
via $IMPALA_HOME/bin/impala-python. Some scripts could be simplified now
that python 2.6 and a dependable set of third-party libraries are
available but that is not done as part of this commit.
Change-Id: If1cf96898d6350e78ea107b9026b12ba63a4162f
Reviewed-on: http://gerrit.cloudera.org:8080/603
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
cgroups.py was using unsupported "except <Exception> as <var>" syntax.
generate_metrics.py was using the json module which is not available
in Python 2.4, but contains simplejson which provides the same
functionality.
Change-Id: If2c176c15a9573dd2a2acf5ee459ff24ce891ce3
Reviewed-on: http://gerrit.cloudera.org:8080/396
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Matthew Jacobs <mj@cloudera.com>
Allows the base cgroup hierarchy path used by the impala
test cluster to be specified with the environment variable
IMPALA_CGROUP_BASE_PATH. This is needed to support older
kernels that do not use the proper default cgroup path
and do not even support finding the hierarchy via mount.
This will be used in jenkins test runs with RM enabled
which run on Centos6 images.
Change-Id: I30984a58fbcf990410f75f7feb5c1d549afa6ddd
Reviewed-on: http://gerrit.cloudera.org:8080/397
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
Context managers are not supported before Python 2.7. Removes the
use of the 'with' clause in cgroups.py because this code is
executed on Centos 6 packaging boxes with an older version of
Python.
Change-Id: Ic6bcf161086f671ec2010df16f9bb23534c57697
Reviewed-on: http://gerrit.cloudera.org:8080/385
Reviewed-by: Martin Grund <mgrund@cloudera.com>
Tested-by: Matthew Jacobs <mj@cloudera.com>
Adds a flag to start-impala-cluster.py (--enable_rm) to set up the
mini Impala cluster using Yarn and Llama. This hides a number of
flags that must be set on the impalads:
-enable_rm
-llama_addressess: set to the local llama service
-fair_scheduler_allocation_path: set to the path of the fair-scheduler.xml
in each node's hadoop conf directory
-cgroup_hierarchy_path: set to a path in the CPU cgroup hierarchy which
has the correct permissions for Impala to manage a child cgroup. The
path comes from cgroups.py.
The new module cgroups.py was added to contain cgroups-related
utilities. Right now it provides paths to the CPU controller
hierarchy root and a path within the hierarchy that can be used
for impalads (i.e. have the proper permissions, one for each
cluster node).
Change-Id: Ic2181ec5613c180f240958c84f885c6b136a64d4
Reviewed-on: http://gerrit.cloudera.org:8080/369
Tested-by: Internal Jenkins
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
I did a local benchmark and there's minimal performance impact(<1%)
Change-Id: I8d84a145acad886c52587258b27d33cff96ea399
(cherry picked from commit 7e750ad5d90007cc85ebe493af4dce7a537ad7c0)
Reviewed-on: http://gerrit.cloudera.org:8080/189
Reviewed-by: Juan Yu <jyu@cloudera.com>
Tested-by: Internal Jenkins
For wide Avro tables, ReadZLong() would get inlined many times into a
single function body, causing LLVM to crash. Not inlining doesn't seem
to have a performance impact on narrow tables, and helps with wide
tables.
This change also adds tests over wide (i.e. many-column) tables. The
test tables are produced by specifying shell commands to generate test
tables in functional_schema_template.sql, which are executed in
generate-schema-statements.py. In the SQL templates, sections starting
with a ` are treated as shell commands. The output of the shell
command is then used as the section text. This is only a starting
point; it isn't currently implemented for all sections, and may have
to be tweaked if we use this mechanism for all tables.
Change-Id: Ife0d857d19b21534167a34c8bc06bc70bef34910
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2206
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Skye Wanderman-Milne <skye@cloudera.com>
(cherry picked from commit 1c5951e3cce25a048208ab9bb3a3aed95e41cf67)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2353
Tested-by: jenkins