The schema file allows specifying a commandline command in
several of the sections (LOAD, DEPENDENT_LOAD, etc). These
are execute by testdata/bin/generate-schema-statements.py
when it is creating the SQL files that are later executed
for dataload. A fair number of tables use this flexibility
to execute hdfs mkdir and copy commands via the command line.
Unfortunately, this is very inefficient. HDFS command line
commands require spinning up a JVM and can take over one
second per command. These commands are executed during a
serial part of dataload, and they can be executed multiple
times. In short, these commands are a significant slowdown
for loading the functional tables.
This converts the hdfs command line statements to equivalent
Hive LOAD DATA LOCAL statements. These are doing the copy
from an already running JVM, so they do not need JVM startup.
They also run in the parallel part of dataload, speeding up
the SQL generation part.
This speeds up generate-schema-statements.py significantly.
On the functional dataset, it saves 7 minutes.
Before:
time testdata/bin/generate-schema-statements.py -w functional-query -e exhaustive -f
real 8m8.068s
user 10m11.218s
sys 0m44.932s
After:
time testdata/bin/generate-schema-statements.py -w functional-query -e exhaustive -f
real 0m35.800s
user 0m42.536s
sys 0m5.210s
This is currently a long-pole in dataload, so it translates directly to
an overall speedup of about 7 minutes.
Testing:
- Ran debug tests
Change-Id: Icf17b85ff85618933716a80f1ccd6701b07f464c
Reviewed-on: http://gerrit.cloudera.org:8080/15228
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
I re-created the original patch for IMPALA-6068, but only
performed what I believe to be the limited legal transformation
of data load: DEPENDENT_LOAD -> DEPENDENT_LOAD_HIVE.
Any place that directly uploads via hadoop or hdfs commands
was left alone as changing it can't be proven to be correct.
Change-Id: I6c242cca209a7138b10ad517076707709b5cd204
Testing: Doing a full data load. I mistakenly changed a variable
name causing the first two dry-runs to fail.
Reviewed-on: http://gerrit.cloudera.org:8080/8690
Reviewed-by: Zach Amsden <zamsden@cloudera.com>
Tested-by: Zach Amsden <zamsden@cloudera.com>
This reverts commit e4f585240a.
Among other things, that commit replaced hdfs command line calls
with "LOAD DATA LOCAL INPATH" using Hive. However, doing so
presumes that the minicluster is the only test environment.
Sometimes though, the data load script is against a remote cluster,
and those cases, the data load process is now broken.
Change-Id: I6dc419934d2953eb950b14d090d7895ec57aa9f2
Reviewed-on: http://gerrit.cloudera.org:8080/8653
Reviewed-by: Philip Zeyliger <philip@cloudera.com>
Reviewed-by: Zach Amsden <zamsden@cloudera.com>
Tested-by: Impala Public Jenkins
Dataload typically follows a pattern of loading data into
a text version of a table, and then using an insert
overwrite from the text table to populate the table for
other file formats. This insert is always done in Impala
for Parquet and Kudu. Otherwise it runs in Hive.
Since Impala doesn't support writing nested data, the
population of complextypes_fileformat tries to hack
the insert to run in Hive by including it in the ALTER
part of the table definition. ALTER runs immediately
after CREATE and always runs in Hive. The problem is
that ALTER also runs before the base table
(functional.complextypes_fileformat) is populated.
The insert succeeds, but it is inserting zero rows.
This code change introduces a way to force the Parquet
load to run using Hive. This lets complextypes_fileformat
specify that the insert should happen in Hive and fixes
the ordering so that the table is populated correctly.
This is also useful for loading custom Parquet files
into Parquet tables. Hive supports the DATA LOAD LOCAL
syntax, which can read a file from the local filesystem.
This means that several locations that currently use
the hdfs commandline can be modified to use this SQL.
This change speeds up dataload by a few minutes, as it
avoids the overhead of the hdfs commandline.
Any other location that could use DATA LOAD LOCAL is
also switched over to use it. This includes the
testescape* tables which now print the appropriate
DATA LOAD commands as a result of text_delims_table.py.
Any location that already uses DATA LOAD LOCAL is also
switched to indicate that it must run in Hive. Any
location that was doing an HDFS command in the LOAD
section is moved to the LOAD_DEPENDENT_HIVE section.
Testing: Ran dataload and core tests. Also verified that
functional_parquet.complextypes_fileformat has rows.
Change-Id: I7152306b2907198204a6d8d282a0bad561129b82
Reviewed-on: http://gerrit.cloudera.org:8080/8350
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins
For files that have a Cloudera copyright (and no other copyright
notice), make changes to follow the ASF source file header policy here:
http://www.apache.org/legal/src-headers.html#headers
Specifically:
1) Remove the Cloudera copyright.
2) Modify NOTICE.txt according to
http://www.apache.org/legal/src-headers.html#notice
to follow that format and add a line for Cloudera.
3) Replace or add the existing ASF license text with the one given
on the website.
Much of this change was automatically generated via:
git grep -li 'Copyright.*Cloudera' > modified_files.txt
cat modified_files.txt | xargs perl -n -i -e 'print unless m#Copyright.*Cloudera#i;'
cat modified_files_txt | xargs fix_apache_license.py [1]
Some manual fixups were performed following those steps, especially when
license text was completely missing from the file.
[1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor
modification to ORIG_LICENSE to match Impala's license text.
Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86
Reviewed-on: http://gerrit.cloudera.org:8080/3779
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
Python tests and infra scripts will now use "python" from the virtualenv
via $IMPALA_HOME/bin/impala-python. Some scripts could be simplified now
that python 2.6 and a dependable set of third-party libraries are
available but that is not done as part of this commit.
Change-Id: If1cf96898d6350e78ea107b9026b12ba63a4162f
Reviewed-on: http://gerrit.cloudera.org:8080/603
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
cgroups.py was using unsupported "except <Exception> as <var>" syntax.
generate_metrics.py was using the json module which is not available
in Python 2.4, but contains simplejson which provides the same
functionality.
Change-Id: If2c176c15a9573dd2a2acf5ee459ff24ce891ce3
Reviewed-on: http://gerrit.cloudera.org:8080/396
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Matthew Jacobs <mj@cloudera.com>
Allows the base cgroup hierarchy path used by the impala
test cluster to be specified with the environment variable
IMPALA_CGROUP_BASE_PATH. This is needed to support older
kernels that do not use the proper default cgroup path
and do not even support finding the hierarchy via mount.
This will be used in jenkins test runs with RM enabled
which run on Centos6 images.
Change-Id: I30984a58fbcf990410f75f7feb5c1d549afa6ddd
Reviewed-on: http://gerrit.cloudera.org:8080/397
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
Context managers are not supported before Python 2.7. Removes the
use of the 'with' clause in cgroups.py because this code is
executed on Centos 6 packaging boxes with an older version of
Python.
Change-Id: Ic6bcf161086f671ec2010df16f9bb23534c57697
Reviewed-on: http://gerrit.cloudera.org:8080/385
Reviewed-by: Martin Grund <mgrund@cloudera.com>
Tested-by: Matthew Jacobs <mj@cloudera.com>
Adds a flag to start-impala-cluster.py (--enable_rm) to set up the
mini Impala cluster using Yarn and Llama. This hides a number of
flags that must be set on the impalads:
-enable_rm
-llama_addressess: set to the local llama service
-fair_scheduler_allocation_path: set to the path of the fair-scheduler.xml
in each node's hadoop conf directory
-cgroup_hierarchy_path: set to a path in the CPU cgroup hierarchy which
has the correct permissions for Impala to manage a child cgroup. The
path comes from cgroups.py.
The new module cgroups.py was added to contain cgroups-related
utilities. Right now it provides paths to the CPU controller
hierarchy root and a path within the hierarchy that can be used
for impalads (i.e. have the proper permissions, one for each
cluster node).
Change-Id: Ic2181ec5613c180f240958c84f885c6b136a64d4
Reviewed-on: http://gerrit.cloudera.org:8080/369
Tested-by: Internal Jenkins
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
I did a local benchmark and there's minimal performance impact(<1%)
Change-Id: I8d84a145acad886c52587258b27d33cff96ea399
(cherry picked from commit 7e750ad5d90007cc85ebe493af4dce7a537ad7c0)
Reviewed-on: http://gerrit.cloudera.org:8080/189
Reviewed-by: Juan Yu <jyu@cloudera.com>
Tested-by: Internal Jenkins
For wide Avro tables, ReadZLong() would get inlined many times into a
single function body, causing LLVM to crash. Not inlining doesn't seem
to have a performance impact on narrow tables, and helps with wide
tables.
This change also adds tests over wide (i.e. many-column) tables. The
test tables are produced by specifying shell commands to generate test
tables in functional_schema_template.sql, which are executed in
generate-schema-statements.py. In the SQL templates, sections starting
with a ` are treated as shell commands. The output of the shell
command is then used as the section text. This is only a starting
point; it isn't currently implemented for all sections, and may have
to be tweaked if we use this mechanism for all tables.
Change-Id: Ife0d857d19b21534167a34c8bc06bc70bef34910
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2206
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Skye Wanderman-Milne <skye@cloudera.com>
(cherry picked from commit 1c5951e3cce25a048208ab9bb3a3aed95e41cf67)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2353
Tested-by: jenkins