The problem was that were were deleting the version.info file because the default
of gen_build_version.py recently changed from --noclean to --clean.
Also fixed a bug in the shell version generation and made debugging a bit easier
by dumping the contents of version.info whenever it is generated.
Change-Id: I764d01c9e46eed1bd39de79bf076c15afa599486
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1901
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
(cherry picked from commit fa673b4d3342fc825ee7fa942bd254234d222906)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1910
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
This change updates the Impala performance schema and test vector generation
techniques. It also migrates the existing benchmark scripts that were Ruby over
to use Python. The changes has a few parts:
1) Conversion of test vector generation and benchmark statement generation from
Ruby to Python. A result of this was also to update the benchmark test vector
and dimension files to be written in CSV format (python doesn't have built-in
YAML support)
2) Standardize on the naming for benchmark tables to (somewhat match Query
tests). In general the form is:
* If file_format=text and compression=none, do not use a table suffix
* Abbreviate sequence file as (seq) rc file as (rc) etc
* If using BLOCK compression don't append anything to table name, if using
'record' append 'record'
3) Created a new way to adding new schemas. this is the
benchmark_schema_template.sql file. The generate_benchmark_statements.py script
reads this in and breaks up the sections. The section format is:
====
Data Set Name
---
BASE table name
---
CREATE STATEMENT Template
---
INSERT ... SELECT * format
---
LOAD Base statement
---
LOAD STATEMENT Format
Where BASE Table is a table the other file formats/compression types can be
generated from. This would generally be a local file.
The thinking is that if the files already exist in HDFS then we can just load
the file directly rather than issue an INSERT ... SELECT * statement. The
generate_benchmark_statements.py script has been updated to use this new
template as well as query HDFS for each table to determine how it should be
created. It then outputs an ideal file call load-benchmark-*-generated.sql.
Since this file is geneated dynamically we can remove the old benchmark
statement files.
4) This has been hooked into load-benchmark-data.sh and run_query has been
updated to use the new format as well
We want to expose issues in an distributed env locally.
We already have 3 data nodes running locally in the MiniDFS. However, the
planner does not distinguish data nodes
on the same host, even though they're running on a different port. So, we're
effectively only running a single node
all the time.
First, we make the change in FE to identify data location as "host/port" instead
of just "host". Then, in
TQueryExecRequest, we list the host/port that serves the data, instead of just
using "host".
The result is that PlannerTest and QueryTest exposes distributed planning issue.
Plans are still correct when the
number of node is 1 or 2. So, to make all the tests passes, I've forced
Planner/Query test to execute with at most 2
nodes.
To see the faulty plan, we simply have to change the number of node back to 0
(all nodes).
o
We've discussed randomizing the SimpleScheduler but I choose not to do it
because we don't need randomization to
expose the distributed planning issue.
I also discovered that exchange node (BE) does not respect the "limit". I fixed
it.
One of the limit test (QueryTest) is completely unstable. It doesn't really test
much. I removed it.
generate and execute the benchmark queries.
Updated to demove Lzo compression and add coverage of 'DefaultCodec'
Fixed up make_release to more cleanly list queries.
- Fixed issue with SSE file parse.
- Moved build scripts to impala/bin. Rebuilding from just BE does not work.
- Cleanedup a few compiler warnings.
- Add option to disable automatic counters for profilers.