Python 3 changed some object model methods:
- __nonzero__ was removed in favor of __bool__
- func_dict / func_name were removed in favor of __dict__ / __name__
- The next() function was deprecated in favor of __next__
(Code locations should use next(iter) rather than iter.next())
- metaclasses are specified a different way
- Locations that specify __eq__ should also specify __hash__
Python 3 also moved some packages around (urllib2, Queue, httplib,
etc), and this adapts the code to use the new locations (usually
handled on Python 2 via future). This also fixes the code to
avoid referencing exception variables outside the exception block
and variables outside of a comprehension. Several of these seem
like false positives, but it is better to avoid the warning.
This fixes these pylint warnings:
bad-python3-import
eq-without-hash
metaclass-assignment
next-method-called
nonzero-method
exception-escape
comprehension-escape
Testing:
- Ran core tests
- Ran release exhaustive tests
Change-Id: I988ae6c139142678b0d40f1f4170b892eabf25ee
Reviewed-on: http://gerrit.cloudera.org:8080/19592
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Python 3 changes list operators such as range, map, and filter
to be lazy. Some code that expects the list operators to happen
immediately will fail. e.g.
Python 2:
range(0,5) == [0,1,2,3,4]
True
Python 3:
range(0,5) == [0,1,2,3,4]
False
The fix is to wrap locations with list(). i.e.
Python 3:
list(range(0,5)) == [0,1,2,3,4]
True
Since the base operators are now lazy, Python 3 also removes the
old lazy versions (e.g. xrange, ifilter, izip, etc). This uses
future's builtins package to convert the code to the Python 3
behavior (i.e. xrange -> future's builtins.range).
Most of the changes were done via these futurize fixes:
- libfuturize.fixes.fix_xrange_with_import
- lib2to3.fixes.fix_map
- lib2to3.fixes.fix_filter
This eliminates the pylint warnings:
- xrange-builtin
- range-builtin-not-iterating
- map-builtin-not-iterating
- zip-builtin-not-iterating
- filter-builtin-not-iterating
- reduce-builtin
- deprecated-itertools-function
Testing:
- Ran core job
Change-Id: Ic7c082711f8eff451a1b5c085e97461c327edb5f
Reviewed-on: http://gerrit.cloudera.org:8080/19589
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
This takes steps to make Python 2 behave like Python 3 as
a way to flush out issues with running on Python 3. Specifically,
it handles two main differences:
1. Python 3 requires absolute imports within packages. This
can be emulated via "from __future__ import absolute_import"
2. Python 3 changed division to "true" division that doesn't
round to an integer. This can be emulated via
"from __future__ import division"
This changes all Python files to add imports for absolute_import
and division. For completeness, this also includes print_function in the
import.
I scrutinized each old-division location and converted some locations
to use the integer division '//' operator if it needed an integer
result (e.g. for indices, counts of records, etc). Some code was also using
relative imports and needed to be adjusted to handle absolute_import.
This fixes all Pylint warnings about no-absolute-import and old-division,
and these warnings are now banned.
Testing:
- Ran core tests
Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b
Reviewed-on: http://gerrit.cloudera.org:8080/19588
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
UPSERTs are very similar to INSERTs, so the UPSERT support is simply
folded into that of INSERT. We do this by adding another "conflict
action", CONFLICT_ACTION_UPDATE. The object responsible for holding the
conflict_action attribute is now the InsertClause. This is needed here
because the SqlWriter now needs to know the conflict_action both when
writing the InsertClause (Impala) and at the tail end of the
InsertStatement (PostgreSQL). We also add a few properties to the
InsertStatement interface so that the PostgresqlSqlWriter can form the
correct "DO UPDATE" conflic action, in which primary key columns and
updatable columns must be known. More information on that here:
https://www.postgresql.org/docs/9.5/static/sql-insert.html
By default, we will tend to generate 3 UPSERTs for every 1 INSERT.
In addition to adding unit tests to make sure UPSERTs are properly
written, I used discrepancy_searcher.py --profile dmlonly, both with and
without --explain-only, do run tests. I made sure we were generating
syntactically valid UPSERT statements, and that the INSERT/UPSERT ratio
was roughly 1/3 after 100 statements.
Change-Id: I6382f6ab22ba29c117e39a5d90592d3637df4b25
Reviewed-on: http://gerrit.cloudera.org:8080/5795
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Impala Public Jenkins
- Rework the discrepancy searcher to run DML statements. We do this by
using the query profile to choose a table, copy that table, and
generate a statement that will INSERT into that copy. We chose a slow
copy over other methods because INSERTing into a copy is a more
reliable test that prevents table sizes from getting out of hand or
time-consuming replay to reproduce a particular statement.
- Introduce a statement generator stub. The real generator work is
tracked in IMPALA-4351 and IMPALA-4353. Here we simply generate a
basic INSERT INTO ... VALUES statement to make sure our general query
execution flow is working.
- Add query profile stub for DML statements (INSERT-only at this time).
Since we'll want INSERT INTO ... SELECT very soon, this inherits from
DefaultProfile. Also add building blocks for choosing random
statements in the DefaultProfile.
- Improve the concept of an "execution mode" and add new modes. Before,
we had "RAW", "CREATE_TABLE_AS", and "CREATE_VIEW_AS". The idea here
is that some random SELECT queries could be generated as "CREATE
TABLE|VIEW AS" at execution time, based on weights in the query
profile. First, we remove the use of raw string literals for this,
since raw string literals can be error-prone, and introduce a
StatementExecutionMode class to contain a namespace for the enumerated
statement execution modes. Second, we introduce a couple new execution
modes. The first is DML_SETUP: this is a DML statement that needs to
be run in both the test and reference databases concurrently. For our
purposes, it's the INSERT ... SELECT that copies data from the chosen
random table into the table copy. The second is DML_TEST: this is a
randomly-generated DML statement.
- Switch to using absolute imports in many places. There was a mix of
absolute and relative imports happening here, and they were causing
problems, especially when comparing data types. In Python,
<class 'db_types.Int'> != <class 'tests.comparison.db_types.Int'>.
Using
from __future__ import absolute_import
didn't seem to catch the relative import usage anyway, so I haven't
employed that.
- Rename some, but not nearly all, names from "query" to "statement".
Doing this is a rather large undertaking leading to much larger diffs
and testing (IMPALA-4602).
- Fix a handful of flake8 warnings. There are a bunch that went unfixed
for over- and under-indentation.
- Testing
o ./discrepancy_searcher.py runs with and without --explain-only, and
with --profile default and --profile dmlonly. For tpch_kudu data, it
seems sufficient to use a --timeout of about 300.
o Leopard run to make sure standard SELECT-only generation still works
o Generated random stress queries locally
o Generated random data locally
Change-Id: Ia4c63a2223185d0e056cc5713796772e5d1b8414
Reviewed-on: http://gerrit.cloudera.org:8080/5387
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins
This patch adds support to the random query generator infrastructure to
model and write SQL INSERTs. It does not actually randomly generate
INSERTs at this time (tracked in IMPALA-4353 and umbrella task
IMPALA-3740) but does provide necessary building blocks to do so.
First, it's necessary to model the INSERTs as part of our data model.
This was done by taking the current notion of a Query and making it a
SelectQuery. We also then create an abstract Query containing some of
the more common methods and attributes. We then model an INSERT query,
INSERT clause, and VALUES clause (IMPALA-4343).
Second, it's necessary to test the basics of this data model. It made
sense to go ahead and implement the necessary SqlWriter methods to write
the SQL for these clauses (IMPALA-4354).
I could then use this writer with some existing and new tests that take
a query written into our data model and write the SQL, verifying they're
correct.
For INSERT into Kudu tables, the equivalent PostgreSQL queries need to
use "ON CONFLICT DO NOTHING", so all existing and new query tests verify
they can be written as PostgreSQL as well.
Testing:
- all the query generator tests pass
- I can run Leopard front_end.py and load older query generator reports,
browse them, and re-run failed queries
- I can run Leopard controller.py to actually do a query generator
run
- discrepancy_searcher.py --explain-only ran for hundreds of queries.
There were no problems writing the SELECT queries
Change-Id: I38e24da78c49e908449b35f0a6276ebe4236ddba
Reviewed-on: http://gerrit.cloudera.org:8080/5162
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
This change removes some of the occurrences of the strings 'CDH'/'cdh'
from the Impala repository. References to Cloudera-internal Jiras have
been replaced with upstream Jira issues on issues.cloudera.org.
For several categories of occurrences (e.g. pom.xml files,
DOWNLOAD_CDH_COMPONENTS) I also created a list of follow-up Jiras to
remove the occurrences left after this change.
Change-Id: Icb37e2ef0cd9fa0e581d359c5dd3db7812b7b2c8
Reviewed-on: http://gerrit.cloudera.org:8080/4187
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
For files that have a Cloudera copyright (and no other copyright
notice), make changes to follow the ASF source file header policy here:
http://www.apache.org/legal/src-headers.html#headers
Specifically:
1) Remove the Cloudera copyright.
2) Modify NOTICE.txt according to
http://www.apache.org/legal/src-headers.html#notice
to follow that format and add a line for Cloudera.
3) Replace or add the existing ASF license text with the one given
on the website.
Much of this change was automatically generated via:
git grep -li 'Copyright.*Cloudera' > modified_files.txt
cat modified_files.txt | xargs perl -n -i -e 'print unless m#Copyright.*Cloudera#i;'
cat modified_files_txt | xargs fix_apache_license.py [1]
Some manual fixups were performed following those steps, especially when
license text was completely missing from the file.
[1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor
modification to ORIG_LICENSE to match Impala's license text.
Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86
Reviewed-on: http://gerrit.cloudera.org:8080/3779
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
Some recent commits broke the query generator leopard framework, for
example QueryResultComparator requires a different number of arguments.
Additional changes:
- Added better support for running the query generator in nested types
mode
- Keeping track of the number of queries that returned data
- Made it easier to control behavior from a central place by adding
flags to controller.py
Change-Id: I8f47c52097ccd53df4233b88eea887ce5fab1955
Reviewed-on: http://gerrit.cloudera.org:8080/1968
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
The major changes are:
1) Collect backtrace and fatal log on crash.
2) Poll memory usage. The data is only displayed at this time.
3) Support kerberos.
4) Add random queries.
5) Generate random and TPC-H nested data on a remote cluster. The
random data generator was converted to use MR for scaling.
6) Add a cluster abstraction to run data loading for #5 on a
remote or local cluster. This also moves and consolidates some
Cloudera Manager utilities that were in the stress test.
7) Cleanup the wrappers around impyla. That stuff was getting
messy.
Change-Id: I4e4b72dbee1c867626a0b22291dd6462819e35d7
Reviewed-on: http://gerrit.cloudera.org:8080/1298
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
- Parsing of describe statements with nested types
- Random query generation that involves nested types
- Query flattening (converts a query for a dataset with nested types
to an equivalent query for a flattened dataset)
Change-Id: If013d104fb90864dcf0934ef92157b95e917e7e8
Reviewed-on: http://gerrit.cloudera.org:8080/1375
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
Summary of changes:
1) Simplified type system. The old system was overly complicated for
the task of query generation. The modeling of types used to mirror
the types used in Impala. For simplicity, new system only uses a
subset of types, Boolean, Char, Decimal, Float, Int, and Timestamp.
2) Functions now have fully typed signatures. Previously you had to
know which functions accepted which inputs, now arbitrary
permutations of functions can be generated. The chance of being
able to add a new function without needing to change the query
generation logic is much higher now.
3) Query generation profiles. The randomness of the previous version
was hardcoded in various places in throughout the query generator.
Now there is a profile to determine which SQL features should be
used. There is still a lot of room for improvement in terms of
intuitiveness and documentation for configuring the profiles.
4) Greater diversity of queries. Besides the function permutations,
various restrictions to simplify query generation have been
removed. Also constants are used in queries.
5) Eliminate spinning and infinite loops. Also the old version would
sometimes "hope" that a generated SQL element would be compatible
with the context and if not, it would try again which would lead
to noticeable spinning and/or infinite loops.
6) Catchup with Impala 2.0 features: subqueries, analytics, and
Char/VarChar.
Change-Id: Ia25f4e85d6a06f7958a906aa42d9f90d63675bc0
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5640
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: jenkins