10 Commits

Author SHA1 Message Date
Joe McDonnell
eb66d00f9f IMPALA-11974: Fix lazy list operators for Python 3 compatibility
Python 3 changes list operators such as range, map, and filter
to be lazy. Some code that expects the list operators to happen
immediately will fail. e.g.

Python 2:
range(0,5) == [0,1,2,3,4]
True

Python 3:
range(0,5) == [0,1,2,3,4]
False

The fix is to wrap locations with list(). i.e.

Python 3:
list(range(0,5)) == [0,1,2,3,4]
True

Since the base operators are now lazy, Python 3 also removes the
old lazy versions (e.g. xrange, ifilter, izip, etc). This uses
future's builtins package to convert the code to the Python 3
behavior (i.e. xrange -> future's builtins.range).

Most of the changes were done via these futurize fixes:
 - libfuturize.fixes.fix_xrange_with_import
 - lib2to3.fixes.fix_map
 - lib2to3.fixes.fix_filter

This eliminates the pylint warnings:
 - xrange-builtin
 - range-builtin-not-iterating
 - map-builtin-not-iterating
 - zip-builtin-not-iterating
 - filter-builtin-not-iterating
 - reduce-builtin
 - deprecated-itertools-function

Testing:
 - Ran core job

Change-Id: Ic7c082711f8eff451a1b5c085e97461c327edb5f
Reviewed-on: http://gerrit.cloudera.org:8080/19589
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2023-03-09 17:17:57 +00:00
Joe McDonnell
82bd087fb1 IMPALA-11973: Add absolute_import, division to all eligible Python files
This takes steps to make Python 2 behave like Python 3 as
a way to flush out issues with running on Python 3. Specifically,
it handles two main differences:
 1. Python 3 requires absolute imports within packages. This
    can be emulated via "from __future__ import absolute_import"
 2. Python 3 changed division to "true" division that doesn't
    round to an integer. This can be emulated via
    "from __future__ import division"

This changes all Python files to add imports for absolute_import
and division. For completeness, this also includes print_function in the
import.

I scrutinized each old-division location and converted some locations
to use the integer division '//' operator if it needed an integer
result (e.g. for indices, counts of records, etc). Some code was also using
relative imports and needed to be adjusted to handle absolute_import.
This fixes all Pylint warnings about no-absolute-import and old-division,
and these warnings are now banned.

Testing:
 - Ran core tests

Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b
Reviewed-on: http://gerrit.cloudera.org:8080/19588
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2023-03-09 17:17:57 +00:00
Greg Rahn
ba9b78c103 IMPALA-7759: Add Levenshtein edit distance built-in function
This patch adds new built-in functions to calculate Levenshtein edit
distance. Implemented as levenshtein() to match PostgreSQL in
both functionality and name and also added le_dst() alias for Netezza,
compatibility, but note that levenshtein() differs in functionality in
that if either value is NULL or both values are NULL, levenshtein()
returns NULL, where Netezza's le_dst() returns the length of the not
NULL value or 0 if both values are NULL.

Testing:
- Added unit tests to expr-test.cc
- Manual test on 966289 string pairs and results match PostgreSQL
- Added changes to qgen tests for PostgreSQL comparison

Change-Id: I549d33ab7cebfa10db2934461c8ec91e2cc1cdcb
Reviewed-on: http://gerrit.cloudera.org:8080/11793
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-12-02 10:39:44 +00:00
Michael Brown
db7facdee0 IMPALA-4351,IMPALA-4353: [qgen] randomly generate INSERT statements
- Generate INSERT statements that are either INSERT ... VALUES or INSERT
  ... SELECT

- On both types of INSERTs, we either insert into all columns, or into
  some column list. If the column list exists, all primary keys will be
  present, and 0 or more additional columns will also be in the list.
  The ordering of the column list is random.

- For INSERT ... SELECT, occasionally generate a WITH clause

- For INSERT ... VALUES, generate non-null constants for the primary
  keys, but for the non-primary keys, randomly generate a value
  expression.

The type system in the random statement/query generator isn't
sophisticated enough to the implicit type of a SELECT item or a value
expression. It knows it will be some INT-based type, but not if it's
going to be a SMALLINT or a BIGINT. To get around this, the easiest
thing seems to be to explicitly cast the SELECT items or value
expressions to the columns' so-called exact_type attribute.

Much of the testing here involved running discrepancy_searcher.py
--explain-only on both tpch_kudu and a random HDFS table, using both the
default profile and DML-only profile. This was done to quickly find bugs
in the statement generation, as they tend to bubble up as analysis
errors. I expect to make other changes as follow on patches and more
random statements find small test issues.

For actual use against Kudu data, you need to migrate data from Kudu
into PostgreSQL 5 (instructions tests/comparison/POSTGRES.txt) and run
something like:

tests/comparison/discrepancy_searcher.py \
  --use-postgresql \
  --postgresql-port 5433 \
  --profile dmlonly \
  --timeout 300 \
  --db-name tpch_kudu \
  --query-count 10

Change-Id: I842b41f0eed07ab30ec76d8fc3cdd5affb525af6
Reviewed-on: http://gerrit.cloudera.org:8080/5486
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins
2017-01-13 01:31:47 +00:00
Michael Brown
54665120cb IMPALA-4355: random query generator: modify statement execution flow to support DML
- Rework the discrepancy searcher to run DML statements. We do this by
  using the query profile to choose a table, copy that table, and
  generate a statement that will INSERT into that copy. We chose a slow
  copy over other methods because INSERTing into a copy is a more
  reliable test that prevents table sizes from getting out of hand or
  time-consuming replay to reproduce a particular statement.

- Introduce a statement generator stub. The real generator work is
  tracked in IMPALA-4351 and IMPALA-4353. Here we simply generate a
  basic INSERT INTO ... VALUES statement to make sure our general query
  execution flow is working.

- Add query profile stub for DML statements (INSERT-only at this time).
  Since we'll want INSERT INTO ... SELECT very soon, this inherits from
  DefaultProfile. Also add building blocks for choosing random
  statements in the DefaultProfile.

- Improve the concept of an "execution mode" and add new modes. Before,
  we had "RAW", "CREATE_TABLE_AS", and "CREATE_VIEW_AS". The idea here
  is that some random SELECT queries could be generated as "CREATE
  TABLE|VIEW AS" at execution time, based on weights in the query
  profile. First, we remove the use of raw string literals for this,
  since raw string literals can be error-prone, and introduce a
  StatementExecutionMode class to contain a namespace for the enumerated
  statement execution modes. Second, we introduce a couple new execution
  modes. The first is DML_SETUP: this is a DML statement that needs to
  be run in both the test and reference databases concurrently. For our
  purposes, it's the INSERT ... SELECT that copies data from the chosen
  random table into the table copy. The second is DML_TEST: this is a
  randomly-generated DML statement.

- Switch to using absolute imports in many places. There was a mix of
  absolute and relative imports happening here, and they were causing
  problems, especially when comparing data types. In Python,
  <class 'db_types.Int'> != <class 'tests.comparison.db_types.Int'>.
  Using
    from __future__ import absolute_import
  didn't seem to catch the relative import usage anyway, so I haven't
  employed that.

- Rename some, but not nearly all, names from "query" to "statement".
  Doing this is a rather large undertaking leading to much larger diffs
  and testing (IMPALA-4602).

- Fix a handful of flake8 warnings. There are a bunch that went unfixed
  for over- and under-indentation.

- Testing
  o ./discrepancy_searcher.py runs with and without --explain-only, and
  with --profile default and --profile dmlonly. For tpch_kudu data, it
  seems sufficient to use a --timeout of about 300.
  o Leopard run to make sure standard SELECT-only generation still works
  o Generated random stress queries locally
  o Generated random data locally

Change-Id: Ia4c63a2223185d0e056cc5713796772e5d1b8414
Reviewed-on: http://gerrit.cloudera.org:8080/5387
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins
2017-01-12 21:40:39 +00:00
Dan Hecht
ffa7829b70 IMPALA-3918: Remove Cloudera copyrights and add ASF license header
For files that have a Cloudera copyright (and no other copyright
notice), make changes to follow the ASF source file header policy here:

http://www.apache.org/legal/src-headers.html#headers

Specifically:
1) Remove the Cloudera copyright.
2) Modify NOTICE.txt according to
   http://www.apache.org/legal/src-headers.html#notice
   to follow that format and add a line for Cloudera.
3) Replace or add the existing ASF license text with the one given
   on the website.

Much of this change was automatically generated via:

git grep -li 'Copyright.*Cloudera' > modified_files.txt
cat modified_files.txt | xargs perl -n -i -e 'print unless m#Copyright.*Cloudera#i;'
cat modified_files_txt | xargs fix_apache_license.py [1]

Some manual fixups were performed following those steps, especially when
license text was completely missing from the file.

[1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor
    modification to ORIG_LICENSE to match Impala's license text.

Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86
Reviewed-on: http://gerrit.cloudera.org:8080/3779
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-08-09 08:19:41 +00:00
Michael Brown
e0fb432b82 IMPALA-3864: qgen: reduce likelihood of create_query() exceptions
1. Fix a bug in which the computation to produce the string for an
   exception was raising a TypeError. We fix the bug by changing how the
   string is built.

2. Fix a bug in which we tried to choose a relational function (defined
   as taking in more than one argument and returning a Boolean) and were
   looking for its weight in QueryProfile.weights.RELATIONAL_FUNCS, but
   the function wasn't defined in that dictionary. We fix the bug by
   defining weights for those functions.

3. Fix a bug in which QueryProfile.choose_func_signatures() was choosing
   a function without taking into account the set of functions in the
   signatures given to it. We fix the bug by pruning off the weights of
   functions that aren't included in provided signatures. We also add a
   note explaining how the weights defined are "best effort", since
   sometimes functions will be pruned.

4. Add Char signatures to LessThan, GreaterThan, LessThanOrEquals,
   GreaterThanOrEquals. Debugging #3 above brought this to my attention.

5. Make changes to aid in debugging or testing:

   a. Add funcs.Signature representation.
   b. Move the code in query_generator.__main__ to
      generate_queries_for_manual_inspection(); call it from __main__.
   c. Increase the number of fake columns when calling
      generate_queries_for_manual_inspection(), which is useful for
      testing.
   d. Rename a few variables. For some reason in the query_generator
      module there are a lot of overwritten variables, which makes
      debugging difficult.

Testing:

1. impala-python tests/comparison/query_generator.py produces far fewer
   exceptions. The ones for this bug are fixed, but see IMPALA-3890.

2. Full 3 and 6 hour runs of the query generator system don't show any
   obvious regressions in behavior.

Change-Id: Idd9434a92973176aefb99e11e039209cac3cea65
Reviewed-on: http://gerrit.cloudera.org:8080/3720
Tested-by: Michael Brown <mikeb@cloudera.com>
Reviewed-by: Michael Brown <mikeb@cloudera.com>
2016-07-22 11:03:33 -07:00
Casey Ching
177ac96e84 IMPALA-2857: Add "IS [NOT] DISTINCT FROM" to query generator
This also add the operator version "<=>". Some bug fixes are also
included. I ran the query generator for a while and didn't see anything
unusual. Over 100 queries ran in a row without crashing. I verified that
all three of the new functions showed up in the generated query log.

Change-Id: I5df1165293ef22a680275dbcef13aaddd42c72bf
Reviewed-on: http://gerrit.cloudera.org:8080/1944
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
2016-02-02 04:30:15 +00:00
Casey Ching
f288867833 Stress test: Various changes
The major changes are:

1) Collect backtrace and fatal log on crash.
2) Poll memory usage. The data is only displayed at this time.
3) Support kerberos.
4) Add random queries.
5) Generate random and TPC-H nested data on a remote cluster. The
   random data generator was converted to use MR for scaling.
6) Add a cluster abstraction to run data loading for #5 on a
   remote or local cluster. This also moves and consolidates some
   Cloudera Manager utilities that were in the stress test.
7) Cleanup the wrappers around impyla. That stuff was getting
   messy.

Change-Id: I4e4b72dbee1c867626a0b22291dd6462819e35d7
Reviewed-on: http://gerrit.cloudera.org:8080/1298
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
2016-01-20 23:00:25 +00:00
casey
b013495e1d Misc updates to the query generator (part 1 of 2)
Summary of changes:

  1) Simplified type system. The old system was overly complicated for
     the task of query generation. The modeling of types used to mirror
     the types used in Impala. For simplicity, new system only uses a
     subset of types, Boolean, Char, Decimal, Float, Int, and Timestamp.

  2) Functions now have fully typed signatures. Previously you had to
     know which functions accepted which inputs, now arbitrary
     permutations of functions can be generated. The chance of being
     able to add a new function without needing to change the query
     generation logic is much higher now.

  3) Query generation profiles. The randomness of the previous version
     was hardcoded in various places in throughout the query generator.
     Now there is a profile to determine which SQL features should be
     used. There is still a lot of room for improvement in terms of
     intuitiveness and documentation for configuring the profiles.

  4) Greater diversity of queries. Besides the function permutations,
     various restrictions to simplify query generation have been
     removed. Also constants are used in queries.

  5) Eliminate spinning and infinite loops. Also the old version would
     sometimes "hope" that a generated SQL element would be compatible
     with the context and if not, it would try again which would lead
     to noticeable spinning and/or infinite loops.

  6) Catchup with Impala 2.0 features: subqueries, analytics, and
     Char/VarChar.

Change-Id: Ia25f4e85d6a06f7958a906aa42d9f90d63675bc0
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5640
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: jenkins
2014-12-19 03:30:44 -08:00