IMPALA-14514: Handle serializing bytes in bin/run-workload.py

On python 3, when Impyla receives a result with a string that is
not valid UTF-8, it returns that as bytes. TPC-DS Q30 on scale 20
has a result that contains invalid UTF-8, so bin/run-workload.py
can fail while trying to dump this to JSON.

This modifies CustomJSONEncoder to handle serializing bytes by
converting it to a string with invalid unicode handled with
backslashes.

Testing:
 - Ran bin/run-workload.py against TPC-DS scale 20

Change-Id: Ibe31c656de4fc65f8580c7b3b49bf655b8a5ecea
Reviewed-on: http://gerrit.cloudera.org:8080/23602
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Jason Fehr <jfehr@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
This commit is contained in:
Joe McDonnell
2025-10-26 13:39:28 -07:00
parent c4c9adf592
commit 001263f58a

View File

@@ -145,6 +145,11 @@ class CustomJSONEncoder(json.JSONEncoder):
if isinstance(obj, datetime): if isinstance(obj, datetime):
# Convert datetime into an standard iso string # Convert datetime into an standard iso string
return obj.isoformat() return obj.isoformat()
if isinstance(obj, bytes):
# Impyla can leave a string value as bytes when it is unable to decode it to UTF-8.
# TPC-DS has queries that produce non-UTF-8 results (e.g. Q30 on scale 20)
# Convert bytes to strings to make JSON encoding work
return obj.decode(encoding="utf-8", errors="backslashreplace")
elif isinstance(obj, (Query, HiveQueryResult, QueryExecConfig, TableFormatInfo)): elif isinstance(obj, (Query, HiveQueryResult, QueryExecConfig, TableFormatInfo)):
# Serialize these objects manually by returning their __dict__ methods. # Serialize these objects manually by returning their __dict__ methods.
return obj.__dict__ return obj.__dict__