IMPALA-14144: Make pip_download.py more tolerant with PEP 503 simple pages

Recent package updates on PyPI have introduced package description pages that have extra newlines in addition to the newline character separating the complete URLs for the difference package versions. These extra newlines usually show up before the closing angle bracket character ('>') of the opening half of the anchor tag. This broke pip_download.py, because it uses a regex to crack out various data items (file name, download path, hash algorithm and hash value) from the download page. The regex attempts the whole anchor element up to and including the closing '</a>' tag, which fails because the '.' in a regex matches any character, except a newline. This failure causes all lines in the package descriptor page to be rejected as not matching the search pattern, so the package with a page in this format can never be recognized. This patch works around this formatting issue by adding the flag re.DOTALL to the regex search call, making the regex '.' character match the newline as well, so that the regex can match the complete anchor element across a line break as well. Change-Id: Ia56f87c54e0d9cad97b7e0ffbcce8f4c0f715c44 Reviewed-on: http://gerrit.cloudera.org:8080/23026 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2025-12-19 18:12:08 -05:00 · 2025-06-13 23:01:14 +02:00
parent 2560487700
commit fae42323da
1 changed files with 1 additions and 1 deletions
--- a/infra/python/deps/pip_download.py
+++ b/infra/python/deps/pip_download.py
@@ -102,7 +102,7 @@ def get_package_info(pkg_name, pkg_version, is_canceled=None):
  pkg_info = subprocess.check_output(
      ["wget", "-q", "-O", "-", url], universal_newlines=True)
  regex = r'<a .*?href=\".*?packages/(.*?)#(.*?)=(.*?)\".*?>(.*?)<\/a>'
-  for match in re.finditer(regex, pkg_info):
+  for match in re.finditer(regex, pkg_info, flags=re.DOTALL):
    path = match.group(1)
    hash_algorithm = match.group(2)
    digest = match.group(3)