After recent updates to congress-legislator historical start/end dates, we began getting:
Multiple matches of name Slaughter (VA-R; 1991-01-03) to legislators (excludes set([])).
[h1-102.1991] Missing bioguide ID and name lookup failed for Slaughter (VA-R on 1991-01-03 12:02:00)
Exception: No bioguide ID for Slaughter (VA-R)
But there weren't really multiple legislators matching, just multiple terms.
(There are other new cases of multiple legislators matching now though too.)
There is no longer a separate amendments scraper. Amendments are saved as a part of importing bills. Amendments to treaties are no longer available.
some of this work was done by @crdunwel
When --diff is specified (for bills & votes), instead of writing output files to
disk, we run a diff over the existing file and the new content and display the
diff. This is handy for testing.
At the same time I'm removing my previous preserve_update_time flag which
I had been using for a similar purpose, but this new method is much easier.
reverts 5122ad6f966ba5899a0758ed92d81ca779314c7f
In db1d414e47 I added a fix for parsing "House Oversight" in the
105th Congress, but this broke a different way of handling the same problem for the 104th
Congress which I had previously done in c70396dbd3.
Not sure how this just broke one bill, but here it is:
[hres1-104] Exception:
Traceback (most recent call last):
File "tasks/utils.py", line 144, in process_set
results = fetch_func(id, options, *extra_args)
File "tasks/bill_info.py", line 27, in fetch_bill
if not utils.committee_names: utils.fetch_committee_names(congress, options)
File "tasks/utils.py", line 559, in fetch_committee_names
committee_names["House Oversight"] = committee_names["House House Oversight"]
KeyError: 'House House Oversight'
When downloading non-binary files, scrapelib/requests automatically does character-set detection,
but we've been ignoring it and using the raw bytes. It's probably incorrect to pass raw bytes
to JSON output, and it started causing errors creating XML output (lxml complained about
non-XML-compatible characters). Instead, we should always work in unicode when dealing with text.
This fix changes utils.download() to return unicode instances rather than str instances when
'binary' is not set to True.
Files cached to disk are now stored on disk in UTF-8 always, because when we load up cached
files we no longer know the character set that scrapelib/requests detected. Unfortunately
this means that some cache files may be invalid (just in cases where the original URL
returned non-UTF8 (probably ISO8859-1) content with actual non-UTF8 characters, which
seems to not be entirely common). Best to just clear your cache directory.
The scraper has always worked in bytes. download() was originally written to get the
text content using str(response), which was implicitly encoding the unicode response
back to bytes using the ASCII encoding, which can fail. In a1da46b78f
I fixed that by getting the text content using response.bytes directly, but I wish
I had just made it 'response' alone, which gives the unicode content.
Reverts 11a504cedc (recent work-around for char set issue)
Fixes#82
Before the 108th Congress, the vote XML never had a bioguide ID.
Starting with the 108th, there are sporadic gaps. This commit
creates a fallback to look-up by name/state/date when no bioguide
ID is present.
So far the 107th Congress maps names to IDs completely. The coverage
may not be as good before that yet. Members with changed names or
changed parties cause some problems.
This also replaces the hard-coded fix for GK Butterfield's "000000"
bioguide ID in several votes, since this is a more generic solution.
See issue #46.