Commit Graph

108 Commits

Author SHA1 Message Date
Joshua Tauberer
dc041db5f6 fix parsing of historical vote legislator lookup to not mind if a legislator has two terms on Jan 3
After recent updates to congress-legislator historical start/end dates, we began getting:

Multiple matches of name Slaughter (VA-R; 1991-01-03) to legislators (excludes set([])).
[h1-102.1991] Missing bioguide ID and name lookup failed for Slaughter (VA-R on 1991-01-03 12:02:00)
Exception: No bioguide ID for Slaughter (VA-R)

But there weren't really multiple legislators matching, just multiple terms.

(There are other new cases of multiple legislators matching now though too.)
2016-12-25 10:16:32 -05:00
Joshua Tauberer
8d56a630dd drop THOMAS IDs from output, replace with bioguide in XML outputs 2016-12-03 14:12:53 -05:00
David Cook
deab2f384d Use urlretrieve() instead of wget, speed up FDsys 2016-08-29 18:57:20 -05:00
Joshua Tauberer
d58df048d3 replace THOMAS scraper with USGPO bill status XML importer
There is no longer a separate amendments scraper. Amendments are saved as a part of importing bills. Amendments to treaties are no longer available.

some of this work was done by @crdunwel
2016-07-01 08:47:57 -04:00
Joshua Tauberer
79f1994089 re-write the FDSys scraper
It now can download bulk data files using the bulk data sitemaps.

It's also a bit cleaner / more maintainable.

Dropped some code no one was using.
2016-03-22 09:53:27 -04:00
Joshua Tauberer
452e56cab5 when a THOMAS ID is not found, correct the congress.gov URL that the script suggests to figure out who it is 2015-01-17 17:09:36 -05:00
Joshua Tauberer
fa07fa32a3 Merge pull request #140 from unitedstates/scrapelib
update scrapelib to 0.10, remove follow_robots kwarg
2015-01-17 17:08:59 -05:00
Joshua Tauberer
4272b83bb4 improvements for my --diff command
Fast-path if output hasn't changed. Make my run script chmod'd executable.
2014-12-10 12:50:22 -05:00
Eric Mill
5d1272506e update scrapelib to 0.10, remove non-bacwards-compatible follow_robots arg. 2014-07-31 23:05:23 -04:00
Joshua Tauberer
960f0c7957 add a --diff option for bills and votes to test if output has changed
When --diff is specified (for bills & votes), instead of writing output files to
disk, we run a diff over the existing file and the new content and display the
diff. This is handy for testing.

At the same time I'm removing my previous preserve_update_time flag which
I had been using for a similar purpose, but this new method is much easier.

reverts 5122ad6f966ba5899a0758ed92d81ca779314c7f
2014-06-18 09:59:06 -04:00
Eric Mill
bdf6a9aade Some refactoring and cleanup, extra console output 2014-05-09 12:35:49 -04:00
Eric Mill
9e67097514 mkdir_p the dir before trying to write a pickle file 2014-05-09 11:59:33 -04:00
Will Van Wazer
04d494856f Automated PEP8 refactoring with autopep8. 2014-04-28 22:39:50 -04:00
Richard
71b90796dc Windows 'which wget' command fixed in utils.py 2014-04-14 18:25:37 -05:00
Richard
c76b3aba87 Fixed Issue #120, windows problem with which wget in utils.pw 2014-04-14 18:12:26 -05:00
Drew Vogel
c72245f363 Added a 30 second default HTTP read timeout. 2014-03-31 17:25:01 -04:00
Drew Vogel
376cc734e0 Added a timeout option for http reads.
This requires scrapelib==0.9.1 or higher.
2014-03-31 13:57:23 -04:00
Joshua Tauberer
6f8168fd13 I broke parsing hres1-104 because of "House House Oversight"
In db1d414e47 I added a fix for parsing "House Oversight" in the
105th Congress, but this broke a different way of handling the same problem for the 104th
Congress which I had previously done in c70396dbd3.

Not sure how this just broke one bill, but here it is:

[hres1-104] Exception:

Traceback (most recent call last):

  File "tasks/utils.py", line 144, in process_set
    results = fetch_func(id, options, *extra_args)

  File "tasks/bill_info.py", line 27, in fetch_bill
    if not utils.committee_names: utils.fetch_committee_names(congress, options)

  File "tasks/utils.py", line 559, in fetch_committee_names
    committee_names["House Oversight"] = committee_names["House House Oversight"]

KeyError: 'House House Oversight'
2014-03-20 19:42:40 -04:00
Eric Mill
62bf2df2bc assure wget exists somewhere, anywhere 2014-03-13 11:59:10 -04:00
Joshua Tauberer
7c21b296e4 don't try to use wget to download from FDSys if wget isn't available
If the system doesn't have wget, was broken by 9b01c418c2.

Hopefully fixes the second issue in #113.
2014-03-13 10:43:53 -04:00
Joshua Tauberer
f7171bf813 for parsing old House votes where bioguide ID is missng, also use other_names in congress-legislators 2014-02-25 08:06:06 -05:00
Joshua Tauberer
b4d96c43a9 for parsing old House votes where bioguide ID is missng, ignore party in congress-legislators for more Members that we know switched party mid term 2014-02-25 08:06:06 -05:00
Joshua Tauberer
db1d414e47 House House Oversight (=House Admin) appears as just 'House Oversight' in the 105th Congress 2014-02-25 08:06:06 -05:00
Joshua Tauberer
521688ea51 create cache/congress-legislators if the directory doesnt exist and we are about to write a pickle file there 2014-02-25 08:06:06 -05:00
Joshua Tauberer
d538ee36ba utils: code cleanup in get_person_id 2014-01-30 11:48:24 -05:00
Joshua Tauberer
d3a0cf105f utils: before updating congress-legislators from the network, check if the environment variable UPDATE_CONGRESS_LEGISLATORS is not set to NO 2014-01-30 11:48:24 -05:00
Gordon P. Hemsley
53ab972c9b Fix #98: Use the stricter and more reliable cache format that congress-legislators does. 2013-11-15 18:13:49 -05:00
Gordon P. Hemsley
59cf2bc3d6 Revamp the DeepBills scraper to actually allow partial requests. 2013-11-14 17:32:45 -05:00
Joshua Tauberer
7ae6e2060e when downloading from FDSys using my 'wget' fast path, delete the empty file that results when wget fails to download a resource 2013-11-07 17:18:33 -05:00
Joshua Tauberer
20346a1749 yaml changes hadn't been tested when a config.yml file is present, wasn't working 2013-11-05 09:31:22 -05:00
Gordon P. Hemsley
38df8d9242 Rename yaml_load() to direct_yaml_load() and rename cached_yaml_load() to yaml_load(). 2013-11-01 16:01:03 -04:00
Gordon P. Hemsley
a45acddb69 Add documentation comments for new utils functions. 2013-11-01 13:44:07 -04:00
Gordon P. Hemsley
e1045ec473 Improve caching of YAML files. Simplify creation of new legislator maps and create person-congresses and congress-persons maps. Mark get_govtrack_person_id() and UnmatchedIdentifer as deprecated in favor of get_person_id(). 2013-11-01 12:40:14 -04:00
Eric Mill
275bf4944d Merge pull request #87 from dcloud/master
Pull committee names, ids from bill actions; simplify detection of referral action types
2013-10-07 13:11:12 -07:00
Daniel Cloud
4f01988cf7 Refactor bill_info to look for committee names separately from identifying referrals. Updated utils.fetch_committee_names to alias so we still get matches when parentheticals are omitted. 2013-10-07 10:03:37 -04:00
Joshua Tauberer
7f28281d16 voteview: make sure we're using the right .ord/.dtl URLs by scraping the .htm page listed on voteview.com for <a> hrefs, and report date parsing problems 2013-10-06 19:08:37 -04:00
Joshua Tauberer
6ea1daa7fe voteview: determine which session of Congress the vote occurs in using GovTrack's sessions.tsv, hook into vote_info.output_vote() for writing JSON and XML, and separate the president's position from Member votes 2013-10-06 18:24:31 -04:00
Eric Mill
e479f25c28 Handle nomination numbers with sub-numbers 2013-09-09 17:28:42 -04:00
Gordon P. Hemsley
3126407e74 Remove return value from generate_person_id_map(). 2013-09-07 15:42:47 -04:00
Gordon P. Hemsley
12063e8fb4 Get any person ID from any other person ID. 2013-09-07 15:06:01 -04:00
Joshua Tauberer
c70396dbd3 in the 104th Congress, 'House Oversight' and 'House House Oversight' are the same 2013-08-19 13:33:42 -04:00
Joshua Tauberer
1876e932a8 forcing some attribute orders in GovTrack-style bill XML to make diffs easier 2013-08-19 13:33:42 -04:00
Eric Mill
ad746608a6 Remove the timestamp from the email subject, prevents conversation threading 2013-08-10 13:23:44 -04:00
Joshua Tauberer
23c92fa22e when reading from the cache directory, look for .zip files containing cached directories and load cached files from the .zip if they exist 2013-07-22 09:36:19 -04:00
Joshua Tauberer
98dbecd4b5 Keep non-binary downloaded content in unicode
When downloading non-binary files, scrapelib/requests automatically does character-set detection,
but we've been ignoring it and using the raw bytes. It's probably incorrect to pass raw bytes
to JSON output, and it started causing errors creating XML output (lxml complained about
non-XML-compatible characters). Instead, we should always work in unicode when dealing with text.

This fix changes utils.download() to return unicode instances rather than str instances when
'binary' is not set to True.

Files cached to disk are now stored on disk in UTF-8 always, because when we load up cached
files we no longer know the character set that scrapelib/requests detected. Unfortunately
this means that some cache files may be invalid (just in cases where the original URL
returned non-UTF8 (probably ISO8859-1) content with actual non-UTF8 characters, which
seems to not be entirely common). Best to just clear your cache directory.

The scraper has always worked in bytes. download() was originally written to get the
text content using str(response), which was implicitly encoding the unicode response
back to bytes using the ASCII encoding, which can fail. In a1da46b78f
I fixed that by getting the text content using response.bytes directly, but I wish
I had just made it 'response' alone, which gives the unicode content.

Reverts 11a504cedc (recent work-around for char set issue)

Fixes #82
2013-07-18 06:57:50 -04:00
Joshua Tauberer
99b4373210 in utils.download(), renamed the 'xml' option to 'binary' 2013-07-17 15:30:31 -04:00
Joshua Tauberer
05d1f8dc78 from the 106th-108th Congress, Pat Toomey's THOMAS ID was 01594; map it to the ID we find in the 109th-113th Congress 2013-07-17 08:21:48 -04:00
Joshua Tauberer
11a504cedc in samdt1819-108, the file download didn't convert bytes to unicode but when trying to do entity replacement we insert unicode characters without explicitly decoding body content, which results in an error decoding from ascii. in this case, insert the unicode characters directly as utf-8. this might indicate a broader issue with character encodings. 2013-07-17 08:21:38 -04:00
Joshua Tauberer
cebffaeb05 In the 101st Congress, 1st session (1989), votes 133 through 136 lack lis_member_id nodes. Fall back to name lookup. 2013-07-16 09:39:58 -04:00
Joshua Tauberer
08f4025fab In historical House vote XML when the bioguide ID field is not present fall back to matching by name.
Before the 108th Congress, the vote XML never had a bioguide ID.
Starting with the 108th, there are sporadic gaps. This commit
creates a fallback to look-up by name/state/date when no bioguide
ID is present.

So far the 107th Congress maps names to IDs completely. The coverage
may not be as good before that yet. Members with changed names or
changed parties cause some problems.

This also replaces the hard-coded fix for GK Butterfield's "000000"
bioguide ID in several votes, since this is a more generic solution.
See issue #46.
2013-07-15 18:33:43 -04:00