congress

mirror of https://github.com/unitedstates/congress.git synced 2025-12-20 09:36:59 -05:00

Author	SHA1	Message	Date
Joshua Tauberer	dc041db5f6	fix parsing of historical vote legislator lookup to not mind if a legislator has two terms on Jan 3 After recent updates to congress-legislator historical start/end dates, we began getting: Multiple matches of name Slaughter (VA-R; 1991-01-03) to legislators (excludes set([])). [h1-102.1991] Missing bioguide ID and name lookup failed for Slaughter (VA-R on 1991-01-03 12:02:00) Exception: No bioguide ID for Slaughter (VA-R) But there weren't really multiple legislators matching, just multiple terms. (There are other new cases of multiple legislators matching now though too.)	2016-12-25 10:16:32 -05:00
Joshua Tauberer	8d56a630dd	drop THOMAS IDs from output, replace with bioguide in XML outputs	2016-12-03 14:12:53 -05:00
David Cook	deab2f384d	Use urlretrieve() instead of wget, speed up FDsys	2016-08-29 18:57:20 -05:00
Joshua Tauberer	d58df048d3	replace THOMAS scraper with USGPO bill status XML importer There is no longer a separate amendments scraper. Amendments are saved as a part of importing bills. Amendments to treaties are no longer available. some of this work was done by @crdunwel	2016-07-01 08:47:57 -04:00
Joshua Tauberer	79f1994089	re-write the FDSys scraper It now can download bulk data files using the bulk data sitemaps. It's also a bit cleaner / more maintainable. Dropped some code no one was using.	2016-03-22 09:53:27 -04:00
Joshua Tauberer	452e56cab5	when a THOMAS ID is not found, correct the congress.gov URL that the script suggests to figure out who it is	2015-01-17 17:09:36 -05:00
Joshua Tauberer	fa07fa32a3	Merge pull request #140 from unitedstates/scrapelib update scrapelib to 0.10, remove follow_robots kwarg	2015-01-17 17:08:59 -05:00
Joshua Tauberer	4272b83bb4	improvements for my --diff command Fast-path if output hasn't changed. Make my run script chmod'd executable.	2014-12-10 12:50:22 -05:00
Eric Mill	5d1272506e	update scrapelib to 0.10, remove non-bacwards-compatible follow_robots arg.	2014-07-31 23:05:23 -04:00
Joshua Tauberer	960f0c7957	add a --diff option for bills and votes to test if output has changed When --diff is specified (for bills & votes), instead of writing output files to disk, we run a diff over the existing file and the new content and display the diff. This is handy for testing. At the same time I'm removing my previous preserve_update_time flag which I had been using for a similar purpose, but this new method is much easier. reverts 5122ad6f966ba5899a0758ed92d81ca779314c7f	2014-06-18 09:59:06 -04:00
Eric Mill	bdf6a9aade	Some refactoring and cleanup, extra console output	2014-05-09 12:35:49 -04:00
Eric Mill	9e67097514	mkdir_p the dir before trying to write a pickle file	2014-05-09 11:59:33 -04:00
Will Van Wazer	04d494856f	Automated PEP8 refactoring with autopep8.	2014-04-28 22:39:50 -04:00
Richard	71b90796dc	Windows 'which wget' command fixed in utils.py	2014-04-14 18:25:37 -05:00
Richard	c76b3aba87	Fixed Issue #120 , windows problem with which wget in utils.pw	2014-04-14 18:12:26 -05:00
Drew Vogel	c72245f363	Added a 30 second default HTTP read timeout.	2014-03-31 17:25:01 -04:00
Drew Vogel	376cc734e0	Added a timeout option for http reads. This requires scrapelib==0.9.1 or higher.	2014-03-31 13:57:23 -04:00
Joshua Tauberer	6f8168fd13	I broke parsing hres1-104 because of "House House Oversight" In `db1d414e47` I added a fix for parsing "House Oversight" in the 105th Congress, but this broke a different way of handling the same problem for the 104th Congress which I had previously done in `c70396dbd3`. Not sure how this just broke one bill, but here it is: [hres1-104] Exception: Traceback (most recent call last): File "tasks/utils.py", line 144, in process_set results = fetch_func(id, options, *extra_args) File "tasks/bill_info.py", line 27, in fetch_bill if not utils.committee_names: utils.fetch_committee_names(congress, options) File "tasks/utils.py", line 559, in fetch_committee_names committee_names["House Oversight"] = committee_names["House House Oversight"] KeyError: 'House House Oversight'	2014-03-20 19:42:40 -04:00
Eric Mill	62bf2df2bc	assure wget exists somewhere, anywhere	2014-03-13 11:59:10 -04:00
Joshua Tauberer	7c21b296e4	don't try to use wget to download from FDSys if wget isn't available If the system doesn't have wget, was broken by `9b01c418c2`. Hopefully fixes the second issue in #113.	2014-03-13 10:43:53 -04:00
Joshua Tauberer	f7171bf813	for parsing old House votes where bioguide ID is missng, also use other_names in congress-legislators	2014-02-25 08:06:06 -05:00
Joshua Tauberer	b4d96c43a9	for parsing old House votes where bioguide ID is missng, ignore party in congress-legislators for more Members that we know switched party mid term	2014-02-25 08:06:06 -05:00
Joshua Tauberer	db1d414e47	House House Oversight (=House Admin) appears as just 'House Oversight' in the 105th Congress	2014-02-25 08:06:06 -05:00
Joshua Tauberer	521688ea51	create cache/congress-legislators if the directory doesnt exist and we are about to write a pickle file there	2014-02-25 08:06:06 -05:00
Joshua Tauberer	d538ee36ba	utils: code cleanup in get_person_id	2014-01-30 11:48:24 -05:00
Joshua Tauberer	d3a0cf105f	utils: before updating congress-legislators from the network, check if the environment variable UPDATE_CONGRESS_LEGISLATORS is not set to NO	2014-01-30 11:48:24 -05:00
Gordon P. Hemsley	53ab972c9b	Fix #98 : Use the stricter and more reliable cache format that congress-legislators does.	2013-11-15 18:13:49 -05:00
Gordon P. Hemsley	59cf2bc3d6	Revamp the DeepBills scraper to actually allow partial requests.	2013-11-14 17:32:45 -05:00
Joshua Tauberer	7ae6e2060e	when downloading from FDSys using my 'wget' fast path, delete the empty file that results when wget fails to download a resource	2013-11-07 17:18:33 -05:00
Joshua Tauberer	20346a1749	yaml changes hadn't been tested when a config.yml file is present, wasn't working	2013-11-05 09:31:22 -05:00
Gordon P. Hemsley	38df8d9242	Rename yaml_load() to direct_yaml_load() and rename cached_yaml_load() to yaml_load().	2013-11-01 16:01:03 -04:00
Gordon P. Hemsley	a45acddb69	Add documentation comments for new utils functions.	2013-11-01 13:44:07 -04:00
Gordon P. Hemsley	e1045ec473	Improve caching of YAML files. Simplify creation of new legislator maps and create person-congresses and congress-persons maps. Mark get_govtrack_person_id() and UnmatchedIdentifer as deprecated in favor of get_person_id().	2013-11-01 12:40:14 -04:00
Eric Mill	275bf4944d	Merge pull request #87 from dcloud/master Pull committee names, ids from bill actions; simplify detection of referral action types	2013-10-07 13:11:12 -07:00
Daniel Cloud	4f01988cf7	Refactor bill_info to look for committee names separately from identifying referrals. Updated utils.fetch_committee_names to alias so we still get matches when parentheticals are omitted.	2013-10-07 10:03:37 -04:00
Joshua Tauberer	7f28281d16	voteview: make sure we're using the right .ord/.dtl URLs by scraping the .htm page listed on voteview.com for <a> hrefs, and report date parsing problems	2013-10-06 19:08:37 -04:00
Joshua Tauberer	6ea1daa7fe	voteview: determine which session of Congress the vote occurs in using GovTrack's sessions.tsv, hook into vote_info.output_vote() for writing JSON and XML, and separate the president's position from Member votes	2013-10-06 18:24:31 -04:00
Eric Mill	e479f25c28	Handle nomination numbers with sub-numbers	2013-09-09 17:28:42 -04:00
Gordon P. Hemsley	3126407e74	Remove return value from generate_person_id_map().	2013-09-07 15:42:47 -04:00
Gordon P. Hemsley	12063e8fb4	Get any person ID from any other person ID.	2013-09-07 15:06:01 -04:00
Joshua Tauberer	c70396dbd3	in the 104th Congress, 'House Oversight' and 'House House Oversight' are the same	2013-08-19 13:33:42 -04:00
Joshua Tauberer	1876e932a8	forcing some attribute orders in GovTrack-style bill XML to make diffs easier	2013-08-19 13:33:42 -04:00
Eric Mill	ad746608a6	Remove the timestamp from the email subject, prevents conversation threading	2013-08-10 13:23:44 -04:00
Joshua Tauberer	23c92fa22e	when reading from the cache directory, look for .zip files containing cached directories and load cached files from the .zip if they exist	2013-07-22 09:36:19 -04:00
Joshua Tauberer	98dbecd4b5	Keep non-binary downloaded content in unicode When downloading non-binary files, scrapelib/requests automatically does character-set detection, but we've been ignoring it and using the raw bytes. It's probably incorrect to pass raw bytes to JSON output, and it started causing errors creating XML output (lxml complained about non-XML-compatible characters). Instead, we should always work in unicode when dealing with text. This fix changes utils.download() to return unicode instances rather than str instances when 'binary' is not set to True. Files cached to disk are now stored on disk in UTF-8 always, because when we load up cached files we no longer know the character set that scrapelib/requests detected. Unfortunately this means that some cache files may be invalid (just in cases where the original URL returned non-UTF8 (probably ISO8859-1) content with actual non-UTF8 characters, which seems to not be entirely common). Best to just clear your cache directory. The scraper has always worked in bytes. download() was originally written to get the text content using str(response), which was implicitly encoding the unicode response back to bytes using the ASCII encoding, which can fail. In `a1da46b78f` I fixed that by getting the text content using response.bytes directly, but I wish I had just made it 'response' alone, which gives the unicode content. Reverts `11a504cedc` (recent work-around for char set issue) Fixes #82	2013-07-18 06:57:50 -04:00
Joshua Tauberer	99b4373210	in utils.download(), renamed the 'xml' option to 'binary'	2013-07-17 15:30:31 -04:00
Joshua Tauberer	05d1f8dc78	from the 106th-108th Congress, Pat Toomey's THOMAS ID was 01594; map it to the ID we find in the 109th-113th Congress	2013-07-17 08:21:48 -04:00
Joshua Tauberer	11a504cedc	in samdt1819-108, the file download didn't convert bytes to unicode but when trying to do entity replacement we insert unicode characters without explicitly decoding body content, which results in an error decoding from ascii. in this case, insert the unicode characters directly as utf-8. this might indicate a broader issue with character encodings.	2013-07-17 08:21:38 -04:00
Joshua Tauberer	cebffaeb05	In the 101st Congress, 1st session (1989), votes 133 through 136 lack lis_member_id nodes. Fall back to name lookup.	2013-07-16 09:39:58 -04:00
Joshua Tauberer	08f4025fab	In historical House vote XML when the bioguide ID field is not present fall back to matching by name. Before the 108th Congress, the vote XML never had a bioguide ID. Starting with the 108th, there are sporadic gaps. This commit creates a fallback to look-up by name/state/date when no bioguide ID is present. So far the 107th Congress maps names to IDs completely. The coverage may not be as good before that yet. Members with changed names or changed parties cause some problems. This also replaces the hard-coded fix for GK Butterfield's "000000" bioguide ID in several votes, since this is a more generic solution. See issue #46.	2013-07-15 18:33:43 -04:00

1 2 3

108 Commits