sitescripts/subscriptions/combineSubscriptions.py - Issue 9170210: Subscription downloads: Ignore stated charset for remote downloads, always assume UTF-8

Keyboard Shortcuts

	File
u :	up to issue
m :	publish + mail comments
M :	edit review message
j / k :	jump to file after / before current file
J / K :	jump to next file with a comment after / before current file
	Side-by-side diff
i :	toggle intra-line diffs
e :	expand all comments
c :	collapse all comments
s :	toggle showing all comments
n / p :	next / previous diff chunk or comment
N / P :	next / previous comment
<Up> / <Down> :	next / previous line
<Enter> :	respond to / edit current comment
d :	mark current comment as done

	Issue
u :	up to list of issues
m :	publish + mail comments
j / k :	jump to patch after / before current patch
o / <Enter> :	open current patch in side-by-side view
i :	open current patch in unified diff view

	Issue List
j / k :	jump to issue after / before current issue
o / <Enter> :	open current issue
# :	close issue

	Comment/message editing
<Ctrl> + s or <Ctrl> + Enter :	save comment
<Esc> :	cancel edit

Unified Diff: sitescripts/subscriptions/combineSubscriptions.py

Issue 9170210: Subscription downloads: Ignore stated charset for remote downloads, always assume UTF-8 (Closed)

Patch Set: Created Jan. 16, 2013, 3:49 p.m.

Use n/p to move between diff chunks; N/P to move between comments.

Jump to:

Index: sitescripts/subscriptions/combineSubscriptions.py

===================================================================

--- a/sitescripts/subscriptions/combineSubscriptions.py

+++ b/sitescripts/subscriptions/combineSubscriptions.py

@@ -152,21 +152,20 @@ def resolveIncludes(sourceName, sourceDi

error = None

break

except urllib2.URLError, e:

error = e

time.sleep(5)

if error:

raise error

- charset = 'utf-8'

- contentType = request.headers.get('content-type', '')

- if contentType.find('charset=') >= 0:

- charset = contentType.split('charset=', 1)[1]

- newLines = unicode(request.read(), charset).split('\n')

+ # We should really get the charset from the headers rather than assuming

+ # that it is UTF-8. However, some of the Google Code mirrors are

+ # misconfigured and will return ISO-8859-1 as charset instead of UTF-8.

+ newLines = unicode(request.read(), 'utf-8').split('\n')

newLines = map(lambda l: re.sub(r'[\r\n]', '', l), newLines)

newLines = filter(lambda l: not re.search(r'^\s*!.*?\bExpires\s*(?::|after)\s*(\d+)\s*(h)?', l, re.M | re.I), newLines)

newLines = filter(lambda l: not re.search(r'^\s*!\s*(Redirect|Homepage|Title)\s*:', l, re.M | re.I), newLines)

else:

result.append('! *** %s ***' % file)

includeSource = sourceName

if file.find(':') >= 0:

« no previous file with comments | « no previous file | no next file » | no next file with comments »