abp/filters/parser.py - Issue 29845767: Issue 6685 - Offer incremental filter list downloads

Side by Side Diff: abp/filters/parser.py

Issue 29845767: Issue 6685 - Offer incremental filter list downloads (Closed) Base URL: https://hg.adblockplus.org/python-abp/

Patch Set: Use iterables instead of str, stop repeating code Created Aug. 20, 2018, 6:18 p.m.

Left:
Right:

Use n/p to move between diff chunks; N/P to move between comments.

Jump to:

View unified diff | Download patch

« no previous file with comments | « no previous file | abp/filters/renderer.py » ('j') | abp/filters/renderer.py » ('J')

OLD	NEW
1 # This file is part of Adblock Plus <https://adblockplus.org/>,	1 # This file is part of Adblock Plus <https://adblockplus.org/>,

2 # Copyright (C) 2006-present eyeo GmbH	2 # Copyright (C) 2006-present eyeo GmbH

3 #	3 #

4 # Adblock Plus is free software: you can redistribute it and/or modify	4 # Adblock Plus is free software: you can redistribute it and/or modify

5 # it under the terms of the GNU General Public License version 3 as	5 # it under the terms of the GNU General Public License version 3 as

6 # published by the Free Software Foundation.	6 # published by the Free Software Foundation.

7 #	7 #

8 # Adblock Plus is distributed in the hope that it will be useful,	8 # Adblock Plus is distributed in the hope that it will be useful,

9 # but WITHOUT ANY WARRANTY; without even the implied warranty of	9 # but WITHOUT ANY WARRANTY; without even the implied warranty of

10 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the	10 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the

(...skipping 122 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
133	133

134	134

135 Header = _line_type('Header', 'version', '[{.version}]')	135 Header = _line_type('Header', 'version', '[{.version}]')

136 EmptyLine = _line_type('EmptyLine', '', '')	136 EmptyLine = _line_type('EmptyLine', '', '')

137 Comment = _line_type('Comment', 'text', '! {.text}')	137 Comment = _line_type('Comment', 'text', '! {.text}')

138 Metadata = _line_type('Metadata', 'key value', '! {0.key}: {0.value}')	138 Metadata = _line_type('Metadata', 'key value', '! {0.key}: {0.value}')

139 Filter = _line_type('Filter', 'text selector action options', '{.text}')	139 Filter = _line_type('Filter', 'text selector action options', '{.text}')

140 Include = _line_type('Include', 'target', '%include {0.target}%')	140 Include = _line_type('Include', 'target', '%include {0.target}%')

141	141

142	142

143 METADATA_REGEXP = re.compile(r'!\s(\w+)\s:\s(.)')	143 METADATA_REGEXP = re.compile(r'!\s([\w-]+)\s:\s(.)')

144 METADATA_KEYS = {'Homepage', 'Title', 'Expires', 'Checksum', 'Redirect',	144 METADATA_KEYS = {'Homepage', 'Title', 'Expires', 'Checksum', 'Redirect',

145 'Version'}	145 'Version', 'Diff-URL', 'Diff-Expires'}
	Sebastian Noack 2018/08/21 19:42:45 I would prefer if python-abp would be agnostic of I would prefer if python-abp would be agnostic of the supported metadata keys. Otherwise, we have to keep this set in sync with what is implemented in Adblock Plus, which seems unpractical. However, note that the regexp above would have to be changed like following so that it won't match lines with arbitrary URLs: r'!\s(\w+)\s:(?!//)\s(.)' Vasily Kuznetsov 2018/08/22 13:10:50 Makes sense and I support it. I actually thought i Show quoted text On 2018/08/21 19:42:45, Sebastian Noack wrote: > I would prefer if python-abp would be agnostic of the supported metadata keys. > Otherwise, we have to keep this set in sync with what is implemented in Adblock > Plus, which seems unpractical. > > However, note that the regexp above would have to be changed like following so > that it won't match lines with arbitrary URLs: r'!\s(\w+)\s:(?!//)\s(.)' Makes sense and I support it. I actually thought it was already implemented this way before I started reviewing this change :) BTW, the regexp should probably be r'!\s([\w-]+)\s:(?!//)\s(.)' so that we can still have dashes in metadata field names. Sebastian Noack 2018/08/22 14:31:06 You are right, well spotted. Show quoted text On 2018/08/22 13:10:50, Vasily Kuznetsov wrote: > On 2018/08/21 19:42:45, Sebastian Noack wrote: > > I would prefer if python-abp would be agnostic of the supported metadata keys. > > Otherwise, we have to keep this set in sync with what is implemented in > Adblock > > Plus, which seems unpractical. > > > > However, note that the regexp above would have to be changed like following so > > that it won't match lines with arbitrary URLs: r'!\s(\w+)\s:(?!//)\s(.)' > > Makes sense and I support it. I actually thought it was already implemented this > way before I started reviewing this change :) > > BTW, the regexp should probably be r'!\s([\w-]+)\s:(?!//)\s(.)' so that we > can still have dashes in metadata field names. You are right, well spotted. rhowell 2018/08/27 22:06:26 It appears this was done to prevent mistaking a co Show quoted text On 2018/08/22 14:31:06, Sebastian Noack wrote: > On 2018/08/22 13:10:50, Vasily Kuznetsov wrote: > > On 2018/08/21 19:42:45, Sebastian Noack wrote: > > > I would prefer if python-abp would be agnostic of the supported metadata > keys. > > > Otherwise, we have to keep this set in sync with what is implemented in > > Adblock > > > Plus, which seems unpractical. > > > > > > However, note that the regexp above would have to be changed like following > so > > > that it won't match lines with arbitrary URLs: r'!\s(\w+)\s:(?!//)\s(.)' > > > > Makes sense and I support it. I actually thought it was already implemented > this > > way before I started reviewing this change :) > > > > BTW, the regexp should probably be r'!\s([\w-]+)\s:(?!//)\s(.)' so that we > > can still have dashes in metadata field names. > > You are right, well spotted. It appears this was done to prevent mistaking a comment with a colon for metadata, like in this test: def test_parse_nonmeta(): line = parse_line('! WrongHeader: something') assert line.type == 'comment' It probably doesn't matter too much, since the parser wouldn't be able to parse invalid metadata anyway. I'll update the test so it doesn't use a colon.
146 INCLUDE_REGEXP = re.compile(r'%include\s+(.+)%')	146 INCLUDE_REGEXP = re.compile(r'%include\s+(.+)%')

147 HEADER_REGEXP = re.compile(r'\[(Adblock(?:\sPlus\s[\d\.]+?)?)\]', flags=re.I)	147 HEADER_REGEXP = re.compile(r'\[(Adblock(?:\sPlus\s[\d\.]+?)?)\]', flags=re.I)

148 HIDING_FILTER_REGEXP = re.compile(r'^([^/\|@"!]?)#([@?])?#(.+)$')	148 HIDING_FILTER_REGEXP = re.compile(r'^([^/\|@"!]?)#([@?])?#(.+)$')

149 FILTER_OPTIONS_REGEXP = re.compile(	149 FILTER_OPTIONS_REGEXP = re.compile(

150 r'\$(~?[\w-]+(?:=[^,]+)?(?:,~?[\w-]+(?:=[^,]+)?)*)$'	150 r'\$(~?[\w-]+(?:=[^,]+)?(?:,~?[\w-]+(?:=[^,]+)?)*)$'

151 )	151 )

152	152

153	153

154 def _parse_comment(text):	154 def _parse_comment(text):

155 match = METADATA_REGEXP.match(text)	155 match = METADATA_REGEXP.match(text)

156 if match and match.group(1) in METADATA_KEYS:	156 if match and match.group(1) in METADATA_KEYS:
	Sebastian Noack 2018/08/21 19:42:46 Note that metadata keys are case-insensitive. If w Note that metadata keys are case-insensitive. If we go with my above suggestion, this won't matter here anymore. But once non-capitalized keys are recognized this needs to be accounted for in _remove_duplicates() in renderer.py as well. Vasily Kuznetsov 2018/08/22 13:10:50 Should we then maybe adopt some canonical capitali Show quoted text On 2018/08/21 19:42:46, Sebastian Noack wrote: > Note that metadata keys are case-insensitive. If we go with my above suggestion, > this won't matter here anymore. But once non-capitalized keys are recognized > this needs to be accounted for in _remove_duplicates() in renderer.py as well. Should we then maybe adopt some canonical capitalization scheme (for example Like-This) and then convert them here to follow it? Otherwise we will have to remember this in any code that works with metadata keys. Sebastian Noack 2018/08/22 14:31:06 Initially, I had a similar though, but on the othe Show quoted text On 2018/08/22 13:10:50, Vasily Kuznetsov wrote: > On 2018/08/21 19:42:46, Sebastian Noack wrote: > > Note that metadata keys are case-insensitive. If we go with my above > suggestion, > > this won't matter here anymore. But once non-capitalized keys are recognized > > this needs to be accounted for in _remove_duplicates() in renderer.py as well. > > Should we then maybe adopt some canonical capitalization scheme (for example > Like-This) and then convert them here to follow it? Otherwise we will have to > remember this in any code that works with metadata keys. Initially, I had a similar though, but on the other hand I'd prefer to not unnecessarily have the output diverge from the input where (easily) possible, and it seems the only place where we specifically check for a particular metadata key is where we strip checksums, besides we'd have to account for case-insensitivity as well when comparing metadata for the diff. Vasily Kuznetsov 2018/08/24 13:03:19 The output that doesn't diverge from the input unn Show quoted text On 2018/08/22 14:31:06, Sebastian Noack wrote: > On 2018/08/22 13:10:50, Vasily Kuznetsov wrote: > > On 2018/08/21 19:42:46, Sebastian Noack wrote: > > > Note that metadata keys are case-insensitive. If we go with my above > > suggestion, > > > this won't matter here anymore. But once non-capitalized keys are recognized > > > this needs to be accounted for in _remove_duplicates() in renderer.py as > well. > > > > Should we then maybe adopt some canonical capitalization scheme (for example > > Like-This) and then convert them here to follow it? Otherwise we will have to > > remember this in any code that works with metadata keys. > > Initially, I had a similar though, but on the other hand I'd prefer to not > unnecessarily have the output diverge from the input where (easily) possible, > and it seems the only place where we specifically check for a particular > metadata key is where we strip checksums, besides we'd have to account for > case-insensitivity as well when comparing metadata for the diff. The output that doesn't diverge from the input unnecessarily seems like a useful feature. In principle, if we normalize the headers, eventually all "our" filter lists will have normalized headers and they won't diverge anyway, but this might not be the case just yet. Anyway, for the time being it's not a big deal and it's unlikely to result in problems in the future so I don't feel too strongly about this. rhowell 2018/08/27 22:06:26 Acknowledged. Show quoted text On 2018/08/24 13:03:19, Vasily Kuznetsov wrote: > On 2018/08/22 14:31:06, Sebastian Noack wrote: > > On 2018/08/22 13:10:50, Vasily Kuznetsov wrote: > > > On 2018/08/21 19:42:46, Sebastian Noack wrote: > > > > Note that metadata keys are case-insensitive. If we go with my above > > > suggestion, > > > > this won't matter here anymore. But once non-capitalized keys are > recognized > > > > this needs to be accounted for in _remove_duplicates() in renderer.py as > > well. > > > > > > Should we then maybe adopt some canonical capitalization scheme (for example > > > Like-This) and then convert them here to follow it? Otherwise we will have > to > > > remember this in any code that works with metadata keys. > > > > Initially, I had a similar though, but on the other hand I'd prefer to not > > unnecessarily have the output diverge from the input where (easily) possible, > > and it seems the only place where we specifically check for a particular > > metadata key is where we strip checksums, besides we'd have to account for > > case-insensitivity as well when comparing metadata for the diff. > > The output that doesn't diverge from the input unnecessarily seems like a useful > feature. In principle, if we normalize the headers, eventually all "our" filter > lists will have normalized headers and they won't diverge anyway, but this might > not be the case just yet. Anyway, for the time being it's not a big deal and > it's unlikely to result in problems in the future so I don't feel too strongly > about this. Acknowledged.
157 return Metadata(match.group(1), match.group(2))	157 return Metadata(match.group(1), match.group(2))

158 return Comment(text[1:].strip())	158 return Comment(text[1:].strip())

159	159

160	160

161 def _parse_header(text):	161 def _parse_header(text):

162 match = HEADER_REGEXP.match(text)	162 match = HEADER_REGEXP.match(text)

163 if not match:	163 if not match:

164 raise ParseError('Malformed header', text)	164 raise ParseError('Malformed header', text)

165 return Header(match.group(1))	165 return Header(match.group(1))

166	166

(...skipping 115 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
282 line_text = line_text.decode('utf-8')	282 line_text = line_text.decode('utf-8')

283	283

284 content = line_text.strip()	284 content = line_text.strip()

285	285

286 if content == '':	286 if content == '':

287 line = EmptyLine()	287 line = EmptyLine()

288 elif content.startswith('!'):	288 elif content.startswith('!'):

289 line = _parse_comment(content)	289 line = _parse_comment(content)

290 elif content.startswith('%') and content.endswith('%'):	290 elif content.startswith('%') and content.endswith('%'):

291 line = _parse_instruction(content)	291 line = _parse_instruction(content)

292 elif content.startswith('[') and content.endswith(']'):	292 elif content.startswith('[') and content.endswith(']'):
	Sebastian Noack 2018/08/21 19:42:45 Somewhat unrelated of these changes, but this is i Somewhat unrelated of these changes, but this is inaccurate too. Only the first line (if encapsulated in brackets) is to be recognized as header. Otherwise it would be considered a filter. Vasily Kuznetsov 2018/08/22 13:10:50 Makes sense. I created a separate ticket for it: h Show quoted text On 2018/08/21 19:42:45, Sebastian Noack wrote: > Somewhat unrelated of these changes, but this is inaccurate too. Only the first > line (if encapsulated in brackets) is to be recognized as header. Otherwise it > would be considered a filter. Makes sense. I created a separate ticket for it: https://issues.adblockplus.org/ticket/6877 rhowell 2018/08/27 22:06:26 Acknowledged. Show quoted text On 2018/08/22 13:10:50, Vasily Kuznetsov wrote: > On 2018/08/21 19:42:45, Sebastian Noack wrote: > > Somewhat unrelated of these changes, but this is inaccurate too. Only the > first > > line (if encapsulated in brackets) is to be recognized as header. Otherwise it > > would be considered a filter. > > Makes sense. I created a separate ticket for it: > https://issues.adblockplus.org/ticket/6877 Acknowledged.
293 line = _parse_header(content)	293 line = _parse_header(content)

294 else:	294 else:

295 line = parse_filter(content)	295 line = parse_filter(content)

296	296

297 assert line.to_string().replace(' ', '') == content.replace(' ', '')	297 assert line.to_string().replace(' ', '') == content.replace(' ', '')

298 return line	298 return line

299	299

300	300

301 def parse_filterlist(lines):	301 def parse_filterlist(lines):

302 """Parse filter list from an iterable.	302 """Parse filter list from an iterable.

(...skipping 11 matching lines...) Expand all Loading...
314 Raises	314 Raises

315 ------	315 ------

316 ParseError	316 ParseError

317 Thrown during iteration for invalid filter list lines.	317 Thrown during iteration for invalid filter list lines.

318 TypeError	318 TypeError

319 If `lines` is not iterable.	319 If `lines` is not iterable.

320	320

321 """	321 """

322 for line in lines:	322 for line in lines:

323 yield parse_line(line)	323 yield parse_line(line)

OLD	NEW