abp/filters/parser.py - Issue 29845767: Issue 6685 - Offer incremental filter list downloads

Delta Between Two Patch Sets: abp/filters/parser.py

Issue 29845767: Issue 6685 - Offer incremental filter list downloads (Closed) Base URL: https://hg.adblockplus.org/python-abp/

Left Patch Set: Use iterables instead of str, stop repeating code Created Aug. 20, 2018, 6:18 p.m.

Right Patch Set: Address comments on PS8 Created Aug. 30, 2018, 5:37 p.m.

Left:
Right:

Use n/p to move between diff chunks; N/P to move between comments.

Jump to:

Left: Side by side diff | Download
Right: Side by side diff | Download

LEFT	RIGHT
1 # This file is part of Adblock Plus <https://adblockplus.org/>,	1 # This file is part of Adblock Plus <https://adblockplus.org/>,

2 # Copyright (C) 2006-present eyeo GmbH	2 # Copyright (C) 2006-present eyeo GmbH

3 #	3 #

4 # Adblock Plus is free software: you can redistribute it and/or modify	4 # Adblock Plus is free software: you can redistribute it and/or modify

5 # it under the terms of the GNU General Public License version 3 as	5 # it under the terms of the GNU General Public License version 3 as

6 # published by the Free Software Foundation.	6 # published by the Free Software Foundation.

7 #	7 #

8 # Adblock Plus is distributed in the hope that it will be useful,	8 # Adblock Plus is distributed in the hope that it will be useful,

9 # but WITHOUT ANY WARRANTY; without even the implied warranty of	9 # but WITHOUT ANY WARRANTY; without even the implied warranty of

10 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the	10 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the

(...skipping 122 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
133	133

134	134

135 Header = _line_type('Header', 'version', '[{.version}]')	135 Header = _line_type('Header', 'version', '[{.version}]')

136 EmptyLine = _line_type('EmptyLine', '', '')	136 EmptyLine = _line_type('EmptyLine', '', '')

137 Comment = _line_type('Comment', 'text', '! {.text}')	137 Comment = _line_type('Comment', 'text', '! {.text}')

138 Metadata = _line_type('Metadata', 'key value', '! {0.key}: {0.value}')	138 Metadata = _line_type('Metadata', 'key value', '! {0.key}: {0.value}')

139 Filter = _line_type('Filter', 'text selector action options', '{.text}')	139 Filter = _line_type('Filter', 'text selector action options', '{.text}')

140 Include = _line_type('Include', 'target', '%include {0.target}%')	140 Include = _line_type('Include', 'target', '%include {0.target}%')

141	141

142	142

143 METADATA_REGEXP = re.compile(r'!\s([\w-]+)\s:\s(.)')	143 METADATA_REGEXP = re.compile(r'!\s([\w-]+)\s:(?!//)\s(.)')

144 METADATA_KEYS = {'Homepage', 'Title', 'Expires', 'Checksum', 'Redirect',

145 'Version', 'Diff-URL', 'Diff-Expires'}
Sebastian Noack 2018/08/21 19:42:45 I would prefer if python-abp would be agnostic of I would prefer if python-abp would be agnostic of the supported metadata keys. Otherwise, we have to keep this set in sync with what is implemented in Adblock Plus, which seems unpractical. However, note that the regexp above would have to be changed like following so that it won't match lines with arbitrary URLs: r'!\s(\w+)\s:(?!//)\s(.)' Vasily Kuznetsov 2018/08/22 13:10:50 Makes sense and I support it. I actually thought i Show quoted text On 2018/08/21 19:42:45, Sebastian Noack wrote: > I would prefer if python-abp would be agnostic of the supported metadata keys. > Otherwise, we have to keep this set in sync with what is implemented in Adblock > Plus, which seems unpractical. > > However, note that the regexp above would have to be changed like following so > that it won't match lines with arbitrary URLs: r'!\s(\w+)\s:(?!//)\s(.)' Makes sense and I support it. I actually thought it was already implemented this way before I started reviewing this change :) BTW, the regexp should probably be r'!\s([\w-]+)\s:(?!//)\s(.)' so that we can still have dashes in metadata field names. Sebastian Noack 2018/08/22 14:31:06 You are right, well spotted. Show quoted text On 2018/08/22 13:10:50, Vasily Kuznetsov wrote: > On 2018/08/21 19:42:45, Sebastian Noack wrote: > > I would prefer if python-abp would be agnostic of the supported metadata keys. > > Otherwise, we have to keep this set in sync with what is implemented in > Adblock > > Plus, which seems unpractical. > > > > However, note that the regexp above would have to be changed like following so > > that it won't match lines with arbitrary URLs: r'!\s(\w+)\s:(?!//)\s(.)' > > Makes sense and I support it. I actually thought it was already implemented this > way before I started reviewing this change :) > > BTW, the regexp should probably be r'!\s([\w-]+)\s:(?!//)\s(.)' so that we > can still have dashes in metadata field names. You are right, well spotted. rhowell 2018/08/27 22:06:26 It appears this was done to prevent mistaking a co Show quoted text On 2018/08/22 14:31:06, Sebastian Noack wrote: > On 2018/08/22 13:10:50, Vasily Kuznetsov wrote: > > On 2018/08/21 19:42:45, Sebastian Noack wrote: > > > I would prefer if python-abp would be agnostic of the supported metadata > keys. > > > Otherwise, we have to keep this set in sync with what is implemented in > > Adblock > > > Plus, which seems unpractical. > > > > > > However, note that the regexp above would have to be changed like following > so > > > that it won't match lines with arbitrary URLs: r'!\s(\w+)\s:(?!//)\s(.)' > > > > Makes sense and I support it. I actually thought it was already implemented > this > > way before I started reviewing this change :) > > > > BTW, the regexp should probably be r'!\s([\w-]+)\s:(?!//)\s(.)' so that we > > can still have dashes in metadata field names. > > You are right, well spotted. It appears this was done to prevent mistaking a comment with a colon for metadata, like in this test: def test_parse_nonmeta(): line = parse_line('! WrongHeader: something') assert line.type == 'comment' It probably doesn't matter too much, since the parser wouldn't be able to parse invalid metadata anyway. I'll update the test so it doesn't use a colon.
146 INCLUDE_REGEXP = re.compile(r'%include\s+(.+)%')	144 INCLUDE_REGEXP = re.compile(r'%include\s+(.+)%')

147 HEADER_REGEXP = re.compile(r'\[(Adblock(?:\sPlus\s[\d\.]+?)?)\]', flags=re.I)	145 HEADER_REGEXP = re.compile(r'\[(Adblock(?:\sPlus\s[\d\.]+?)?)\]', flags=re.I)

148 HIDING_FILTER_REGEXP = re.compile(r'^([^/\|@"!]?)#([@?])?#(.+)$')	146 HIDING_FILTER_REGEXP = re.compile(r'^([^/\|@"!]?)#([@?])?#(.+)$')

149 FILTER_OPTIONS_REGEXP = re.compile(	147 FILTER_OPTIONS_REGEXP = re.compile(

150 r'\$(~?[\w-]+(?:=[^,]+)?(?:,~?[\w-]+(?:=[^,]+)?)*)$'	148 r'\$(~?[\w-]+(?:=[^,]+)?(?:,~?[\w-]+(?:=[^,]+)?)*)$'

151 )	149 )

152	150

153	151

154 def _parse_comment(text):	152 def _parse_comment(text):

155 match = METADATA_REGEXP.match(text)	153 match = METADATA_REGEXP.match(text)

156 if match and match.group(1) in METADATA_KEYS:	154 if match:
Sebastian Noack 2018/08/21 19:42:46 Note that metadata keys are case-insensitive. If w Note that metadata keys are case-insensitive. If we go with my above suggestion, this won't matter here anymore. But once non-capitalized keys are recognized this needs to be accounted for in _remove_duplicates() in renderer.py as well. Vasily Kuznetsov 2018/08/22 13:10:50 Should we then maybe adopt some canonical capitali Show quoted text On 2018/08/21 19:42:46, Sebastian Noack wrote: > Note that metadata keys are case-insensitive. If we go with my above suggestion, > this won't matter here anymore. But once non-capitalized keys are recognized > this needs to be accounted for in _remove_duplicates() in renderer.py as well. Should we then maybe adopt some canonical capitalization scheme (for example Like-This) and then convert them here to follow it? Otherwise we will have to remember this in any code that works with metadata keys. Sebastian Noack 2018/08/22 14:31:06 Initially, I had a similar though, but on the othe Show quoted text On 2018/08/22 13:10:50, Vasily Kuznetsov wrote: > On 2018/08/21 19:42:46, Sebastian Noack wrote: > > Note that metadata keys are case-insensitive. If we go with my above > suggestion, > > this won't matter here anymore. But once non-capitalized keys are recognized > > this needs to be accounted for in _remove_duplicates() in renderer.py as well. > > Should we then maybe adopt some canonical capitalization scheme (for example > Like-This) and then convert them here to follow it? Otherwise we will have to > remember this in any code that works with metadata keys. Initially, I had a similar though, but on the other hand I'd prefer to not unnecessarily have the output diverge from the input where (easily) possible, and it seems the only place where we specifically check for a particular metadata key is where we strip checksums, besides we'd have to account for case-insensitivity as well when comparing metadata for the diff. Vasily Kuznetsov 2018/08/24 13:03:19 The output that doesn't diverge from the input unn Show quoted text On 2018/08/22 14:31:06, Sebastian Noack wrote: > On 2018/08/22 13:10:50, Vasily Kuznetsov wrote: > > On 2018/08/21 19:42:46, Sebastian Noack wrote: > > > Note that metadata keys are case-insensitive. If we go with my above > > suggestion, > > > this won't matter here anymore. But once non-capitalized keys are recognized > > > this needs to be accounted for in _remove_duplicates() in renderer.py as > well. > > > > Should we then maybe adopt some canonical capitalization scheme (for example > > Like-This) and then convert them here to follow it? Otherwise we will have to > > remember this in any code that works with metadata keys. > > Initially, I had a similar though, but on the other hand I'd prefer to not > unnecessarily have the output diverge from the input where (easily) possible, > and it seems the only place where we specifically check for a particular > metadata key is where we strip checksums, besides we'd have to account for > case-insensitivity as well when comparing metadata for the diff. The output that doesn't diverge from the input unnecessarily seems like a useful feature. In principle, if we normalize the headers, eventually all "our" filter lists will have normalized headers and they won't diverge anyway, but this might not be the case just yet. Anyway, for the time being it's not a big deal and it's unlikely to result in problems in the future so I don't feel too strongly about this. rhowell 2018/08/27 22:06:26 Acknowledged. Show quoted text On 2018/08/24 13:03:19, Vasily Kuznetsov wrote: > On 2018/08/22 14:31:06, Sebastian Noack wrote: > > On 2018/08/22 13:10:50, Vasily Kuznetsov wrote: > > > On 2018/08/21 19:42:46, Sebastian Noack wrote: > > > > Note that metadata keys are case-insensitive. If we go with my above > > > suggestion, > > > > this won't matter here anymore. But once non-capitalized keys are > recognized > > > > this needs to be accounted for in _remove_duplicates() in renderer.py as > > well. > > > > > > Should we then maybe adopt some canonical capitalization scheme (for example > > > Like-This) and then convert them here to follow it? Otherwise we will have > to > > > remember this in any code that works with metadata keys. > > > > Initially, I had a similar though, but on the other hand I'd prefer to not > > unnecessarily have the output diverge from the input where (easily) possible, > > and it seems the only place where we specifically check for a particular > > metadata key is where we strip checksums, besides we'd have to account for > > case-insensitivity as well when comparing metadata for the diff. > > The output that doesn't diverge from the input unnecessarily seems like a useful > feature. In principle, if we normalize the headers, eventually all "our" filter > lists will have normalized headers and they won't diverge anyway, but this might > not be the case just yet. Anyway, for the time being it's not a big deal and > it's unlikely to result in problems in the future so I don't feel too strongly > about this. Acknowledged.
157 return Metadata(match.group(1), match.group(2))	155 return Metadata(match.group(1), match.group(2))

158 return Comment(text[1:].strip())	156 return Comment(text[1:].strip())

159	157

160	158

161 def _parse_header(text):	159 def _parse_header(text):

162 match = HEADER_REGEXP.match(text)	160 match = HEADER_REGEXP.match(text)

163 if not match:	161 if not match:

164 raise ParseError('Malformed header', text)	162 raise ParseError('Malformed header', text)

165 return Header(match.group(1))	163 return Header(match.group(1))

166	164

(...skipping 115 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
282 line_text = line_text.decode('utf-8')	280 line_text = line_text.decode('utf-8')

283	281

284 content = line_text.strip()	282 content = line_text.strip()

285	283

286 if content == '':	284 if content == '':

287 line = EmptyLine()	285 line = EmptyLine()

288 elif content.startswith('!'):	286 elif content.startswith('!'):

289 line = _parse_comment(content)	287 line = _parse_comment(content)

290 elif content.startswith('%') and content.endswith('%'):	288 elif content.startswith('%') and content.endswith('%'):

291 line = _parse_instruction(content)	289 line = _parse_instruction(content)

292 elif content.startswith('[') and content.endswith(']'):	290 elif content.startswith('[') and content.endswith(']'):
Sebastian Noack 2018/08/21 19:42:45 Somewhat unrelated of these changes, but this is i Somewhat unrelated of these changes, but this is inaccurate too. Only the first line (if encapsulated in brackets) is to be recognized as header. Otherwise it would be considered a filter. Vasily Kuznetsov 2018/08/22 13:10:50 Makes sense. I created a separate ticket for it: h Show quoted text On 2018/08/21 19:42:45, Sebastian Noack wrote: > Somewhat unrelated of these changes, but this is inaccurate too. Only the first > line (if encapsulated in brackets) is to be recognized as header. Otherwise it > would be considered a filter. Makes sense. I created a separate ticket for it: https://issues.adblockplus.org/ticket/6877 rhowell 2018/08/27 22:06:26 Acknowledged. Show quoted text On 2018/08/22 13:10:50, Vasily Kuznetsov wrote: > On 2018/08/21 19:42:45, Sebastian Noack wrote: > > Somewhat unrelated of these changes, but this is inaccurate too. Only the > first > > line (if encapsulated in brackets) is to be recognized as header. Otherwise it > > would be considered a filter. > > Makes sense. I created a separate ticket for it: > https://issues.adblockplus.org/ticket/6877 Acknowledged.
293 line = _parse_header(content)	291 line = _parse_header(content)

294 else:	292 else:

295 line = parse_filter(content)	293 line = parse_filter(content)

296	294

297 assert line.to_string().replace(' ', '') == content.replace(' ', '')	295 assert line.to_string().replace(' ', '') == content.replace(' ', '')

298 return line	296 return line

299	297

300	298

301 def parse_filterlist(lines):	299 def parse_filterlist(lines):

302 """Parse filter list from an iterable.	300 """Parse filter list from an iterable.

(...skipping 11 matching lines...) Expand all Loading...
314 Raises	312 Raises

315 ------	313 ------

316 ParseError	314 ParseError

317 Thrown during iteration for invalid filter list lines.	315 Thrown during iteration for invalid filter list lines.

318 TypeError	316 TypeError

319 If `lines` is not iterable.	317 If `lines` is not iterable.

320	318

321 """	319 """

322 for line in lines:	320 for line in lines:

323 yield parse_line(line)	321 yield parse_line(line)

LEFT	RIGHT