Rietveld Code Review Tool
Help | Bug tracker | Discussion group | Source code

Side by Side Diff: abp/filters/parser.py

Issue 29880577: Issue 6877 - Only parse headers in the first line of the filter list (Closed)
Patch Set: Correct behavior, add comments, improve naming, add tests Created Sept. 18, 2018, 12:37 p.m.
Left:
Right:
Use n/p to move between diff chunks; N/P to move between comments.
Jump to:
View unified diff | Download patch
« no previous file with comments | « no previous file | abp/filters/rpy.py » ('j') | tests/test_rpy.py » ('J')
Toggle Intra-line Diffs ('i') | Expand Comments ('e') | Collapse Comments ('c') | Show Comments Hide Comments ('s')
OLDNEW
1 # This file is part of Adblock Plus <https://adblockplus.org/>, 1 # This file is part of Adblock Plus <https://adblockplus.org/>,
2 # Copyright (C) 2006-present eyeo GmbH 2 # Copyright (C) 2006-present eyeo GmbH
3 # 3 #
4 # Adblock Plus is free software: you can redistribute it and/or modify 4 # Adblock Plus is free software: you can redistribute it and/or modify
5 # it under the terms of the GNU General Public License version 3 as 5 # it under the terms of the GNU General Public License version 3 as
6 # published by the Free Software Foundation. 6 # published by the Free Software Foundation.
7 # 7 #
8 # Adblock Plus is distributed in the hope that it will be useful, 8 # Adblock Plus is distributed in the hope that it will be useful,
9 # but WITHOUT ANY WARRANTY; without even the implied warranty of 9 # but WITHOUT ANY WARRANTY; without even the implied warranty of
10 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 10 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
(...skipping 122 matching lines...) Expand 10 before | Expand all | Expand 10 after
133 133
134 134
135 Header = _line_type('Header', 'version', '[{.version}]') 135 Header = _line_type('Header', 'version', '[{.version}]')
136 EmptyLine = _line_type('EmptyLine', '', '') 136 EmptyLine = _line_type('EmptyLine', '', '')
137 Comment = _line_type('Comment', 'text', '! {.text}') 137 Comment = _line_type('Comment', 'text', '! {.text}')
138 Metadata = _line_type('Metadata', 'key value', '! {0.key}: {0.value}') 138 Metadata = _line_type('Metadata', 'key value', '! {0.key}: {0.value}')
139 Filter = _line_type('Filter', 'text selector action options', '{.text}') 139 Filter = _line_type('Filter', 'text selector action options', '{.text}')
140 Include = _line_type('Include', 'target', '%include {0.target}%') 140 Include = _line_type('Include', 'target', '%include {0.target}%')
141 141
142 142
143 METADATA_REGEXP = re.compile(r'(.*?)\s*:\s*(.*)') 143 METADATA_REGEXP = re.compile(r'\s*!\s*(.*?)\s*:\s*(.*)')
144 INCLUDE_REGEXP = re.compile(r'%include\s+(.+)%') 144 INCLUDE_REGEXP = re.compile(r'%include\s+(.+)%')
145 HEADER_REGEXP = re.compile(r'\[(Adblock(?:\s*Plus\s*[\d\.]+?)?)\]', flags=re.I) 145 HEADER_REGEXP = re.compile(r'\[(Adblock(?:\s*Plus\s*[\d\.]+?)?)\]', flags=re.I)
Sebastian Noack 2018/09/18 15:19:08 Does this regexp is missing a $ at the end? I didn
146 HIDING_FILTER_REGEXP = re.compile(r'^([^/*|@"!]*?)#([@?])?#(.+)$') 146 HIDING_FILTER_REGEXP = re.compile(r'^([^/*|@"!]*?)#([@?])?#(.+)$')
147 FILTER_OPTIONS_REGEXP = re.compile( 147 FILTER_OPTIONS_REGEXP = re.compile(
148 r'\$(~?[\w-]+(?:=[^,]+)?(?:,~?[\w-]+(?:=[^,]+)?)*)$' 148 r'\$(~?[\w-]+(?:=[^,]+)?(?:,~?[\w-]+(?:=[^,]+)?)*)$'
149 ) 149 )
150 150
151 151
152 def _parse_header(text): 152 def _parse_header(text):
153 match = HEADER_REGEXP.match(text) 153 match = HEADER_REGEXP.match(text)
154 if not match: 154 if not match:
155 raise ParseError('Malformed header', text) 155 raise ParseError('Malformed header', text)
(...skipping 88 matching lines...) Expand 10 before | Expand all | Expand 10 after
244 Parsed filter. 244 Parsed filter.
245 245
246 """ 246 """
247 if '#' in text: 247 if '#' in text:
248 match = HIDING_FILTER_REGEXP.search(text) 248 match = HIDING_FILTER_REGEXP.search(text)
249 if match: 249 if match:
250 return _parse_hiding_filter(text, *match.groups()) 250 return _parse_hiding_filter(text, *match.groups())
251 return _parse_blocking_filter(text) 251 return _parse_blocking_filter(text)
252 252
253 253
254 def parse_line(line_text): 254 def parse_line(line_text, mode='body'):
Sebastian Noack 2018/09/18 15:19:08 We probably should call the argument here "positio
Vasily Kuznetsov 2018/09/18 18:11:44 I originally left it as "mode" on purpose, but now
255 """Parse one line of a filter list. 255 """Parse one line of a filter list.
256 256
257 Note that parse_line() doesn't handle special comments, hence never returns 257 The types of lines that that the parser recognizes depend on the mode. In
258 a Metadata() object, Adblock Plus only considers metadata when parsing the 258 body mode the parser only recognizes filters, comments, processing
259 whole filter list and only if they are given at the top of the filter list. 259 instructions and empty lines. In medata mode it in addition recognizes
260 metadata. In start mode it also recognizes headers.
261
262 Note: checksum metadata lines are recognized in all modes for backwards
Sebastian Noack 2018/09/18 15:19:07 Typo: Capitalize the first word after a colon.
Vasily Kuznetsov 2018/09/18 18:11:44 Done.
263 compatibility. Historically, checksums can occur at the bottom of the
264 filter list. They are are no longer used by Adblock Plus, but in order to
265 strip them (in abp.filters.renderer), we have to make sure to still parse
266 them regardless of their position in the filter list.
260 267
261 Parameters 268 Parameters
262 ---------- 269 ----------
263 line_text : str 270 line_text : str
264 Line of a filter list. 271 Line of a filter list.
272 mode : str
273 Parsing mode, one of "start", "metadata" or "body" (default).
265 274
266 Returns 275 Returns
267 ------- 276 -------
268 namedtuple 277 namedtuple
269 Parsed line (see `_line_type`). 278 Parsed line (see `_line_type`).
270 279
271 Raises 280 Raises
272 ------ 281 ------
273 ParseError 282 ParseError
274 ParseError: If the line can't be parsed. 283 ParseError: If the line can't be parsed.
284
275 """ 285 """
286 MODES = {'body', 'start', 'metadata'}
287 if mode not in MODES:
288 raise ValueError('mode should be one of {}'.format(MODES))
289
276 if isinstance(line_text, type(b'')): 290 if isinstance(line_text, type(b'')):
277 line_text = line_text.decode('utf-8') 291 line_text = line_text.decode('utf-8')
278 292
279 content = line_text.strip() 293 content = line_text.strip()
Sebastian Noack 2018/09/18 15:19:07 Given the only difference between line_text and co
280 294
281 if content == '': 295 if content == '':
282 line = EmptyLine() 296 return EmptyLine()
283 elif content.startswith('!'):
284 line = Comment(content[1:].lstrip())
285 elif content.startswith('%') and content.endswith('%'):
286 line = _parse_instruction(content)
287 elif content.startswith('[') and content.endswith(']'):
288 line = _parse_header(content)
289 else:
290 line = parse_filter(content)
291 297
292 assert line.to_string().replace(' ', '') == content.replace(' ', '') 298 if content.startswith('!'):
293 return line 299 match = METADATA_REGEXP.match(line_text)
300 if match:
301 key, value = match.groups()
302 # Metadata is only parsed in start or metadata modes but there's
303 # an exception for checksums (see docstring).
Sebastian Noack 2018/09/18 15:19:08 Nit: IMO, this comment is redundant. That we speci
Vasily Kuznetsov 2018/09/18 18:11:43 Done.
304 if mode != 'body' or key.lower() == 'checksum':
305 return Metadata(key, value)
306 return Comment(content[1:].lstrip())
307
308 if content.startswith('%') and content.endswith('%'):
309 return _parse_instruction(content)
310
311 if mode == 'start' and (line_text.startswith('[') and
312 line_text.endswith(']')):
Sebastian Noack 2018/09/18 15:19:08 Nit: Perhaps just rename the variable (e.g. to "li
Vasily Kuznetsov 2018/09/18 18:11:43 Done.
313 return _parse_header(content)
Sebastian Noack 2018/09/18 15:19:08 The result should be the same, since by asserting
Vasily Kuznetsov 2018/09/18 18:11:43 Done.
314
315 return parse_filter(content)
294 316
295 317
296 def parse_filterlist(lines): 318 def parse_filterlist(lines):
297 """Parse filter list from an iterable. 319 """Parse filter list from an iterable.
298 320
299 Parameters 321 Parameters
300 ---------- 322 ----------
301 lines: iterable of str 323 lines: iterable of str
302 Lines of the filter list. 324 Lines of the filter list.
303 325
304 Returns 326 Returns
305 ------- 327 -------
306 iterator of namedtuple 328 iterator of namedtuple
307 Parsed lines of the filter list. 329 Parsed lines of the filter list.
308 330
309 Raises 331 Raises
310 ------ 332 ------
311 ParseError 333 ParseError
312 Thrown during iteration for invalid filter list lines. 334 Thrown during iteration for invalid filter list lines.
313 TypeError 335 TypeError
314 If `lines` is not iterable. 336 If `lines` is not iterable.
315 337
316 """ 338 """
317 metadata_closed = False 339 position = 'start'
318 340
319 for line in lines: 341 for line in lines:
320 result = parse_line(line) 342 parsed_line = parse_line(line, position)
343 yield parsed_line
321 344
322 if result.type == 'comment': 345 if position != 'body' and parsed_line.type in {'header', 'metadata'}:
323 match = METADATA_REGEXP.match(result.text) 346 # Continue parsing metadata until it's over...
324 if match: 347 position = 'metadata'
325 key, value = match.groups() 348 else:
326 349 # ...then switch to parsing the body.
327 # Historically, checksums can occur at the bottom of the 350 position = 'body'
328 # filter list. Checksums are no longer used by Adblock Plus,
329 # but in order to strip them (in abp.filters.renderer),
330 # we have to make sure to still parse them regardless of
331 # their position in the filter list.
332 if not metadata_closed or key.lower() == 'checksum':
333 result = Metadata(key, value)
334
335 if result.type not in {'header', 'metadata'}:
336 metadata_closed = True
337
338 yield result
OLDNEW
« no previous file with comments | « no previous file | abp/filters/rpy.py » ('j') | tests/test_rpy.py » ('J')

Powered by Google App Engine
This is Rietveld