Rietveld Code Review Tool
Help | Bug tracker | Discussion group | Source code

Delta Between Two Patch Sets: abp/filters/parser.py

Issue 29880577: Issue 6877 - Only parse headers in the first line of the filter list (Closed)
Left Patch Set: Correct behavior, add comments, improve naming, add tests Created Sept. 18, 2018, 12:37 p.m.
Right Patch Set: Fix header parsing, improve argument naming and documentation Created Sept. 18, 2018, 6:06 p.m.
Left:
Right:
Use n/p to move between diff chunks; N/P to move between comments.
Jump to:
Left: Side by side diff | Download
Right: Side by side diff | Download
« no previous file with change/comment | « no previous file | abp/filters/rpy.py » ('j') | no next file with change/comment »
Toggle Intra-line Diffs ('i') | Expand Comments ('e') | Collapse Comments ('c') | Show Comments Hide Comments ('s')
LEFTRIGHT
1 # This file is part of Adblock Plus <https://adblockplus.org/>, 1 # This file is part of Adblock Plus <https://adblockplus.org/>,
2 # Copyright (C) 2006-present eyeo GmbH 2 # Copyright (C) 2006-present eyeo GmbH
3 # 3 #
4 # Adblock Plus is free software: you can redistribute it and/or modify 4 # Adblock Plus is free software: you can redistribute it and/or modify
5 # it under the terms of the GNU General Public License version 3 as 5 # it under the terms of the GNU General Public License version 3 as
6 # published by the Free Software Foundation. 6 # published by the Free Software Foundation.
7 # 7 #
8 # Adblock Plus is distributed in the hope that it will be useful, 8 # Adblock Plus is distributed in the hope that it will be useful,
9 # but WITHOUT ANY WARRANTY; without even the implied warranty of 9 # but WITHOUT ANY WARRANTY; without even the implied warranty of
10 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 10 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
(...skipping 124 matching lines...) Expand 10 before | Expand all | Expand 10 after
135 Header = _line_type('Header', 'version', '[{.version}]') 135 Header = _line_type('Header', 'version', '[{.version}]')
136 EmptyLine = _line_type('EmptyLine', '', '') 136 EmptyLine = _line_type('EmptyLine', '', '')
137 Comment = _line_type('Comment', 'text', '! {.text}') 137 Comment = _line_type('Comment', 'text', '! {.text}')
138 Metadata = _line_type('Metadata', 'key value', '! {0.key}: {0.value}') 138 Metadata = _line_type('Metadata', 'key value', '! {0.key}: {0.value}')
139 Filter = _line_type('Filter', 'text selector action options', '{.text}') 139 Filter = _line_type('Filter', 'text selector action options', '{.text}')
140 Include = _line_type('Include', 'target', '%include {0.target}%') 140 Include = _line_type('Include', 'target', '%include {0.target}%')
141 141
142 142
143 METADATA_REGEXP = re.compile(r'\s*!\s*(.*?)\s*:\s*(.*)') 143 METADATA_REGEXP = re.compile(r'\s*!\s*(.*?)\s*:\s*(.*)')
144 INCLUDE_REGEXP = re.compile(r'%include\s+(.+)%') 144 INCLUDE_REGEXP = re.compile(r'%include\s+(.+)%')
145 HEADER_REGEXP = re.compile(r'\[(Adblock(?:\s*Plus\s*[\d\.]+?)?)\]', flags=re.I) 145 HEADER_REGEXP = re.compile(r'\[(?:(Adblock(?:\s*Plus\s*[\d\.]+?)?)|.*)\]',
146 flags=re.I)
146 HIDING_FILTER_REGEXP = re.compile(r'^([^/*|@"!]*?)#([@?])?#(.+)$') 147 HIDING_FILTER_REGEXP = re.compile(r'^([^/*|@"!]*?)#([@?])?#(.+)$')
147 FILTER_OPTIONS_REGEXP = re.compile( 148 FILTER_OPTIONS_REGEXP = re.compile(
148 r'\$(~?[\w-]+(?:=[^,]+)?(?:,~?[\w-]+(?:=[^,]+)?)*)$' 149 r'\$(~?[\w-]+(?:=[^,]+)?(?:,~?[\w-]+(?:=[^,]+)?)*)$'
149 ) 150 )
150
151
152 def _parse_header(text):
153 match = HEADER_REGEXP.match(text)
154 if not match:
155 raise ParseError('Malformed header', text)
156 return Header(match.group(1))
157 151
158 152
159 def _parse_instruction(text): 153 def _parse_instruction(text):
160 match = INCLUDE_REGEXP.match(text) 154 match = INCLUDE_REGEXP.match(text)
161 if not match: 155 if not match:
162 raise ParseError('Unrecognized instruction', text) 156 raise ParseError('Unrecognized instruction', text)
163 return Include(match.group(1)) 157 return Include(match.group(1))
164 158
165 159
166 def _parse_option(option): 160 def _parse_option(option):
(...skipping 77 matching lines...) Expand 10 before | Expand all | Expand 10 after
244 Parsed filter. 238 Parsed filter.
245 239
246 """ 240 """
247 if '#' in text: 241 if '#' in text:
248 match = HIDING_FILTER_REGEXP.search(text) 242 match = HIDING_FILTER_REGEXP.search(text)
249 if match: 243 if match:
250 return _parse_hiding_filter(text, *match.groups()) 244 return _parse_hiding_filter(text, *match.groups())
251 return _parse_blocking_filter(text) 245 return _parse_blocking_filter(text)
252 246
253 247
254 def parse_line(line_text, mode='body'): 248 def parse_line(line, position='body'):
Sebastian Noack 2018/09/18 15:19:08 We probably should call the argument here "positio
Vasily Kuznetsov 2018/09/18 18:11:44 I originally left it as "mode" on purpose, but now
255 """Parse one line of a filter list. 249 """Parse one line of a filter list.
256 250
257 The types of lines that that the parser recognizes depend on the mode. In 251 The types of lines that that the parser recognizes depend on the position.
258 body mode the parser only recognizes filters, comments, processing 252 If position="body", the parser only recognizes filters, comments,
259 instructions and empty lines. In medata mode it in addition recognizes 253 processing instructions and empty lines. If position="metadata", it in
260 metadata. In start mode it also recognizes headers. 254 addition recognizes metadata. If position="start", it also recognizes
261 255 headers.
262 Note: checksum metadata lines are recognized in all modes for backwards 256
Sebastian Noack 2018/09/18 15:19:07 Typo: Capitalize the first word after a colon.
Vasily Kuznetsov 2018/09/18 18:11:44 Done.
257 Note: Checksum metadata lines are recognized in all positions for backwards
263 compatibility. Historically, checksums can occur at the bottom of the 258 compatibility. Historically, checksums can occur at the bottom of the
264 filter list. They are are no longer used by Adblock Plus, but in order to 259 filter list. They are are no longer used by Adblock Plus, but in order to
265 strip them (in abp.filters.renderer), we have to make sure to still parse 260 strip them (in abp.filters.renderer), we have to make sure to still parse
266 them regardless of their position in the filter list. 261 them regardless of their position in the filter list.
267 262
268 Parameters 263 Parameters
269 ---------- 264 ----------
270 line_text : str 265 line : str
271 Line of a filter list. 266 Line of a filter list.
272 mode : str 267 position : str
273 Parsing mode, one of "start", "metadata" or "body" (default). 268 Position in the filter list, one of "start", "metadata" or "body"
269 (default is "body").
274 270
275 Returns 271 Returns
276 ------- 272 -------
277 namedtuple 273 namedtuple
278 Parsed line (see `_line_type`). 274 Parsed line (see `_line_type`).
279 275
280 Raises 276 Raises
281 ------ 277 ------
282 ParseError 278 ParseError
283 ParseError: If the line can't be parsed. 279 ParseError: If the line can't be parsed.
284 280
285 """ 281 """
286 MODES = {'body', 'start', 'metadata'} 282 POSITIONS = {'body', 'start', 'metadata'}
287 if mode not in MODES: 283 if position not in POSITIONS:
288 raise ValueError('mode should be one of {}'.format(MODES)) 284 raise ValueError('position should be one of {}'.format(POSITIONS))
289 285
290 if isinstance(line_text, type(b'')): 286 if isinstance(line, type(b'')):
291 line_text = line_text.decode('utf-8') 287 line = line.decode('utf-8')
292 288
293 content = line_text.strip() 289 stripped = line.strip()
294 290
295 if content == '': 291 if stripped == '':
296 return EmptyLine() 292 return EmptyLine()
297 293
298 if content.startswith('!'): 294 if position == 'start':
299 match = METADATA_REGEXP.match(line_text) 295 match = HEADER_REGEXP.search(line)
296 if match:
297 version = match.group(1)
298 if not version:
299 raise ParseError('Malformed header', line)
300 return Header(version)
301
302 if stripped.startswith('!'):
303 match = METADATA_REGEXP.match(line)
300 if match: 304 if match:
301 key, value = match.groups() 305 key, value = match.groups()
302 # Metadata is only parsed in start or metadata modes but there's 306 if position != 'body' or key.lower() == 'checksum':
303 # an exception for checksums (see docstring).
Sebastian Noack 2018/09/18 15:19:08 Nit: IMO, this comment is redundant. That we speci
Vasily Kuznetsov 2018/09/18 18:11:43 Done.
304 if mode != 'body' or key.lower() == 'checksum':
305 return Metadata(key, value) 307 return Metadata(key, value)
306 return Comment(content[1:].lstrip()) 308 return Comment(stripped[1:].lstrip())
307 309
308 if content.startswith('%') and content.endswith('%'): 310 if stripped.startswith('%') and stripped.endswith('%'):
309 return _parse_instruction(content) 311 return _parse_instruction(stripped)
310 312
311 if mode == 'start' and (line_text.startswith('[') and 313 return parse_filter(stripped)
312 line_text.endswith(']')):
Sebastian Noack 2018/09/18 15:19:08 Nit: Perhaps just rename the variable (e.g. to "li
Vasily Kuznetsov 2018/09/18 18:11:43 Done.
313 return _parse_header(content)
Sebastian Noack 2018/09/18 15:19:08 The result should be the same, since by asserting
Vasily Kuznetsov 2018/09/18 18:11:43 Done.
314
315 return parse_filter(content)
316 314
317 315
318 def parse_filterlist(lines): 316 def parse_filterlist(lines):
319 """Parse filter list from an iterable. 317 """Parse filter list from an iterable.
320 318
321 Parameters 319 Parameters
322 ---------- 320 ----------
323 lines: iterable of str 321 lines: iterable of str
324 Lines of the filter list. 322 Lines of the filter list.
325 323
(...skipping 15 matching lines...) Expand all
341 for line in lines: 339 for line in lines:
342 parsed_line = parse_line(line, position) 340 parsed_line = parse_line(line, position)
343 yield parsed_line 341 yield parsed_line
344 342
345 if position != 'body' and parsed_line.type in {'header', 'metadata'}: 343 if position != 'body' and parsed_line.type in {'header', 'metadata'}:
346 # Continue parsing metadata until it's over... 344 # Continue parsing metadata until it's over...
347 position = 'metadata' 345 position = 'metadata'
348 else: 346 else:
349 # ...then switch to parsing the body. 347 # ...then switch to parsing the body.
350 position = 'body' 348 position = 'body'
LEFTRIGHT
« no previous file | abp/filters/rpy.py » ('j') | Toggle Intra-line Diffs ('i') | Expand Comments ('e') | Collapse Comments ('c') | Toggle Comments ('s')

Powered by Google App Engine
This is Rietveld