abp/filters/parser.py - Issue 29880577: Issue 6877 - Only parse headers in the first line of the filter list

Side by Side Diff: abp/filters/parser.py

Issue 29880577: Issue 6877 - Only parse headers in the first line of the filter list (Closed)

Patch Set: Correct behavior, add comments, improve naming, add tests Created Sept. 18, 2018, 12:37 p.m.

Left:
Right:

Use n/p to move between diff chunks; N/P to move between comments.

Jump to:

View unified diff | Download patch

OLD	NEW
1 # This file is part of Adblock Plus <https://adblockplus.org/>,	1 # This file is part of Adblock Plus <https://adblockplus.org/>,

2 # Copyright (C) 2006-present eyeo GmbH	2 # Copyright (C) 2006-present eyeo GmbH

3 #	3 #

4 # Adblock Plus is free software: you can redistribute it and/or modify	4 # Adblock Plus is free software: you can redistribute it and/or modify

5 # it under the terms of the GNU General Public License version 3 as	5 # it under the terms of the GNU General Public License version 3 as

6 # published by the Free Software Foundation.	6 # published by the Free Software Foundation.

7 #	7 #

8 # Adblock Plus is distributed in the hope that it will be useful,	8 # Adblock Plus is distributed in the hope that it will be useful,

9 # but WITHOUT ANY WARRANTY; without even the implied warranty of	9 # but WITHOUT ANY WARRANTY; without even the implied warranty of

10 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the	10 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the

(...skipping 122 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
133	133

134	134

135 Header = _line_type('Header', 'version', '[{.version}]')	135 Header = _line_type('Header', 'version', '[{.version}]')

136 EmptyLine = _line_type('EmptyLine', '', '')	136 EmptyLine = _line_type('EmptyLine', '', '')

137 Comment = _line_type('Comment', 'text', '! {.text}')	137 Comment = _line_type('Comment', 'text', '! {.text}')

138 Metadata = _line_type('Metadata', 'key value', '! {0.key}: {0.value}')	138 Metadata = _line_type('Metadata', 'key value', '! {0.key}: {0.value}')

139 Filter = _line_type('Filter', 'text selector action options', '{.text}')	139 Filter = _line_type('Filter', 'text selector action options', '{.text}')

140 Include = _line_type('Include', 'target', '%include {0.target}%')	140 Include = _line_type('Include', 'target', '%include {0.target}%')

141	141

142	142

143 METADATA_REGEXP = re.compile(r'(.?)\s:\s(.)')	143 METADATA_REGEXP = re.compile(r'\s!\s(.?)\s:\s(.)')

144 INCLUDE_REGEXP = re.compile(r'%include\s+(.+)%')	144 INCLUDE_REGEXP = re.compile(r'%include\s+(.+)%')

145 HEADER_REGEXP = re.compile(r'\[(Adblock(?:\sPlus\s[\d\.]+?)?)\]', flags=re.I)	145 HEADER_REGEXP = re.compile(r'\[(Adblock(?:\sPlus\s[\d\.]+?)?)\]', flags=re.I)
Sebastian Noack 2018/09/18 15:19:08 Does this regexp is missing a $ at the end? I didn Does this regexp is missing a $ at the end? I didn't test it, but to me it seems we would incorrectly consider "[Adblock]]" a valid header.
146 HIDING_FILTER_REGEXP = re.compile(r'^([^/\|@"!]?)#([@?])?#(.+)$')	146 HIDING_FILTER_REGEXP = re.compile(r'^([^/\|@"!]?)#([@?])?#(.+)$')

147 FILTER_OPTIONS_REGEXP = re.compile(	147 FILTER_OPTIONS_REGEXP = re.compile(

148 r'\$(~?[\w-]+(?:=[^,]+)?(?:,~?[\w-]+(?:=[^,]+)?)*)$'	148 r'\$(~?[\w-]+(?:=[^,]+)?(?:,~?[\w-]+(?:=[^,]+)?)*)$'

149 )	149 )

150	150

151	151

152 def _parse_header(text):	152 def _parse_header(text):

153 match = HEADER_REGEXP.match(text)	153 match = HEADER_REGEXP.match(text)

154 if not match:	154 if not match:

155 raise ParseError('Malformed header', text)	155 raise ParseError('Malformed header', text)

(...skipping 88 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
244 Parsed filter.	244 Parsed filter.

245	245

246 """	246 """

247 if '#' in text:	247 if '#' in text:

248 match = HIDING_FILTER_REGEXP.search(text)	248 match = HIDING_FILTER_REGEXP.search(text)

249 if match:	249 if match:

250 return _parse_hiding_filter(text, *match.groups())	250 return _parse_hiding_filter(text, *match.groups())

251 return _parse_blocking_filter(text)	251 return _parse_blocking_filter(text)

252	252

253	253

254 def parse_line(line_text):	254 def parse_line(line_text, mode='body'):
	Sebastian Noack 2018/09/18 15:19:08 We probably should call the argument here "positio We probably should call the argument here "position" too. Vasily Kuznetsov 2018/09/18 18:11:44 I originally left it as "mode" on purpose, but now Show quoted text On 2018/09/18 15:19:08, Sebastian Noack wrote: > We probably should call the argument here "position" too. I originally left it as "mode" on purpose, but now it seems that "position" is also a good name and it's more consistent with the use in parse_filterlist(). Done.
255 """Parse one line of a filter list.	255 """Parse one line of a filter list.

256	256

257 Note that parse_line() doesn't handle special comments, hence never returns	257 The types of lines that that the parser recognizes depend on the mode. In

258 a Metadata() object, Adblock Plus only considers metadata when parsing the	258 body mode the parser only recognizes filters, comments, processing

259 whole filter list and only if they are given at the top of the filter list.	259 instructions and empty lines. In medata mode it in addition recognizes

	260 metadata. In start mode it also recognizes headers.

	261

	262 Note: checksum metadata lines are recognized in all modes for backwards
	Sebastian Noack 2018/09/18 15:19:07 Typo: Capitalize the first word after a colon. Typo: Capitalize the first word after a colon. Vasily Kuznetsov 2018/09/18 18:11:44 Done. Show quoted text On 2018/09/18 15:19:07, Sebastian Noack wrote: > Typo: Capitalize the first word after a colon. Done.
	263 compatibility. Historically, checksums can occur at the bottom of the

	264 filter list. They are are no longer used by Adblock Plus, but in order to

	265 strip them (in abp.filters.renderer), we have to make sure to still parse

	266 them regardless of their position in the filter list.

260	267

261 Parameters	268 Parameters

262 ----------	269 ----------

263 line_text : str	270 line_text : str

264 Line of a filter list.	271 Line of a filter list.

	272 mode : str

	273 Parsing mode, one of "start", "metadata" or "body" (default).

265	274

266 Returns	275 Returns

267 -------	276 -------

268 namedtuple	277 namedtuple

269 Parsed line (see `_line_type`).	278 Parsed line (see `_line_type`).

270	279

271 Raises	280 Raises

272 ------	281 ------

273 ParseError	282 ParseError

274 ParseError: If the line can't be parsed.	283 ParseError: If the line can't be parsed.

	284

275 """	285 """

	286 MODES = {'body', 'start', 'metadata'}

	287 if mode not in MODES:

	288 raise ValueError('mode should be one of {}'.format(MODES))

	289

276 if isinstance(line_text, type(b'')):	290 if isinstance(line_text, type(b'')):

277 line_text = line_text.decode('utf-8')	291 line_text = line_text.decode('utf-8')

278	292

279 content = line_text.strip()	293 content = line_text.strip()
Sebastian Noack 2018/09/18 15:19:07 Given the only difference between line_text and co Given the only difference between line_text and content is that the latter is stripped, and we now have logic that dependent on the case either needs the stripped or unstripped string, perhaps the code is easier to understand if we rename this variable to "stripped".
280	294

281 if content == '':	295 if content == '':

282 line = EmptyLine()	296 return EmptyLine()

283 elif content.startswith('!'):

284 line = Comment(content[1:].lstrip())

285 elif content.startswith('%') and content.endswith('%'):

286 line = _parse_instruction(content)

287 elif content.startswith('[') and content.endswith(']'):

288 line = _parse_header(content)

289 else:

290 line = parse_filter(content)

291	297

292 assert line.to_string().replace(' ', '') == content.replace(' ', '')	298 if content.startswith('!'):

293 return line	299 match = METADATA_REGEXP.match(line_text)

	300 if match:

	301 key, value = match.groups()

	302 # Metadata is only parsed in start or metadata modes but there's

	303 # an exception for checksums (see docstring).
	Sebastian Noack 2018/09/18 15:19:08 Nit: IMO, this comment is redundant. That we speci Nit: IMO, this comment is redundant. That we special case "checksum" is obvious from the code, the reason why is no covered in the docstring. Vasily Kuznetsov 2018/09/18 18:11:43 Done. Show quoted text On 2018/09/18 15:19:08, Sebastian Noack wrote: > Nit: IMO, this comment is redundant. That we special case "checksum" is obvious > from the code, the reason why is no covered in the docstring. Done.
	304 if mode != 'body' or key.lower() == 'checksum':

	305 return Metadata(key, value)

	306 return Comment(content[1:].lstrip())

	307

	308 if content.startswith('%') and content.endswith('%'):

	309 return _parse_instruction(content)

	310

	311 if mode == 'start' and (line_text.startswith('[') and

	312 line_text.endswith(']')):
	Sebastian Noack 2018/09/18 15:19:08 Nit: Perhaps just rename the variable (e.g. to "li Nit: Perhaps just rename the variable (e.g. to "line"), so that you don't need to wrap here. Vasily Kuznetsov 2018/09/18 18:11:43 Done. Show quoted text On 2018/09/18 15:19:08, Sebastian Noack wrote: > Nit: Perhaps just rename the variable (e.g. to "line"), so that you don't need > to wrap here. Done.
	313 return _parse_header(content)
	Sebastian Noack 2018/09/18 15:19:08 The result should be the same, since by asserting The result should be the same, since by asserting that the lines starts end ends with square brackets, it's implied that there are no leading or trailing whitespaces, but it might be easier to verify the code without having to make that assumption by passing in line_text instead of content here. Vasily Kuznetsov 2018/09/18 18:11:43 Done. Show quoted text On 2018/09/18 15:19:08, Sebastian Noack wrote: > The result should be the same, since by asserting that the lines starts end ends > with square brackets, it's implied that there are no leading or trailing > whitespaces, but it might be easier to verify the code without having to make > that assumption by passing in line_text instead of content here. Done.
	314

	315 return parse_filter(content)

294	316

295	317

296 def parse_filterlist(lines):	318 def parse_filterlist(lines):

297 """Parse filter list from an iterable.	319 """Parse filter list from an iterable.

298	320

299 Parameters	321 Parameters

300 ----------	322 ----------

301 lines: iterable of str	323 lines: iterable of str

302 Lines of the filter list.	324 Lines of the filter list.

303	325

304 Returns	326 Returns

305 -------	327 -------

306 iterator of namedtuple	328 iterator of namedtuple

307 Parsed lines of the filter list.	329 Parsed lines of the filter list.

308	330

309 Raises	331 Raises

310 ------	332 ------

311 ParseError	333 ParseError

312 Thrown during iteration for invalid filter list lines.	334 Thrown during iteration for invalid filter list lines.

313 TypeError	335 TypeError

314 If `lines` is not iterable.	336 If `lines` is not iterable.

315	337

316 """	338 """

317 metadata_closed = False	339 position = 'start'

318	340

319 for line in lines:	341 for line in lines:

320 result = parse_line(line)	342 parsed_line = parse_line(line, position)

	343 yield parsed_line

321	344

322 if result.type == 'comment':	345 if position != 'body' and parsed_line.type in {'header', 'metadata'}:

323 match = METADATA_REGEXP.match(result.text)	346 # Continue parsing metadata until it's over...

324 if match:	347 position = 'metadata'

325 key, value = match.groups()	348 else:

326	349 # ...then switch to parsing the body.

327 # Historically, checksums can occur at the bottom of the	350 position = 'body'

328 # filter list. Checksums are no longer used by Adblock Plus,

329 # but in order to strip them (in abp.filters.renderer),

330 # we have to make sure to still parse them regardless of

331 # their position in the filter list.

332 if not metadata_closed or key.lower() == 'checksum':

333 result = Metadata(key, value)

334

335 if result.type not in {'header', 'metadata'}:

336 metadata_closed = True

337

338 yield result

OLD	NEW

« no previous file with comments | « no previous file | abp/filters/rpy.py » ('j') | tests/test_rpy.py » ('J')