Issue 29880577: Issue 6877 - Only parse headers in the first line of the filter list

Issue 29880577: Issue 6877 - Only parse headers in the first line of the filter list (Closed)

Created:
Sept. 14, 2018, 4:37 p.m. by Vasily Kuznetsov

Modified:
Sept. 19, 2018, 1:30 p.m.

Reviewers:
Sebastian Noack

Visibility:
Public.

Description

Issue 6877 - Only parse headers in the first line of the filter list This is an alternative version of https://codereview.adblockplus.org/29880555/ Based on this review: https://codereview.adblockplus.org/29879650/

Patch Set 1 : Initial #

Total comments: 15

Patch Set 2 : Correct behavior, add comments, improve naming, add tests #

Total comments: 14

Patch Set 3 : Fix header parsing, improve argument naming and documentation #

Created: Sept. 18, 2018, 6:06 p.m.

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+111 lines, -55 lines)			Patch
M	abp/filters/parser.py	View	1 2	3 chunks	+58 lines, -48 lines	0 comments	Download
M	abp/filters/rpy.py	View		1 chunk	+4 lines, -2 lines	0 comments	Download
M	tests/test_parser.py	View	1 2	1 chunk	+39 lines, -4 lines	0 comments	Download
M	tests/test_rpy.py	View	1 2	2 chunks	+10 lines, -1 line	0 comments	Download

Messages

Total messages: 8

Expand All Messages | Collapse All Messages

Vasily Kuznetsov

Here's an implementation of the idea from https://issues.adblockplus.org/ticket/6877#comment:3 The logic is the same as in ...

Sept. 14, 2018, 4:52 p.m. (2018-09-14 16:52:50 UTC) #1

Sebastian Noack

https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser.py File abp/filters/parser.py (left): https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser.py#oldcode279 abp/filters/parser.py:279: content = line_text.strip() Adblock Plus doesn't strip the line ...

Sept. 15, 2018, 4:08 p.m. (2018-09-15 16:08:33 UTC) #2

Vasily Kuznetsov

Hi Sebastian, Thanks for the comments, I will address them. See also some follow up ...

Sept. 17, 2018, 10:40 a.m. (2018-09-17 10:40:27 UTC) #3

Hi Sebastian,

Thanks for the comments, I will address them. See also some follow up questions
below.

What do you think about the whole approach? Are you more convinced or do you
still prefer doing more parsing in parse_filterlist() as in your version of this
review?

Cheers,
Vasily

https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser.py
File abp/filters/parser.py (left):

https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser....
abp/filters/parser.py:279: content = line_text.strip()
On 2018/09/15 16:08:32, Sebastian Noack wrote:
> Adblock Plus doesn't strip the line before processing headers and metadata,
i.e.
> a line with leading and/or trailing whitespaces isn't considered a valid
header,
> and trailing whitespaces in metadata values are preserved.

The behavior of ABP for the headers seems right. I will adjust the code here.

However I'm not so sure about preserving the trailing space. Do you think is
desirable? I mean do you think ABP is doing the right thing in this case -- I
agree that python-abp should behave the same.

https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser.py
File abp/filters/parser.py (right):

https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser....
abp/filters/parser.py:145: HEADER_REGEXP =
re.compile(r'\[(?:(Adblock(?:\s*Plus\s*[\d\.]+?)?)|.*)\]$',
On 2018/09/15 16:08:32, Sebastian Noack wrote:
> I changed this regular epxressions like this in my patch so that I don't have
to
> first check whether the line starts and ends with square brackets. With your
> implementation this seems redundant.

Yeah, you're right. I think the logic of parse_line() is more clear the way it
is so I will undo the regexp change.

https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser....
abp/filters/parser.py:256: def parse_line(line_text, mode='body'):
On 2018/09/15 16:08:32, Sebastian Noack wrote:
> Having the "mode" as part of the public API, requires to document it (you did
> that below), and probably also calls for more tests than we currently have. By
> not exposing this implementation detail, the public API (and it's
documentation
> and tests) can be simpler. Also we wouldn't need to validate the value below.

I would like to keep the mode in the public API. This way full functionality or
line parsing is available to the users and I think it's a good thing. You're
right about the tests, however, I will add them.

https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser....
abp/filters/parser.py:304: if mode != 'body' or key.lower() == 'checksum':
On 2018/09/15 16:08:32, Sebastian Noack wrote:
> We probably should keep the comment why we treat checksums special here.

I would like to also keep the note about checksums in the docstring also. Do you
think it would be ok to just refer to the docstring here?

Sebastian Noack

On 2018/09/17 10:40:27, Vasily Kuznetsov wrote: > What do you think about the whole approach? ...

Sept. 17, 2018, 6:11 p.m. (2018-09-17 18:11:52 UTC) #4

On 2018/09/17 10:40:27, Vasily Kuznetsov wrote:
> What do you think about the whole approach? Are you more convinced or do you
> still prefer doing more parsing in parse_filterlist() as in your version of
this
> review?

Let's see how the changes will look like after the outstanding comments have
been addressed. But anyway it seems your approach make things more complex.
Another aspect we didn't look into yet would be performance.

https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser.py
File abp/filters/parser.py (left):

https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser....
abp/filters/parser.py:279: content = line_text.strip()
On 2018/09/17 10:40:27, Vasily Kuznetsov wrote:
> However I'm not so sure about preserving the trailing space. Do you think is
> desirable? I mean do you think ABP is doing the right thing in this case -- I
> agree that python-abp should behave the same.

Adblock Plus extracts metadata (and the header) before parsing the remaining
filter list content (so does my patch for python-abp). The regular expression
used to identify metadata and extract the key and value would be more
complicated if we'd want to strip trailing whitespaces from the value, and in
practice it doesn't seem to matter whether trailing spaces are stripped.

https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser.py
File abp/filters/parser.py (right):

https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser....
abp/filters/parser.py:304: if mode != 'body' or key.lower() == 'checksum':
On 2018/09/17 10:40:27, Vasily Kuznetsov wrote:
> On 2018/09/15 16:08:32, Sebastian Noack wrote:
> > We probably should keep the comment why we treat checksums special here.
> 
> I would like to also keep the note about checksums in the docstring also. Do
you
> think it would be ok to just refer to the docstring here?

I didn't notice that you moved that note to the docstring. I personally, would
rather kept it here, but fair enough.

https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser....
abp/filters/parser.py:338: mode = 'start'
Maybe "position" would be more accurate name for this variable? Also instead of
"start" maybe "header" would be more in line with the other values.

Vasily Kuznetsov

Hi Sebastian, I've addressed / replied to your comments and added a couple of tests ...

Sept. 18, 2018, 12:41 p.m. (2018-09-18 12:41:15 UTC) #5

Hi Sebastian,

I've addressed / replied to your comments and added a couple of tests to cover
the extended behavior of parse_line().

Cheers,
Vasily

https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser.py
File abp/filters/parser.py (left):

https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser....
abp/filters/parser.py:279: content = line_text.strip()
On 2018/09/17 18:11:52, Sebastian Noack wrote:
> On 2018/09/17 10:40:27, Vasily Kuznetsov wrote:
> > However I'm not so sure about preserving the trailing space. Do you think is
> > desirable? I mean do you think ABP is doing the right thing in this case --
I
> > agree that python-abp should behave the same.
> 
> Adblock Plus extracts metadata (and the header) before parsing the remaining
> filter list content (so does my patch for python-abp). The regular expression
> used to identify metadata and extract the key and value would be more
> complicated if we'd want to strip trailing whitespaces from the value, and in
> practice it doesn't seem to matter whether trailing spaces are stripped.

Acknowledged.

https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser.py
File abp/filters/parser.py (right):

https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser....
abp/filters/parser.py:145: HEADER_REGEXP =
re.compile(r'\[(?:(Adblock(?:\s*Plus\s*[\d\.]+?)?)|.*)\]$',
On 2018/09/17 10:40:27, Vasily Kuznetsov wrote:
> On 2018/09/15 16:08:32, Sebastian Noack wrote:
> > I changed this regular epxressions like this in my patch so that I don't
have
> to
> > first check whether the line starts and ends with square brackets. With your
> > implementation this seems redundant.
> 
> Yeah, you're right. I think the logic of parse_line() is more clear the way it
> is so I will undo the regexp change.

Done.

https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser....
abp/filters/parser.py:304: if mode != 'body' or key.lower() == 'checksum':
On 2018/09/17 18:11:52, Sebastian Noack wrote:
> On 2018/09/17 10:40:27, Vasily Kuznetsov wrote:
> > On 2018/09/15 16:08:32, Sebastian Noack wrote:
> > > We probably should keep the comment why we treat checksums special here.
> > 
> > I would like to also keep the note about checksums in the docstring also. Do
> you
> > think it would be ok to just refer to the docstring here?
> 
> I didn't notice that you moved that note to the docstring. I personally, would
> rather kept it here, but fair enough.

It needs to be in the docstring because it's part of the external behavior /
API. I agree that mentioning it here is helpful, so I'll do both.

https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser....
abp/filters/parser.py:338: mode = 'start'
On 2018/09/17 18:11:52, Sebastian Noack wrote:
> Maybe "position" would be more accurate name for this variable? Also instead
of
> "start" maybe "header" would be more in line with the other values.

Yeah, "position" is a better name. I changed it.

As for "start" vs. "header" -- I have chosen the former because there's not
always a header in the file, so giving the impression that we're only looking
for a header seems misleading.

Sebastian Noack

https://codereview.adblockplus.org/29880577/diff/29884555/abp/filters/parser.py File abp/filters/parser.py (left): https://codereview.adblockplus.org/29880577/diff/29884555/abp/filters/parser.py#oldcode145 abp/filters/parser.py:145: HEADER_REGEXP = re.compile(r'\[(Adblock(?:\s*Plus\s*[\d\.]+?)?)\]', flags=re.I) Does this regexp is missing ...

Sept. 18, 2018, 3:19 p.m. (2018-09-18 15:19:08 UTC) #6

Vasily Kuznetsov

Hi Sebastian, Thanks for all the comments and for the discussion about header parsing in ...

Sept. 18, 2018, 6:11 p.m. (2018-09-18 18:11:44 UTC) #7

Hi Sebastian,

Thanks for all the comments and for the discussion about header parsing in the
#python chat. I've implemented all that.

Cheers,
Vasily

https://codereview.adblockplus.org/29880577/diff/29884555/abp/filters/parser.py
File abp/filters/parser.py (right):

https://codereview.adblockplus.org/29880577/diff/29884555/abp/filters/parser....
abp/filters/parser.py:254: def parse_line(line_text, mode='body'):
On 2018/09/18 15:19:08, Sebastian Noack wrote:
> We probably should call the argument here "position" too.

I originally left it as "mode" on purpose, but now it seems that "position" is
also a good name and it's more consistent with the use in parse_filterlist().

Done.

https://codereview.adblockplus.org/29880577/diff/29884555/abp/filters/parser....
abp/filters/parser.py:262: Note: checksum metadata lines are recognized in all
modes for backwards
On 2018/09/18 15:19:07, Sebastian Noack wrote:
> Typo: Capitalize the first word after a colon.

Done.

https://codereview.adblockplus.org/29880577/diff/29884555/abp/filters/parser....
abp/filters/parser.py:303: # an exception for checksums (see docstring).
On 2018/09/18 15:19:08, Sebastian Noack wrote:
> Nit: IMO, this comment is redundant. That we special case "checksum" is
obvious
> from the code, the reason why is no covered in the docstring.

Done.

https://codereview.adblockplus.org/29880577/diff/29884555/abp/filters/parser....
abp/filters/parser.py:312: line_text.endswith(']')):
On 2018/09/18 15:19:08, Sebastian Noack wrote:
> Nit: Perhaps just rename the variable (e.g. to "line"), so that you don't need
> to wrap here.

Done.

https://codereview.adblockplus.org/29880577/diff/29884555/abp/filters/parser....
abp/filters/parser.py:313: return _parse_header(content)
On 2018/09/18 15:19:08, Sebastian Noack wrote:
> The result should be the same, since by asserting that the lines starts end
ends
> with square brackets, it's implied that there are no leading or trailing
> whitespaces, but it might be easier to verify the code without having to make
> that assumption by passing in line_text instead of content here.

Done.

https://codereview.adblockplus.org/29880577/diff/29884555/tests/test_rpy.py
File tests/test_rpy.py (right):

https://codereview.adblockplus.org/29880577/diff/29884555/tests/test_rpy.py#n...
tests/test_rpy.py:139: data = line2dict(_TEST_EXAMPLES[line_type]['in'], mode)
On 2018/09/18 15:19:08, Sebastian Noack wrote:
> Note that I removed the test case for metadata here. We might want to add it
> back and add a similar special case as for headers.

Done.

LGTM

Expand All Messages | Collapse All Messages