|
|
Created:
Sept. 14, 2018, 4:37 p.m. by Vasily Kuznetsov Modified:
Sept. 19, 2018, 1:30 p.m. Reviewers:
Sebastian Noack Visibility:
Public. |
DescriptionIssue 6877 - Only parse headers in the first line of the filter list
This is an alternative version of https://codereview.adblockplus.org/29880555/
Based on this review: https://codereview.adblockplus.org/29879650/
Patch Set 1 : Initial #
Total comments: 15
Patch Set 2 : Correct behavior, add comments, improve naming, add tests #
Total comments: 14
Patch Set 3 : Fix header parsing, improve argument naming and documentation #
MessagesTotal messages: 8
Here's an implementation of the idea from https://issues.adblockplus.org/ticket/6877#comment:3 The logic is the same as in https://codereview.adblockplus.org/29880555/ as far as I can tell (or do we need more tests?). The advantages of this approach in my opinion are (1) the separation of responsibilities where `parse_line` does all the text processing and regexp matching and `parse_filterlist` only works at the high level and (2) fully functional `parse_line` that can still parse headers and metadata (and this is already useful even inside python-abp because of the R integration). Let me know what you think.
https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser.py File abp/filters/parser.py (left): https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser.... abp/filters/parser.py:279: content = line_text.strip() Adblock Plus doesn't strip the line before processing headers and metadata, i.e. a line with leading and/or trailing whitespaces isn't considered a valid header, and trailing whitespaces in metadata values are preserved. https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser.py File abp/filters/parser.py (right): https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser.... abp/filters/parser.py:145: HEADER_REGEXP = re.compile(r'\[(?:(Adblock(?:\s*Plus\s*[\d\.]+?)?)|.*)\]$', I changed this regular epxressions like this in my patch so that I don't have to first check whether the line starts and ends with square brackets. With your implementation this seems redundant. https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser.... abp/filters/parser.py:256: def parse_line(line_text, mode='body'): Having the "mode" as part of the public API, requires to document it (you did that below), and probably also calls for more tests than we currently have. By not exposing this implementation detail, the public API (and it's documentation and tests) can be simpler. Also we wouldn't need to validate the value below. https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser.... abp/filters/parser.py:304: if mode != 'body' or key.lower() == 'checksum': We probably should keep the comment why we treat checksums special here.
Hi Sebastian, Thanks for the comments, I will address them. See also some follow up questions below. What do you think about the whole approach? Are you more convinced or do you still prefer doing more parsing in parse_filterlist() as in your version of this review? Cheers, Vasily https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser.py File abp/filters/parser.py (left): https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser.... abp/filters/parser.py:279: content = line_text.strip() On 2018/09/15 16:08:32, Sebastian Noack wrote: > Adblock Plus doesn't strip the line before processing headers and metadata, i.e. > a line with leading and/or trailing whitespaces isn't considered a valid header, > and trailing whitespaces in metadata values are preserved. The behavior of ABP for the headers seems right. I will adjust the code here. However I'm not so sure about preserving the trailing space. Do you think is desirable? I mean do you think ABP is doing the right thing in this case -- I agree that python-abp should behave the same. https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser.py File abp/filters/parser.py (right): https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser.... abp/filters/parser.py:145: HEADER_REGEXP = re.compile(r'\[(?:(Adblock(?:\s*Plus\s*[\d\.]+?)?)|.*)\]$', On 2018/09/15 16:08:32, Sebastian Noack wrote: > I changed this regular epxressions like this in my patch so that I don't have to > first check whether the line starts and ends with square brackets. With your > implementation this seems redundant. Yeah, you're right. I think the logic of parse_line() is more clear the way it is so I will undo the regexp change. https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser.... abp/filters/parser.py:256: def parse_line(line_text, mode='body'): On 2018/09/15 16:08:32, Sebastian Noack wrote: > Having the "mode" as part of the public API, requires to document it (you did > that below), and probably also calls for more tests than we currently have. By > not exposing this implementation detail, the public API (and it's documentation > and tests) can be simpler. Also we wouldn't need to validate the value below. I would like to keep the mode in the public API. This way full functionality or line parsing is available to the users and I think it's a good thing. You're right about the tests, however, I will add them. https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser.... abp/filters/parser.py:304: if mode != 'body' or key.lower() == 'checksum': On 2018/09/15 16:08:32, Sebastian Noack wrote: > We probably should keep the comment why we treat checksums special here. I would like to also keep the note about checksums in the docstring also. Do you think it would be ok to just refer to the docstring here?
On 2018/09/17 10:40:27, Vasily Kuznetsov wrote: > What do you think about the whole approach? Are you more convinced or do you > still prefer doing more parsing in parse_filterlist() as in your version of this > review? Let's see how the changes will look like after the outstanding comments have been addressed. But anyway it seems your approach make things more complex. Another aspect we didn't look into yet would be performance. https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser.py File abp/filters/parser.py (left): https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser.... abp/filters/parser.py:279: content = line_text.strip() On 2018/09/17 10:40:27, Vasily Kuznetsov wrote: > However I'm not so sure about preserving the trailing space. Do you think is > desirable? I mean do you think ABP is doing the right thing in this case -- I > agree that python-abp should behave the same. Adblock Plus extracts metadata (and the header) before parsing the remaining filter list content (so does my patch for python-abp). The regular expression used to identify metadata and extract the key and value would be more complicated if we'd want to strip trailing whitespaces from the value, and in practice it doesn't seem to matter whether trailing spaces are stripped. https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser.py File abp/filters/parser.py (right): https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser.... abp/filters/parser.py:304: if mode != 'body' or key.lower() == 'checksum': On 2018/09/17 10:40:27, Vasily Kuznetsov wrote: > On 2018/09/15 16:08:32, Sebastian Noack wrote: > > We probably should keep the comment why we treat checksums special here. > > I would like to also keep the note about checksums in the docstring also. Do you > think it would be ok to just refer to the docstring here? I didn't notice that you moved that note to the docstring. I personally, would rather kept it here, but fair enough. https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser.... abp/filters/parser.py:338: mode = 'start' Maybe "position" would be more accurate name for this variable? Also instead of "start" maybe "header" would be more in line with the other values.
Hi Sebastian, I've addressed / replied to your comments and added a couple of tests to cover the extended behavior of parse_line(). Cheers, Vasily https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser.py File abp/filters/parser.py (left): https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser.... abp/filters/parser.py:279: content = line_text.strip() On 2018/09/17 18:11:52, Sebastian Noack wrote: > On 2018/09/17 10:40:27, Vasily Kuznetsov wrote: > > However I'm not so sure about preserving the trailing space. Do you think is > > desirable? I mean do you think ABP is doing the right thing in this case -- I > > agree that python-abp should behave the same. > > Adblock Plus extracts metadata (and the header) before parsing the remaining > filter list content (so does my patch for python-abp). The regular expression > used to identify metadata and extract the key and value would be more > complicated if we'd want to strip trailing whitespaces from the value, and in > practice it doesn't seem to matter whether trailing spaces are stripped. Acknowledged. https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser.py File abp/filters/parser.py (right): https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser.... abp/filters/parser.py:145: HEADER_REGEXP = re.compile(r'\[(?:(Adblock(?:\s*Plus\s*[\d\.]+?)?)|.*)\]$', On 2018/09/17 10:40:27, Vasily Kuznetsov wrote: > On 2018/09/15 16:08:32, Sebastian Noack wrote: > > I changed this regular epxressions like this in my patch so that I don't have > to > > first check whether the line starts and ends with square brackets. With your > > implementation this seems redundant. > > Yeah, you're right. I think the logic of parse_line() is more clear the way it > is so I will undo the regexp change. Done. https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser.... abp/filters/parser.py:304: if mode != 'body' or key.lower() == 'checksum': On 2018/09/17 18:11:52, Sebastian Noack wrote: > On 2018/09/17 10:40:27, Vasily Kuznetsov wrote: > > On 2018/09/15 16:08:32, Sebastian Noack wrote: > > > We probably should keep the comment why we treat checksums special here. > > > > I would like to also keep the note about checksums in the docstring also. Do > you > > think it would be ok to just refer to the docstring here? > > I didn't notice that you moved that note to the docstring. I personally, would > rather kept it here, but fair enough. It needs to be in the docstring because it's part of the external behavior / API. I agree that mentioning it here is helpful, so I'll do both. https://codereview.adblockplus.org/29880577/diff/29880583/abp/filters/parser.... abp/filters/parser.py:338: mode = 'start' On 2018/09/17 18:11:52, Sebastian Noack wrote: > Maybe "position" would be more accurate name for this variable? Also instead of > "start" maybe "header" would be more in line with the other values. Yeah, "position" is a better name. I changed it. As for "start" vs. "header" -- I have chosen the former because there's not always a header in the file, so giving the impression that we're only looking for a header seems misleading.
https://codereview.adblockplus.org/29880577/diff/29884555/abp/filters/parser.py File abp/filters/parser.py (left): https://codereview.adblockplus.org/29880577/diff/29884555/abp/filters/parser.... abp/filters/parser.py:145: HEADER_REGEXP = re.compile(r'\[(Adblock(?:\s*Plus\s*[\d\.]+?)?)\]', flags=re.I) Does this regexp is missing a $ at the end? I didn't test it, but to me it seems we would incorrectly consider "[Adblock]]" a valid header. https://codereview.adblockplus.org/29880577/diff/29884555/abp/filters/parser.... abp/filters/parser.py:279: content = line_text.strip() Given the only difference between line_text and content is that the latter is stripped, and we now have logic that dependent on the case either needs the stripped or unstripped string, perhaps the code is easier to understand if we rename this variable to "stripped". https://codereview.adblockplus.org/29880577/diff/29884555/abp/filters/parser.py File abp/filters/parser.py (right): https://codereview.adblockplus.org/29880577/diff/29884555/abp/filters/parser.... abp/filters/parser.py:254: def parse_line(line_text, mode='body'): We probably should call the argument here "position" too. https://codereview.adblockplus.org/29880577/diff/29884555/abp/filters/parser.... abp/filters/parser.py:262: Note: checksum metadata lines are recognized in all modes for backwards Typo: Capitalize the first word after a colon. https://codereview.adblockplus.org/29880577/diff/29884555/abp/filters/parser.... abp/filters/parser.py:303: # an exception for checksums (see docstring). Nit: IMO, this comment is redundant. That we special case "checksum" is obvious from the code, the reason why is no covered in the docstring. https://codereview.adblockplus.org/29880577/diff/29884555/abp/filters/parser.... abp/filters/parser.py:312: line_text.endswith(']')): Nit: Perhaps just rename the variable (e.g. to "line"), so that you don't need to wrap here. https://codereview.adblockplus.org/29880577/diff/29884555/abp/filters/parser.... abp/filters/parser.py:313: return _parse_header(content) The result should be the same, since by asserting that the lines starts end ends with square brackets, it's implied that there are no leading or trailing whitespaces, but it might be easier to verify the code without having to make that assumption by passing in line_text instead of content here. https://codereview.adblockplus.org/29880577/diff/29884555/tests/test_rpy.py File tests/test_rpy.py (right): https://codereview.adblockplus.org/29880577/diff/29884555/tests/test_rpy.py#n... tests/test_rpy.py:139: data = line2dict(_TEST_EXAMPLES[line_type]['in'], mode) Note that I removed the test case for metadata here. We might want to add it back and add a similar special case as for headers.
Hi Sebastian, Thanks for all the comments and for the discussion about header parsing in the #python chat. I've implemented all that. Cheers, Vasily https://codereview.adblockplus.org/29880577/diff/29884555/abp/filters/parser.py File abp/filters/parser.py (right): https://codereview.adblockplus.org/29880577/diff/29884555/abp/filters/parser.... abp/filters/parser.py:254: def parse_line(line_text, mode='body'): On 2018/09/18 15:19:08, Sebastian Noack wrote: > We probably should call the argument here "position" too. I originally left it as "mode" on purpose, but now it seems that "position" is also a good name and it's more consistent with the use in parse_filterlist(). Done. https://codereview.adblockplus.org/29880577/diff/29884555/abp/filters/parser.... abp/filters/parser.py:262: Note: checksum metadata lines are recognized in all modes for backwards On 2018/09/18 15:19:07, Sebastian Noack wrote: > Typo: Capitalize the first word after a colon. Done. https://codereview.adblockplus.org/29880577/diff/29884555/abp/filters/parser.... abp/filters/parser.py:303: # an exception for checksums (see docstring). On 2018/09/18 15:19:08, Sebastian Noack wrote: > Nit: IMO, this comment is redundant. That we special case "checksum" is obvious > from the code, the reason why is no covered in the docstring. Done. https://codereview.adblockplus.org/29880577/diff/29884555/abp/filters/parser.... abp/filters/parser.py:312: line_text.endswith(']')): On 2018/09/18 15:19:08, Sebastian Noack wrote: > Nit: Perhaps just rename the variable (e.g. to "line"), so that you don't need > to wrap here. Done. https://codereview.adblockplus.org/29880577/diff/29884555/abp/filters/parser.... abp/filters/parser.py:313: return _parse_header(content) On 2018/09/18 15:19:08, Sebastian Noack wrote: > The result should be the same, since by asserting that the lines starts end ends > with square brackets, it's implied that there are no leading or trailing > whitespaces, but it might be easier to verify the code without having to make > that assumption by passing in line_text instead of content here. Done. https://codereview.adblockplus.org/29880577/diff/29884555/tests/test_rpy.py File tests/test_rpy.py (right): https://codereview.adblockplus.org/29880577/diff/29884555/tests/test_rpy.py#n... tests/test_rpy.py:139: data = line2dict(_TEST_EXAMPLES[line_type]['in'], mode) On 2018/09/18 15:19:08, Sebastian Noack wrote: > Note that I removed the test case for metadata here. We might want to add it > back and add a similar special case as for headers. Done.
LGTM |