|
|
Created:
Sept. 3, 2018, 5:57 p.m. by Sebastian Noack Modified:
Sept. 13, 2018, 3:40 p.m. Visibility:
Public. |
DescriptionIssue 6920 - Only parse metadata from the top of the file
Patch Set 1 : #
Total comments: 8
Patch Set 2 : Fixed typo and moved logic to parse_filterlist() #Patch Set 3 : Documented behavior for parse_line(), simplified end-of-metadata semantics #
Total comments: 2
Patch Set 4 : Close metadata on everything but headers or comments #
Total comments: 2
Patch Set 5 : Test 'Last modified' case #
MessagesTotal messages: 14
Hi Sebastian, Thanks for working on this. The code looks good but I had a couple of thoughts about the API and general functionality. I think it would be preferable to keep the behavior of `parse_line` unchanged and to allow the caller to choose what they want to do with metadata-like comments. I also think that it would be better to make decisions about when to consider metadata closed in `parse_filterlist` rather than in `_parse_comment`, where some parts of the context are missing. Also the logic would be more obvious if it would all be in `parse_filterlist`. Finally, what do you think about closing metadata on a filter or an empty line instead of on a non-metadata comment (see supporting examples below)? Cheers, Vasily https://codereview.adblockplus.org/29873561/diff/29873574/abp/filters/parser.py File abp/filters/parser.py (right): https://codereview.adblockplus.org/29873561/diff/29873574/abp/filters/parser.... abp/filters/parser.py:315: return _parse_line(line_text, True)[0] Here we're changing the behavior of `parse_line` -- it will never parse comments as metadata now. This could create problems for people who use `parse_line` directly (and it's part of the external API). I also think it would be good to still be able to parse metadata comments using `parse_line`. I would propose that we have `metadata_closed` parameter in this function with a default of `False`. This way the behavior will be the same as before for people who use it directly (and they can choose if they want to disable metadata comment parsing as they like). We would need to move the logic of determining when to consider metadata closed into `parse_filterlist` (now it's in `_parse_comment`) but arguably that's where it belongs anyway. https://codereview.adblockplus.org/29873561/diff/29873574/abp/filters/parser.... abp/filters/parser.py:338: metadata_closed = False What do you think about filter lists like this? [Adblock Plus 1.1] ! Homepage: https://foo.bar/ ! This is the filter list that foos the bar. ! Expires: 1 day ||example.com^ It seems that allowing comments inside of the "header" part could be useful for the filter authors and it looks kind of natural. Maybe we should declare the metadata closed on the first filter instead of on the first obviously non-metadata comment? Or maybe (also) on an empty line? Another example, that could be handled not the way we like: [Adblock Plus 1.1] ! Homepage: https://foo.bar ! WARNING: The following filters will drink your beer ||beer.de https://codereview.adblockplus.org/29873561/diff/29873574/tests/test_parser.py File tests/test_parser.py (right): https://codereview.adblockplus.org/29873561/diff/29873574/tests/test_parser.p... tests/test_parser.py:169: '||exmample.com^', I guess it doesn't really matter but "exmample.com" looks kind of funny :D
On 2018/09/04 10:03:31, Vasily Kuznetsov wrote: > It seems that allowing comments inside of the "header" part could be useful for > the filter authors and it looks kind of natural. Maybe we should declare the > metadata closed on the first filter instead of on the first obviously > non-metadata comment? Or maybe (also) on an empty line? > > Another example, that could be handled not the way we like: > > [Adblock Plus 1.1] > ! Homepage: https://foo.bar > > ! WARNING: The following filters will drink your beer > ||beer.de > > https://codereview.adblockplus.org/29873561/diff/29873574/tests/test_parser.py > File tests/test_parser.py (right): > > https://codereview.adblockplus.org/29873561/diff/29873574/tests/test_parser.p... > tests/test_parser.py:169: '||exmample.com^', > I guess it doesn't really matter but "exmample.com" looks kind of funny :D I just looked at the changes in https://codereview.adblockplus.org/29873561/diff/29873574/abp/filters/parser.py again and I see that actually you do close metadata on non-comment lines as well (and the behavior in https://codereview.adblockplus.org/29873570/diff/29873571/lib/synchronizer.js is the same). So basically the behavior I propose is quite similar to what you've implemented, the only difference would be to allow normal comments between metadata comments. I don't feel too strong about this and we should probably base the decisions on real filter lists and/or what the filter authors prefer (if we can get their feedback). This also makes me feel stronger about having all the metadata closing logic inside `parse_filterlist` -- it would be much easier to follow it this way.
https://codereview.adblockplus.org/29873561/diff/29873574/abp/filters/parser.py File abp/filters/parser.py (right): https://codereview.adblockplus.org/29873561/diff/29873574/abp/filters/parser.... abp/filters/parser.py:315: return _parse_line(line_text, True)[0] On 2018/09/04 10:03:30, Vasily Kuznetsov wrote: > Here we're changing the behavior of `parse_line` -- it will never parse comments > as metadata now. This could create problems for people who use `parse_line` > directly (and it's part of the external API). I also think it would be good to > still be able to parse metadata comments using `parse_line`. > > I would propose that we have `metadata_closed` parameter in this function with a > default of `False`. This way the behavior will be the same as before for people > who use it directly (and they can choose if they want to disable metadata > comment parsing as they like). We would need to move the logic of determining > when to consider metadata closed into `parse_filterlist` (now it's in > `_parse_comment`) but arguably that's where it belongs anyway. I moved the logic as requested to parse_filterlist(). But IMO, this makes it even more clear why parse_line() should never return a Metadata() object. Metadata (and headers but there is a separate issue for them) only have any meaning in the context of the whole filter list (and only if they occur in a particular position). This is consistent with how the parser in Adblock Plus is implemented, i.e. right after downloading a filter list, the header and metadata are processed, and then the remaining lines are parsed individually. https://codereview.adblockplus.org/29873561/diff/29873574/abp/filters/parser.... abp/filters/parser.py:338: metadata_closed = False On 2018/09/04 10:03:30, Vasily Kuznetsov wrote: > What do you think about filter lists like this? > > [Adblock Plus 1.1] > ! Homepage: https://foo.bar/ > ! This is the filter list that foos the bar. > ! Expires: 1 day > ||example.com^ > > It seems that allowing comments inside of the "header" part could be useful for > the filter authors and it looks kind of natural. Maybe we should declare the > metadata closed on the first filter instead of on the first obviously > non-metadata comment? Or maybe (also) on an empty line? In practice nobody seems to put comments in between metadata. But what you see (a lot) in practice are comments right after metadata (with no blank line in between), e.g. EasyList is doing that. So I think the logic I implemented here makes more sense. https://codereview.adblockplus.org/29873561/diff/29873574/tests/test_parser.py File tests/test_parser.py (right): https://codereview.adblockplus.org/29873561/diff/29873574/tests/test_parser.p... tests/test_parser.py:169: '||exmample.com^', On 2018/09/04 10:03:31, Vasily Kuznetsov wrote: > I guess it doesn't really matter but "exmample.com" looks kind of funny :D Done.
Hi Sebastian! I'm not too happy about the complicated behavior where we close metadata on a first comment that doesn't look like metadata or some kind of malformed metadata. It seems like it would be better to either allow comments in between metadata (and close it with the first empty string or filter) or to consider malformed metadata comments to also be metadata comments and close metadata on the first non-metadata comment. Both of these behaviors are simpler and we could them implement them in a more elegant way. However, if you think that those two approaches don't work and we need to do this thing with malformed metadata comments being parsed as comments but considered metadata for the purposes of closing metadata, then I think the current implementation is good. We should probably document that parse_line() doesn't know about metadata comments but otherwise the code looks good. Cheers, Vasily https://codereview.adblockplus.org/29873561/diff/29873574/abp/filters/parser.py File abp/filters/parser.py (right): https://codereview.adblockplus.org/29873561/diff/29873574/abp/filters/parser.... abp/filters/parser.py:315: return _parse_line(line_text, True)[0] On 2018/09/04 16:01:03, Sebastian Noack wrote: > On 2018/09/04 10:03:30, Vasily Kuznetsov wrote: > > Here we're changing the behavior of `parse_line` -- it will never parse > comments > > as metadata now. This could create problems for people who use `parse_line` > > directly (and it's part of the external API). I also think it would be good to > > still be able to parse metadata comments using `parse_line`. > > > > I would propose that we have `metadata_closed` parameter in this function with > a > > default of `False`. This way the behavior will be the same as before for > people > > who use it directly (and they can choose if they want to disable metadata > > comment parsing as they like). We would need to move the logic of determining > > when to consider metadata closed into `parse_filterlist` (now it's in > > `_parse_comment`) but arguably that's where it belongs anyway. > > I moved the logic as requested to parse_filterlist(). But IMO, this makes it > even more clear why parse_line() should never return a Metadata() object. > Metadata (and headers but there is a separate issue for them) only have any > meaning in the context of the whole filter list (and only if they occur in a > particular position). This is consistent with how the parser in Adblock Plus is > implemented, i.e. right after downloading a filter list, the header and metadata > are processed, and then the remaining lines are parsed individually. What I had in mind was to keep _parse_comment() with metadata_closed parameter and to also have it in parse_line() -- this way the caller could control what behavior they want. Then the logic in parse_filterlist() would be high level, something like parse_line(..., metadata_closed=False) until we want to stop accepting metadata and after that parse_line(..., metadata_close=True). Unfortunately my plan didn't take into account malformed keys like "Last modified". The logic we have for them is rather special, we consider them regular comments but we still don't close metadata. I suppose we could consider them regular comments in the middle of metadata if we allow that -- you can judge better whether this would work in practice or not. https://codereview.adblockplus.org/29873561/diff/29873574/abp/filters/parser.... abp/filters/parser.py:338: metadata_closed = False On 2018/09/04 16:01:03, Sebastian Noack wrote: > On 2018/09/04 10:03:30, Vasily Kuznetsov wrote: > > What do you think about filter lists like this? > > > > [Adblock Plus 1.1] > > ! Homepage: https://foo.bar/ > > ! This is the filter list that foos the bar. > > ! Expires: 1 day > > ||example.com^ > > > > It seems that allowing comments inside of the "header" part could be useful > for > > the filter authors and it looks kind of natural. Maybe we should declare the > > metadata closed on the first filter instead of on the first obviously > > non-metadata comment? Or maybe (also) on an empty line? > > In practice nobody seems to put comments in between metadata. But what you see > (a lot) in practice are comments right after metadata (with no blank line in > between), e.g. EasyList is doing that. So I think the logic I implemented here > makes more sense. Well, "Last modified" is considered a comment and it is in the middle of metadata, so this does happen. Maybe we should just consider such comments still metadata and get more consistent behavior?
On 2018/09/04 16:49:51, Vasily Kuznetsov wrote: > I'm not too happy about the complicated behavior where we close metadata on a > first comment that doesn't look like metadata or some kind of malformed > metadata. It seems like it would be better to either allow comments in between > metadata (and close it with the first empty string or filter) or to consider > malformed metadata comments to also be metadata comments and close metadata on > the first non-metadata comment. Both of these behaviors are simpler and we could > them implement them in a more elegant way. How about this: We allow comments along metadata, unless the comment is empty. So we would consider the metadata closed, when we either see a filter, empty line, or a line just consisting of "!" (ignoring whitespaces). I implemented this now, and this seems to address all real world scenarios, while removing some complexity. > We should probably document that parse_line() > doesn't know about metadata comments but otherwise the code looks good. Done.
Cool, I like this new version! What do you think about includes (see comment below)? https://codereview.adblockplus.org/29873561/diff/29874645/abp/filters/parser.py File abp/filters/parser.py (right): https://codereview.adblockplus.org/29873561/diff/29874645/abp/filters/parser.... abp/filters/parser.py:338: elif isinstance(result, (EmptyLine, Filter)): I just thought that since this could also be used to parse fragments of filter lists, we should probably also close metadata on an include. Does this make sense?
https://codereview.adblockplus.org/29873561/diff/29874645/abp/filters/parser.py File abp/filters/parser.py (right): https://codereview.adblockplus.org/29873561/diff/29874645/abp/filters/parser.... abp/filters/parser.py:338: elif isinstance(result, (EmptyLine, Filter)): On 2018/09/04 19:50:56, Vasily Kuznetsov wrote: > I just thought that since this could also be used to parse fragments of filter > lists, we should probably also close metadata on an include. Does this make > sense? Done.
Looks pretty good to me. Just one minor suggestion for the test.
https://codereview.adblockplus.org/29873561/diff/29874658/tests/test_parser.py File tests/test_parser.py (right): https://codereview.adblockplus.org/29873561/diff/29874658/tests/test_parser.p... tests/test_parser.py:167: result = parse_filterlist(['[Adblock Plus 1.1]', It might be good to include the 'Last modified' comment here, to make sure this doesn't break in the future.
https://codereview.adblockplus.org/29873561/diff/29874658/tests/test_parser.py File tests/test_parser.py (right): https://codereview.adblockplus.org/29873561/diff/29874658/tests/test_parser.p... tests/test_parser.py:167: result = parse_filterlist(['[Adblock Plus 1.1]', On 2018/09/05 02:50:06, rhowell wrote: > It might be good to include the 'Last modified' comment here, to make sure this > doesn't break in the future. Done.
LGTM
LGTM
LGTM |