|
|
Created:
June 14, 2017, 5:32 p.m. by Vasily Kuznetsov Modified:
Oct. 24, 2017, 4:17 p.m. Reviewers:
mathias Visibility:
Public. |
DescriptionFixes 4969 - Add parsing of filters
repository: https://hg.adblockplus.org/python-abp
base revision: f41b6ebd236a
Patch Set 1 #
Total comments: 14
Patch Set 2 : remove all interpretation and keep only parsing, add support for element hiding emulation filters, … #
Total comments: 17
Patch Set 3 : Address review comments on patch set 2 #
Total comments: 6
Patch Set 4 : Address review comments on patch set 3 #
Total comments: 1
Patch Set 5 : Rebase to 1f5d7ead9bff #MessagesTotal messages: 17
In the end the implementation changed quite a bit from what I've originally programmed one year ago. The current API came out of discussions with Kirill and Thomas and has been tested on real data somewhat. Perhaps it was not such a bad idea indeed to not implement filter parsing back then ;) The documentation is coming in another review since it's covered by a separate issue (#4970).
Hi Matze! The conversation in the ticket seems to be resolved enough to move forward with the review, as far as I'm concerned. You are one of the main stakeholders of python-abp and this ticket in particular and you can review Python, so I would appreciate your feedback on this change. Cheers, Vasily
https://codereview.adblockplus.org/29465715/diff/29465716/abp/filters/parser.py File abp/filters/parser.py (right): https://codereview.adblockplus.org/29465715/diff/29465716/abp/filters/parser.... abp/filters/parser.py:25: """Internal exception used by the parser to signal invalid input.""" Removing the custom __init__ function looks like an unrelated change to me. Don't you think it should be part of a separate patch-set? https://codereview.adblockplus.org/29465715/diff/29465716/abp/filters/parser.... abp/filters/parser.py:34: :param format_string: A format specifier for converting this line type Fixing the missing format_string parameter documentation looks like an unrelated change as well. So what about this one? https://codereview.adblockplus.org/29465715/diff/29465716/abp/filters/parser.... abp/filters/parser.py:102: raise ParseError('Malformed header') Please explain why you don't include the malformed text in the error message here (and the similar functions below). I was under the impression that such kind of information is highly appreciated because Python exception trace output does not include invocation parameters, hence reproduction without can is often quite tricky. https://codereview.adblockplus.org/29465715/diff/29465716/abp/filters/parser.... abp/filters/parser.py:203: action = 'allow' I think we should have symbols like BFILTER_ACTION_ALLOW and BFILTER_ACTION_BLOCK for actions asslociated with blocking filters, analogous to HFILTER_ACTION_HIDE and HFILTER_ACTION_SHOW for element hiding filters above. https://codereview.adblockplus.org/29465715/diff/29465716/abp/filters/parser.... abp/filters/parser.py:214: selector = {'type': 'url-regexp', 'value': selector[1:-1]} I also think we should have symbols like SELECTOR_TYPE_REGEXP and SELECTOR_TYPE_PATTERN. And be it just to have a place to document them like # http://link/to/explanation or an actual definition or something, or for IDE's to recognize them symbols, or plain old helping humans with the association by creating an official item. https://codereview.adblockplus.org/29465715/diff/29465716/abp/filters/parser.... abp/filters/parser.py:227: match = HFILTER_REGEXP.match(text) if '#' in text else False Call me old-fashioned but I seriously dislike changing the type of a variable (especially when the former and new value types quack quite differently). You could just use None instead of False here, so it'll be match or no match, not match or untruth. https://codereview.adblockplus.org/29465715/diff/29465716/setup.py File setup.py (right): https://codereview.adblockplus.org/29465715/diff/29465716/setup.py#newcode46 setup.py:46: version='0.0.2', I wonder how the patch-version upgrade is already considered in this patch-set (one might consider this something unknown until the last minute before the commit/push), but the README / documentation update is a separate review? https://codereview.adblockplus.org/29465715/diff/29465716/tests/test_parser.py File tests/test_parser.py (left): https://codereview.adblockplus.org/29465715/diff/29465716/tests/test_parser.p... tests/test_parser.py:80: def test_exception_timing(): This test vanishing seems unrelated as well.
Hi Matze, Thanks for the comments, I agree with every one of them. Will make a new patch addressing these comments and the parsing vs. interpretation distinction that we've discussed yesterday. Cheers, Vasily https://codereview.adblockplus.org/29465715/diff/29465716/abp/filters/parser.py File abp/filters/parser.py (right): https://codereview.adblockplus.org/29465715/diff/29465716/abp/filters/parser.... abp/filters/parser.py:102: raise ParseError('Malformed header') On 2017/07/26 20:37:15, mathias wrote: > Please explain why you don't include the malformed text in the error message > here (and the similar functions below). I was under the impression that such > kind of information is highly appreciated because Python exception trace output > does not include invocation parameters, hence reproduction without can is often > quite tricky. My reasoning was that you can get to this place in two ways: 1. by calling `parse_line(stuff)` directly: in this case the caller should expect to get the exception (it's documented in the docstring of `parse_line`) and handle it more or less in the same scope, where they know what was passed to `parse_line`. 2. by calling `parse_filterlist(list_of_stuff)`: in this case the exception gets caught inside of `parse_filterlist` and the result is an `InvalidLine` object, that contains the original line. Looking at it now, it seems to me that the assumption that the user will handle the exception before they lose track of what was passed to `parse_line`, that I made in considering case 1, is too optimistic. I will change the exception to be more user-friendly. This should also resolve your concern about the disappearing `__init__` in line 25. https://codereview.adblockplus.org/29465715/diff/29465716/abp/filters/parser.... abp/filters/parser.py:203: action = 'allow' On 2017/07/26 20:37:15, mathias wrote: > I think we should have symbols like BFILTER_ACTION_ALLOW and > BFILTER_ACTION_BLOCK for actions asslociated with blocking filters, analogous to > HFILTER_ACTION_HIDE and HFILTER_ACTION_SHOW for element hiding filters above. Probably not these exact names for the constants, but in general I agree, constants are better than magic strings. https://codereview.adblockplus.org/29465715/diff/29465716/abp/filters/parser.... abp/filters/parser.py:214: selector = {'type': 'url-regexp', 'value': selector[1:-1]} On 2017/07/26 20:37:15, mathias wrote: > I also think we should have symbols like SELECTOR_TYPE_REGEXP and > SELECTOR_TYPE_PATTERN. And be it just to have a place to document them like # > http://link/to/explanation or an actual definition or something, or for IDE's to > recognize them symbols, or plain old helping humans with the association by > creating an official item. Acknowledged. https://codereview.adblockplus.org/29465715/diff/29465716/abp/filters/parser.... abp/filters/parser.py:227: match = HFILTER_REGEXP.match(text) if '#' in text else False On 2017/07/26 20:37:15, mathias wrote: > Call me old-fashioned but I seriously dislike changing the type of a variable > (especially when the former and new value types quack quite differently). You > could just use None instead of False here, so it'll be match or no match, not > match or untruth. Completely agree about changing the type of the variable. Here, as far as we're concerned (the variable will only be used in the following `if`, if it's not a proper match), the quacking is the same, but the following code could potentially change, leading to subtle bugs. Thanks for catching this, I will fix it.
Oops, looks like I missed these two comments. https://codereview.adblockplus.org/29465715/diff/29465716/setup.py File setup.py (right): https://codereview.adblockplus.org/29465715/diff/29465716/setup.py#newcode46 setup.py:46: version='0.0.2', On 2017/07/26 20:37:15, mathias wrote: > I wonder how the patch-version upgrade is already considered in this patch-set > (one might consider this something unknown until the last minute before the > commit/push), but the README / documentation update is a separate review? Good point. Actually, this version is not even right (at least if we assume SemVer), it should be 0.1.0. https://codereview.adblockplus.org/29465715/diff/29465716/tests/test_parser.py File tests/test_parser.py (left): https://codereview.adblockplus.org/29465715/diff/29465716/tests/test_parser.p... tests/test_parser.py:80: def test_exception_timing(): On 2017/07/26 20:37:15, mathias wrote: > This test vanishing seems unrelated as well. `parse_filterlist` doesn't throw exceptions anymore, it now returns invalid lines, so this test makes no sense. You could probably say that this change is also unrelated, and it could be a separate review.
On 2017/07/27 12:20:58, Vasily Kuznetsov wrote: > https://codereview.adblockplus.org/29465715/diff/29465716/tests/test_parser.py > File tests/test_parser.py (left): > > https://codereview.adblockplus.org/29465715/diff/29465716/tests/test_parser.p... > tests/test_parser.py:80: def test_exception_timing(): > On 2017/07/26 20:37:15, mathias wrote: > > This test vanishing seems unrelated as well. > > `parse_filterlist` doesn't throw exceptions anymore, it now returns invalid > lines, so this test makes no sense. > > You could probably say that this change is also unrelated, and it could be a > separate review. Well that would be enough explanation for me, no need to separate that one.
On 2017/07/27 12:20:58, Vasily Kuznetsov wrote: > https://codereview.adblockplus.org/29465715/diff/29465716/tests/test_parser.py > File tests/test_parser.py (left): > > https://codereview.adblockplus.org/29465715/diff/29465716/tests/test_parser.p... > tests/test_parser.py:80: def test_exception_timing(): > On 2017/07/26 20:37:15, mathias wrote: > > This test vanishing seems unrelated as well. > > `parse_filterlist` doesn't throw exceptions anymore, it now returns invalid > lines, so this test makes no sense. > > You could probably say that this change is also unrelated, and it could be a > separate review. Well that would be enough explanation for me, no need to separate that one.
Hi Matze, Here's a more minimalist changeset following your advice. The delta for `parser.py` might look a bit confusing since the new changeset is based on line_type rename which just landed and the old one doesn't have those changes. Perhaps you can look at the side-by-side diffs (from the original) instead, they look cleaner. Cheers, Vasily
I like the simplified patch-set a lot! Just missing some more """info""" in the source. Or maybe include the README updates here as well? https://codereview.adblockplus.org/29465715/diff/29499749/abp/filters/parser.py File abp/filters/parser.py (right): https://codereview.adblockplus.org/29465715/diff/29499749/abp/filters/parser.... abp/filters/parser.py:38: class ST: Why abbreviating here (ST) and below (FA)? https://codereview.adblockplus.org/29465715/diff/29499749/abp/filters/parser.... abp/filters/parser.py:86: HFILTER_REGEXP = re.compile(r'^([^/*|@"!]*?)#([@?])?#(.+)$') Why abbreviating? What's wrong about HIDING_FILTER_REGEXP here and BLOCKING_FILTER_REGEXP below? https://codereview.adblockplus.org/29465715/diff/29499749/abp/filters/parser.... abp/filters/parser.py:87: BFILTER_REGEXP_REGEXP = re.compile( I was wondering about the *_REGEXP_REGEXP name, but then where is this actually used anyway? https://codereview.adblockplus.org/29465715/diff/29499749/abp/filters/parser.... abp/filters/parser.py:125: if name == 'domain': Wouldn't it make sense to enumerate recognized OPTION_$NAME symbols as well? https://codereview.adblockplus.org/29465715/diff/29499749/abp/filters/parser.... abp/filters/parser.py:126: name, value = 'domains', _parse_options(value, '|') Why using a different / plural key for the parsed version? https://codereview.adblockplus.org/29465715/diff/29499749/abp/filters/parser.... abp/filters/parser.py:183: :returns: filter object. I think this should be upper-case "Filter".
Thanks for the comments again, Matze. I just replied to some of your points for now, will upload the updated patch later. https://codereview.adblockplus.org/29465715/diff/29499749/abp/filters/parser.py File abp/filters/parser.py (right): https://codereview.adblockplus.org/29465715/diff/29499749/abp/filters/parser.... abp/filters/parser.py:38: class ST: On 2017/07/28 16:43:29, mathias wrote: > Why abbreviating here (ST) and below (FA)? To be completely honest, the reason is kind of stupid: to make flake8 happy :/ This class was called SELECTOR_TYPE originally and the following one FILTER_ACTION. So then you'd end up with constants like SELECTOR_TYPE.CSS and FILTER_ACTION.ALLOW. And then if you (as a user of the library) like shortness, you can import them `from abp.filters import SELECTOR_TYPE as ST` and get your short ST.CSS. But flake8 doesn't like class names that are not CamelCased. I thought about rolling a small special class to hold constants, kind of like: SELECTOR_TYPE = Namespace( URL_PATTERN='url-pattern', URL_REGEXP='url-regexp', ... ) But then you need to implement this `Namespace` class (or it can be a function), which is annoying and inelegant. Another option is to just disable N801 in flake8, that could probably also do it. https://codereview.adblockplus.org/29465715/diff/29499749/abp/filters/parser.... abp/filters/parser.py:86: HFILTER_REGEXP = re.compile(r'^([^/*|@"!]*?)#([@?])?#(.+)$') On 2017/07/28 16:43:29, mathias wrote: > Why abbreviating? What's wrong about HIDING_FILTER_REGEXP here and > BLOCKING_FILTER_REGEXP below? It's shorter this way. But I don't feel very strong about it, will rename them. https://codereview.adblockplus.org/29465715/diff/29499749/abp/filters/parser.... abp/filters/parser.py:87: BFILTER_REGEXP_REGEXP = re.compile( On 2017/07/28 16:43:30, mathias wrote: > I was wondering about the *_REGEXP_REGEXP name, but then where is this actually > used anyway? It's the regular expression for blocking filters which use regular expressions, so the name is legit. However, you're right that I'm not actually using it. Just copied it from ABP source. The funny thing, it seems that there it's not used either :) (see https://hg.adblockplus.org/adblockpluscore/file/tip/lib/filterClasses.js) I will remove it. https://codereview.adblockplus.org/29465715/diff/29499749/abp/filters/parser.... abp/filters/parser.py:125: if name == 'domain': On 2017/07/28 16:43:30, mathias wrote: > Wouldn't it make sense to enumerate recognized OPTION_$NAME symbols as well? Yeah, probably makes sense to make some kind of enum for these things. It will be needed anyway later, for interpretation, and might also be useful for the users of the library. https://codereview.adblockplus.org/29465715/diff/29499749/abp/filters/parser.... abp/filters/parser.py:126: name, value = 'domains', _parse_options(value, '|') On 2017/07/28 16:43:30, mathias wrote: > Why using a different / plural key for the parsed version? Because semantically it's a list, always, so calling it 'domain' is a bit confusing. I see your point, however, that it might be not very obvious if the option is named differently than what it's called in the unparsed filter. Perhaps this could be solved in a mutually beneficial way if the key will be 'domain' but the constant will be called `OPTION.DOMAINS`. Although this is also confusing :/ https://codereview.adblockplus.org/29465715/diff/29499749/abp/filters/parser.... abp/filters/parser.py:183: :returns: filter object. On 2017/07/28 16:43:29, mathias wrote: > I think this should be upper-case "Filter". Yes.
Addressed all your comments. https://codereview.adblockplus.org/29465715/diff/29499749/abp/filters/parser.py File abp/filters/parser.py (right): https://codereview.adblockplus.org/29465715/diff/29499749/abp/filters/parser.... abp/filters/parser.py:38: class ST: On 2017/07/28 17:38:10, Vasily Kuznetsov wrote: > On 2017/07/28 16:43:29, mathias wrote: > > Why abbreviating here (ST) and below (FA)? > > To be completely honest, the reason is kind of stupid: to make flake8 happy :/ > > This class was called SELECTOR_TYPE originally and the following one > FILTER_ACTION. So then you'd end up with constants like SELECTOR_TYPE.CSS and > FILTER_ACTION.ALLOW. And then if you (as a user of the library) like shortness, > you can import them `from abp.filters import SELECTOR_TYPE as ST` and get your > short ST.CSS. But flake8 doesn't like class names that are not CamelCased. I > thought about rolling a small special class to hold constants, kind of like: > > SELECTOR_TYPE = Namespace( > URL_PATTERN='url-pattern', > URL_REGEXP='url-regexp', > ... > ) > > But then you need to implement this `Namespace` class (or it can be a function), > which is annoying and inelegant. Another option is to just disable N801 in > flake8, that could probably also do it. Done. https://codereview.adblockplus.org/29465715/diff/29499749/abp/filters/parser.... abp/filters/parser.py:86: HFILTER_REGEXP = re.compile(r'^([^/*|@"!]*?)#([@?])?#(.+)$') On 2017/07/28 17:38:10, Vasily Kuznetsov wrote: > On 2017/07/28 16:43:29, mathias wrote: > > Why abbreviating? What's wrong about HIDING_FILTER_REGEXP here and > > BLOCKING_FILTER_REGEXP below? > > It's shorter this way. But I don't feel very strong about it, will rename them. Done. https://codereview.adblockplus.org/29465715/diff/29499749/abp/filters/parser.... abp/filters/parser.py:87: BFILTER_REGEXP_REGEXP = re.compile( On 2017/07/28 16:43:30, mathias wrote: > I was wondering about the *_REGEXP_REGEXP name, but then where is this actually > used anyway? Done. https://codereview.adblockplus.org/29465715/diff/29499749/abp/filters/parser.... abp/filters/parser.py:125: if name == 'domain': On 2017/07/28 16:43:30, mathias wrote: > Wouldn't it make sense to enumerate recognized OPTION_$NAME symbols as well? Done. https://codereview.adblockplus.org/29465715/diff/29499749/abp/filters/parser.... abp/filters/parser.py:126: name, value = 'domains', _parse_options(value, '|') On 2017/07/28 16:43:30, mathias wrote: > Why using a different / plural key for the parsed version? Done. https://codereview.adblockplus.org/29465715/diff/29500688/tests/test_parser.py File tests/test_parser.py (right): https://codereview.adblockplus.org/29465715/diff/29500688/tests/test_parser.p... tests/test_parser.py:113: def test_parse_bad_option(): Now we can also detect unknown options. Yaaay!
Almost there I think ;) https://codereview.adblockplus.org/29465715/diff/29500688/abp/filters/parser.py File abp/filters/parser.py (right): https://codereview.adblockplus.org/29465715/diff/29500688/abp/filters/parser.... abp/filters/parser.py:165: if name not in ALL_OPTIONS: I don't think this part of the code should validate whether an option is recognized, at least not unconditionally, for the sake of compatibility with future versions but also because parsing and validating is not necessarily the same step. https://codereview.adblockplus.org/29465715/diff/29500688/abp/filters/parser.... abp/filters/parser.py:177: def _parse_filter_options(options, separator=','): Why is the separator a parameter? The only place where this private function is invoked is _parse_blocking_filter, and that one does not bother passing a custom separator.
Hi Matze, Thank you for the comments. Agreed and addressed. Cheers, Vasily https://codereview.adblockplus.org/29465715/diff/29500688/abp/filters/parser.py File abp/filters/parser.py (right): https://codereview.adblockplus.org/29465715/diff/29500688/abp/filters/parser.... abp/filters/parser.py:165: if name not in ALL_OPTIONS: On 2017/08/01 06:31:35, mathias wrote: > I don't think this part of the code should validate whether an option is > recognized, at least not unconditionally, for the sake of compatibility with > future versions but also because parsing and validating is not necessarily the > same step. Following our conversation, I agree. Done https://codereview.adblockplus.org/29465715/diff/29500688/abp/filters/parser.... abp/filters/parser.py:177: def _parse_filter_options(options, separator=','): On 2017/08/01 06:31:35, mathias wrote: > Why is the separator a parameter? The only place where this private function is > invoked is _parse_blocking_filter, and that one does not bother passing a custom > separator. This is left-over from an earlier version that used this parameter. Now removed. https://codereview.adblockplus.org/29465715/diff/29500688/tests/test_parser.py File tests/test_parser.py (right): https://codereview.adblockplus.org/29465715/diff/29500688/tests/test_parser.p... tests/test_parser.py:113: def test_parse_bad_option(): On 2017/07/28 18:57:49, Vasily Kuznetsov wrote: > Now we can also detect unknown options. Yaaay! I remove this test since we've agreed to remove the related functionality. https://codereview.adblockplus.org/29465715/diff/29503616/abp/filters/parser.py File abp/filters/parser.py (right): https://codereview.adblockplus.org/29465715/diff/29503616/abp/filters/parser.... abp/filters/parser.py:21: __all__ = [ I updated my formatting rules to include "prefer one item per line notation for lists, sets and dictionary literals that don't fit on one line" after our conversation. This is updated accordingly.
LGTM.
FYI, I uploaded the commit after rebasing to the current trunk, so I can be sure land what's in the review. There are no visible changes introduced by the rebase so I assume LGTM holds.
On 2017/10/24 16:03:31, Vasily Kuznetsov wrote: > FYI, I uploaded the commit after rebasing to the current trunk, so I can be sure > land what's in the review. There are no visible changes introduced by the rebase > so I assume LGTM holds. Yeah I figured something along these lines. Works for me.
LGTM. |