Issue 29465715: Fixes 4969 - Add parsing of filters

Issue 29465715: Fixes 4969 - Add parsing of filters (Closed)

Created:
June 14, 2017, 5:32 p.m. by Vasily Kuznetsov

Modified:
Oct. 24, 2017, 4:17 p.m.

Reviewers:
mathias

Visibility:
Public.

Description

Fixes 4969 - Add parsing of filters repository: https://hg.adblockplus.org/python-abp base revision: f41b6ebd236a

Patch Set 2 : remove all interpretation and keep only parsing, add support for element hiding emulation filters, … #

Total comments: 17

Patch Set 3 : Address review comments on patch set 2 #

Total comments: 6

Patch Set 4 : Address review comments on patch set 3 #

Total comments: 1

Patch Set 5 : Rebase to 1f5d7ead9bff #

Created: Oct. 24, 2017, 3:58 p.m.

Download [raw] [tar.bz2]

		Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+231 lines, -8 lines)			Patch
	M	abp/filters/parser.py	View	1 2 3	4 chunks	+148 lines, -3 lines	0 comments	Download
	M	tests/test_parser.py	View	1 2 3	1 chunk	+83 lines, -5 lines	0 comments	Download

Messages

Total messages: 17

Expand All Messages | Collapse All Messages

Vasily Kuznetsov

In the end the implementation changed quite a bit from what I've originally programmed one ...

June 14, 2017, 5:44 p.m. (2017-06-14 17:44:00 UTC) #1

Vasily Kuznetsov

Hi Matze! The conversation in the ticket seems to be resolved enough to move forward ...

July 26, 2017, 2:52 p.m. (2017-07-26 14:52:26 UTC) #2

mathias

https://codereview.adblockplus.org/29465715/diff/29465716/abp/filters/parser.py File abp/filters/parser.py (right): https://codereview.adblockplus.org/29465715/diff/29465716/abp/filters/parser.py#newcode25 abp/filters/parser.py:25: """Internal exception used by the parser to signal invalid ...

July 26, 2017, 8:37 p.m. (2017-07-26 20:37:16 UTC) #3

Vasily Kuznetsov

Hi Matze, Thanks for the comments, I agree with every one of them. Will make ...

July 27, 2017, 11:05 a.m. (2017-07-27 11:05:03 UTC) #4

Hi Matze,

Thanks for the comments, I agree with every one of them.

Will make a new patch addressing these comments and the parsing vs.
interpretation distinction that we've discussed yesterday.

Cheers,
Vasily

https://codereview.adblockplus.org/29465715/diff/29465716/abp/filters/parser.py
File abp/filters/parser.py (right):

https://codereview.adblockplus.org/29465715/diff/29465716/abp/filters/parser....
abp/filters/parser.py:102: raise ParseError('Malformed header')
On 2017/07/26 20:37:15, mathias wrote:
> Please explain why you don't include the malformed text in the error message
> here (and the similar functions below). I was under the impression that such
> kind of information is highly appreciated because Python exception trace
output
> does not include invocation parameters, hence reproduction without can is
often
> quite tricky.

My reasoning was that you can get to this place in two ways:
1. by calling `parse_line(stuff)` directly: in this case the caller should
expect to get the exception (it's documented in the docstring of `parse_line`)
and handle it more or less in the same scope, where they know what was passed to
`parse_line`.
2. by calling `parse_filterlist(list_of_stuff)`: in this case the exception gets
caught inside of `parse_filterlist` and the result is an `InvalidLine` object,
that contains the original line.

Looking at it now, it seems to me that the assumption that the user will handle
the exception before they lose track of what was passed to `parse_line`, that I
made in considering case 1, is too optimistic. I will change the exception to be
more user-friendly. This should also resolve your concern about the disappearing
`__init__` in line 25.

https://codereview.adblockplus.org/29465715/diff/29465716/abp/filters/parser....
abp/filters/parser.py:203: action = 'allow'
On 2017/07/26 20:37:15, mathias wrote:
> I think we should have symbols like BFILTER_ACTION_ALLOW and
> BFILTER_ACTION_BLOCK for actions asslociated with blocking filters, analogous
to
> HFILTER_ACTION_HIDE and HFILTER_ACTION_SHOW for element hiding filters above.

Probably not these exact names for the constants, but in general I agree,
constants are better than magic strings.

https://codereview.adblockplus.org/29465715/diff/29465716/abp/filters/parser....
abp/filters/parser.py:214: selector = {'type': 'url-regexp', 'value':
selector[1:-1]}
On 2017/07/26 20:37:15, mathias wrote:
> I also think we should have symbols like SELECTOR_TYPE_REGEXP and
> SELECTOR_TYPE_PATTERN. And be it just to have a place to document them like #
> http://link/to/explanation or an actual definition or something, or for IDE's
to
> recognize them symbols, or plain old helping humans with the association by
> creating an official item.

Acknowledged.

https://codereview.adblockplus.org/29465715/diff/29465716/abp/filters/parser....
abp/filters/parser.py:227: match = HFILTER_REGEXP.match(text) if '#' in text
else False
On 2017/07/26 20:37:15, mathias wrote:
> Call me old-fashioned but I seriously dislike changing the type of a variable
> (especially when the former and new value types quack quite differently). You
> could just use None instead of False here, so it'll be match or no match, not
> match or untruth.

Completely agree about changing the type of the variable. Here, as far as we're
concerned (the variable will only be used in the following `if`, if it's not a
proper match), the quacking is the same, but the following code could
potentially change, leading to subtle bugs. Thanks for catching this, I will fix
it.

Vasily Kuznetsov

Oops, looks like I missed these two comments. https://codereview.adblockplus.org/29465715/diff/29465716/setup.py File setup.py (right): https://codereview.adblockplus.org/29465715/diff/29465716/setup.py#newcode46 setup.py:46: version='0.0.2', ...

July 27, 2017, 12:20 p.m. (2017-07-27 12:20:58 UTC) #5

mathias

On 2017/07/27 12:20:58, Vasily Kuznetsov wrote: > https://codereview.adblockplus.org/29465715/diff/29465716/tests/test_parser.py > File tests/test_parser.py (left): > > https://codereview.adblockplus.org/29465715/diff/29465716/tests/test_parser.py#oldcode80 ...

July 27, 2017, 12:36 p.m. (2017-07-27 12:36:12 UTC) #6

mathias

July 27, 2017, 12:36 p.m. (2017-07-27 12:36:13 UTC) #7

Vasily Kuznetsov

Hi Matze, Here's a more minimalist changeset following your advice. The delta for `parser.py` might ...

July 27, 2017, 7:23 p.m. (2017-07-27 19:23:46 UTC) #8

mathias

I like the simplified patch-set a lot! Just missing some more """info""" in the source. ...

July 28, 2017, 4:43 p.m. (2017-07-28 16:43:30 UTC) #9

Vasily Kuznetsov

Thanks for the comments again, Matze. I just replied to some of your points for ...

July 28, 2017, 5:38 p.m. (2017-07-28 17:38:10 UTC) #10

Thanks for the comments again, Matze. I just replied to some of your points for
now, will upload the updated patch later.

https://codereview.adblockplus.org/29465715/diff/29499749/abp/filters/parser.py
File abp/filters/parser.py (right):

https://codereview.adblockplus.org/29465715/diff/29499749/abp/filters/parser....
abp/filters/parser.py:38: class ST:
On 2017/07/28 16:43:29, mathias wrote:
> Why abbreviating here (ST) and below (FA)?

To be completely honest, the reason is kind of stupid: to make flake8 happy :/

This class was called SELECTOR_TYPE originally and the following one
FILTER_ACTION. So then you'd end up with constants like SELECTOR_TYPE.CSS and
FILTER_ACTION.ALLOW. And then if you (as a user of the library) like shortness,
you can import them `from abp.filters import SELECTOR_TYPE as ST` and get your
short ST.CSS. But flake8 doesn't like class names that are not CamelCased. I
thought about rolling a small special class to hold constants, kind of like:

SELECTOR_TYPE = Namespace(
    URL_PATTERN='url-pattern',
    URL_REGEXP='url-regexp',
    ...
)

But then you need to implement this `Namespace` class (or it can be a function),
which is annoying and inelegant. Another option is to just disable N801 in
flake8, that could probably also do it.

https://codereview.adblockplus.org/29465715/diff/29499749/abp/filters/parser....
abp/filters/parser.py:86: HFILTER_REGEXP =
re.compile(r'^([^/*|@"!]*?)#([@?])?#(.+)$')
On 2017/07/28 16:43:29, mathias wrote:
> Why abbreviating? What's wrong about HIDING_FILTER_REGEXP here and
> BLOCKING_FILTER_REGEXP below?

It's shorter this way. But I don't feel very strong about it, will rename them.

https://codereview.adblockplus.org/29465715/diff/29499749/abp/filters/parser....
abp/filters/parser.py:87: BFILTER_REGEXP_REGEXP = re.compile(
On 2017/07/28 16:43:30, mathias wrote:
> I was wondering about the *_REGEXP_REGEXP name, but then where is this
actually
> used anyway?

It's the regular expression for blocking filters which use regular expressions,
so the name is legit. However, you're right that I'm not actually using it. Just
copied it from ABP source. The funny thing, it seems that there it's not used
either :) (see
https://hg.adblockplus.org/adblockpluscore/file/tip/lib/filterClasses.js)

I will remove it.

https://codereview.adblockplus.org/29465715/diff/29499749/abp/filters/parser....
abp/filters/parser.py:125: if name == 'domain':
On 2017/07/28 16:43:30, mathias wrote:
> Wouldn't it make sense to enumerate recognized OPTION_$NAME symbols as well?

Yeah, probably makes sense to make some kind of enum for these things. It will
be needed anyway later, for interpretation, and might also be useful for the
users of the library.

https://codereview.adblockplus.org/29465715/diff/29499749/abp/filters/parser....
abp/filters/parser.py:126: name, value = 'domains', _parse_options(value, '|')
On 2017/07/28 16:43:30, mathias wrote:
> Why using a different / plural key for the parsed version?

Because semantically it's a list, always, so calling it 'domain' is a bit
confusing. I see your point, however, that it might be not very obvious if the
option is named differently than what it's called in the unparsed filter. 

Perhaps this could be solved in a mutually beneficial way if the key will be
'domain' but the constant will be called `OPTION.DOMAINS`. Although this is also
confusing :/

https://codereview.adblockplus.org/29465715/diff/29499749/abp/filters/parser....
abp/filters/parser.py:183: :returns: filter object.
On 2017/07/28 16:43:29, mathias wrote:
> I think this should be upper-case "Filter".

Yes.

Vasily Kuznetsov

Addressed all your comments. https://codereview.adblockplus.org/29465715/diff/29499749/abp/filters/parser.py File abp/filters/parser.py (right): https://codereview.adblockplus.org/29465715/diff/29499749/abp/filters/parser.py#newcode38 abp/filters/parser.py:38: class ST: On 2017/07/28 17:38:10, ...

July 28, 2017, 6:57 p.m. (2017-07-28 18:57:49 UTC) #11

Addressed all your comments.

https://codereview.adblockplus.org/29465715/diff/29499749/abp/filters/parser.py
File abp/filters/parser.py (right):

https://codereview.adblockplus.org/29465715/diff/29499749/abp/filters/parser....
abp/filters/parser.py:38: class ST:
On 2017/07/28 17:38:10, Vasily Kuznetsov wrote:
> On 2017/07/28 16:43:29, mathias wrote:
> > Why abbreviating here (ST) and below (FA)?
> 
> To be completely honest, the reason is kind of stupid: to make flake8 happy :/
> 
> This class was called SELECTOR_TYPE originally and the following one
> FILTER_ACTION. So then you'd end up with constants like SELECTOR_TYPE.CSS and
> FILTER_ACTION.ALLOW. And then if you (as a user of the library) like
shortness,
> you can import them `from abp.filters import SELECTOR_TYPE as ST` and get your
> short ST.CSS. But flake8 doesn't like class names that are not CamelCased. I
> thought about rolling a small special class to hold constants, kind of like:
> 
> SELECTOR_TYPE = Namespace(
>     URL_PATTERN='url-pattern',
>     URL_REGEXP='url-regexp',
>     ...
> )
> 
> But then you need to implement this `Namespace` class (or it can be a
function),
> which is annoying and inelegant. Another option is to just disable N801 in
> flake8, that could probably also do it.

Done.

https://codereview.adblockplus.org/29465715/diff/29499749/abp/filters/parser....
abp/filters/parser.py:86: HFILTER_REGEXP =
re.compile(r'^([^/*|@"!]*?)#([@?])?#(.+)$')
On 2017/07/28 17:38:10, Vasily Kuznetsov wrote:
> On 2017/07/28 16:43:29, mathias wrote:
> > Why abbreviating? What's wrong about HIDING_FILTER_REGEXP here and
> > BLOCKING_FILTER_REGEXP below?
> 
> It's shorter this way. But I don't feel very strong about it, will rename
them.

Done.

https://codereview.adblockplus.org/29465715/diff/29499749/abp/filters/parser....
abp/filters/parser.py:87: BFILTER_REGEXP_REGEXP = re.compile(
On 2017/07/28 16:43:30, mathias wrote:
> I was wondering about the *_REGEXP_REGEXP name, but then where is this
actually
> used anyway?

Done.

https://codereview.adblockplus.org/29465715/diff/29499749/abp/filters/parser....
abp/filters/parser.py:125: if name == 'domain':
On 2017/07/28 16:43:30, mathias wrote:
> Wouldn't it make sense to enumerate recognized OPTION_$NAME symbols as well?

Done.

https://codereview.adblockplus.org/29465715/diff/29499749/abp/filters/parser....
abp/filters/parser.py:126: name, value = 'domains', _parse_options(value, '|')
On 2017/07/28 16:43:30, mathias wrote:
> Why using a different / plural key for the parsed version?

Done.

https://codereview.adblockplus.org/29465715/diff/29500688/tests/test_parser.py
File tests/test_parser.py (right):

https://codereview.adblockplus.org/29465715/diff/29500688/tests/test_parser.p...
tests/test_parser.py:113: def test_parse_bad_option():
Now we can also detect unknown options. Yaaay!

mathias

Almost there I think ;) https://codereview.adblockplus.org/29465715/diff/29500688/abp/filters/parser.py File abp/filters/parser.py (right): https://codereview.adblockplus.org/29465715/diff/29500688/abp/filters/parser.py#newcode165 abp/filters/parser.py:165: if name not in ...

Aug. 1, 2017, 6:31 a.m. (2017-08-01 06:31:36 UTC) #12

Vasily Kuznetsov

Hi Matze, Thank you for the comments. Agreed and addressed. Cheers, Vasily https://codereview.adblockplus.org/29465715/diff/29500688/abp/filters/parser.py File abp/filters/parser.py ...

Aug. 2, 2017, 4:21 p.m. (2017-08-02 16:21:18 UTC) #13

Vasily Kuznetsov

FYI, I uploaded the commit after rebasing to the current trunk, so I can be ...

Oct. 24, 2017, 4:03 p.m. (2017-10-24 16:03:31 UTC) #15

mathias

On 2017/10/24 16:03:31, Vasily Kuznetsov wrote: > FYI, I uploaded the commit after rebasing to ...

Oct. 24, 2017, 4:12 p.m. (2017-10-24 16:12:18 UTC) #16

LGTM.

Expand All Messages | Collapse All Messages