Issue 30031558: Issue 7391 - Let rpy recursively parse filter options to dicts

Issue 30031558: Issue 7391 - Let rpy recursively parse filter options to dicts (Closed)

Created:
March 21, 2019, 11:12 p.m. by rhowell

Modified:
April 16, 2019, 1:34 a.m.

Reviewers:
Vasily Kuznetsov, sporz

Base URL:
https://hg.adblockplus.org/python-abp

Visibility:
Public.

Description

Issue 7391 - Let rpy recursively parse filter options to dicts Repository: https://hg.adblockplus.org/python-abp Base revision: 43e9e9e7ecdd

Patch Set 2 : Address comments on Patch Set 1 and bug fixes #

Total comments: 2

Patch Set 3 : Removed OrderedDict #

Patch Set 4 : Remove all OrderedDicts #

Created: April 12, 2019, 6:30 p.m.

Download [raw] [tar.bz2]

		Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+42 lines, -6 lines)			Patch
	M	abp/filters/rpy.py	View	1 2	3 chunks	+24 lines, -2 lines	0 comments	Download
	M	tests/test_rpy.py	View	1 2 3	3 chunks	+18 lines, -4 lines	0 comments	Download

Messages

Total messages: 9

Expand All Messages | Collapse All Messages

Vasily Kuznetsov

Hi Rosie, The general direction looks good. I have some comments on the details (see ...

March 22, 2019, 12:17 p.m. (2019-03-22 12:17:05 UTC) #2

sporz

Feedback about tuple2dict. https://codereview.adblockplus.org/30031558/diff/30031559/abp/filters/rpy.py File abp/filters/rpy.py (right): https://codereview.adblockplus.org/30031558/diff/30031559/abp/filters/rpy.py#newcode45 abp/filters/rpy.py:45: result['domain'] = parse_options(result.get('domain')) Thank you Vasily ...

March 25, 2019, 5:30 p.m. (2019-03-25 17:30:03 UTC) #3

rhowell

Hi Vasily and Stephan! Thanks for the feedback! Hope this looks a bit better now. ...

March 27, 2019, 7:14 p.m. (2019-03-27 19:14:07 UTC) #4

Hi Vasily and Stephan!

Thanks for the feedback! Hope this looks a bit better now. I've got a few more
comments/questions below.

Hope all is well!  :)

Cheers,
Rosie

https://codereview.adblockplus.org/30031558/diff/30031559/abp/filters/rpy.py
File abp/filters/rpy.py (left):

https://codereview.adblockplus.org/30031558/diff/30031559/abp/filters/rpy.py#...
abp/filters/rpy.py:65: if isinstance(data, list):
On 2019/03/22 12:17:04, Vasily Kuznetsov wrote:
> Unfortunately we can't remove the list branch here because sitekey option also
> has a list value (and that one probably shouldn't be converted to a dict).

Ah, gotcha. Since we didn't have a test using the sitekey option, these lines
weren't being tested, so tox was failing. I added a test for sitekey, so test
coverage is back at 100%.

But, sitekey options are not turned into a dict because line 44/46 only converts
'domain' options into a nested dict. So that should be ok?

https://codereview.adblockplus.org/30031558/diff/30031559/abp/filters/rpy.py
File abp/filters/rpy.py (right):

https://codereview.adblockplus.org/30031558/diff/30031559/abp/filters/rpy.py#...
abp/filters/rpy.py:29: def parse_options(options):
On 2019/03/22 12:17:04, Vasily Kuznetsov wrote:
> Maybe we should change this name to something like `option_list_to_dict`?
> Because we're not really doing parsing here.

Done.

https://codereview.adblockplus.org/30031558/diff/30031559/abp/filters/rpy.py#...
abp/filters/rpy.py:32: Parameters
On 2019/03/22 12:17:04, Vasily Kuznetsov wrote:
> Nit: The indentation of this line seems to be off.

Done.

https://codereview.adblockplus.org/30031558/diff/30031559/abp/filters/rpy.py#...
abp/filters/rpy.py:34: options: A list of tuples or namedtuples
On 2019/03/22 12:17:04, Vasily Kuznetsov wrote:
> We're treating them as just tuples so we probably don't need to mention
> nametuples at all. Or do we?

Done.

https://codereview.adblockplus.org/30031558/diff/30031559/abp/filters/rpy.py#...
abp/filters/rpy.py:45: result['domain'] = parse_options(result.get('domain'))
On 2019/03/25 17:30:03, sporz wrote:
> Interesting choice to put this as a function on it's own. Can you elaborate?
> 
> I'd have incorporated it into the tuple2dict function (using a check for list
of
> tuples as value instead of (domain, options) as key as deciding factor for
> recursion.

I ran into a few issues trying to call tuple2dict recursively.

1) A list of tuples doesn't have the function `_asdict` (it seems only
namedtuples have this function built in):
    *** AttributeError: 'list' object has no attribute '_asdict'

2) If we try to convert the list of tuples into a namedtuple, we run into issues
with character encoding, since some of the keys are unicode:
    *** ValueError: Type names and field names can only contain alphanumeric
characters and underscores: "(u'a', True)"

3) If we try to get really hacky and switch from Unicode to Str, we might lose
some unicode characters, and also tuples are immutable:
    TypeError: "'tuple' object does not support item assignment"

It was just getting really hacky and messy, so I figured it would be better to
make a helper function.

But I'd like to clarify - do we want only the 'domains' in the 'options' to be
sub-dicts? Looks like the only other 'option' which might contain a sub-list is
sitekey, and it seems we want to leave that as a list?

https://codereview.adblockplus.org/30031558/diff/30031559/abp/filters/rpy.py#...
abp/filters/rpy.py:45: result['domain'] = parse_options(result.get('domain'))
On 2019/03/25 17:30:03, sporz wrote:
> Thank you Vasily for pointing this out - I wasn't aware that option order and
> domain order have an impact on execution!
> 
> For our current specific use case this is not an issue. For python-abp as a
> resource to be used by many teams, however, it might be.
> 
> When I suggested this, I had the following idea in mind: How can python-abp
> provide a consistent data structure to be used by end users. As tuple2dict()
was
> used on the main level (resulting in a named list in R), I focused on trying
to
> provide that same data structure in the nested tuples as well. 
> 
> Two ideas:
> 
> * How about we keep tuples, and don't use tuple2dict at all?
> * How about we use an ordered dict instead of normal ones?
>   * Hopefully it's converted correctly into R! (R has ordered named lists)

I was able to implement this using OrderedDict and briefly test it in R, and it
seems to be working - properly parsing the domain options and preserving the
order.

As far as just using tuples, I'm not sure what the desired input/output would
look like there. rpy does have access to the `parse_line` function, maybe
something closer to that? Or what did you have in mind? (Some example
input/output in R would be helpful for me.)

> abp <- import("abp.filters.rpy")
> abp$parse_line("example.com")
Filter(text=u'example.com', selector={u'type': u'url-pattern', u'value':
u'example.com'}, action=u'block', options=[])

https://codereview.adblockplus.org/30031558/diff/30031559/abp/filters/rpy.py#...
abp/filters/rpy.py:64: result = {}
On 2019/03/22 12:17:04, Vasily Kuznetsov wrote:
> Maybe we just add `result['options'] = parse_options(result['options'])` after
> old version of this line. What do you think?

Yeah, good idea. We also need to check that 'options' is a key in the result.
Lines like Header, Comment, EmptyLine, and others don't have options, and would
produce a KeyError.

https://codereview.adblockplus.org/30031558/diff/30032561/tests/test_rpy.py
File tests/test_rpy.py (right):

https://codereview.adblockplus.org/30031558/diff/30032561/tests/test_rpy.py#n...
tests/test_rpy.py:100: b'options': {b'domain': {b'bar.foo.com': False,
b'foo.com': True},
I'm thinking about making this test stricter. Even though the domains are in the
wrong order here, the test still passes because comparing a dict with an
OrderedDict will pass, even if they're not in the same order. I should probably
make this an OrderedDict.

Vasily Kuznetsov

Hi Rosie! I'm pretty happy with this implementation and just have a small readability nit. ...

April 2, 2019, 5:36 p.m. (2019-04-02 17:36:22 UTC) #5

rhowell

Hi Vasily! I believe Stephan is using Python 3, but I'm not sure about the ...

April 9, 2019, 12:47 a.m. (2019-04-09 00:47:39 UTC) #6

Vasily Kuznetsov

Hi Rosie! > Should the OrderedDict also be removed from the test? It seems we'd ...

April 9, 2019, 8:10 p.m. (2019-04-09 20:10:30 UTC) #7

Hi Rosie!

> Should the OrderedDict also be removed from the test? It seems we'd still want
> to make sure the options are in order. I tested this (using `dict` in rpy.py
and
> `OrderedDict` in test_rpy.py), expecting it to fail, but it doesn't, and I'm
not
> sure why. Do you know why this test would be passing? Is it just me? Should I
> skip the test in Python 2?
> 
> This is the test I was expecting to fail in Python 2:
> tests/test_rpy.py::test_line2dict_format[filter_with_sitekey_list] PASSED

It's quite likely that when you compare dicts to other dicts (ordered or not)
the order of the keys is ignored, so the test would not care about it.

I have actually realized today that the order of the domains doesn't matter
after all (see this comment for more details:
https://issues.adblockplus.org/ticket/7443#comment:1). However, it still does
matter for content types (because if the first content-type option is inverted,
all default content types are assumed to be in, otherwise only the ones that are
explicitly included; see
https://github.com/adblockplus/adblockpluscore/blob/cb37f3bf799c47b92b4f1814e...
for reference).

Given that this is the case, I would say let's keep the OrderedDicts as long as
we're supporting Python 2. If you would like, feel free to add a check that the
returned values are OrderedDicts and the keys have the right order when the
expected result has OrderedDicts, to test_line2dict_format, but TBH, I wouldn't
bother at this point -- the time would be better spent migrating all of this off
Python 2.

> https://codereview.adblockplus.org/30031558/diff/30032564/abp/filters/rpy.py
> File abp/filters/rpy.py (right):
> 
>
https://codereview.adblockplus.org/30031558/diff/30032564/abp/filters/rpy.py#...
> abp/filters/rpy.py:90: return OrderedDict((strings2utf8(k), strings2utf8(v))
for
> k, v in
> On 2019/04/02 17:36:22, Vasily Kuznetsov wrote:
> > Nit: it would be better to break the string before `for` -- it's more
readable
> > this way.
> 
> Looks like we might not need these lines anyway. :)

yay!

rhowell

On 2019/04/09 20:10:30, Vasily Kuznetsov wrote: > Hi Rosie! > > > Should the OrderedDict ...

April 12, 2019, 6:30 p.m. (2019-04-12 18:30:39 UTC) #8

On 2019/04/09 20:10:30, Vasily Kuznetsov wrote:
> Hi Rosie!
> 
> > Should the OrderedDict also be removed from the test? It seems we'd still
want
> > to make sure the options are in order. I tested this (using `dict` in rpy.py
> and
> > `OrderedDict` in test_rpy.py), expecting it to fail, but it doesn't, and I'm
> not
> > sure why. Do you know why this test would be passing? Is it just me? Should
I
> > skip the test in Python 2?
> > 
> > This is the test I was expecting to fail in Python 2:
> > tests/test_rpy.py::test_line2dict_format[filter_with_sitekey_list] PASSED
> 
> It's quite likely that when you compare dicts to other dicts (ordered or not)
> the order of the keys is ignored, so the test would not care about it.
> 
> I have actually realized today that the order of the domains doesn't matter
> after all (see this comment for more details:
> https://issues.adblockplus.org/ticket/7443#comment:1). However, it still does
> matter for content types (because if the first content-type option is
inverted,
> all default content types are assumed to be in, otherwise only the ones that
are
> explicitly included; see
>
https://github.com/adblockplus/adblockpluscore/blob/cb37f3bf799c47b92b4f1814e...
> for reference).
> 
> Given that this is the case, I would say let's keep the OrderedDicts as long
as
> we're supporting Python 2. If you would like, feel free to add a check that
the
> returned values are OrderedDicts and the keys have the right order when the
> expected result has OrderedDicts, to test_line2dict_format, but TBH, I
wouldn't
> bother at this point -- the time would be better spent migrating all of this
off
> Python 2.
> 
> > https://codereview.adblockplus.org/30031558/diff/30032564/abp/filters/rpy.py
> > File abp/filters/rpy.py (right):
> > 
> >
>
https://codereview.adblockplus.org/30031558/diff/30032564/abp/filters/rpy.py#...
> > abp/filters/rpy.py:90: return OrderedDict((strings2utf8(k), strings2utf8(v))
> for
> > k, v in
> > On 2019/04/02 17:36:22, Vasily Kuznetsov wrote:
> > > Nit: it would be better to break the string before `for` -- it's more
> readable
> > > this way.
> > 
> > Looks like we might not need these lines anyway. :)
> 
> yay!

Hi Vasily!

Yeah, you might be right that we should rather migrate off Python 2. I've gotten
rid of the OrderedDicts, and I can make a new ticket for the migration. For this
ticket, let me know if you see anything else?

Hi Rosie!

LGTM.

Cheers,
Vasily

Expand All Messages | Collapse All Messages