|
|
Created:
March 21, 2019, 11:12 p.m. by rhowell Modified:
April 16, 2019, 1:34 a.m. Base URL:
https://hg.adblockplus.org/python-abp Visibility:
Public. |
DescriptionIssue 7391 - Let rpy recursively parse filter options to dicts
Repository: https://hg.adblockplus.org/python-abp
Base revision: 43e9e9e7ecdd
Patch Set 1 #
Total comments: 17
Patch Set 2 : Address comments on Patch Set 1 and bug fixes #
Total comments: 2
Patch Set 3 : Removed OrderedDict #Patch Set 4 : Remove all OrderedDicts #MessagesTotal messages: 9
Hi Rosie, The general direction looks good. I have some comments on the details (see below). Cheers, Vasily https://codereview.adblockplus.org/30031558/diff/30031559/abp/filters/rpy.py File abp/filters/rpy.py (left): https://codereview.adblockplus.org/30031558/diff/30031559/abp/filters/rpy.py#... abp/filters/rpy.py:65: if isinstance(data, list): Unfortunately we can't remove the list branch here because sitekey option also has a list value (and that one probably shouldn't be converted to a dict). https://codereview.adblockplus.org/30031558/diff/30031559/abp/filters/rpy.py File abp/filters/rpy.py (right): https://codereview.adblockplus.org/30031558/diff/30031559/abp/filters/rpy.py#... abp/filters/rpy.py:29: def parse_options(options): Maybe we should change this name to something like `option_list_to_dict`? Because we're not really doing parsing here. https://codereview.adblockplus.org/30031558/diff/30031559/abp/filters/rpy.py#... abp/filters/rpy.py:32: Parameters Nit: The indentation of this line seems to be off. https://codereview.adblockplus.org/30031558/diff/30031559/abp/filters/rpy.py#... abp/filters/rpy.py:34: options: A list of tuples or namedtuples We're treating them as just tuples so we probably don't need to mention nametuples at all. Or do we? https://codereview.adblockplus.org/30031558/diff/30031559/abp/filters/rpy.py#... abp/filters/rpy.py:45: result['domain'] = parse_options(result.get('domain')) This conversion will lose the order of the domain options, which has significance to how the filter is matched. Also the order of content type related options is significant. We need to check with Stephan whether losing this information here is ok. If not, we can probably still represent it somehow in the dict form because the only important part is whether the first domain is True or False (same for the content type options). @sporz: please comment on this. https://codereview.adblockplus.org/30031558/diff/30031559/abp/filters/rpy.py#... abp/filters/rpy.py:45: result['domain'] = parse_options(result.get('domain')) Nit: `result.get('domain')` could be just `result['domain']` https://codereview.adblockplus.org/30031558/diff/30031559/abp/filters/rpy.py#... abp/filters/rpy.py:64: result = {} Maybe we just add `result['options'] = parse_options(result['options'])` after old version of this line. What do you think?
Feedback about tuple2dict. https://codereview.adblockplus.org/30031558/diff/30031559/abp/filters/rpy.py File abp/filters/rpy.py (right): https://codereview.adblockplus.org/30031558/diff/30031559/abp/filters/rpy.py#... abp/filters/rpy.py:45: result['domain'] = parse_options(result.get('domain')) Thank you Vasily for pointing this out - I wasn't aware that option order and domain order have an impact on execution! For our current specific use case this is not an issue. For python-abp as a resource to be used by many teams, however, it might be. When I suggested this, I had the following idea in mind: How can python-abp provide a consistent data structure to be used by end users. As tuple2dict() was used on the main level (resulting in a named list in R), I focused on trying to provide that same data structure in the nested tuples as well. Two ideas: * How about we keep tuples, and don't use tuple2dict at all? * How about we use an ordered dict instead of normal ones? * Hopefully it's converted correctly into R! (R has ordered named lists) https://codereview.adblockplus.org/30031558/diff/30031559/abp/filters/rpy.py#... abp/filters/rpy.py:45: result['domain'] = parse_options(result.get('domain')) Interesting choice to put this as a function on it's own. Can you elaborate? I'd have incorporated it into the tuple2dict function (using a check for list of tuples as value instead of (domain, options) as key as deciding factor for recursion.
Hi Vasily and Stephan! Thanks for the feedback! Hope this looks a bit better now. I've got a few more comments/questions below. Hope all is well! :) Cheers, Rosie https://codereview.adblockplus.org/30031558/diff/30031559/abp/filters/rpy.py File abp/filters/rpy.py (left): https://codereview.adblockplus.org/30031558/diff/30031559/abp/filters/rpy.py#... abp/filters/rpy.py:65: if isinstance(data, list): On 2019/03/22 12:17:04, Vasily Kuznetsov wrote: > Unfortunately we can't remove the list branch here because sitekey option also > has a list value (and that one probably shouldn't be converted to a dict). Ah, gotcha. Since we didn't have a test using the sitekey option, these lines weren't being tested, so tox was failing. I added a test for sitekey, so test coverage is back at 100%. But, sitekey options are not turned into a dict because line 44/46 only converts 'domain' options into a nested dict. So that should be ok? https://codereview.adblockplus.org/30031558/diff/30031559/abp/filters/rpy.py File abp/filters/rpy.py (right): https://codereview.adblockplus.org/30031558/diff/30031559/abp/filters/rpy.py#... abp/filters/rpy.py:29: def parse_options(options): On 2019/03/22 12:17:04, Vasily Kuznetsov wrote: > Maybe we should change this name to something like `option_list_to_dict`? > Because we're not really doing parsing here. Done. https://codereview.adblockplus.org/30031558/diff/30031559/abp/filters/rpy.py#... abp/filters/rpy.py:32: Parameters On 2019/03/22 12:17:04, Vasily Kuznetsov wrote: > Nit: The indentation of this line seems to be off. Done. https://codereview.adblockplus.org/30031558/diff/30031559/abp/filters/rpy.py#... abp/filters/rpy.py:34: options: A list of tuples or namedtuples On 2019/03/22 12:17:04, Vasily Kuznetsov wrote: > We're treating them as just tuples so we probably don't need to mention > nametuples at all. Or do we? Done. https://codereview.adblockplus.org/30031558/diff/30031559/abp/filters/rpy.py#... abp/filters/rpy.py:45: result['domain'] = parse_options(result.get('domain')) On 2019/03/25 17:30:03, sporz wrote: > Interesting choice to put this as a function on it's own. Can you elaborate? > > I'd have incorporated it into the tuple2dict function (using a check for list of > tuples as value instead of (domain, options) as key as deciding factor for > recursion. I ran into a few issues trying to call tuple2dict recursively. 1) A list of tuples doesn't have the function `_asdict` (it seems only namedtuples have this function built in): *** AttributeError: 'list' object has no attribute '_asdict' 2) If we try to convert the list of tuples into a namedtuple, we run into issues with character encoding, since some of the keys are unicode: *** ValueError: Type names and field names can only contain alphanumeric characters and underscores: "(u'a', True)" 3) If we try to get really hacky and switch from Unicode to Str, we might lose some unicode characters, and also tuples are immutable: TypeError: "'tuple' object does not support item assignment" It was just getting really hacky and messy, so I figured it would be better to make a helper function. But I'd like to clarify - do we want only the 'domains' in the 'options' to be sub-dicts? Looks like the only other 'option' which might contain a sub-list is sitekey, and it seems we want to leave that as a list? https://codereview.adblockplus.org/30031558/diff/30031559/abp/filters/rpy.py#... abp/filters/rpy.py:45: result['domain'] = parse_options(result.get('domain')) On 2019/03/25 17:30:03, sporz wrote: > Thank you Vasily for pointing this out - I wasn't aware that option order and > domain order have an impact on execution! > > For our current specific use case this is not an issue. For python-abp as a > resource to be used by many teams, however, it might be. > > When I suggested this, I had the following idea in mind: How can python-abp > provide a consistent data structure to be used by end users. As tuple2dict() was > used on the main level (resulting in a named list in R), I focused on trying to > provide that same data structure in the nested tuples as well. > > Two ideas: > > * How about we keep tuples, and don't use tuple2dict at all? > * How about we use an ordered dict instead of normal ones? > * Hopefully it's converted correctly into R! (R has ordered named lists) I was able to implement this using OrderedDict and briefly test it in R, and it seems to be working - properly parsing the domain options and preserving the order. As far as just using tuples, I'm not sure what the desired input/output would look like there. rpy does have access to the `parse_line` function, maybe something closer to that? Or what did you have in mind? (Some example input/output in R would be helpful for me.) > abp <- import("abp.filters.rpy") > abp$parse_line("example.com") Filter(text=u'example.com', selector={u'type': u'url-pattern', u'value': u'example.com'}, action=u'block', options=[]) https://codereview.adblockplus.org/30031558/diff/30031559/abp/filters/rpy.py#... abp/filters/rpy.py:64: result = {} On 2019/03/22 12:17:04, Vasily Kuznetsov wrote: > Maybe we just add `result['options'] = parse_options(result['options'])` after > old version of this line. What do you think? Yeah, good idea. We also need to check that 'options' is a key in the result. Lines like Header, Comment, EmptyLine, and others don't have options, and would produce a KeyError. https://codereview.adblockplus.org/30031558/diff/30032561/tests/test_rpy.py File tests/test_rpy.py (right): https://codereview.adblockplus.org/30031558/diff/30032561/tests/test_rpy.py#n... tests/test_rpy.py:100: b'options': {b'domain': {b'bar.foo.com': False, b'foo.com': True}, I'm thinking about making this test stricter. Even though the domains are in the wrong order here, the test still passes because comparing a dict with an OrderedDict will pass, even if they're not in the same order. I should probably make this an OrderedDict.
Hi Rosie! I'm pretty happy with this implementation and just have a small readability nit. Stephan, what do you think? BTW, what version of Python are you guys using with the R bridge? If it's Python 3, the dictionaries are guaranteed to preserve the order of keys there so we could then skip all this OrderedDict stuff. Cheers, Vasily https://codereview.adblockplus.org/30031558/diff/30031559/abp/filters/rpy.py File abp/filters/rpy.py (left): https://codereview.adblockplus.org/30031558/diff/30031559/abp/filters/rpy.py#... abp/filters/rpy.py:65: if isinstance(data, list): On 2019/03/27 19:14:06, rhowell wrote: > On 2019/03/22 12:17:04, Vasily Kuznetsov wrote: > > Unfortunately we can't remove the list branch here because sitekey option also > > has a list value (and that one probably shouldn't be converted to a dict). > > Ah, gotcha. Since we didn't have a test using the sitekey option, these lines > weren't being tested, so tox was failing. I added a test for sitekey, so test > coverage is back at 100%. > > But, sitekey options are not turned into a dict because line 44/46 only converts > 'domain' options into a nested dict. So that should be ok? Yeah, seems reasonable to keep it as a list, unless Stephan has other ideas. https://codereview.adblockplus.org/30031558/diff/30032564/abp/filters/rpy.py File abp/filters/rpy.py (right): https://codereview.adblockplus.org/30031558/diff/30032564/abp/filters/rpy.py#... abp/filters/rpy.py:90: return OrderedDict((strings2utf8(k), strings2utf8(v)) for k, v in Nit: it would be better to break the string before `for` -- it's more readable this way.
Hi Vasily! I believe Stephan is using Python 3, but I'm not sure about the rest of the data team. Should the OrderedDict also be removed from the test? It seems we'd still want to make sure the options are in order. I tested this (using `dict` in rpy.py and `OrderedDict` in test_rpy.py), expecting it to fail, but it doesn't, and I'm not sure why. Do you know why this test would be passing? Is it just me? Should I skip the test in Python 2? This is the test I was expecting to fail in Python 2: tests/test_rpy.py::test_line2dict_format[filter_with_sitekey_list] PASSED Best, Rosie https://codereview.adblockplus.org/30031558/diff/30032564/abp/filters/rpy.py File abp/filters/rpy.py (right): https://codereview.adblockplus.org/30031558/diff/30032564/abp/filters/rpy.py#... abp/filters/rpy.py:90: return OrderedDict((strings2utf8(k), strings2utf8(v)) for k, v in On 2019/04/02 17:36:22, Vasily Kuznetsov wrote: > Nit: it would be better to break the string before `for` -- it's more readable > this way. Looks like we might not need these lines anyway. :)
Hi Rosie! > Should the OrderedDict also be removed from the test? It seems we'd still want > to make sure the options are in order. I tested this (using `dict` in rpy.py and > `OrderedDict` in test_rpy.py), expecting it to fail, but it doesn't, and I'm not > sure why. Do you know why this test would be passing? Is it just me? Should I > skip the test in Python 2? > > This is the test I was expecting to fail in Python 2: > tests/test_rpy.py::test_line2dict_format[filter_with_sitekey_list] PASSED It's quite likely that when you compare dicts to other dicts (ordered or not) the order of the keys is ignored, so the test would not care about it. I have actually realized today that the order of the domains doesn't matter after all (see this comment for more details: https://issues.adblockplus.org/ticket/7443#comment:1). However, it still does matter for content types (because if the first content-type option is inverted, all default content types are assumed to be in, otherwise only the ones that are explicitly included; see https://github.com/adblockplus/adblockpluscore/blob/cb37f3bf799c47b92b4f1814e... for reference). Given that this is the case, I would say let's keep the OrderedDicts as long as we're supporting Python 2. If you would like, feel free to add a check that the returned values are OrderedDicts and the keys have the right order when the expected result has OrderedDicts, to test_line2dict_format, but TBH, I wouldn't bother at this point -- the time would be better spent migrating all of this off Python 2. > https://codereview.adblockplus.org/30031558/diff/30032564/abp/filters/rpy.py > File abp/filters/rpy.py (right): > > https://codereview.adblockplus.org/30031558/diff/30032564/abp/filters/rpy.py#... > abp/filters/rpy.py:90: return OrderedDict((strings2utf8(k), strings2utf8(v)) for > k, v in > On 2019/04/02 17:36:22, Vasily Kuznetsov wrote: > > Nit: it would be better to break the string before `for` -- it's more readable > > this way. > > Looks like we might not need these lines anyway. :) yay!
On 2019/04/09 20:10:30, Vasily Kuznetsov wrote: > Hi Rosie! > > > Should the OrderedDict also be removed from the test? It seems we'd still want > > to make sure the options are in order. I tested this (using `dict` in rpy.py > and > > `OrderedDict` in test_rpy.py), expecting it to fail, but it doesn't, and I'm > not > > sure why. Do you know why this test would be passing? Is it just me? Should I > > skip the test in Python 2? > > > > This is the test I was expecting to fail in Python 2: > > tests/test_rpy.py::test_line2dict_format[filter_with_sitekey_list] PASSED > > It's quite likely that when you compare dicts to other dicts (ordered or not) > the order of the keys is ignored, so the test would not care about it. > > I have actually realized today that the order of the domains doesn't matter > after all (see this comment for more details: > https://issues.adblockplus.org/ticket/7443#comment:1). However, it still does > matter for content types (because if the first content-type option is inverted, > all default content types are assumed to be in, otherwise only the ones that are > explicitly included; see > https://github.com/adblockplus/adblockpluscore/blob/cb37f3bf799c47b92b4f1814e... > for reference). > > Given that this is the case, I would say let's keep the OrderedDicts as long as > we're supporting Python 2. If you would like, feel free to add a check that the > returned values are OrderedDicts and the keys have the right order when the > expected result has OrderedDicts, to test_line2dict_format, but TBH, I wouldn't > bother at this point -- the time would be better spent migrating all of this off > Python 2. > > > https://codereview.adblockplus.org/30031558/diff/30032564/abp/filters/rpy.py > > File abp/filters/rpy.py (right): > > > > > https://codereview.adblockplus.org/30031558/diff/30032564/abp/filters/rpy.py#... > > abp/filters/rpy.py:90: return OrderedDict((strings2utf8(k), strings2utf8(v)) > for > > k, v in > > On 2019/04/02 17:36:22, Vasily Kuznetsov wrote: > > > Nit: it would be better to break the string before `for` -- it's more > readable > > > this way. > > > > Looks like we might not need these lines anyway. :) > > yay! Hi Vasily! Yeah, you might be right that we should rather migrate off Python 2. I've gotten rid of the OrderedDicts, and I can make a new ticket for the migration. For this ticket, let me know if you see anything else?
Hi Rosie! LGTM. Cheers, Vasily |