|
|
Created:
Jan. 24, 2019, 3:37 a.m. by rhowell Modified:
Jan. 29, 2019, 8:49 p.m. Reviewers:
Vasily Kuznetsov Base URL:
https://hg.adblockplus.org/python-abp/ Visibility:
Public. |
DescriptionIssue 7230 - python-abp line2dict vectorization
Patch Set 1 #
Total comments: 4
Patch Set 2 : Switch from dict to list #
Total comments: 12
Patch Set 3 : Address comments on PS2 #
Total comments: 2
Patch Set 4 : Separate tests #
MessagesTotal messages: 8
Hi Rosie! If I understand the ticket right, the function should be more like a "vectorized" version of list2dict. See inline comments below for details. The README changes look unrelated. Maybe they should be in a NoIssue commit instead? Cheers, Vasily https://codereview.adblockplus.org/29988581/diff/29988582/abp/filters/rpy.py File abp/filters/rpy.py (right): https://codereview.adblockplus.org/29988581/diff/29988582/abp/filters/rpy.py#... abp/filters/rpy.py:110: Parsing mode. Note: The default is 'start' in case any of the filters I think the default should be 'body'. The data team are doing their own parsing of header and metadata, so in their use case this function will likely only see filter lines. Also this would be consistent with the behavior of `line2dict`, which we want to "vectorize" here. https://codereview.adblockplus.org/29988581/diff/29988582/abp/filters/rpy.py#... abp/filters/rpy.py:116: dict If I understand it right, the output should be a list of dicts, not dict of dicts. The idea of this function was to move the loop (line 122) from R to Python to avoid going through the bridge at each iteration. The final product that Stephan needs is a list again.
Hey Vasily! Oops, yeah Stephan also clarified that on the ticket. Switched it to return a list now. Also, I submitted the README changes in a Noissue. Let me know if you see anything else? Thanks! :) Rosie https://codereview.adblockplus.org/29988581/diff/29988582/abp/filters/rpy.py File abp/filters/rpy.py (right): https://codereview.adblockplus.org/29988581/diff/29988582/abp/filters/rpy.py#... abp/filters/rpy.py:110: Parsing mode. Note: The default is 'start' in case any of the filters On 2019/01/24 13:27:59, Vasily Kuznetsov wrote: > I think the default should be 'body'. The data team are doing their own parsing > of header and metadata, so in their use case this function will likely only see > filter lines. Also this would be consistent with the behavior of `line2dict`, > which we want to "vectorize" here. Done. https://codereview.adblockplus.org/29988581/diff/29988582/abp/filters/rpy.py#... abp/filters/rpy.py:116: dict On 2019/01/24 13:27:59, Vasily Kuznetsov wrote: > If I understand it right, the output should be a list of dicts, not dict of > dicts. The idea of this function was to move the loop (line 122) from R to > Python to avoid going through the bridge at each iteration. The final product > that Stephan needs is a list again. Done.
Hi Rosie! The function looks good. I have some suggestions regarding the docstring and the tests in the comments below. Cheers, Vasily https://codereview.adblockplus.org/29988581/diff/29989584/abp/filters/rpy.py File abp/filters/rpy.py (right): https://codereview.adblockplus.org/29988581/diff/29989584/abp/filters/rpy.py#... abp/filters/rpy.py:108: filter. Headers and metadata will not parse properly unless the mode The connection between the first and the second sentence seems a bit confusing. How about something like "If the mode is 'start', headers and metadata can also be parsed" to make it more clear what we mean? https://codereview.adblockplus.org/29988581/diff/29989584/abp/filters/rpy.py#... abp/filters/rpy.py:115: dict Nit: it should be list of dict. https://codereview.adblockplus.org/29988581/diff/29989584/tests/test_rpy.py File tests/test_rpy.py (right): https://codereview.adblockplus.org/29988581/diff/29989584/tests/test_rpy.py#n... tests/test_rpy.py:161: for key in _TEST_EXAMPLES.keys(): This is equivalent to `for key in _TEST_EXAMPLES:`. I would recommend to either use that and keep the rest of the code as is, or use the idiom: for key, value in _TEST_EXAMPLES.items(): This is a bit more "pythonic". https://codereview.adblockplus.org/29988581/diff/29989584/tests/test_rpy.py#n... tests/test_rpy.py:162: if (_TEST_EXAMPLES[key]['out'][b'type'] != b'Header' You can use `... not in {b'Header', b'Metadata'}`. https://codereview.adblockplus.org/29988581/diff/29989584/tests/test_rpy.py#n... tests/test_rpy.py:172: assert data == _TEST_EXAMPLES[line_type]['out'] So in the end we compare the output of line2dict(_TEST_EXAMPLES[line_type]['in']) to _TEST_EXAMPLES[line_type]['out'] -- this seems like a test of line2dict with an innovative way of looping over some part of _TEST_EXAMPLES. But we are not doing much checking of the output of lines2dicts(). A better approach would be to first produce the list of all inputs and outputs: tests = [t for t in _TEST_EXAMPLES.values() if t[b'type'] not in {b'Header', b'Metadata'}] ins = [t['in'] for ex in tests] outs = [t['out'] for ex in tests] and then we could just compare the output of `lines2dicts(ins)` to `outs`. assert lines2dicts(string_list, 'start')
Hi Vasily! I think I've gotten all your suggestions implemented, just had one question, see below. :) https://codereview.adblockplus.org/29988581/diff/29989584/abp/filters/rpy.py File abp/filters/rpy.py (right): https://codereview.adblockplus.org/29988581/diff/29989584/abp/filters/rpy.py#... abp/filters/rpy.py:108: filter. Headers and metadata will not parse properly unless the mode On 2019/01/25 15:00:58, Vasily Kuznetsov wrote: > The connection between the first and the second sentence seems a bit confusing. > How about something like "If the mode is 'start', headers and metadata can also > be parsed" to make it more clear what we mean? Done. https://codereview.adblockplus.org/29988581/diff/29989584/abp/filters/rpy.py#... abp/filters/rpy.py:115: dict On 2019/01/25 15:00:58, Vasily Kuznetsov wrote: > Nit: it should be list of dict. Done. https://codereview.adblockplus.org/29988581/diff/29989584/tests/test_rpy.py File tests/test_rpy.py (right): https://codereview.adblockplus.org/29988581/diff/29989584/tests/test_rpy.py#n... tests/test_rpy.py:161: for key in _TEST_EXAMPLES.keys(): On 2019/01/25 15:00:58, Vasily Kuznetsov wrote: > This is equivalent to `for key in _TEST_EXAMPLES:`. I would recommend to either > use that and keep the rest of the code as is, or use the idiom: > > for key, value in _TEST_EXAMPLES.items(): > > This is a bit more "pythonic". Done. https://codereview.adblockplus.org/29988581/diff/29989584/tests/test_rpy.py#n... tests/test_rpy.py:162: if (_TEST_EXAMPLES[key]['out'][b'type'] != b'Header' On 2019/01/25 15:00:58, Vasily Kuznetsov wrote: > You can use `... not in {b'Header', b'Metadata'}`. Done. https://codereview.adblockplus.org/29988581/diff/29989584/tests/test_rpy.py#n... tests/test_rpy.py:172: assert data == _TEST_EXAMPLES[line_type]['out'] On 2019/01/25 15:00:58, Vasily Kuznetsov wrote: > So in the end we compare the output of > line2dict(_TEST_EXAMPLES[line_type]['in']) to _TEST_EXAMPLES[line_type]['out'] > -- this seems like a test of line2dict with an innovative way of looping over > some part of _TEST_EXAMPLES. But we are not doing much checking of the output of > lines2dicts(). > > A better approach would be to first produce the list of all inputs and outputs: > > tests = [t for t in _TEST_EXAMPLES.values() > if t[b'type'] not in {b'Header', b'Metadata'}] > ins = [t['in'] for ex in tests] > outs = [t['out'] for ex in tests] > > and then we could just compare the output of `lines2dicts(ins)` to `outs`. > > assert lines2dicts(string_list, 'start') Good idea, this test makes much more sense. I was trying to avoid something like this because Python2 doesn't preserve the order of dicts, so the resulting lists have the dicts in different orders, but the tests are passing every time on Python2.7 so now I'm confused. Normally, lists that are in different order are not considered equal. E.g. --> 179 assert lines2dicts(ins) == outs ipdb> lines2dicts(ins) [{'target': 'www.test.py/filtelist.txt', 'type': 'Include'}, {'selector': {'type': 'css', 'value': 'div#ad1'}, 'action': 'hide', 'options': [('domain', [('foo.com', True), ('bar.com', True)])], 'text': 'foo.com,bar.com##div#ad1', 'type': 'Filter'}, {'type': 'EmptyLine'}, {'type': 'Comment', 'text': 'Comment'}, {'selector': {'type': 'css', 'value': 'div#ad1'}, 'action': 'hide', 'options': [('domain', [('foo.com', True)])], 'text': 'foo.com##div#ad1', 'type': 'Filter'}] ipdb> outs [{'target': 'www.test.py/filtelist.txt', 'type': 'Include'}, {'selector': {'value': 'div#ad1', 'type': 'css'}, 'action': 'hide', 'type': 'Filter', 'options': [('domain', [('foo.com', True), ('bar.com', True)])], 'text': 'foo.com,bar.com##div#ad1'}, {'type': 'EmptyLine'}, {'type': 'Comment', 'text': 'Comment'}, {'selector': {'value': 'div#ad1', 'type': 'css'}, 'action': 'hide', 'type': 'Filter', 'options': [('domain', [('foo.com', True)])], 'text': 'foo.com##div#ad1'}] ipdb> lines2dicts(ins) == outs True But: $ python2 Python 2.7.12 (default, Nov 12 2018, 14:36:49) [GCC 5.4.0 20160609] on linux2 >>> a = [{'a':1}, {'b':2}] >>> b = [{'b':2}, {'a':1}] >>> a == b False
Hi Rosie! Almost done! I have replied to your question and have one more minor suggestion about the tests. Cheers, Vasily https://codereview.adblockplus.org/29988581/diff/29989584/tests/test_rpy.py File tests/test_rpy.py (right): https://codereview.adblockplus.org/29988581/diff/29989584/tests/test_rpy.py#n... tests/test_rpy.py:172: assert data == _TEST_EXAMPLES[line_type]['out'] On 2019/01/26 01:08:22, rhowell wrote: > On 2019/01/25 15:00:58, Vasily Kuznetsov wrote: > > So in the end we compare the output of > > line2dict(_TEST_EXAMPLES[line_type]['in']) to _TEST_EXAMPLES[line_type]['out'] > > -- this seems like a test of line2dict with an innovative way of looping over > > some part of _TEST_EXAMPLES. But we are not doing much checking of the output > of > > lines2dicts(). > > > > A better approach would be to first produce the list of all inputs and > outputs: > > > > tests = [t for t in _TEST_EXAMPLES.values() > > if t[b'type'] not in {b'Header', b'Metadata'}] > > ins = [t['in'] for ex in tests] > > outs = [t['out'] for ex in tests] > > > > and then we could just compare the output of `lines2dicts(ins)` to `outs`. > > > > assert lines2dicts(string_list, 'start') > > Good idea, this test makes much more sense. I was trying to avoid something like > this because Python2 doesn't preserve the order of dicts, so the resulting lists > have the dicts in different orders, but the tests are passing every time on > Python2.7 so now I'm confused. Normally, lists that are in different order are > not considered equal. > > E.g. > > --> 179 assert lines2dicts(ins) == outs > > ipdb> lines2dicts(ins) > [{'target': 'www.test.py/filtelist.txt', 'type': 'Include'}, {'selector': > {'type': 'css', 'value': 'div#ad1'}, 'action': 'hide', 'options': [('domain', > [('foo.com', True), ('bar.com', True)])], 'text': 'foo.com,bar.com##div#ad1', > 'type': 'Filter'}, {'type': 'EmptyLine'}, {'type': 'Comment', 'text': > 'Comment'}, {'selector': {'type': 'css', 'value': 'div#ad1'}, 'action': 'hide', > 'options': [('domain', [('foo.com', True)])], 'text': 'foo.com##div#ad1', > 'type': 'Filter'}] > > ipdb> outs > [{'target': 'www.test.py/filtelist.txt', 'type': 'Include'}, {'selector': > {'value': 'div#ad1', 'type': 'css'}, 'action': 'hide', 'type': 'Filter', > 'options': [('domain', [('foo.com', True), ('bar.com', True)])], 'text': > 'foo.com,bar.com##div#ad1'}, {'type': 'EmptyLine'}, {'type': 'Comment', 'text': > 'Comment'}, {'selector': {'value': 'div#ad1', 'type': 'css'}, 'action': 'hide', > 'type': 'Filter', 'options': [('domain', [('foo.com', True)])], 'text': > 'foo.com##div#ad1'}] > > ipdb> lines2dicts(ins) == outs > True > > But: > > $ python2 > Python 2.7.12 (default, Nov 12 2018, 14:36:49) > [GCC 5.4.0 20160609] on linux2 > >>> a = [{'a':1}, {'b':2}] > >>> b = [{'b':2}, {'a':1}] > >>> a == b > False This happens because the proposed code doesn't depend on the order of iteration over a dict. The first line picks the tests that we are going to run and saves them in a python list (`tests`). This way the order stays the same when we use it to produce the lists of inputs and outputs. https://codereview.adblockplus.org/29988581/diff/29990578/tests/test_rpy.py File tests/test_rpy.py (right): https://codereview.adblockplus.org/29988581/diff/29990578/tests/test_rpy.py#n... tests/test_rpy.py:163: test_body = [t for t in _TEST_EXAMPLES.values() What do you think about separating this part into a separate test? Would be useful to have more granularity of test run results and it seems quite independent from everything above.
Hi Vasily, Thanks for the feedback. Let me know if you see anything else? https://codereview.adblockplus.org/29988581/diff/29989584/tests/test_rpy.py File tests/test_rpy.py (right): https://codereview.adblockplus.org/29988581/diff/29989584/tests/test_rpy.py#n... tests/test_rpy.py:172: assert data == _TEST_EXAMPLES[line_type]['out'] On 2019/01/28 17:47:36, Vasily Kuznetsov wrote: > > This happens because the proposed code doesn't depend on the order of iteration > over a dict. The first line picks the tests that we are going to run and saves > them in a python list (`tests`). This way the order stays the same when we use > it to produce the lists of inputs and outputs. Ah, right, makes sense. I ran into this issue when I thought I read `lines2dict`, and it was returning a dict. But now that it returns a list, the order it preserved. Got it, thanks for the clarification! https://codereview.adblockplus.org/29988581/diff/29990578/tests/test_rpy.py File tests/test_rpy.py (right): https://codereview.adblockplus.org/29988581/diff/29990578/tests/test_rpy.py#n... tests/test_rpy.py:163: test_body = [t for t in _TEST_EXAMPLES.values() On 2019/01/28 17:47:36, Vasily Kuznetsov wrote: > What do you think about separating this part into a separate test? Would be > useful to have more granularity of test run results and it seems quite > independent from everything above. Done.
LGTM |