Issue 29988581: Issue 7230 - python-abp line2dict vectorization

Issue 29988581: Issue 7230 - python-abp line2dict vectorization (Closed)

Created:
Jan. 24, 2019, 3:37 a.m. by rhowell

Modified:
Jan. 29, 2019, 8:49 p.m.

Reviewers:
Vasily Kuznetsov

Base URL:
https://hg.adblockplus.org/python-abp/

Visibility:
Public.

Description

Issue 7230 - python-abp line2dict vectorization

Patch Set 1 #

Total comments: 4

Patch Set 2 : Switch from dict to list #

Total comments: 12

Patch Set 3 : Address comments on PS2 #

Total comments: 2

Patch Set 4 : Separate tests #

Created: Jan. 29, 2019, 1:43 a.m.

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+55 lines, -1 line)			Patch
M	.hgignore	View		1 chunk	+1 line, -0 lines	0 comments	Download
M	abp/filters/rpy.py	View	1 2	1 chunk	+28 lines, -0 lines	0 comments	Download
M	tests/test_rpy.py	View	1 2 3	2 chunks	+26 lines, -1 line	0 comments	Download

Messages

Total messages: 8

Expand All Messages | Collapse All Messages

Vasily Kuznetsov

Hi Rosie! If I understand the ticket right, the function should be more like a ...

Jan. 24, 2019, 1:27 p.m. (2019-01-24 13:27:59 UTC) #2

rhowell

Hey Vasily! Oops, yeah Stephan also clarified that on the ticket. Switched it to return ...

Jan. 25, 2019, 12:30 a.m. (2019-01-25 00:30:07 UTC) #3

Vasily Kuznetsov

Hi Rosie! The function looks good. I have some suggestions regarding the docstring and the ...

Jan. 25, 2019, 3 p.m. (2019-01-25 15:00:59 UTC) #4

Hi Rosie!

The function looks good. I have some suggestions regarding the docstring and the
tests in the comments below.

Cheers,
Vasily

https://codereview.adblockplus.org/29988581/diff/29989584/abp/filters/rpy.py
File abp/filters/rpy.py (right):

https://codereview.adblockplus.org/29988581/diff/29989584/abp/filters/rpy.py#...
abp/filters/rpy.py:108: filter. Headers and metadata will not parse properly
unless the mode
The connection between the first and the second sentence seems a bit confusing.
How about something like "If the mode is 'start', headers and metadata can also
be parsed" to make it more clear what we mean?

https://codereview.adblockplus.org/29988581/diff/29989584/abp/filters/rpy.py#...
abp/filters/rpy.py:115: dict
Nit: it should be list of dict.

https://codereview.adblockplus.org/29988581/diff/29989584/tests/test_rpy.py
File tests/test_rpy.py (right):

https://codereview.adblockplus.org/29988581/diff/29989584/tests/test_rpy.py#n...
tests/test_rpy.py:161: for key in _TEST_EXAMPLES.keys():
This is equivalent to `for key in _TEST_EXAMPLES:`. I would recommend to either
use that and keep the rest of the code as is, or use the idiom:

    for key, value in _TEST_EXAMPLES.items():

This is a bit more "pythonic".

https://codereview.adblockplus.org/29988581/diff/29989584/tests/test_rpy.py#n...
tests/test_rpy.py:162: if (_TEST_EXAMPLES[key]['out'][b'type'] != b'Header'
You can use `... not in {b'Header', b'Metadata'}`.

https://codereview.adblockplus.org/29988581/diff/29989584/tests/test_rpy.py#n...
tests/test_rpy.py:172: assert data == _TEST_EXAMPLES[line_type]['out']
So in the end we compare the output of
line2dict(_TEST_EXAMPLES[line_type]['in']) to _TEST_EXAMPLES[line_type]['out']
-- this seems like a test of line2dict with an innovative way of looping over
some part of _TEST_EXAMPLES. But we are not doing much checking of the output of
lines2dicts().

A better approach would be to first produce the list of all inputs and outputs:

    tests = [t for t in _TEST_EXAMPLES.values()
             if t[b'type'] not in {b'Header', b'Metadata'}]
    ins = [t['in'] for ex in tests]
    outs = [t['out'] for ex in tests]

and then we could just compare the output of `lines2dicts(ins)` to `outs`.

    assert lines2dicts(string_list, 'start')

rhowell

Hi Vasily! I think I've gotten all your suggestions implemented, just had one question, see ...

Jan. 26, 2019, 1:08 a.m. (2019-01-26 01:08:22 UTC) #5

Hi Vasily!

I think I've gotten all your suggestions implemented, just had one question, see
below.  :)

https://codereview.adblockplus.org/29988581/diff/29989584/abp/filters/rpy.py
File abp/filters/rpy.py (right):

https://codereview.adblockplus.org/29988581/diff/29989584/abp/filters/rpy.py#...
abp/filters/rpy.py:108: filter. Headers and metadata will not parse properly
unless the mode
On 2019/01/25 15:00:58, Vasily Kuznetsov wrote:
> The connection between the first and the second sentence seems a bit
confusing.
> How about something like "If the mode is 'start', headers and metadata can
also
> be parsed" to make it more clear what we mean?

Done.

https://codereview.adblockplus.org/29988581/diff/29989584/abp/filters/rpy.py#...
abp/filters/rpy.py:115: dict
On 2019/01/25 15:00:58, Vasily Kuznetsov wrote:
> Nit: it should be list of dict.

Done.

https://codereview.adblockplus.org/29988581/diff/29989584/tests/test_rpy.py
File tests/test_rpy.py (right):

https://codereview.adblockplus.org/29988581/diff/29989584/tests/test_rpy.py#n...
tests/test_rpy.py:161: for key in _TEST_EXAMPLES.keys():
On 2019/01/25 15:00:58, Vasily Kuznetsov wrote:
> This is equivalent to `for key in _TEST_EXAMPLES:`. I would recommend to
either
> use that and keep the rest of the code as is, or use the idiom:
> 
>     for key, value in _TEST_EXAMPLES.items():
> 
> This is a bit more "pythonic".

Done.

https://codereview.adblockplus.org/29988581/diff/29989584/tests/test_rpy.py#n...
tests/test_rpy.py:162: if (_TEST_EXAMPLES[key]['out'][b'type'] != b'Header'
On 2019/01/25 15:00:58, Vasily Kuznetsov wrote:
> You can use `... not in {b'Header', b'Metadata'}`.

Done.

https://codereview.adblockplus.org/29988581/diff/29989584/tests/test_rpy.py#n...
tests/test_rpy.py:172: assert data == _TEST_EXAMPLES[line_type]['out']
On 2019/01/25 15:00:58, Vasily Kuznetsov wrote:
> So in the end we compare the output of
> line2dict(_TEST_EXAMPLES[line_type]['in']) to _TEST_EXAMPLES[line_type]['out']
> -- this seems like a test of line2dict with an innovative way of looping over
> some part of _TEST_EXAMPLES. But we are not doing much checking of the output
of
> lines2dicts().
> 
> A better approach would be to first produce the list of all inputs and
outputs:
> 
>     tests = [t for t in _TEST_EXAMPLES.values()
>              if t[b'type'] not in {b'Header', b'Metadata'}]
>     ins = [t['in'] for ex in tests]
>     outs = [t['out'] for ex in tests]
> 
> and then we could just compare the output of `lines2dicts(ins)` to `outs`.
> 
>     assert lines2dicts(string_list, 'start') 

Good idea, this test makes much more sense. I was trying to avoid something like
this because Python2 doesn't preserve the order of dicts, so the resulting lists
have the dicts in different orders, but the tests are passing every time on
Python2.7 so now I'm confused. Normally, lists that are in different order are
not considered equal.

E.g.

--> 179     assert lines2dicts(ins) == outs

ipdb> lines2dicts(ins)
[{'target': 'www.test.py/filtelist.txt', 'type': 'Include'}, {'selector':
{'type': 'css', 'value': 'div#ad1'}, 'action': 'hide', 'options': [('domain',
[('foo.com', True), ('bar.com', True)])], 'text': 'foo.com,bar.com##div#ad1',
'type': 'Filter'}, {'type': 'EmptyLine'}, {'type': 'Comment', 'text':
'Comment'}, {'selector': {'type': 'css', 'value': 'div#ad1'}, 'action': 'hide',
'options': [('domain', [('foo.com', True)])], 'text': 'foo.com##div#ad1',
'type': 'Filter'}]

ipdb> outs
[{'target': 'www.test.py/filtelist.txt', 'type': 'Include'}, {'selector':
{'value': 'div#ad1', 'type': 'css'}, 'action': 'hide', 'type': 'Filter',
'options': [('domain', [('foo.com', True), ('bar.com', True)])], 'text':
'foo.com,bar.com##div#ad1'}, {'type': 'EmptyLine'}, {'type': 'Comment', 'text':
'Comment'}, {'selector': {'value': 'div#ad1', 'type': 'css'}, 'action': 'hide',
'type': 'Filter', 'options': [('domain', [('foo.com', True)])], 'text':
'foo.com##div#ad1'}]

ipdb> lines2dicts(ins) == outs
True

But:

$ python2
Python 2.7.12 (default, Nov 12 2018, 14:36:49) 
[GCC 5.4.0 20160609] on linux2
>>> a = [{'a':1}, {'b':2}]
>>> b = [{'b':2}, {'a':1}]
>>> a == b
False

Vasily Kuznetsov

Hi Rosie! Almost done! I have replied to your question and have one more minor ...

Jan. 28, 2019, 5:47 p.m. (2019-01-28 17:47:36 UTC) #6

Hi Rosie!

Almost done! I have replied to your question and have one more minor suggestion
about the tests.

Cheers,
Vasily

https://codereview.adblockplus.org/29988581/diff/29989584/tests/test_rpy.py
File tests/test_rpy.py (right):

https://codereview.adblockplus.org/29988581/diff/29989584/tests/test_rpy.py#n...
tests/test_rpy.py:172: assert data == _TEST_EXAMPLES[line_type]['out']
On 2019/01/26 01:08:22, rhowell wrote:
> On 2019/01/25 15:00:58, Vasily Kuznetsov wrote:
> > So in the end we compare the output of
> > line2dict(_TEST_EXAMPLES[line_type]['in']) to
_TEST_EXAMPLES[line_type]['out']
> > -- this seems like a test of line2dict with an innovative way of looping
over
> > some part of _TEST_EXAMPLES. But we are not doing much checking of the
output
> of
> > lines2dicts().
> > 
> > A better approach would be to first produce the list of all inputs and
> outputs:
> > 
> >     tests = [t for t in _TEST_EXAMPLES.values()
> >              if t[b'type'] not in {b'Header', b'Metadata'}]
> >     ins = [t['in'] for ex in tests]
> >     outs = [t['out'] for ex in tests]
> > 
> > and then we could just compare the output of `lines2dicts(ins)` to `outs`.
> > 
> >     assert lines2dicts(string_list, 'start') 
> 
> Good idea, this test makes much more sense. I was trying to avoid something
like
> this because Python2 doesn't preserve the order of dicts, so the resulting
lists
> have the dicts in different orders, but the tests are passing every time on
> Python2.7 so now I'm confused. Normally, lists that are in different order are
> not considered equal.
> 
> E.g.
> 
> --> 179     assert lines2dicts(ins) == outs
> 
> ipdb> lines2dicts(ins)
> [{'target': 'www.test.py/filtelist.txt', 'type': 'Include'}, {'selector':
> {'type': 'css', 'value': 'div#ad1'}, 'action': 'hide', 'options': [('domain',
> [('foo.com', True), ('bar.com', True)])], 'text': 'foo.com,bar.com##div#ad1',
> 'type': 'Filter'}, {'type': 'EmptyLine'}, {'type': 'Comment', 'text':
> 'Comment'}, {'selector': {'type': 'css', 'value': 'div#ad1'}, 'action':
'hide',
> 'options': [('domain', [('foo.com', True)])], 'text': 'foo.com##div#ad1',
> 'type': 'Filter'}]
> 
> ipdb> outs
> [{'target': 'www.test.py/filtelist.txt', 'type': 'Include'}, {'selector':
> {'value': 'div#ad1', 'type': 'css'}, 'action': 'hide', 'type': 'Filter',
> 'options': [('domain', [('foo.com', True), ('bar.com', True)])], 'text':
> 'foo.com,bar.com##div#ad1'}, {'type': 'EmptyLine'}, {'type': 'Comment',
'text':
> 'Comment'}, {'selector': {'value': 'div#ad1', 'type': 'css'}, 'action':
'hide',
> 'type': 'Filter', 'options': [('domain', [('foo.com', True)])], 'text':
> 'foo.com##div#ad1'}]
> 
> ipdb> lines2dicts(ins) == outs
> True
> 
> But:
> 
> $ python2
> Python 2.7.12 (default, Nov 12 2018, 14:36:49) 
> [GCC 5.4.0 20160609] on linux2
> >>> a = [{'a':1}, {'b':2}]
> >>> b = [{'b':2}, {'a':1}]
> >>> a == b
> False

This happens because the proposed code doesn't depend on the order of iteration
over a dict. The first line picks the tests that we are going to run and saves
them in a python list (`tests`). This way the order stays the same when we use
it to produce the lists of inputs and outputs.

https://codereview.adblockplus.org/29988581/diff/29990578/tests/test_rpy.py
File tests/test_rpy.py (right):

https://codereview.adblockplus.org/29988581/diff/29990578/tests/test_rpy.py#n...
tests/test_rpy.py:163: test_body = [t for t in _TEST_EXAMPLES.values()
What do you think about separating this part into a separate test? Would be
useful to have more granularity of test run results and it seems quite
independent from everything above.

rhowell

Hi Vasily, Thanks for the feedback. Let me know if you see anything else? https://codereview.adblockplus.org/29988581/diff/29989584/tests/test_rpy.py ...

Jan. 29, 2019, 1:44 a.m. (2019-01-29 01:44:48 UTC) #7

LGTM

Expand All Messages | Collapse All Messages