Issue 29845767: Issue 6685 - Offer incremental filter list downloads

rhowell

Aug. 3, 2018, 1:58 a.m. (2018-08-03 01:58:27 UTC) #1

rhowell

On 2018/08/03 01:58:27, rhowell wrote: Hi Vasily! Wanted to check in and get some feedback ...

Aug. 3, 2018, 2:03 a.m. (2018-08-03 02:03:27 UTC) #2

Vasily Kuznetsov

On 2018/08/03 02:03:27, rhowell wrote: > On 2018/08/03 01:58:27, rhowell wrote: > > Hi Vasily! ...

Aug. 3, 2018, 5:39 p.m. (2018-08-03 17:39:27 UTC) #3

On 2018/08/03 02:03:27, rhowell wrote:
> On 2018/08/03 01:58:27, rhowell wrote:
> 
> Hi Vasily!
> Wanted to check in and get some feedback so far. I'll work on writing the
script
> to call these functions next. I'm thinking it'll go something like:
> 
> Make the latest-diff (blank)
> For each file in /diffs/:
>   If it's still within the time window:
>     generate the diff
>     publish the diff
>   else:
>     delete the file (will the server automatically serve a 404 then?)

Hi Rosie!

The logic that you've outlined looks right. However, I think it makes more sense
to split it into a script that generates the diff between two filter lists (this
will go in python-abp) and some additional logic that determines which diffs are
published and where they go (this should probably be part of the
infrastructure). Let's start with the diff generation first (the script could be
called something like "fldiff" to keep with the existing naming scheme).

As for the script, you're using difflib, which would make sense for normal
diffs, but this approach might produce extra lines in the diff for the filters
that were rearranged or moved. Since we don't care about the order of the
filters, we could just do the following:
1. For each filter list take its metadata comments and filters and put them into
two sets. Discard everything else.
2. Put into the output only the metadata comments that are in the new filter
list but not in the old (treat metadata comments with same name but different
values as different ones) -- just use set subtraction for this.
3. Take the sets of filters, subtract new from old and put those into the output
with a "-". Subtract old from new and put those into the output with a "+".

You can use existing functionality of python-abp to read and write filter lists
and you will just need to add a different kind of header, that's used for diffs.

Does this make sense? Feel free to also ask me on IRC for speed.

Cheers,
Vasily

rhowell

It wasn't possible to use the namedtuple filter list objects in sets, because they contain ...

Aug. 9, 2018, 7:39 p.m. (2018-08-09 19:39:02 UTC) #4

Vasily Kuznetsov

Hi Rosie! Sorry for the late reply -- my vacation has started just when you ...

Aug. 17, 2018, 10:30 a.m. (2018-08-17 10:30:54 UTC) #5

Hi Rosie!

Sorry for the late reply -- my vacation has started just when you submitted the
patch.

This looks pretty good. I have a few comments here and there but it's mostly
some suggestions about structuring the code.

Regarding hashing of line objects, it's a bummer that there are dicts inside of
them. I was thinking that perhaps we could replace those dicts with persistent
dicts that would be hashable, however this would present us with another
problem: the added and removed lines would still need new data types to
represent them and it doesn't seem like there's much use to all of this stuff,
at least not at this point. I'm thus leaning towards keeping your simple
string-based logic and string-based API -- we can always change it later if we
decide that we do need the added-filter and removed-filter objects. TLDR: don't
worry about this paragraph :D

Cheers,
Vasily

https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/rendere...
File abp/filters/renderer.py (right):

https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/rendere...
abp/filters/renderer.py:29: __all__ = ['IncludeError', 'MissingHeader',
'render_filterlist']
render_diff() should probably become a public export here.

https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/rendere...
abp/filters/renderer.py:204: def render_diff(base, latest):
I think it makes more sense to make `base`, `latest` and the return value
iterables of strings. This is what parse_filterlist() expects and this is what
you get when you open a file. This is also the API that other functions dealing
with filterlists take and return. It would also reduce the amount of code here
eliminating the calls to splitlines() and some others.

It would also be good to document the API of this function in the same way as
the other functions are documented.

https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/rendere...
abp/filters/renderer.py:212: if line.type == 'metadata' and 'Checksum' not in
line.to_string():
Note: If this lands after the checksum patch, the 'Checksum' exception should be
removed. If this lands first, it should be removed by that patch.

https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/rendere...
abp/filters/renderer.py:219: for line in parse_filterlist(base.splitlines()):
It would be good to move this repeated piece of code into a small internal
function. This will also make render_diff() more compact and easy to follow.
Maybe something like:

    base_fl, base_md, base_keys = _split_list_for_diff(base)

https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/rendere...
abp/filters/renderer.py:230: diff.append('{}\n'.format(item))
If we return an iterable of strings from this function we can just yield in this
and the following loops.

rhowell

Hey Vasily! Thanks for the feedback; the structure seems a lot cleaner like this. Just ...

Aug. 20, 2018, 6:21 p.m. (2018-08-20 18:21:27 UTC) #6

Hey Vasily!

Thanks for the feedback; the structure seems a lot cleaner like this. Just had a
quick questions about the checksum thing. Otherwise, let me know if you see any
other issues?

Thanks!  :)
Rosie

https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/rendere...
File abp/filters/renderer.py (right):

https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/rendere...
abp/filters/renderer.py:29: __all__ = ['IncludeError', 'MissingHeader',
'render_filterlist']
On 2018/08/17 10:30:54, Vasily Kuznetsov wrote:
> render_diff() should probably become a public export here.

Done.

https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/rendere...
abp/filters/renderer.py:204: def render_diff(base, latest):
On 2018/08/17 10:30:54, Vasily Kuznetsov wrote:
> I think it makes more sense to make `base`, `latest` and the return value
> iterables of strings. This is what parse_filterlist() expects and this is what
> you get when you open a file. This is also the API that other functions
dealing
> with filterlists take and return. It would also reduce the amount of code here
> eliminating the calls to splitlines() and some others.
> 
> It would also be good to document the API of this function in the same way as
> the other functions are documented.

Done.

https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/rendere...
abp/filters/renderer.py:212: if line.type == 'metadata' and 'Checksum' not in
line.to_string():
On 2018/08/17 10:30:53, Vasily Kuznetsov wrote:
> Note: If this lands after the checksum patch, the 'Checksum' exception should
be
> removed. If this lands first, it should be removed by that patch.

Are we guaranteed that no filter lists will be encountered that have a checksum?
This line and line 220 make sure that, if we ever see a checksum in either
`base` or `latest`, it will not make it into the diff. But, if we will never see
a checksum, then this check is unnecessary.

https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/rendere...
abp/filters/renderer.py:219: for line in parse_filterlist(base.splitlines()):
On 2018/08/17 10:30:53, Vasily Kuznetsov wrote:
> It would be good to move this repeated piece of code into a small internal
> function. This will also make render_diff() more compact and easy to follow.
> Maybe something like:
> 
>     base_fl, base_md, base_keys = _split_list_for_diff(base)

Done.

https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/rendere...
abp/filters/renderer.py:230: diff.append('{}\n'.format(item))
On 2018/08/17 10:30:53, Vasily Kuznetsov wrote:
> If we return an iterable of strings from this function we can just yield in
this
> and the following loops.

Done.

Vasily Kuznetsov

LGTM https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/renderer.py File abp/filters/renderer.py (right): https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/renderer.py#newcode212 abp/filters/renderer.py:212: if line.type == 'metadata' and 'Checksum' not in ...

Aug. 21, 2018, 3 p.m. (2018-08-21 15:00:00 UTC) #7

Sebastian Noack

https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/renderer.py File abp/filters/renderer.py (right): https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/renderer.py#newcode212 abp/filters/renderer.py:212: if line.type == 'metadata' and 'Checksum' not in line.to_string(): ...

Aug. 21, 2018, 7:42 p.m. (2018-08-21 19:42:46 UTC) #8

https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/rendere...
File abp/filters/renderer.py (right):

https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/rendere...
abp/filters/renderer.py:212: if line.type == 'metadata' and 'Checksum' not in
line.to_string():
On 2018/08/21 14:59:59, Vasily Kuznetsov wrote:
> On 2018/08/20 18:21:27, rhowell wrote:
> > On 2018/08/17 10:30:53, Vasily Kuznetsov wrote:
> > > Note: If this lands after the checksum patch, the 'Checksum' exception
> should
> > be
> > > removed. If this lands first, it should be removed by that patch.
> > 
> > Are we guaranteed that no filter lists will be encountered that have a
> checksum?
> > This line and line 220 make sure that, if we ever see a checksum in either
> > `base` or `latest`, it will not make it into the diff. But, if we will never
> see
> > a checksum, then this check is unnecessary.
> 
> Hm. No, for now we're not guaranteed that there's no Checksum, so you're
right,
> we should keep this.

Isn't python-abp stripping checksums when generating the filter lists now, and
therefore isn't it in fact guaranteed that filter lists delivered by our servers
at least won't include any checksum? Even if not, any reason we'd have to strip
them here? Clients that support diffs, will ignore checksums anyway.

https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser.py
File abp/filters/parser.py (right):

https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser....
abp/filters/parser.py:145: 'Version', 'Diff-URL', 'Diff-Expires'}
I would prefer if python-abp would be agnostic of the supported metadata keys.
Otherwise, we have to keep this set in sync with what is implemented in Adblock
Plus, which seems unpractical.

However, note that the regexp above would have to be changed like following so
that it won't match lines with arbitrary URLs: r'!\s*(\w+)\s*:(?!//)\s*(.*)'

https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser....
abp/filters/parser.py:156: if match and match.group(1) in METADATA_KEYS:
Note that metadata keys are case-insensitive. If we go with my above suggestion,
this won't matter here anymore. But once non-capitalized keys are recognized
this needs to be accounted for in _remove_duplicates() in renderer.py as well.

https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser....
abp/filters/parser.py:292: elif content.startswith('[') and
content.endswith(']'):
Somewhat unrelated of these changes, but this is inaccurate too. Only the first
line (if encapsulated in brackets) is to be recognized as header. Otherwise it
would be considered a filter.

https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/rendere...
File abp/filters/renderer.py (right):

https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/rendere...
abp/filters/renderer.py:187: for line in parse_filterlist(list_in):
When parsing generated filter lists in order to generate diffs, we might want to
make the parser ignore includes/instructions.

https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/rendere...
abp/filters/renderer.py:230: yield '- {}'.format(item)
In the specification, we demand the client to process removed filters first. But
given this implementation, it seems simpler if we'd rather make the diff list
the removed filters first, so that the client can process the diff in one pass.

https://codereview.adblockplus.org/29845767/diff/29859561/tests/test_differ.py
File tests/test_differ.py (right):

https://codereview.adblockplus.org/29845767/diff/29859561/tests/test_differ.p...
tests/test_differ.py:61: test
Perhaps add a leading whitespace here, so that we also test that filter texts
are stripped before being compared.

Vasily Kuznetsov

Thanks for the comments, Sebastian! Here are a few comments on the comments :) https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser.py ...

Aug. 22, 2018, 1:10 p.m. (2018-08-22 13:10:50 UTC) #9

Thanks for the comments, Sebastian! Here are a few comments on the comments :)

https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser.py
File abp/filters/parser.py (right):

https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser....
abp/filters/parser.py:145: 'Version', 'Diff-URL', 'Diff-Expires'}
On 2018/08/21 19:42:45, Sebastian Noack wrote:
> I would prefer if python-abp would be agnostic of the supported metadata keys.
> Otherwise, we have to keep this set in sync with what is implemented in
Adblock
> Plus, which seems unpractical.
> 
> However, note that the regexp above would have to be changed like following so
> that it won't match lines with arbitrary URLs: r'!\s*(\w+)\s*:(?!//)\s*(.*)'

Makes sense and I support it. I actually thought it was already implemented this
way before I started reviewing this change :)

BTW, the regexp should probably be r'!\s*([\w-]+)\s*:(?!//)\s*(.*)' so that we
can still have dashes in metadata field names.

https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser....
abp/filters/parser.py:156: if match and match.group(1) in METADATA_KEYS:
On 2018/08/21 19:42:46, Sebastian Noack wrote:
> Note that metadata keys are case-insensitive. If we go with my above
suggestion,
> this won't matter here anymore. But once non-capitalized keys are recognized
> this needs to be accounted for in _remove_duplicates() in renderer.py as well.

Should we then maybe adopt some canonical capitalization scheme (for example
Like-This) and then convert them here to follow it? Otherwise we will have to
remember this in any code that works with metadata keys.

https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser....
abp/filters/parser.py:292: elif content.startswith('[') and
content.endswith(']'):
On 2018/08/21 19:42:45, Sebastian Noack wrote:
> Somewhat unrelated of these changes, but this is inaccurate too. Only the
first
> line (if encapsulated in brackets) is to be recognized as header. Otherwise it
> would be considered a filter.

Makes sense. I created a separate ticket for it:
https://issues.adblockplus.org/ticket/6877

https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/rendere...
File abp/filters/renderer.py (right):

https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/rendere...
abp/filters/renderer.py:187: for line in parse_filterlist(list_in):
On 2018/08/21 19:42:46, Sebastian Noack wrote:
> When parsing generated filter lists in order to generate diffs, we might want
to
> make the parser ignore includes/instructions.

Theoretically there should be no includes/instructions in the final lists (and
we're not planning to produce diffs of fragments of filter lists), so this is
not really necessary. And even if there were includes, the code below ignores
them, which achieves the same outcome as the parsing ignoring them, so it seems
that things are ok as is.

Sebastian Noack

https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser.py File abp/filters/parser.py (right): https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser.py#newcode145 abp/filters/parser.py:145: 'Version', 'Diff-URL', 'Diff-Expires'} On 2018/08/22 13:10:50, Vasily Kuznetsov wrote: ...

Aug. 22, 2018, 2:31 p.m. (2018-08-22 14:31:06 UTC) #10

https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser.py
File abp/filters/parser.py (right):

https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser....
abp/filters/parser.py:145: 'Version', 'Diff-URL', 'Diff-Expires'}
On 2018/08/22 13:10:50, Vasily Kuznetsov wrote:
> On 2018/08/21 19:42:45, Sebastian Noack wrote:
> > I would prefer if python-abp would be agnostic of the supported metadata
keys.
> > Otherwise, we have to keep this set in sync with what is implemented in
> Adblock
> > Plus, which seems unpractical.
> > 
> > However, note that the regexp above would have to be changed like following
so
> > that it won't match lines with arbitrary URLs: r'!\s*(\w+)\s*:(?!//)\s*(.*)'
> 
> Makes sense and I support it. I actually thought it was already implemented
this
> way before I started reviewing this change :)
> 
> BTW, the regexp should probably be r'!\s*([\w-]+)\s*:(?!//)\s*(.*)' so that we
> can still have dashes in metadata field names.

You are right, well spotted.

https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser....
abp/filters/parser.py:156: if match and match.group(1) in METADATA_KEYS:
On 2018/08/22 13:10:50, Vasily Kuznetsov wrote:
> On 2018/08/21 19:42:46, Sebastian Noack wrote:
> > Note that metadata keys are case-insensitive. If we go with my above
> suggestion,
> > this won't matter here anymore. But once non-capitalized keys are recognized
> > this needs to be accounted for in _remove_duplicates() in renderer.py as
well.
> 
> Should we then maybe adopt some canonical capitalization scheme (for example
> Like-This) and then convert them here to follow it? Otherwise we will have to
> remember this in any code that works with metadata keys.

Initially, I had a similar though, but on the other hand I'd prefer to not
unnecessarily have the output diverge from the input where (easily) possible,
and it seems the only place where we specifically check for a particular
metadata key is where we strip checksums, besides we'd have to account for
case-insensitivity as well when comparing metadata for the diff.

https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/rendere...
File abp/filters/renderer.py (right):

https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/rendere...
abp/filters/renderer.py:187: for line in parse_filterlist(list_in):
On 2018/08/22 13:10:50, Vasily Kuznetsov wrote:
> On 2018/08/21 19:42:46, Sebastian Noack wrote:
> > When parsing generated filter lists in order to generate diffs, we might
want
> to
> > make the parser ignore includes/instructions.
> 
> Theoretically there should be no includes/instructions in the final lists (and
> we're not planning to produce diffs of fragments of filter lists), so this is
> not really necessary. And even if there were includes, the code below ignores
> them, which achieves the same outcome as the parsing ignoring them, so it
seems
> that things are ok as is.

Granted, this is quite an edge case. But in theory if we'll find  "%include" in
the input the diff is generated from, then it seems somewhat more accurate to
just treat it as a filter (like Adblock Plus would) rather than stripping this
line (or perhaps a warning would be in order). But I don't feel too strong.

Sebastian Noack

https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/renderer.py File abp/filters/renderer.py (right): https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/renderer.py#newcode212 abp/filters/renderer.py:212: if line.type == 'metadata' and 'Checksum' not in line.to_string(): ...

Aug. 22, 2018, 4:03 p.m. (2018-08-22 16:03:30 UTC) #11

https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/rendere...
File abp/filters/renderer.py (right):

https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/rendere...
abp/filters/renderer.py:212: if line.type == 'metadata' and 'Checksum' not in
line.to_string():
On 2018/08/21 19:42:45, Sebastian Noack wrote:
> On 2018/08/21 14:59:59, Vasily Kuznetsov wrote:
> > On 2018/08/20 18:21:27, rhowell wrote:
> > > On 2018/08/17 10:30:53, Vasily Kuznetsov wrote:
> > > > Note: If this lands after the checksum patch, the 'Checksum' exception
> > should
> > > be
> > > > removed. If this lands first, it should be removed by that patch.
> > > 
> > > Are we guaranteed that no filter lists will be encountered that have a
> > checksum?
> > > This line and line 220 make sure that, if we ever see a checksum in either
> > > `base` or `latest`, it will not make it into the diff. But, if we will
never
> > see
> > > a checksum, then this check is unnecessary.
> > 
> > Hm. No, for now we're not guaranteed that there's no Checksum, so you're
> right,
> > we should keep this.
> 
> Isn't python-abp stripping checksums when generating the filter lists now, and
> therefore isn't it in fact guaranteed that filter lists delivered by our
servers
> at least won't include any checksum? Even if not, any reason we'd have to
strip
> them here? Clients that support diffs, will ignore checksums anyway.

Moving the discussion from IRC over here:

16:34:47 <•vasily> snoack: the second one I agree with you. For the first one,
at the moment, I think we have other code producing filter lists in production,
so they do have checksums. But we might convert all of this to python-abp
(perhaps together with deployment of the incremental updates) so you would also
be right then... however, we haven't yet done it.
16:36:03 <•vasily> snoack: in general seems like the checksums will be ignored
by diff-aware clients, so yeah, it can be removed.  
17:24:01 <snoack> vasily: How about ignoring checksum metadata in the parser
(rather than when rendering the full list and/or diff)? That way checksums would
be stripped no matter what, without any additional complexity.

https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/rendere...
File abp/filters/renderer.py (right):

https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/rendere...
abp/filters/renderer.py:230: yield '- {}'.format(item)
On 2018/08/21 19:42:46, Sebastian Noack wrote:
> In the specification, we demand the client to process removed filters first.
But
> given this implementation, it seems simpler if we'd rather make the diff list
> the removed filters first, so that the client can process the diff in one
pass.

After discussing this on IRC, Vasily agrees. I updated the spec accordingly.

Vasily Kuznetsov

https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser.py File abp/filters/parser.py (right): https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser.py#newcode156 abp/filters/parser.py:156: if match and match.group(1) in METADATA_KEYS: On 2018/08/22 14:31:06, ...

Aug. 24, 2018, 1:03 p.m. (2018-08-24 13:03:19 UTC) #12

https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser.py
File abp/filters/parser.py (right):

https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser....
abp/filters/parser.py:156: if match and match.group(1) in METADATA_KEYS:
On 2018/08/22 14:31:06, Sebastian Noack wrote:
> On 2018/08/22 13:10:50, Vasily Kuznetsov wrote:
> > On 2018/08/21 19:42:46, Sebastian Noack wrote:
> > > Note that metadata keys are case-insensitive. If we go with my above
> > suggestion,
> > > this won't matter here anymore. But once non-capitalized keys are
recognized
> > > this needs to be accounted for in _remove_duplicates() in renderer.py as
> well.
> > 
> > Should we then maybe adopt some canonical capitalization scheme (for example
> > Like-This) and then convert them here to follow it? Otherwise we will have
to
> > remember this in any code that works with metadata keys.
> 
> Initially, I had a similar though, but on the other hand I'd prefer to not
> unnecessarily have the output diverge from the input where (easily) possible,
> and it seems the only place where we specifically check for a particular
> metadata key is where we strip checksums, besides we'd have to account for
> case-insensitivity as well when comparing metadata for the diff.

The output that doesn't diverge from the input unnecessarily seems like a useful
feature. In principle, if we normalize the headers, eventually all "our" filter
lists will have normalized headers and they won't diverge anyway, but this might
not be the case just yet. Anyway, for the time being it's not a big deal and
it's unlikely to result in problems in the future so I don't feel too strongly
about this.

https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/rendere...
File abp/filters/renderer.py (right):

https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/rendere...
abp/filters/renderer.py:187: for line in parse_filterlist(list_in):
On 2018/08/22 14:31:06, Sebastian Noack wrote:
> On 2018/08/22 13:10:50, Vasily Kuznetsov wrote:
> > On 2018/08/21 19:42:46, Sebastian Noack wrote:
> > > When parsing generated filter lists in order to generate diffs, we might
> want
> > to
> > > make the parser ignore includes/instructions.
> > 
> > Theoretically there should be no includes/instructions in the final lists
(and
> > we're not planning to produce diffs of fragments of filter lists), so this
is
> > not really necessary. And even if there were includes, the code below
ignores
> > them, which achieves the same outcome as the parsing ignoring them, so it
> seems
> > that things are ok as is.
> 
> Granted, this is quite an edge case. But in theory if we'll find  "%include"
in
> the input the diff is generated from, then it seems somewhat more accurate to
> just treat it as a filter (like Adblock Plus would) rather than stripping this
> line (or perhaps a warning would be in order). But I don't feel too strong.

To do this we would need to start parsing fragments of filter lists differently
from rendered filter lists. It does make sense but it also makes things more
complicated and the cases where it's relevant seem to be pretty unrealistic and
edge-casey.

I lean towards keeping the code unified and parsing "%include"s as instructions
everywhere until we have evidence that there are cases where it would be useful
to make the distinction.

https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/rendere...
abp/filters/renderer.py:230: yield '- {}'.format(item)
On 2018/08/22 16:03:29, Sebastian Noack wrote:
> On 2018/08/21 19:42:46, Sebastian Noack wrote:
> > In the specification, we demand the client to process removed filters first.
> But
> > given this implementation, it seems simpler if we'd rather make the diff
list
> > the removed filters first, so that the client can process the diff in one
> pass.
> 
> After discussing this on IRC, Vasily agrees. I updated the spec accordingly.

👍

rhowell

Thanks for the input! I made the suggested changes. Let me know if you see ...

Aug. 27, 2018, 10:06 p.m. (2018-08-27 22:06:27 UTC) #13

Thanks for the input! I made the suggested changes. Let me know if you see any
other issues?

https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser.py
File abp/filters/parser.py (right):

https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser....
abp/filters/parser.py:145: 'Version', 'Diff-URL', 'Diff-Expires'}
On 2018/08/22 14:31:06, Sebastian Noack wrote:
> On 2018/08/22 13:10:50, Vasily Kuznetsov wrote:
> > On 2018/08/21 19:42:45, Sebastian Noack wrote:
> > > I would prefer if python-abp would be agnostic of the supported metadata
> keys.
> > > Otherwise, we have to keep this set in sync with what is implemented in
> > Adblock
> > > Plus, which seems unpractical.
> > > 
> > > However, note that the regexp above would have to be changed like
following
> so
> > > that it won't match lines with arbitrary URLs:
r'!\s*(\w+)\s*:(?!//)\s*(.*)'
> > 
> > Makes sense and I support it. I actually thought it was already implemented
> this
> > way before I started reviewing this change :)
> > 
> > BTW, the regexp should probably be r'!\s*([\w-]+)\s*:(?!//)\s*(.*)' so that
we
> > can still have dashes in metadata field names.
> 
> You are right, well spotted.

It appears this was done to prevent mistaking a comment with a colon for
metadata, like in this test:

def test_parse_nonmeta():
    line = parse_line('! WrongHeader: something')
    assert line.type == 'comment'

It probably doesn't matter too much, since the parser wouldn't be able to parse
invalid metadata anyway. I'll update the test so it doesn't use a colon.

https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser....
abp/filters/parser.py:156: if match and match.group(1) in METADATA_KEYS:
On 2018/08/24 13:03:19, Vasily Kuznetsov wrote:
> On 2018/08/22 14:31:06, Sebastian Noack wrote:
> > On 2018/08/22 13:10:50, Vasily Kuznetsov wrote:
> > > On 2018/08/21 19:42:46, Sebastian Noack wrote:
> > > > Note that metadata keys are case-insensitive. If we go with my above
> > > suggestion,
> > > > this won't matter here anymore. But once non-capitalized keys are
> recognized
> > > > this needs to be accounted for in _remove_duplicates() in renderer.py as
> > well.
> > > 
> > > Should we then maybe adopt some canonical capitalization scheme (for
example
> > > Like-This) and then convert them here to follow it? Otherwise we will have
> to
> > > remember this in any code that works with metadata keys.
> > 
> > Initially, I had a similar though, but on the other hand I'd prefer to not
> > unnecessarily have the output diverge from the input where (easily)
possible,
> > and it seems the only place where we specifically check for a particular
> > metadata key is where we strip checksums, besides we'd have to account for
> > case-insensitivity as well when comparing metadata for the diff.
> 
> The output that doesn't diverge from the input unnecessarily seems like a
useful
> feature. In principle, if we normalize the headers, eventually all "our"
filter
> lists will have normalized headers and they won't diverge anyway, but this
might
> not be the case just yet. Anyway, for the time being it's not a big deal and
> it's unlikely to result in problems in the future so I don't feel too strongly
> about this.

Acknowledged.

https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser....
abp/filters/parser.py:292: elif content.startswith('[') and
content.endswith(']'):
On 2018/08/22 13:10:50, Vasily Kuznetsov wrote:
> On 2018/08/21 19:42:45, Sebastian Noack wrote:
> > Somewhat unrelated of these changes, but this is inaccurate too. Only the
> first
> > line (if encapsulated in brackets) is to be recognized as header. Otherwise
it
> > would be considered a filter.
> 
> Makes sense. I created a separate ticket for it:
> https://issues.adblockplus.org/ticket/6877

Acknowledged.

https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/rendere...
File abp/filters/renderer.py (right):

https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/rendere...
abp/filters/renderer.py:186: filterlist, metadata, keys = (set() for i in
range(3))
On 2018/08/21 14:59:59, Vasily Kuznetsov wrote:
> I'm not sure if maybe a slightly more readable version of this line would be:
> 
>     filterlist, metadata, keys = set(), set(), set()
> 
> or even:
> 
>     filterlist = set()
>     metadata = set()
>     keys = set()
> 
> I don't feel too strongly about it so feel free to keep your original version
or
> choose one of the proposals above as you like.

Done.

https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/rendere...
abp/filters/renderer.py:187: for line in parse_filterlist(list_in):
On 2018/08/24 13:03:19, Vasily Kuznetsov wrote:
> On 2018/08/22 14:31:06, Sebastian Noack wrote:
> > On 2018/08/22 13:10:50, Vasily Kuznetsov wrote:
> > > On 2018/08/21 19:42:46, Sebastian Noack wrote:
> > > > When parsing generated filter lists in order to generate diffs, we might
> > want
> > > to
> > > > make the parser ignore includes/instructions.
> > > 
> > > Theoretically there should be no includes/instructions in the final lists
> (and
> > > we're not planning to produce diffs of fragments of filter lists), so this
> is
> > > not really necessary. And even if there were includes, the code below
> ignores
> > > them, which achieves the same outcome as the parsing ignoring them, so it
> > seems
> > > that things are ok as is.
> > 
> > Granted, this is quite an edge case. But in theory if we'll find  "%include"
> in
> > the input the diff is generated from, then it seems somewhat more accurate
to
> > just treat it as a filter (like Adblock Plus would) rather than stripping
this
> > line (or perhaps a warning would be in order). But I don't feel too strong.
> 
> To do this we would need to start parsing fragments of filter lists
differently
> from rendered filter lists. It does make sense but it also makes things more
> complicated and the cases where it's relevant seem to be pretty unrealistic
and
> edge-casey.
> 
> I lean towards keeping the code unified and parsing "%include"s as
instructions
> everywhere until we have evidence that there are cases where it would be
useful
> to make the distinction.

Yeah, I was working under the assumption that the full lists wouldn't have
%includes in them, and `%include` appears to be the only instruction the parser
knows. Are there any other instructions that might appear, that we might need to
handle? If not, I would agree with Vasily, that it's not necessary to handle
this case.

https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/rendere...
abp/filters/renderer.py:230: yield '- {}'.format(item)
On 2018/08/24 13:03:19, Vasily Kuznetsov wrote:
> On 2018/08/22 16:03:29, Sebastian Noack wrote:
> > On 2018/08/21 19:42:46, Sebastian Noack wrote:
> > > In the specification, we demand the client to process removed filters
first.
> > But
> > > given this implementation, it seems simpler if we'd rather make the diff
> list
> > > the removed filters first, so that the client can process the diff in one
> > pass.
> > 
> > After discussing this on IRC, Vasily agrees. I updated the spec accordingly.
> 
> 👍

Done.

https://codereview.adblockplus.org/29845767/diff/29859561/tests/test_differ.py
File tests/test_differ.py (right):

https://codereview.adblockplus.org/29845767/diff/29859561/tests/test_differ.p...
tests/test_differ.py:61: test
On 2018/08/21 19:42:46, Sebastian Noack wrote:
> Perhaps add a leading whitespace here, so that we also test that filter texts
> are stripped before being compared.

Done.

Sebastian Noack

https://codereview.adblockplus.org/29845767/diff/29866656/abp/filters/renderer.py File abp/filters/renderer.py (right): https://codereview.adblockplus.org/29845767/diff/29866656/abp/filters/renderer.py#newcode127 abp/filters/renderer.py:127: seen = {'Checksum'} Since we parse metadata with arbitrary ...

Aug. 28, 2018, 7:52 p.m. (2018-08-28 19:52:19 UTC) #14

https://codereview.adblockplus.org/29845767/diff/29866656/abp/filters/rendere...
File abp/filters/renderer.py (right):

https://codereview.adblockplus.org/29845767/diff/29866656/abp/filters/rendere...
abp/filters/renderer.py:127: seen = {'Checksum'}
Since we parse metadata with arbitrary keys now, we have to account for keys
with inconsistent capitalization, here:

  seen = {'checksum'}
  for i, line in enumerate(lines):
      if line.type == 'metadata':
          key = line.key.lower()
          if key not in seen:
              seen.add(key)
              yield line

https://codereview.adblockplus.org/29845767/diff/29866656/abp/filters/rendere...
abp/filters/renderer.py:188: if line.type == 'metadata' and 'Checksum' not in
line.to_string():
I think it has been agreed on that this isn't necessary, as any diff-aware
client ignores checksums anyway (besides most filter list if not all wouldn't
have a checksum anymore anyway).

https://codereview.adblockplus.org/29845767/diff/29866656/abp/filters/rendere...
abp/filters/renderer.py:190: keys.add(line.key)
This code has to be changed as well, to handle case-insensitive keys:

  def _split_list_for_diff(list_in):
      metadata = {}
      for line in parse_filterlist(list_in):
          if line.type == 'metadata':
              metadata[line.key.lower()] = line
      ...

  def render_diff(base, latest):
      ...
      for key, line in latest_md.items():
          other = base_md.get(key)
          if not other or other.value != line.value:
              yield line.to_string()
      for key in set(base_md) - set(latest_md):
          yield '! {}:'.format(base_md[key].key)
      ...

https://codereview.adblockplus.org/29845767/diff/29866656/tests/test_parser.py
File tests/test_parser.py (right):

https://codereview.adblockplus.org/29845767/diff/29866656/tests/test_parser.p...
tests/test_parser.py:158: line = parse_line('! WrongHeader; something')
To me it seems, this test specifically existed to make sure headers are checked
against the whitelist (which you removed). The modified version of this test,
seems kinda useless to me.

rhowell

https://codereview.adblockplus.org/29845767/diff/29866656/abp/filters/renderer.py File abp/filters/renderer.py (right): https://codereview.adblockplus.org/29845767/diff/29866656/abp/filters/renderer.py#newcode127 abp/filters/renderer.py:127: seen = {'Checksum'} On 2018/08/28 19:52:18, Sebastian Noack wrote: ...

Aug. 29, 2018, 9:43 p.m. (2018-08-29 21:43:35 UTC) #15

https://codereview.adblockplus.org/29845767/diff/29866656/abp/filters/rendere...
File abp/filters/renderer.py (right):

https://codereview.adblockplus.org/29845767/diff/29866656/abp/filters/rendere...
abp/filters/renderer.py:127: seen = {'Checksum'}
On 2018/08/28 19:52:18, Sebastian Noack wrote:
> Since we parse metadata with arbitrary keys now, we have to account for keys
> with inconsistent capitalization, here:
> 
>   seen = {'checksum'}
>   for i, line in enumerate(lines):
>       if line.type == 'metadata':
>           key = line.key.lower()
>           if key not in seen:
>               seen.add(key)
>               yield line

Done.

https://codereview.adblockplus.org/29845767/diff/29866656/abp/filters/rendere...
abp/filters/renderer.py:188: if line.type == 'metadata' and 'Checksum' not in
line.to_string():
On 2018/08/28 19:52:18, Sebastian Noack wrote:
> I think it has been agreed on that this isn't necessary, as any diff-aware
> client ignores checksums anyway (besides most filter list if not all wouldn't
> have a checksum anymore anyway).

Done.

https://codereview.adblockplus.org/29845767/diff/29866656/abp/filters/rendere...
abp/filters/renderer.py:190: keys.add(line.key)
On 2018/08/28 19:52:18, Sebastian Noack wrote:
> This code has to be changed as well, to handle case-insensitive keys:
> 
>   def _split_list_for_diff(list_in):
>       metadata = {}
>       for line in parse_filterlist(list_in):
>           if line.type == 'metadata':
>               metadata[line.key.lower()] = line
>       ...
> 
>   def render_diff(base, latest):
>       ...
>       for key, line in latest_md.items():
>           other = base_md.get(key)
>           if not other or other.value != line.value:
>               yield line.to_string()
>       for key in set(base_md) - set(latest_md):
>           yield '! {}:'.format(base_md[key].key)
>       ...

Any reason to use a dict instead of a set? I guess one advantage would be: if a
metadata item is removed in the new list, we would convert it to lower-case and
give it a blank value. If that's a concern (converting it to lower case), then I
can use dicts instead of sets, as you suggested.

https://codereview.adblockplus.org/29845767/diff/29866656/tests/test_parser.py
File tests/test_parser.py (right):

https://codereview.adblockplus.org/29845767/diff/29866656/tests/test_parser.p...
tests/test_parser.py:158: line = parse_line('! WrongHeader; something')
On 2018/08/28 19:52:18, Sebastian Noack wrote:
> To me it seems, this test specifically existed to make sure headers are
checked
> against the whitelist (which you removed). The modified version of this test,
> seems kinda useless to me.

Done.

Sebastian Noack

https://codereview.adblockplus.org/29845767/diff/29866656/abp/filters/renderer.py File abp/filters/renderer.py (right): https://codereview.adblockplus.org/29845767/diff/29866656/abp/filters/renderer.py#newcode190 abp/filters/renderer.py:190: keys.add(line.key) On 2018/08/29 21:43:34, rhowell wrote: > On 2018/08/28 ...

Aug. 29, 2018, 9:56 p.m. (2018-08-29 21:56:37 UTC) #16

rhowell

https://codereview.adblockplus.org/29845767/diff/29866656/abp/filters/renderer.py File abp/filters/renderer.py (right): https://codereview.adblockplus.org/29845767/diff/29866656/abp/filters/renderer.py#newcode190 abp/filters/renderer.py:190: keys.add(line.key) On 2018/08/29 21:56:37, Sebastian Noack wrote: > On ...

Aug. 29, 2018, 10:52 p.m. (2018-08-29 22:52:24 UTC) #17

Sebastian Noack

https://codereview.adblockplus.org/29845767/diff/29868603/tests/test_differ.py File tests/test_differ.py (right): https://codereview.adblockplus.org/29845767/diff/29868603/tests/test_differ.py#newcode21 tests/test_differ.py:21: ! Diff-URL: https://easylist-downloads.adblockplus.org/easylist/diffs/111.txt Perhaps turn one of the keys ...

Aug. 29, 2018, 10:58 p.m. (2018-08-29 22:58:10 UTC) #18

Sebastian Noack

https://codereview.adblockplus.org/29845767/diff/29868603/tests/test_differ.py File tests/test_differ.py (right): https://codereview.adblockplus.org/29845767/diff/29868603/tests/test_differ.py#newcode21 tests/test_differ.py:21: ! Diff-URL: https://easylist-downloads.adblockplus.org/easylist/diffs/111.txt On 2018/08/29 22:58:10, Sebastian Noack wrote: ...

Aug. 29, 2018, 11:03 p.m. (2018-08-29 23:03:24 UTC) #19

Sebastian Noack

https://codereview.adblockplus.org/29845767/diff/29868603/abp/filters/renderer.py File abp/filters/renderer.py (right): https://codereview.adblockplus.org/29845767/diff/29868603/abp/filters/renderer.py#newcode215 abp/filters/renderer.py:215: base_md, base_rules = _split_list_for_diff(base) Nit: Is it necessary to ...

Aug. 29, 2018, 11:46 p.m. (2018-08-29 23:46:26 UTC) #20

rhowell

https://codereview.adblockplus.org/29845767/diff/29868603/abp/filters/renderer.py File abp/filters/renderer.py (right): https://codereview.adblockplus.org/29845767/diff/29868603/abp/filters/renderer.py#newcode215 abp/filters/renderer.py:215: base_md, base_rules = _split_list_for_diff(base) On 2018/08/29 23:46:26, Sebastian Noack ...

Aug. 30, 2018, 4:23 p.m. (2018-08-30 16:23:31 UTC) #21

Sebastian Noack

https://codereview.adblockplus.org/29845767/diff/29868603/tests/test_differ.py File tests/test_differ.py (right): https://codereview.adblockplus.org/29845767/diff/29868603/tests/test_differ.py#newcode21 tests/test_differ.py:21: ! Diff-URL: https://easylist-downloads.adblockplus.org/easylist/diffs/111.txt On 2018/08/29 23:03:24, Sebastian Noack wrote: ...

Aug. 30, 2018, 4:38 p.m. (2018-08-30 16:38:33 UTC) #22

rhowell

https://codereview.adblockplus.org/29845767/diff/29868603/tests/test_differ.py File tests/test_differ.py (right): https://codereview.adblockplus.org/29845767/diff/29868603/tests/test_differ.py#newcode21 tests/test_differ.py:21: ! Diff-URL: https://easylist-downloads.adblockplus.org/easylist/diffs/111.txt On 2018/08/30 16:38:33, Sebastian Noack wrote: ...

Aug. 30, 2018, 5:38 p.m. (2018-08-30 17:38:20 UTC) #23

Vasily Kuznetsov

LGTM again! https://codereview.adblockplus.org/29845767/diff/29869585/abp/filters/renderer.py File abp/filters/renderer.py (right): https://codereview.adblockplus.org/29845767/diff/29869585/abp/filters/renderer.py#newcode187 abp/filters/renderer.py:187: """Split a filter list into metadata, keys, ...

Aug. 31, 2018, 12:11 p.m. (2018-08-31 12:11:03 UTC) #25

rhowell

Aug. 31, 2018, 9:21 p.m. (2018-08-31 21:21:03 UTC) #26

https://codereview.adblockplus.org/29845767/diff/29869585/abp/filters/rendere...
File abp/filters/renderer.py (right):

https://codereview.adblockplus.org/29845767/diff/29869585/abp/filters/rendere...
abp/filters/renderer.py:187: """Split a filter list into metadata, keys, and
rules."""
On 2018/08/31 12:11:02, Vasily Kuznetsov wrote:
> Nit: strictly speaking this function now returns metadata and rules (the keys
> are just part of metadata). The docstring could be updated.

Ah, good point! I'll update that.

Issue 29845767: Issue 6685 - Offer incremental filter list downloads (Closed)

Description

Patch Set 1 #

Patch Set 2 : Use sets instead of ndiff, account for removal of special comments #

Patch Set 3 : Use namedtuple filter list objects instead of strings #

Patch Set 4 : Use iterables instead of str, stop repeating code #

Patch Set 5 : Remove metadata_keys, yield deletions first #

Patch Set 6 : Address comments on PS5 #

Patch Set 7 : Store metadata in a dict #

Patch Set 8 : Address comments on PS7 #

Patch Set 9 : Address comments on PS8 #

Messages