|
|
Created:
Aug. 3, 2018, 1:58 a.m. by rhowell Modified:
Sept. 25, 2018, 1:04 p.m. Base URL:
https://hg.adblockplus.org/python-abp/ Visibility:
Public. |
DescriptionIssue 6685 - Offer incremental filter list downloads
Patch Set 1 #Patch Set 2 : Use sets instead of ndiff, account for removal of special comments #Patch Set 3 : Use namedtuple filter list objects instead of strings #
Total comments: 13
Patch Set 4 : Use iterables instead of str, stop repeating code #
Total comments: 25
Patch Set 5 : Remove metadata_keys, yield deletions first #
Total comments: 10
Patch Set 6 : Address comments on PS5 #Patch Set 7 : Store metadata in a dict #
Total comments: 8
Patch Set 8 : Address comments on PS7 #Patch Set 9 : Address comments on PS8 #
Total comments: 2
MessagesTotal messages: 26
On 2018/08/03 01:58:27, rhowell wrote: Hi Vasily! Wanted to check in and get some feedback so far. I'll work on writing the script to call these functions next. I'm thinking it'll go something like: Make the latest-diff (blank) For each file in /diffs/: If it's still within the time window: generate the diff publish the diff else: delete the file (will the server automatically serve a 404 then?)
On 2018/08/03 02:03:27, rhowell wrote: > On 2018/08/03 01:58:27, rhowell wrote: > > Hi Vasily! > Wanted to check in and get some feedback so far. I'll work on writing the script > to call these functions next. I'm thinking it'll go something like: > > Make the latest-diff (blank) > For each file in /diffs/: > If it's still within the time window: > generate the diff > publish the diff > else: > delete the file (will the server automatically serve a 404 then?) Hi Rosie! The logic that you've outlined looks right. However, I think it makes more sense to split it into a script that generates the diff between two filter lists (this will go in python-abp) and some additional logic that determines which diffs are published and where they go (this should probably be part of the infrastructure). Let's start with the diff generation first (the script could be called something like "fldiff" to keep with the existing naming scheme). As for the script, you're using difflib, which would make sense for normal diffs, but this approach might produce extra lines in the diff for the filters that were rearranged or moved. Since we don't care about the order of the filters, we could just do the following: 1. For each filter list take its metadata comments and filters and put them into two sets. Discard everything else. 2. Put into the output only the metadata comments that are in the new filter list but not in the old (treat metadata comments with same name but different values as different ones) -- just use set subtraction for this. 3. Take the sets of filters, subtract new from old and put those into the output with a "-". Subtract old from new and put those into the output with a "+". You can use existing functionality of python-abp to read and write filter lists and you will just need to add a different kind of header, that's used for diffs. Does this make sense? Feel free to also ask me on IRC for speed. Cheers, Vasily
It wasn't possible to use the namedtuple filter list objects in sets, because they contain nested dicts that aren't hashable, so I'm storing strings into the sets instead. I also removed the checksum logic. Is this more what you were thinking? Unless I'm missing something, I don't see much opportunity for using the other parsing and rendering functions.
Hi Rosie! Sorry for the late reply -- my vacation has started just when you submitted the patch. This looks pretty good. I have a few comments here and there but it's mostly some suggestions about structuring the code. Regarding hashing of line objects, it's a bummer that there are dicts inside of them. I was thinking that perhaps we could replace those dicts with persistent dicts that would be hashable, however this would present us with another problem: the added and removed lines would still need new data types to represent them and it doesn't seem like there's much use to all of this stuff, at least not at this point. I'm thus leaning towards keeping your simple string-based logic and string-based API -- we can always change it later if we decide that we do need the added-filter and removed-filter objects. TLDR: don't worry about this paragraph :D Cheers, Vasily https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/rendere... File abp/filters/renderer.py (right): https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/rendere... abp/filters/renderer.py:29: __all__ = ['IncludeError', 'MissingHeader', 'render_filterlist'] render_diff() should probably become a public export here. https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/rendere... abp/filters/renderer.py:204: def render_diff(base, latest): I think it makes more sense to make `base`, `latest` and the return value iterables of strings. This is what parse_filterlist() expects and this is what you get when you open a file. This is also the API that other functions dealing with filterlists take and return. It would also reduce the amount of code here eliminating the calls to splitlines() and some others. It would also be good to document the API of this function in the same way as the other functions are documented. https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/rendere... abp/filters/renderer.py:212: if line.type == 'metadata' and 'Checksum' not in line.to_string(): Note: If this lands after the checksum patch, the 'Checksum' exception should be removed. If this lands first, it should be removed by that patch. https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/rendere... abp/filters/renderer.py:219: for line in parse_filterlist(base.splitlines()): It would be good to move this repeated piece of code into a small internal function. This will also make render_diff() more compact and easy to follow. Maybe something like: base_fl, base_md, base_keys = _split_list_for_diff(base) https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/rendere... abp/filters/renderer.py:230: diff.append('{}\n'.format(item)) If we return an iterable of strings from this function we can just yield in this and the following loops.
Hey Vasily! Thanks for the feedback; the structure seems a lot cleaner like this. Just had a quick questions about the checksum thing. Otherwise, let me know if you see any other issues? Thanks! :) Rosie https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/rendere... File abp/filters/renderer.py (right): https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/rendere... abp/filters/renderer.py:29: __all__ = ['IncludeError', 'MissingHeader', 'render_filterlist'] On 2018/08/17 10:30:54, Vasily Kuznetsov wrote: > render_diff() should probably become a public export here. Done. https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/rendere... abp/filters/renderer.py:204: def render_diff(base, latest): On 2018/08/17 10:30:54, Vasily Kuznetsov wrote: > I think it makes more sense to make `base`, `latest` and the return value > iterables of strings. This is what parse_filterlist() expects and this is what > you get when you open a file. This is also the API that other functions dealing > with filterlists take and return. It would also reduce the amount of code here > eliminating the calls to splitlines() and some others. > > It would also be good to document the API of this function in the same way as > the other functions are documented. Done. https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/rendere... abp/filters/renderer.py:212: if line.type == 'metadata' and 'Checksum' not in line.to_string(): On 2018/08/17 10:30:53, Vasily Kuznetsov wrote: > Note: If this lands after the checksum patch, the 'Checksum' exception should be > removed. If this lands first, it should be removed by that patch. Are we guaranteed that no filter lists will be encountered that have a checksum? This line and line 220 make sure that, if we ever see a checksum in either `base` or `latest`, it will not make it into the diff. But, if we will never see a checksum, then this check is unnecessary. https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/rendere... abp/filters/renderer.py:219: for line in parse_filterlist(base.splitlines()): On 2018/08/17 10:30:53, Vasily Kuznetsov wrote: > It would be good to move this repeated piece of code into a small internal > function. This will also make render_diff() more compact and easy to follow. > Maybe something like: > > base_fl, base_md, base_keys = _split_list_for_diff(base) Done. https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/rendere... abp/filters/renderer.py:230: diff.append('{}\n'.format(item)) On 2018/08/17 10:30:53, Vasily Kuznetsov wrote: > If we return an iterable of strings from this function we can just yield in this > and the following loops. Done.
LGTM https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/rendere... File abp/filters/renderer.py (right): https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/rendere... abp/filters/renderer.py:212: if line.type == 'metadata' and 'Checksum' not in line.to_string(): On 2018/08/20 18:21:27, rhowell wrote: > On 2018/08/17 10:30:53, Vasily Kuznetsov wrote: > > Note: If this lands after the checksum patch, the 'Checksum' exception should > be > > removed. If this lands first, it should be removed by that patch. > > Are we guaranteed that no filter lists will be encountered that have a checksum? > This line and line 220 make sure that, if we ever see a checksum in either > `base` or `latest`, it will not make it into the diff. But, if we will never see > a checksum, then this check is unnecessary. Hm. No, for now we're not guaranteed that there's no Checksum, so you're right, we should keep this. https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/rendere... File abp/filters/renderer.py (right): https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/rendere... abp/filters/renderer.py:186: filterlist, metadata, keys = (set() for i in range(3)) I'm not sure if maybe a slightly more readable version of this line would be: filterlist, metadata, keys = set(), set(), set() or even: filterlist = set() metadata = set() keys = set() I don't feel too strongly about it so feel free to keep your original version or choose one of the proposals above as you like.
https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/rendere... File abp/filters/renderer.py (right): https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/rendere... abp/filters/renderer.py:212: if line.type == 'metadata' and 'Checksum' not in line.to_string(): On 2018/08/21 14:59:59, Vasily Kuznetsov wrote: > On 2018/08/20 18:21:27, rhowell wrote: > > On 2018/08/17 10:30:53, Vasily Kuznetsov wrote: > > > Note: If this lands after the checksum patch, the 'Checksum' exception > should > > be > > > removed. If this lands first, it should be removed by that patch. > > > > Are we guaranteed that no filter lists will be encountered that have a > checksum? > > This line and line 220 make sure that, if we ever see a checksum in either > > `base` or `latest`, it will not make it into the diff. But, if we will never > see > > a checksum, then this check is unnecessary. > > Hm. No, for now we're not guaranteed that there's no Checksum, so you're right, > we should keep this. Isn't python-abp stripping checksums when generating the filter lists now, and therefore isn't it in fact guaranteed that filter lists delivered by our servers at least won't include any checksum? Even if not, any reason we'd have to strip them here? Clients that support diffs, will ignore checksums anyway. https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser.py File abp/filters/parser.py (right): https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser.... abp/filters/parser.py:145: 'Version', 'Diff-URL', 'Diff-Expires'} I would prefer if python-abp would be agnostic of the supported metadata keys. Otherwise, we have to keep this set in sync with what is implemented in Adblock Plus, which seems unpractical. However, note that the regexp above would have to be changed like following so that it won't match lines with arbitrary URLs: r'!\s*(\w+)\s*:(?!//)\s*(.*)' https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser.... abp/filters/parser.py:156: if match and match.group(1) in METADATA_KEYS: Note that metadata keys are case-insensitive. If we go with my above suggestion, this won't matter here anymore. But once non-capitalized keys are recognized this needs to be accounted for in _remove_duplicates() in renderer.py as well. https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser.... abp/filters/parser.py:292: elif content.startswith('[') and content.endswith(']'): Somewhat unrelated of these changes, but this is inaccurate too. Only the first line (if encapsulated in brackets) is to be recognized as header. Otherwise it would be considered a filter. https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/rendere... File abp/filters/renderer.py (right): https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/rendere... abp/filters/renderer.py:187: for line in parse_filterlist(list_in): When parsing generated filter lists in order to generate diffs, we might want to make the parser ignore includes/instructions. https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/rendere... abp/filters/renderer.py:230: yield '- {}'.format(item) In the specification, we demand the client to process removed filters first. But given this implementation, it seems simpler if we'd rather make the diff list the removed filters first, so that the client can process the diff in one pass. https://codereview.adblockplus.org/29845767/diff/29859561/tests/test_differ.py File tests/test_differ.py (right): https://codereview.adblockplus.org/29845767/diff/29859561/tests/test_differ.p... tests/test_differ.py:61: test Perhaps add a leading whitespace here, so that we also test that filter texts are stripped before being compared.
Thanks for the comments, Sebastian! Here are a few comments on the comments :) https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser.py File abp/filters/parser.py (right): https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser.... abp/filters/parser.py:145: 'Version', 'Diff-URL', 'Diff-Expires'} On 2018/08/21 19:42:45, Sebastian Noack wrote: > I would prefer if python-abp would be agnostic of the supported metadata keys. > Otherwise, we have to keep this set in sync with what is implemented in Adblock > Plus, which seems unpractical. > > However, note that the regexp above would have to be changed like following so > that it won't match lines with arbitrary URLs: r'!\s*(\w+)\s*:(?!//)\s*(.*)' Makes sense and I support it. I actually thought it was already implemented this way before I started reviewing this change :) BTW, the regexp should probably be r'!\s*([\w-]+)\s*:(?!//)\s*(.*)' so that we can still have dashes in metadata field names. https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser.... abp/filters/parser.py:156: if match and match.group(1) in METADATA_KEYS: On 2018/08/21 19:42:46, Sebastian Noack wrote: > Note that metadata keys are case-insensitive. If we go with my above suggestion, > this won't matter here anymore. But once non-capitalized keys are recognized > this needs to be accounted for in _remove_duplicates() in renderer.py as well. Should we then maybe adopt some canonical capitalization scheme (for example Like-This) and then convert them here to follow it? Otherwise we will have to remember this in any code that works with metadata keys. https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser.... abp/filters/parser.py:292: elif content.startswith('[') and content.endswith(']'): On 2018/08/21 19:42:45, Sebastian Noack wrote: > Somewhat unrelated of these changes, but this is inaccurate too. Only the first > line (if encapsulated in brackets) is to be recognized as header. Otherwise it > would be considered a filter. Makes sense. I created a separate ticket for it: https://issues.adblockplus.org/ticket/6877 https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/rendere... File abp/filters/renderer.py (right): https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/rendere... abp/filters/renderer.py:187: for line in parse_filterlist(list_in): On 2018/08/21 19:42:46, Sebastian Noack wrote: > When parsing generated filter lists in order to generate diffs, we might want to > make the parser ignore includes/instructions. Theoretically there should be no includes/instructions in the final lists (and we're not planning to produce diffs of fragments of filter lists), so this is not really necessary. And even if there were includes, the code below ignores them, which achieves the same outcome as the parsing ignoring them, so it seems that things are ok as is.
https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser.py File abp/filters/parser.py (right): https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser.... abp/filters/parser.py:145: 'Version', 'Diff-URL', 'Diff-Expires'} On 2018/08/22 13:10:50, Vasily Kuznetsov wrote: > On 2018/08/21 19:42:45, Sebastian Noack wrote: > > I would prefer if python-abp would be agnostic of the supported metadata keys. > > Otherwise, we have to keep this set in sync with what is implemented in > Adblock > > Plus, which seems unpractical. > > > > However, note that the regexp above would have to be changed like following so > > that it won't match lines with arbitrary URLs: r'!\s*(\w+)\s*:(?!//)\s*(.*)' > > Makes sense and I support it. I actually thought it was already implemented this > way before I started reviewing this change :) > > BTW, the regexp should probably be r'!\s*([\w-]+)\s*:(?!//)\s*(.*)' so that we > can still have dashes in metadata field names. You are right, well spotted. https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser.... abp/filters/parser.py:156: if match and match.group(1) in METADATA_KEYS: On 2018/08/22 13:10:50, Vasily Kuznetsov wrote: > On 2018/08/21 19:42:46, Sebastian Noack wrote: > > Note that metadata keys are case-insensitive. If we go with my above > suggestion, > > this won't matter here anymore. But once non-capitalized keys are recognized > > this needs to be accounted for in _remove_duplicates() in renderer.py as well. > > Should we then maybe adopt some canonical capitalization scheme (for example > Like-This) and then convert them here to follow it? Otherwise we will have to > remember this in any code that works with metadata keys. Initially, I had a similar though, but on the other hand I'd prefer to not unnecessarily have the output diverge from the input where (easily) possible, and it seems the only place where we specifically check for a particular metadata key is where we strip checksums, besides we'd have to account for case-insensitivity as well when comparing metadata for the diff. https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/rendere... File abp/filters/renderer.py (right): https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/rendere... abp/filters/renderer.py:187: for line in parse_filterlist(list_in): On 2018/08/22 13:10:50, Vasily Kuznetsov wrote: > On 2018/08/21 19:42:46, Sebastian Noack wrote: > > When parsing generated filter lists in order to generate diffs, we might want > to > > make the parser ignore includes/instructions. > > Theoretically there should be no includes/instructions in the final lists (and > we're not planning to produce diffs of fragments of filter lists), so this is > not really necessary. And even if there were includes, the code below ignores > them, which achieves the same outcome as the parsing ignoring them, so it seems > that things are ok as is. Granted, this is quite an edge case. But in theory if we'll find "%include" in the input the diff is generated from, then it seems somewhat more accurate to just treat it as a filter (like Adblock Plus would) rather than stripping this line (or perhaps a warning would be in order). But I don't feel too strong.
https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/rendere... File abp/filters/renderer.py (right): https://codereview.adblockplus.org/29845767/diff/29851647/abp/filters/rendere... abp/filters/renderer.py:212: if line.type == 'metadata' and 'Checksum' not in line.to_string(): On 2018/08/21 19:42:45, Sebastian Noack wrote: > On 2018/08/21 14:59:59, Vasily Kuznetsov wrote: > > On 2018/08/20 18:21:27, rhowell wrote: > > > On 2018/08/17 10:30:53, Vasily Kuznetsov wrote: > > > > Note: If this lands after the checksum patch, the 'Checksum' exception > > should > > > be > > > > removed. If this lands first, it should be removed by that patch. > > > > > > Are we guaranteed that no filter lists will be encountered that have a > > checksum? > > > This line and line 220 make sure that, if we ever see a checksum in either > > > `base` or `latest`, it will not make it into the diff. But, if we will never > > see > > > a checksum, then this check is unnecessary. > > > > Hm. No, for now we're not guaranteed that there's no Checksum, so you're > right, > > we should keep this. > > Isn't python-abp stripping checksums when generating the filter lists now, and > therefore isn't it in fact guaranteed that filter lists delivered by our servers > at least won't include any checksum? Even if not, any reason we'd have to strip > them here? Clients that support diffs, will ignore checksums anyway. Moving the discussion from IRC over here: 16:34:47 <•vasily> snoack: the second one I agree with you. For the first one, at the moment, I think we have other code producing filter lists in production, so they do have checksums. But we might convert all of this to python-abp (perhaps together with deployment of the incremental updates) so you would also be right then... however, we haven't yet done it. 16:36:03 <•vasily> snoack: in general seems like the checksums will be ignored by diff-aware clients, so yeah, it can be removed. 17:24:01 <snoack> vasily: How about ignoring checksum metadata in the parser (rather than when rendering the full list and/or diff)? That way checksums would be stripped no matter what, without any additional complexity. https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/rendere... File abp/filters/renderer.py (right): https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/rendere... abp/filters/renderer.py:230: yield '- {}'.format(item) On 2018/08/21 19:42:46, Sebastian Noack wrote: > In the specification, we demand the client to process removed filters first. But > given this implementation, it seems simpler if we'd rather make the diff list > the removed filters first, so that the client can process the diff in one pass. After discussing this on IRC, Vasily agrees. I updated the spec accordingly.
https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser.py File abp/filters/parser.py (right): https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser.... abp/filters/parser.py:156: if match and match.group(1) in METADATA_KEYS: On 2018/08/22 14:31:06, Sebastian Noack wrote: > On 2018/08/22 13:10:50, Vasily Kuznetsov wrote: > > On 2018/08/21 19:42:46, Sebastian Noack wrote: > > > Note that metadata keys are case-insensitive. If we go with my above > > suggestion, > > > this won't matter here anymore. But once non-capitalized keys are recognized > > > this needs to be accounted for in _remove_duplicates() in renderer.py as > well. > > > > Should we then maybe adopt some canonical capitalization scheme (for example > > Like-This) and then convert them here to follow it? Otherwise we will have to > > remember this in any code that works with metadata keys. > > Initially, I had a similar though, but on the other hand I'd prefer to not > unnecessarily have the output diverge from the input where (easily) possible, > and it seems the only place where we specifically check for a particular > metadata key is where we strip checksums, besides we'd have to account for > case-insensitivity as well when comparing metadata for the diff. The output that doesn't diverge from the input unnecessarily seems like a useful feature. In principle, if we normalize the headers, eventually all "our" filter lists will have normalized headers and they won't diverge anyway, but this might not be the case just yet. Anyway, for the time being it's not a big deal and it's unlikely to result in problems in the future so I don't feel too strongly about this. https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/rendere... File abp/filters/renderer.py (right): https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/rendere... abp/filters/renderer.py:187: for line in parse_filterlist(list_in): On 2018/08/22 14:31:06, Sebastian Noack wrote: > On 2018/08/22 13:10:50, Vasily Kuznetsov wrote: > > On 2018/08/21 19:42:46, Sebastian Noack wrote: > > > When parsing generated filter lists in order to generate diffs, we might > want > > to > > > make the parser ignore includes/instructions. > > > > Theoretically there should be no includes/instructions in the final lists (and > > we're not planning to produce diffs of fragments of filter lists), so this is > > not really necessary. And even if there were includes, the code below ignores > > them, which achieves the same outcome as the parsing ignoring them, so it > seems > > that things are ok as is. > > Granted, this is quite an edge case. But in theory if we'll find "%include" in > the input the diff is generated from, then it seems somewhat more accurate to > just treat it as a filter (like Adblock Plus would) rather than stripping this > line (or perhaps a warning would be in order). But I don't feel too strong. To do this we would need to start parsing fragments of filter lists differently from rendered filter lists. It does make sense but it also makes things more complicated and the cases where it's relevant seem to be pretty unrealistic and edge-casey. I lean towards keeping the code unified and parsing "%include"s as instructions everywhere until we have evidence that there are cases where it would be useful to make the distinction. https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/rendere... abp/filters/renderer.py:230: yield '- {}'.format(item) On 2018/08/22 16:03:29, Sebastian Noack wrote: > On 2018/08/21 19:42:46, Sebastian Noack wrote: > > In the specification, we demand the client to process removed filters first. > But > > given this implementation, it seems simpler if we'd rather make the diff list > > the removed filters first, so that the client can process the diff in one > pass. > > After discussing this on IRC, Vasily agrees. I updated the spec accordingly. 👍
Thanks for the input! I made the suggested changes. Let me know if you see any other issues? https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser.py File abp/filters/parser.py (right): https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser.... abp/filters/parser.py:145: 'Version', 'Diff-URL', 'Diff-Expires'} On 2018/08/22 14:31:06, Sebastian Noack wrote: > On 2018/08/22 13:10:50, Vasily Kuznetsov wrote: > > On 2018/08/21 19:42:45, Sebastian Noack wrote: > > > I would prefer if python-abp would be agnostic of the supported metadata > keys. > > > Otherwise, we have to keep this set in sync with what is implemented in > > Adblock > > > Plus, which seems unpractical. > > > > > > However, note that the regexp above would have to be changed like following > so > > > that it won't match lines with arbitrary URLs: r'!\s*(\w+)\s*:(?!//)\s*(.*)' > > > > Makes sense and I support it. I actually thought it was already implemented > this > > way before I started reviewing this change :) > > > > BTW, the regexp should probably be r'!\s*([\w-]+)\s*:(?!//)\s*(.*)' so that we > > can still have dashes in metadata field names. > > You are right, well spotted. It appears this was done to prevent mistaking a comment with a colon for metadata, like in this test: def test_parse_nonmeta(): line = parse_line('! WrongHeader: something') assert line.type == 'comment' It probably doesn't matter too much, since the parser wouldn't be able to parse invalid metadata anyway. I'll update the test so it doesn't use a colon. https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser.... abp/filters/parser.py:156: if match and match.group(1) in METADATA_KEYS: On 2018/08/24 13:03:19, Vasily Kuznetsov wrote: > On 2018/08/22 14:31:06, Sebastian Noack wrote: > > On 2018/08/22 13:10:50, Vasily Kuznetsov wrote: > > > On 2018/08/21 19:42:46, Sebastian Noack wrote: > > > > Note that metadata keys are case-insensitive. If we go with my above > > > suggestion, > > > > this won't matter here anymore. But once non-capitalized keys are > recognized > > > > this needs to be accounted for in _remove_duplicates() in renderer.py as > > well. > > > > > > Should we then maybe adopt some canonical capitalization scheme (for example > > > Like-This) and then convert them here to follow it? Otherwise we will have > to > > > remember this in any code that works with metadata keys. > > > > Initially, I had a similar though, but on the other hand I'd prefer to not > > unnecessarily have the output diverge from the input where (easily) possible, > > and it seems the only place where we specifically check for a particular > > metadata key is where we strip checksums, besides we'd have to account for > > case-insensitivity as well when comparing metadata for the diff. > > The output that doesn't diverge from the input unnecessarily seems like a useful > feature. In principle, if we normalize the headers, eventually all "our" filter > lists will have normalized headers and they won't diverge anyway, but this might > not be the case just yet. Anyway, for the time being it's not a big deal and > it's unlikely to result in problems in the future so I don't feel too strongly > about this. Acknowledged. https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/parser.... abp/filters/parser.py:292: elif content.startswith('[') and content.endswith(']'): On 2018/08/22 13:10:50, Vasily Kuznetsov wrote: > On 2018/08/21 19:42:45, Sebastian Noack wrote: > > Somewhat unrelated of these changes, but this is inaccurate too. Only the > first > > line (if encapsulated in brackets) is to be recognized as header. Otherwise it > > would be considered a filter. > > Makes sense. I created a separate ticket for it: > https://issues.adblockplus.org/ticket/6877 Acknowledged. https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/rendere... File abp/filters/renderer.py (right): https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/rendere... abp/filters/renderer.py:186: filterlist, metadata, keys = (set() for i in range(3)) On 2018/08/21 14:59:59, Vasily Kuznetsov wrote: > I'm not sure if maybe a slightly more readable version of this line would be: > > filterlist, metadata, keys = set(), set(), set() > > or even: > > filterlist = set() > metadata = set() > keys = set() > > I don't feel too strongly about it so feel free to keep your original version or > choose one of the proposals above as you like. Done. https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/rendere... abp/filters/renderer.py:187: for line in parse_filterlist(list_in): On 2018/08/24 13:03:19, Vasily Kuznetsov wrote: > On 2018/08/22 14:31:06, Sebastian Noack wrote: > > On 2018/08/22 13:10:50, Vasily Kuznetsov wrote: > > > On 2018/08/21 19:42:46, Sebastian Noack wrote: > > > > When parsing generated filter lists in order to generate diffs, we might > > want > > > to > > > > make the parser ignore includes/instructions. > > > > > > Theoretically there should be no includes/instructions in the final lists > (and > > > we're not planning to produce diffs of fragments of filter lists), so this > is > > > not really necessary. And even if there were includes, the code below > ignores > > > them, which achieves the same outcome as the parsing ignoring them, so it > > seems > > > that things are ok as is. > > > > Granted, this is quite an edge case. But in theory if we'll find "%include" > in > > the input the diff is generated from, then it seems somewhat more accurate to > > just treat it as a filter (like Adblock Plus would) rather than stripping this > > line (or perhaps a warning would be in order). But I don't feel too strong. > > To do this we would need to start parsing fragments of filter lists differently > from rendered filter lists. It does make sense but it also makes things more > complicated and the cases where it's relevant seem to be pretty unrealistic and > edge-casey. > > I lean towards keeping the code unified and parsing "%include"s as instructions > everywhere until we have evidence that there are cases where it would be useful > to make the distinction. Yeah, I was working under the assumption that the full lists wouldn't have %includes in them, and `%include` appears to be the only instruction the parser knows. Are there any other instructions that might appear, that we might need to handle? If not, I would agree with Vasily, that it's not necessary to handle this case. https://codereview.adblockplus.org/29845767/diff/29859561/abp/filters/rendere... abp/filters/renderer.py:230: yield '- {}'.format(item) On 2018/08/24 13:03:19, Vasily Kuznetsov wrote: > On 2018/08/22 16:03:29, Sebastian Noack wrote: > > On 2018/08/21 19:42:46, Sebastian Noack wrote: > > > In the specification, we demand the client to process removed filters first. > > But > > > given this implementation, it seems simpler if we'd rather make the diff > list > > > the removed filters first, so that the client can process the diff in one > > pass. > > > > After discussing this on IRC, Vasily agrees. I updated the spec accordingly. > > 👍 Done. https://codereview.adblockplus.org/29845767/diff/29859561/tests/test_differ.py File tests/test_differ.py (right): https://codereview.adblockplus.org/29845767/diff/29859561/tests/test_differ.p... tests/test_differ.py:61: test On 2018/08/21 19:42:46, Sebastian Noack wrote: > Perhaps add a leading whitespace here, so that we also test that filter texts > are stripped before being compared. Done.
https://codereview.adblockplus.org/29845767/diff/29866656/abp/filters/rendere... File abp/filters/renderer.py (right): https://codereview.adblockplus.org/29845767/diff/29866656/abp/filters/rendere... abp/filters/renderer.py:127: seen = {'Checksum'} Since we parse metadata with arbitrary keys now, we have to account for keys with inconsistent capitalization, here: seen = {'checksum'} for i, line in enumerate(lines): if line.type == 'metadata': key = line.key.lower() if key not in seen: seen.add(key) yield line https://codereview.adblockplus.org/29845767/diff/29866656/abp/filters/rendere... abp/filters/renderer.py:188: if line.type == 'metadata' and 'Checksum' not in line.to_string(): I think it has been agreed on that this isn't necessary, as any diff-aware client ignores checksums anyway (besides most filter list if not all wouldn't have a checksum anymore anyway). https://codereview.adblockplus.org/29845767/diff/29866656/abp/filters/rendere... abp/filters/renderer.py:190: keys.add(line.key) This code has to be changed as well, to handle case-insensitive keys: def _split_list_for_diff(list_in): metadata = {} for line in parse_filterlist(list_in): if line.type == 'metadata': metadata[line.key.lower()] = line ... def render_diff(base, latest): ... for key, line in latest_md.items(): other = base_md.get(key) if not other or other.value != line.value: yield line.to_string() for key in set(base_md) - set(latest_md): yield '! {}:'.format(base_md[key].key) ... https://codereview.adblockplus.org/29845767/diff/29866656/tests/test_parser.py File tests/test_parser.py (right): https://codereview.adblockplus.org/29845767/diff/29866656/tests/test_parser.p... tests/test_parser.py:158: line = parse_line('! WrongHeader; something') To me it seems, this test specifically existed to make sure headers are checked against the whitelist (which you removed). The modified version of this test, seems kinda useless to me.
https://codereview.adblockplus.org/29845767/diff/29866656/abp/filters/rendere... File abp/filters/renderer.py (right): https://codereview.adblockplus.org/29845767/diff/29866656/abp/filters/rendere... abp/filters/renderer.py:127: seen = {'Checksum'} On 2018/08/28 19:52:18, Sebastian Noack wrote: > Since we parse metadata with arbitrary keys now, we have to account for keys > with inconsistent capitalization, here: > > seen = {'checksum'} > for i, line in enumerate(lines): > if line.type == 'metadata': > key = line.key.lower() > if key not in seen: > seen.add(key) > yield line Done. https://codereview.adblockplus.org/29845767/diff/29866656/abp/filters/rendere... abp/filters/renderer.py:188: if line.type == 'metadata' and 'Checksum' not in line.to_string(): On 2018/08/28 19:52:18, Sebastian Noack wrote: > I think it has been agreed on that this isn't necessary, as any diff-aware > client ignores checksums anyway (besides most filter list if not all wouldn't > have a checksum anymore anyway). Done. https://codereview.adblockplus.org/29845767/diff/29866656/abp/filters/rendere... abp/filters/renderer.py:190: keys.add(line.key) On 2018/08/28 19:52:18, Sebastian Noack wrote: > This code has to be changed as well, to handle case-insensitive keys: > > def _split_list_for_diff(list_in): > metadata = {} > for line in parse_filterlist(list_in): > if line.type == 'metadata': > metadata[line.key.lower()] = line > ... > > def render_diff(base, latest): > ... > for key, line in latest_md.items(): > other = base_md.get(key) > if not other or other.value != line.value: > yield line.to_string() > for key in set(base_md) - set(latest_md): > yield '! {}:'.format(base_md[key].key) > ... Any reason to use a dict instead of a set? I guess one advantage would be: if a metadata item is removed in the new list, we would convert it to lower-case and give it a blank value. If that's a concern (converting it to lower case), then I can use dicts instead of sets, as you suggested. https://codereview.adblockplus.org/29845767/diff/29866656/tests/test_parser.py File tests/test_parser.py (right): https://codereview.adblockplus.org/29845767/diff/29866656/tests/test_parser.p... tests/test_parser.py:158: line = parse_line('! WrongHeader; something') On 2018/08/28 19:52:18, Sebastian Noack wrote: > To me it seems, this test specifically existed to make sure headers are checked > against the whitelist (which you removed). The modified version of this test, > seems kinda useless to me. Done.
https://codereview.adblockplus.org/29845767/diff/29866656/abp/filters/rendere... File abp/filters/renderer.py (right): https://codereview.adblockplus.org/29845767/diff/29866656/abp/filters/rendere... abp/filters/renderer.py:190: keys.add(line.key) On 2018/08/29 21:43:34, rhowell wrote: > On 2018/08/28 19:52:18, Sebastian Noack wrote: > > This code has to be changed as well, to handle case-insensitive keys: > > > > def _split_list_for_diff(list_in): > > metadata = {} > > for line in parse_filterlist(list_in): > > if line.type == 'metadata': > > metadata[line.key.lower()] = line > > ... > > > > def render_diff(base, latest): > > ... > > for key, line in latest_md.items(): > > other = base_md.get(key) > > if not other or other.value != line.value: > > yield line.to_string() > > for key in set(base_md) - set(latest_md): > > yield '! {}:'.format(base_md[key].key) > > ... > > Any reason to use a dict instead of a set? I guess one advantage would be: if a > metadata item is removed in the new list, we would convert it to lower-case and > give it a blank value. If that's a concern (converting it to lower case), then I > can use dicts instead of sets, as you suggested. The algorithm I suggest here, is the (probably) simplest way to treat keys case-insensitive while preserving the original capitalization. With a set alone you cannot associate the normalized keys with the original.
https://codereview.adblockplus.org/29845767/diff/29866656/abp/filters/rendere... File abp/filters/renderer.py (right): https://codereview.adblockplus.org/29845767/diff/29866656/abp/filters/rendere... abp/filters/renderer.py:190: keys.add(line.key) On 2018/08/29 21:56:37, Sebastian Noack wrote: > On 2018/08/29 21:43:34, rhowell wrote: > > On 2018/08/28 19:52:18, Sebastian Noack wrote: > > > This code has to be changed as well, to handle case-insensitive keys: > > > > > > def _split_list_for_diff(list_in): > > > metadata = {} > > > for line in parse_filterlist(list_in): > > > if line.type == 'metadata': > > > metadata[line.key.lower()] = line > > > ... > > > > > > def render_diff(base, latest): > > > ... > > > for key, line in latest_md.items(): > > > other = base_md.get(key) > > > if not other or other.value != line.value: > > > yield line.to_string() > > > for key in set(base_md) - set(latest_md): > > > yield '! {}:'.format(base_md[key].key) > > > ... > > > > Any reason to use a dict instead of a set? I guess one advantage would be: if > a > > metadata item is removed in the new list, we would convert it to lower-case > and > > give it a blank value. If that's a concern (converting it to lower case), then > I > > can use dicts instead of sets, as you suggested. > > The algorithm I suggest here, is the (probably) simplest way to treat keys > case-insensitive while preserving the original capitalization. With a set alone > you cannot associate the normalized keys with the original. Done.
https://codereview.adblockplus.org/29845767/diff/29868603/tests/test_differ.py File tests/test_differ.py (right): https://codereview.adblockplus.org/29845767/diff/29868603/tests/test_differ.p... tests/test_differ.py:21: ! Diff-URL: https://easylist-downloads.adblockplus.org/easylist/diffs/111.txt Perhaps turn one of the keys here to lowercase in order to test that they are compared case-insensitively?
https://codereview.adblockplus.org/29845767/diff/29868603/tests/test_differ.py File tests/test_differ.py (right): https://codereview.adblockplus.org/29845767/diff/29868603/tests/test_differ.p... tests/test_differ.py:21: ! Diff-URL: https://easylist-downloads.adblockplus.org/easylist/diffs/111.txt On 2018/08/29 22:58:10, Sebastian Noack wrote: > Perhaps turn one of the keys here to lowercase in order to test that they are > compared case-insensitively? Ideally, there should be one key with changed and one key with unchanged value using inconsistent spelling (capitalized vs all lowercase).
https://codereview.adblockplus.org/29845767/diff/29868603/abp/filters/rendere... File abp/filters/renderer.py (right): https://codereview.adblockplus.org/29845767/diff/29868603/abp/filters/rendere... abp/filters/renderer.py:215: base_md, base_rules = _split_list_for_diff(base) Nit: Is it necessary to abbreviate metadata to md in order to avoid wrapping? If so that is fine, but otherwise, I think it reads better spelled out. https://codereview.adblockplus.org/29845767/diff/29868603/abp/filters/rendere... abp/filters/renderer.py:216: add_fl = latest_rules - base_rules Nit: These expressions could be inlined below. This would also be consistent with the second loop iterating over removed metadata keys.
https://codereview.adblockplus.org/29845767/diff/29868603/abp/filters/rendere... File abp/filters/renderer.py (right): https://codereview.adblockplus.org/29845767/diff/29868603/abp/filters/rendere... abp/filters/renderer.py:215: base_md, base_rules = _split_list_for_diff(base) On 2018/08/29 23:46:26, Sebastian Noack wrote: > Nit: Is it necessary to abbreviate metadata to md in order to avoid wrapping? If > so that is fine, but otherwise, I think it reads better spelled out. Done. https://codereview.adblockplus.org/29845767/diff/29868603/abp/filters/rendere... abp/filters/renderer.py:216: add_fl = latest_rules - base_rules On 2018/08/29 23:46:25, Sebastian Noack wrote: > Nit: These expressions could be inlined below. This would also be consistent > with the second loop iterating over removed metadata keys. Done.
https://codereview.adblockplus.org/29845767/diff/29868603/tests/test_differ.py File tests/test_differ.py (right): https://codereview.adblockplus.org/29845767/diff/29868603/tests/test_differ.p... tests/test_differ.py:21: ! Diff-URL: https://easylist-downloads.adblockplus.org/easylist/diffs/111.txt On 2018/08/29 23:03:24, Sebastian Noack wrote: > On 2018/08/29 22:58:10, Sebastian Noack wrote: > > Perhaps turn one of the keys here to lowercase in order to test that they are > > compared case-insensitively? > > Ideally, there should be one key with changed and one key with unchanged value > using inconsistent spelling (capitalized vs all lowercase). It seems you missed this comment.
https://codereview.adblockplus.org/29845767/diff/29868603/tests/test_differ.py File tests/test_differ.py (right): https://codereview.adblockplus.org/29845767/diff/29868603/tests/test_differ.p... tests/test_differ.py:21: ! Diff-URL: https://easylist-downloads.adblockplus.org/easylist/diffs/111.txt On 2018/08/30 16:38:33, Sebastian Noack wrote: > On 2018/08/29 23:03:24, Sebastian Noack wrote: > > On 2018/08/29 22:58:10, Sebastian Noack wrote: > > > Perhaps turn one of the keys here to lowercase in order to test that they > are > > > compared case-insensitively? > > > > Ideally, there should be one key with changed and one key with unchanged value > > using inconsistent spelling (capitalized vs all lowercase). > > It seems you missed this comment. Oops! Done.
LGTM
LGTM again! https://codereview.adblockplus.org/29845767/diff/29869585/abp/filters/rendere... File abp/filters/renderer.py (right): https://codereview.adblockplus.org/29845767/diff/29869585/abp/filters/rendere... abp/filters/renderer.py:187: """Split a filter list into metadata, keys, and rules.""" Nit: strictly speaking this function now returns metadata and rules (the keys are just part of metadata). The docstring could be updated.
https://codereview.adblockplus.org/29845767/diff/29869585/abp/filters/rendere... File abp/filters/renderer.py (right): https://codereview.adblockplus.org/29845767/diff/29869585/abp/filters/rendere... abp/filters/renderer.py:187: """Split a filter list into metadata, keys, and rules.""" On 2018/08/31 12:11:02, Vasily Kuznetsov wrote: > Nit: strictly speaking this function now returns metadata and rules (the keys > are just part of metadata). The docstring could be updated. Ah, good point! I'll update that. |