|
|
Created:
Oct. 30, 2015, 7:11 a.m. by Felix Dahlke Modified:
Nov. 20, 2015, 9:07 a.m. CC:
Wladimir Palant Visibility:
Public. |
DescriptionIssue 3168 - Add a script for generating content blocker lists
Patch Set 1 : #Patch Set 2 : Keep the lists in memory, don't print anything to stdout, and more #
Total comments: 8
Patch Set 3 : Remove _get_config_value, combine lists in _convert_filter_lists #
Total comments: 3
Patch Set 4 : Rebased, check retcode outside the finally block #
Total comments: 3
Patch Set 5 : Remove retcode variable, move return code check out of the with block #
MessagesTotal messages: 19
https://codereview.adblockplus.org/29329537/diff/29329538/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29329537/diff/29329538/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:1: #!/usr/bin/env python This script is almost the same as was reviewed in https://codereview.adblockplus.org/29328894/ - the only change is that it's now reading abp2blocklist_path etc. from sitescripts.ini. https://codereview.adblockplus.org/29329537/diff/29329538/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:84: "easylist_content_blocker_path") I've violated the 80 column rule a bit here. It pains me, but I didn't see a better way to wrap this to make it comply, and I think extracting the long strings into shorter variables would hurt readability. "A foolish consistency" and all that. Same below.
This seems like quite a hack, wouldn't we be better off with some small Bash script to automate this process instead? It seems to me that all we are doing here is downloading a couple of filter lists, concatenating them and piping them through the NodeJS script. I mean we might not even need a Bash script here, wouldn't this command do the same thing? `curl https://easylist-downloads.adblockplus.org/{easylist_noadult.txt,exceptionrul... | node abp2blocklist.js > easylist+exceptionrules.json` In the future if we wanted to add more of this helpful functionality to the NodeJS script we could, or alternatively we could have the whole thing rewritten in Python and under sitescripts. But for now I guess I just don't understand why we're doing it this way. https://codereview.adblockplus.org/29329537/diff/29329538/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29329537/diff/29329538/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:36: abp2blocklist_path]) So we're manually calling Mercurial here instead of using our dependency management system?! (What if the user uses Git for example?) https://codereview.adblockplus.org/29329537/diff/29329538/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:43: easylist_file = tempfile.NamedTemporaryFile(mode="wb+") Why not use "with" to avoid having to manually close easylist_response? easylist_file = tempfile.NamedTemporaryFile(mode="wb+") easylist_url = get_config().get("content_blocker_lists", "easylist_url") with urllib2.urlopen(easylist_url) as easylist_response: shutil.copyfileobj(easylist_response, easylist_file) https://codereview.adblockplus.org/29329537/diff/29329538/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:48: try: These two blocks of logic seem basically identical, perhaps you should avoid duplicating them? It seems like the only thing that changes is the URL? How about having the URL as a parameter to `_download_filter_lists()` and then just calling the function twice? (Of course the return value would be a single file instead, and the name of the function might want to be renamed.) https://codereview.adblockplus.org/29329537/diff/29329538/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:64: subprocess.check_call(["node", "abp2blocklist.js"], I don't really understand why we have a Python script to grab the filter lists and then call a NodeJS script from it to process them. Shouldn't we either add this helpful functionality to the NodeJS script, or rewrite the whole thing in Python as a part of sitescripts? https://codereview.adblockplus.org/29329537/diff/29329538/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:80: easylist_file, exceptionrules_file = _download_filter_lists() You should probably use with here so that you don't have to manually close the files with finally.
Replying to some comments already, I'll look into the other ones in a bit. On 2015/10/30 12:56:15, kzar wrote: > This seems like quite a hack, wouldn't we be better off with some small Bash > script to automate this process instead? It seems to me that all we are doing > here is downloading a couple of filter lists, concatenating them and piping them > through the NodeJS script. I mean we might not even need a Bash script here, > wouldn't this command do the same thing? `curl > https://easylist-downloads.adblockplus.org/%7Beasylist_noadult.txt,exceptionr... > | node abp2blocklist.js > easylist+exceptionrules.json` I had the exact same discussion with Sebastian, but not in the review apparently. The script has more logic than that one liner, but it would indeed be a few lines shorter as a bash script. However, we're trying hard to avoid Bash scripts even in such situations: They tend to get more logic over time and turn into a mess. Plus, we have very near term plans for extending what this script does: https://issues.adblockplus.org/ticket/3176 > In the future if we wanted to add more of this helpful functionality to the > NodeJS script we could, or alternatively we could have the whole thing rewritten > in Python and under sitescripts. But for now I guess I just don't understand why > we're doing it this way. I really don't think this logic belongs into abp2blocklist, that code is complicated enough and should IMHO focus on converting valid ABP filter lists to valid content blocker lists. Any additional logic we have around should rather go into sitescripts. In fact, this logic is pretty much temporary: We intend to generate the lists on the fly eventually. https://codereview.adblockplus.org/29329537/diff/29329538/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29329537/diff/29329538/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:36: abp2blocklist_path]) On 2015/10/30 12:56:14, kzar wrote: > So we're manually calling Mercurial here instead of using our dependency > management system?! (What if the user uses Git for example?) That's another discussion I already had with Sebastian, this time in the review: https://codereview.adblockplus.org/29328894/ Before we start running this script on an actual server, we would probably move to syncing the repository via Puppet. But for now it's a lot more convenient this way. https://codereview.adblockplus.org/29329537/diff/29329538/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:64: subprocess.check_call(["node", "abp2blocklist.js"], On 2015/10/30 12:56:14, kzar wrote: > I don't really understand why we have a Python script to grab the filter lists > and then call a NodeJS script from it to process them. Shouldn't we either add > this helpful functionality to the NodeJS script, or rewrite the whole thing in > Python as a part of sitescripts? See my non-inline reply on this.
On 2015/10/30 13:22:15, Felix Dahlke wrote: > I had the exact same discussion with Sebastian, but not in the review > apparently. The script has more logic than that one liner, but it would indeed > be a few lines shorter as a bash script. However, we're trying hard to avoid > Bash scripts even in such situations: They tend to get more logic over time and > turn into a mess. > > Plus, we have very near term plans for extending what this script does: > https://issues.adblockplus.org/ticket/3176 I remain unconvinced, does the Python script here do anything more than this 2/3 line Bash script? https://gist.github.com/kzar/58a54e3462180df9520d If you're against using a Bash script fair enough, call those same two commands from a Python script. We're calling a bunch of shell commands from this Python script anyway, I just don't understand what the current approach affords us. It seems so much more complicated. As for adding the metadata, surely it makes way more sense to do that in the ab2blocklist code? (We could even add a switch so that adding the metadata is optional in case someone else is using the script and doesn't want meta data to be added.)
On 2015/10/30 16:24:22, kzar wrote: > I remain unconvinced, does the Python script here do anything more than this 2/3 > line Bash script? https://gist.github.com/kzar/58a54e3462180df9520d > If you're against using a Bash script fair enough, call those same two commands > from a Python script. We're calling a bunch of shell commands from this Python > script anyway, I just don't understand what the current approach affords us. It > seems so much more complicated. Yes it does: The URLs and paths are configurable and it syncs the abp2blocklist repository. Like I said, a bash script would be a bit shorter, but not THAT much, at the end of the day. The next step is to run this on our servers, that's why it's in Sitescripts, and in Python. I don't like how complicated the file merging and abp2blocklist invocation is in Python either. I have the rough equivalent of your script here locally to generate the lists, but that's really not the point of this issue. This is the first step to move generation to our servers, and add more logic to it. > As for adding the metadata, surely it makes way more sense to do that in the > ab2blocklist code? (We could even add a switch so that adding the metadata is > optional in case someone else is using the script and doesn't want meta data to > be added.) No it doesn't, that's my point. abp2blocklist should focus on converting valid ABP filter lists to valid content blocker lists. That's where the real logic is, that's the code that will have to move to Core at some point, that purpose is complicated enough. Whatever we do on top of that to host the lists on our servers and transfer them to the app is a separate issue specific to how our app/infrastructure works. And as I said, a temporary one: The lists will be converted on the fly at some point.
I just read through the previous code review that you mentioned Felix in IRC. Although I don't really agree, I see that Sebastian also thinks this belongs in Sitescripts, so let's just carry on and do it that way. Want to address my other feedback for the code?
I know it's a little late, and I don't insist of having that one addressed, but it seems we could eliminate some complexity by using in-memory strings rather than temp files. Sure, this will increase the memory footprint of this script. But that should be fine. After all the browser extension has the filter lists in memory as well, and memory consumption is certainly more an issue there than for a script running on our servers. Regarding where this code belongs, I don't have a strong opinion. I'm fine with the current implementation. But if somebody shows me a simpler implementation inside abp2blocklist or whatever, I might vote for that. But no, a bash script that does the very same thing, isn't too trivial either, plus it's not portable. And anyway this is only a mid-term solution anyway. In the long term we want to have the the conversion performed on the client side.
New patch set is up. Apart from the things mentioned below, I made two more changes: - Use tuples instead of lists where possible (remembering receiving that advice from Sebastian a while ago) - Don't print anything to stdout - now that it's in sitescripts, we should stick to "no news is good news". On 2015/10/30 12:56:15, kzar wrote: > https://codereview.adblockplus.org/29329537/diff/29329538/sitescripts/content... > sitescripts/content_blocker_lists/bin/generate_lists.py:43: easylist_file = > tempfile.NamedTemporaryFile(mode="wb+") > Why not use "with" to avoid having to manually close easylist_response? > > easylist_file = tempfile.NamedTemporaryFile(mode="wb+") > easylist_url = get_config().get("content_blocker_lists", "easylist_url") > with urllib2.urlopen(easylist_url) as easylist_response: > shutil.copyfileobj(easylist_response, easylist_file) > The return value of urllib2.urlopen is a file-like object, but no context manager, unfortunately. > https://codereview.adblockplus.org/29329537/diff/29329538/sitescripts/content... > sitescripts/content_blocker_lists/bin/generate_lists.py:48: try: > These two blocks of logic seem basically identical, perhaps you should avoid > duplicating them? It seems like the only thing that changes is the URL? > > How about having the URL as a parameter to `_download_filter_lists()` and then > just calling the function twice? (Of course the return value would be a single > file instead, and the name of the function might want to be renamed.) Yeah good point, done. On 2015/11/12 03:09:53, Sebastian Noack wrote: > I know it's a little late, and I don't insist of having that one addressed, but > it seems we could eliminate some complexity by using in-memory strings rather > than temp files. Sure, this will increase the memory footprint of this script. > But that should be fine. After all the browser extension has the filter lists in > memory as well, and memory consumption is certainly more an issue there than for > a script running on our servers. Yeah you're right, in fact, memory usage is much less of an issue here since we're just dealing with the filter source, the additional memory usage is just about 4 MB.
This version looks a lot better, a lot simpler. One nit below... (It was a little harder than necessary to review because it appears you've deleted the earlier patch set which I had commented on.) https://codereview.adblockplus.org/29329537/diff/29330181/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29329537/diff/29330181/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:68: _convert_filter_list(combined, Nit: Maybe avoid the combined variable? _convert_filter_list(easylist + "\n" + exceptionrules, ...
https://codereview.adblockplus.org/29329537/diff/29330181/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29329537/diff/29330181/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:26: return get_config().get("content_blocker_lists", key) We probably shouldn't call get_config every for every option again. https://codereview.adblockplus.org/29329537/diff/29330181/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:61: easylist = _download(_get_config_value("easylist_url")) As _download is only used in these two lines, and both call _get_config_value() before, how about moving that call into _download()? https://codereview.adblockplus.org/29329537/diff/29330181/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:68: _convert_filter_list(combined, On 2015/11/18 17:35:38, kzar wrote: > Nit: Maybe avoid the combined variable? > > _convert_filter_list(easylist + "\n" + exceptionrules, > ... Even better: _convert_filter_list("%s\n%s" % (easylist, exceptionrules))
https://codereview.adblockplus.org/29329537/diff/29330181/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29329537/diff/29330181/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:68: _convert_filter_list(combined, On 2015/11/18 20:10:37, Sebastian Noack wrote: > On 2015/11/18 17:35:38, kzar wrote: > > Nit: Maybe avoid the combined variable? > > > > _convert_filter_list(easylist + "\n" + exceptionrules, > > ... > > Even better: > > _convert_filter_list("%s\n%s" % (easylist, exceptionrules)) Or maybe even even better: Make _convert_filter_list() take a variable number of sources, passes them separately to the subprocess. This would be way more efficient, and doesn't seem to add much complexity: def _convert_filter_list(sources, destination_path): with open(destination_path, "wb") as destination_file: process = subprocess.Popen(...) try: for source in sources: print >>process.stdin, source finally: process.stdin.close() retcode = process.wait() if retcode: raise Exception("abp2blocklist returned %s" % process.returncode)
New patch set is up. https://codereview.adblockplus.org/29329537/diff/29330181/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29329537/diff/29330181/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:26: return get_config().get("content_blocker_lists", key) On 2015/11/18 20:10:37, Sebastian Noack wrote: > We probably shouldn't call get_config every for every option again. After implementing your suggestion below, I could get rid of this thing without hurting readability. https://codereview.adblockplus.org/29329537/diff/29330181/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:61: easylist = _download(_get_config_value("easylist_url")) On 2015/11/18 20:10:37, Sebastian Noack wrote: > As _download is only used in these two lines, and both call _get_config_value() > before, how about moving that call into _download()? Done, did the same for _convert_filter_list. https://codereview.adblockplus.org/29329537/diff/29330181/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:68: _convert_filter_list(combined, On 2015/11/18 23:24:34, Sebastian Noack wrote: > On 2015/11/18 20:10:37, Sebastian Noack wrote: > > On 2015/11/18 17:35:38, kzar wrote: > > > Nit: Maybe avoid the combined variable? > > > > > > _convert_filter_list(easylist + "\n" + exceptionrules, > > > ... > > > > Even better: > > > > _convert_filter_list("%s\n%s" % (easylist, exceptionrules)) > > Or maybe even even better: Make _convert_filter_list() take a variable number of > sources, passes them separately to the subprocess. This would be way more > efficient, and doesn't seem to add much complexity: > > > def _convert_filter_list(sources, destination_path): > with open(destination_path, "wb") as destination_file: > process = subprocess.Popen(...) > try: > for source in sources: > print >>process.stdin, source > finally: > process.stdin.close() > retcode = process.wait() > > if retcode: > raise Exception("abp2blocklist returned %s" % process.returncode) > Yeah I actually like that one better too, done. https://codereview.adblockplus.org/29329537/diff/29330418/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29329537/diff/29330418/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:63: if process.wait(): Do you think this is too obscure?
LGTM
https://codereview.adblockplus.org/29329537/diff/29330418/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29329537/diff/29330418/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:63: if process.wait(): On 2015/11/19 11:15:20, Felix Dahlke wrote: > Do you think this is too obscure? This ... 1. Introduces an unnecessary level of indentation. 2. Even worse, will swallow exceptions raised in the try block So please just call process.wait() here and check for the return code at the bottom outside the try and with block.
New patch set is up. https://codereview.adblockplus.org/29329537/diff/29330418/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29329537/diff/29330418/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:63: if process.wait(): On 2015/11/19 20:08:34, Sebastian Noack wrote: > On 2015/11/19 11:15:20, Felix Dahlke wrote: > > Do you think this is too obscure? > > This ... > > 1. Introduces an unnecessary level of indentation. > 2. Even worse, will swallow exceptions raised in the try block > > So please just call process.wait() here and check for the return code at the > bottom outside the try and with block. Done. https://codereview.adblockplus.org/29329537/diff/29330514/.sitescripts.example File .sitescripts.example (right): https://codereview.adblockplus.org/29329537/diff/29330514/.sitescripts.exampl... .sitescripts.example:115: downloadsDirectory=%(root)s/www/downloads This got added after rebasing, please ignore.
https://codereview.adblockplus.org/29329537/diff/29330514/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29329537/diff/29330514/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:65: if retcode: You can move that out of the with block as well. Also you actually don't need a variable, you can just check for process.retcode after calling wait() before.
New patch set is up. https://codereview.adblockplus.org/29329537/diff/29330514/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29329537/diff/29330514/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:65: if retcode: On 2015/11/20 00:03:47, Sebastian Noack wrote: > You can move that out of the with block as well. Also you actually don't need a > variable, you can just check for process.retcode after calling wait() before. Done.
LGTM
LGTM |