|
|
Created:
Oct. 6, 2015, 6:17 p.m. by Felix Dahlke Modified:
Oct. 30, 2015, 7:17 a.m. Visibility:
Public. |
DescriptionIssue 3168 - Add a script for generating new content blocker lists
Patch Set 1 : #
Total comments: 13
Patch Set 2 : Address comments #Patch Set 3 : Flip _concatenate_files and _convert_filter_list #
Total comments: 4
Patch Set 4 : Close files explicitly, open files in binary mode, add whitespace #
Total comments: 2
Patch Set 5 : Use the with statement #MessagesTotal messages: 15
So, here's a variant that doesn't use abp2blocklist via ensure_dependencies.py, but just fetches the latest revision. I kinda like it. https://codereview.adblockplus.org/29328894/diff/29328899/generate_lists.py File generate_lists.py (right): https://codereview.adblockplus.org/29328894/diff/29328899/generate_lists.py#n... generate_lists.py:39: subprocess.check_call(["npm", "install", "tldjs"], cwd=ABP2BLOCKLIST_PATH) Well, this is a bit ugly. Once my patch that adds package.json lands in abp2blocklist, I can change this to just call "npm install" after every clone/pull.
I'm still undecided whether we should host abp2blocklist (at its current stage) on our official infrastructure. Note that with this approach, there isn't any need to use mercurial anymore. So we could as well clone it from https://github.com/snoack/abp2blocklist and remove the clone from hg.adblockplus.org.
On 2015/10/09 13:08:05, Sebastian Noack wrote: > I'm still undecided whether we should host abp2blocklist (at its current stage) > on our official infrastructure. Note that with this approach, there isn't any > need to use mercurial anymore. So we could as well clone it from > https://github.com/snoack/abp2blocklist and remove the clone from > http://hg.adblockplus.org. Well, we host everything we use in production hg.adblockplus.org, we should continue to do that. Whether we should use abp2blocklist in its current form in production is a different question - but currently we do.
On 2015/10/12 08:14:15, Felix Dahlke wrote: > Well, we host everything we use in production http://hg.adblockplus.org, we should > continue to do that. Whether we should use abp2blocklist in its current form in > production is a different question - but currently we do. Fair enough. For reference, the abp2blocklist repo has now properly be moved from GitHub account to https://github.com/adblockplus/abp2blocklist. On another note, where should we put that script? We discussed that before, but didn't get to a conclusion. However, I'm sure that creating separate repository for that simple script would go to far. https://codereview.adblockplus.org/29328894/diff/29328899/generate_lists.py File generate_lists.py (right): https://codereview.adblockplus.org/29328894/diff/29328899/generate_lists.py#n... generate_lists.py:39: subprocess.check_call(["npm", "install", "tldjs"], cwd=ABP2BLOCKLIST_PATH) On 2015/10/06 18:23:58, Felix Dahlke wrote: > Well, this is a bit ugly. Once my patch that adds package.json lands in > abp2blocklist, I can change this to just call "npm install" after every > clone/pull. This can be addressed now. https://codereview.adblockplus.org/29328894/diff/29328899/generate_lists.py#n... generate_lists.py:42: easylist_response = urllib2.urlopen(EASYLIST_URL) Please close the file like object returned by urlopen() in a finally-block. https://codereview.adblockplus.org/29328894/diff/29328899/generate_lists.py#n... generate_lists.py:43: with tempfile.NamedTemporaryFile(mode="w", delete=False) as easylist_file: Always add "b" to the mode when dealing with binary files, for same behavior on all platforms. https://codereview.adblockplus.org/29328894/diff/29328899/generate_lists.py#n... generate_lists.py:43: with tempfile.NamedTemporaryFile(mode="w", delete=False) as easylist_file: How about, using mode NamedTemporaryFile("wb+"), passing around the file object instead the filename? That way you don't have to unnecessarily open the file multiple times, and the cleanup is done implicitly when it gets closed. https://codereview.adblockplus.org/29328894/diff/29328899/generate_lists.py#n... generate_lists.py:44: easylist_file.write(easylist_response.read()) Please use shutil.copyfileobj consistently.
Adding Dave as a reviewer since he should get more in the loop here. Still have to address Sebastian's comments first however.
New patch set is up, all comments addressed. On 2015/10/12 13:08:32, Sebastian Noack wrote: > On another note, where should we put that script? We discussed that before, but > didn't get to a conclusion. However, I'm sure that creating separate repository > for that simple script would go to far. You mean generate_lists.py? I didn't create the contentblockerlists repository specifically for that, the main purpose of that repository is to host the lists we want to deliver. Still, since we're not using ensure_dependencies.py to fetch abp2blocklist anymore, IMHO it would make more sense in sitescripts: 1. We will run this on our infrastructure eventually. 2. The stuff we're currently hard coding in this script should eventually be configured. 3. The equivalent script for the ABP filter lists (combineSubscriptions.py) is in sitescripts as well. If you agree, I'll integrate it there.
https://codereview.adblockplus.org/29328894/diff/29328899/generate_lists.py File generate_lists.py (right): https://codereview.adblockplus.org/29328894/diff/29328899/generate_lists.py#n... generate_lists.py:39: subprocess.check_call(["npm", "install", "tldjs"], cwd=ABP2BLOCKLIST_PATH) On 2015/10/12 13:08:31, Sebastian Noack wrote: > On 2015/10/06 18:23:58, Felix Dahlke wrote: > > Well, this is a bit ugly. Once my patch that adds package.json lands in > > abp2blocklist, I can change this to just call "npm install" after every > > clone/pull. > > This can be addressed now. Done. https://codereview.adblockplus.org/29328894/diff/29328899/generate_lists.py#n... generate_lists.py:42: easylist_response = urllib2.urlopen(EASYLIST_URL) On 2015/10/12 13:08:31, Sebastian Noack wrote: > Please close the file like object returned by urlopen() in a finally-block. Done. https://codereview.adblockplus.org/29328894/diff/29328899/generate_lists.py#n... generate_lists.py:43: with tempfile.NamedTemporaryFile(mode="w", delete=False) as easylist_file: On 2015/10/12 13:08:31, Sebastian Noack wrote: > Always add "b" to the mode when dealing with binary files, for same behavior on > all platforms. Those are ASCII files, however, should I still add the "b"? https://codereview.adblockplus.org/29328894/diff/29328899/generate_lists.py#n... generate_lists.py:43: with tempfile.NamedTemporaryFile(mode="w", delete=False) as easylist_file: On 2015/10/12 13:08:31, Sebastian Noack wrote: > How about, using mode NamedTemporaryFile("wb+"), passing around the file object > instead the filename? That way you don't have to unnecessarily open the file > multiple times, and the cleanup is done implicitly when it gets closed. Passing around RW file objects seemed pretty hacky to me. I tried it now, and it does feel hacky, but I like how it gets rid of the manual cleanup and shortens things a bit. So I'm fine with this if you prefer it, it's in the latest patch set. https://codereview.adblockplus.org/29328894/diff/29328899/generate_lists.py#n... generate_lists.py:44: easylist_file.write(easylist_response.read()) On 2015/10/12 13:08:31, Sebastian Noack wrote: > Please use shutil.copyfileobj consistently. Done.
https://codereview.adblockplus.org/29328894/diff/29328899/generate_lists.py File generate_lists.py (right): https://codereview.adblockplus.org/29328894/diff/29328899/generate_lists.py#n... generate_lists.py:43: with tempfile.NamedTemporaryFile(mode="w", delete=False) as easylist_file: On 2015/10/21 20:36:22, Felix Dahlke wrote: > On 2015/10/12 13:08:31, Sebastian Noack wrote: > > Always add "b" to the mode when dealing with binary files, for same behavior > on > > all platforms. > > Those are ASCII files, however, should I still add the "b"? Either you treat it as a binary file. Then you should add "b" to prevent Windows from messing with line endings. Or you can treat if as text files using io.open(.., encoding=..). https://codereview.adblockplus.org/29328894/diff/29329327/generate_lists.py File generate_lists.py (right): https://codereview.adblockplus.org/29328894/diff/29329327/generate_lists.py#n... generate_lists.py:47: easylist_response.close() Nit: An empty line after each finally block would improve the readability.
https://codereview.adblockplus.org/29328894/diff/29329327/generate_lists.py File generate_lists.py (right): https://codereview.adblockplus.org/29328894/diff/29329327/generate_lists.py#n... generate_lists.py:75: easylist_file, exceptionrules_file = _download_filter_lists() Mind closing the temporary files again?
New patch set up. BTW, have you noticed this comment? https://codereview.adblockplus.org/29328894/#msg6 I hit reply instead of publish there, so it went out as two messages. https://codereview.adblockplus.org/29328894/diff/29328899/generate_lists.py File generate_lists.py (right): https://codereview.adblockplus.org/29328894/diff/29328899/generate_lists.py#n... generate_lists.py:43: with tempfile.NamedTemporaryFile(mode="w", delete=False) as easylist_file: On 2015/10/21 20:56:23, Sebastian Noack wrote: > On 2015/10/21 20:36:22, Felix Dahlke wrote: > > On 2015/10/12 13:08:31, Sebastian Noack wrote: > > > Always add "b" to the mode when dealing with binary files, for same behavior > > on > > > all platforms. > > > > Those are ASCII files, however, should I still add the "b"? > > Either you treat it as a binary file. Then you should add "b" to prevent Windows > from messing with line endings. Or you can treat if as text files using > io.open(.., encoding=..). I see, done. https://codereview.adblockplus.org/29328894/diff/29329327/generate_lists.py File generate_lists.py (right): https://codereview.adblockplus.org/29328894/diff/29329327/generate_lists.py#n... generate_lists.py:47: easylist_response.close() On 2015/10/21 20:56:23, Sebastian Noack wrote: > Nit: An empty line after each finally block would improve the readability. Done. https://codereview.adblockplus.org/29328894/diff/29329327/generate_lists.py#n... generate_lists.py:75: easylist_file, exceptionrules_file = _download_filter_lists() On 2015/10/21 21:14:35, Sebastian Noack wrote: > Mind closing the temporary files again? The OS is closing the files when the runtime terminates anyway, so I thought that wasn't necessary. However, I've learned that the files aren't necessarily flushed - that has to happen in the runtime, and while CPython seems to do it, I guess not all do. I'm closing them explicitly now, but I'm not really convinced, now it looks a lot like the old variant where I was explicitly removing the temporary files again. Anyway, I'm fine with this, I don't think it's worth more gold plating.
I agree that this should go into sitrscripts. https://codereview.adblockplus.org/29328894/diff/29329329/generate_lists.py File generate_lists.py (right): https://codereview.adblockplus.org/29328894/diff/29329329/generate_lists.py#n... generate_lists.py:84: combined_file = _concatenate_files(easylist_file, exceptionrules_file) With statement?
On 2015/10/22 02:55:06, Sebastian Noack wrote: > I agree that this should go into sitrscripts. Alright, I'll tackle that once all the comments here have been addressed. Hopefully that's the case now, new patch set is up. https://codereview.adblockplus.org/29328894/diff/29329329/generate_lists.py File generate_lists.py (right): https://codereview.adblockplus.org/29328894/diff/29329329/generate_lists.py#n... generate_lists.py:84: combined_file = _concatenate_files(easylist_file, exceptionrules_file) On 2015/10/22 02:55:06, Sebastian Noack wrote: > With statement? Good catch, couldn't use it above (not without introducing more complexity than saving three lines of code is worth, anyway), but here I can. Had to violate the 80 column rule now since I don't want to change the naming, but I guess that's the best option.
LGTM
On 2015/10/22 03:28:13, Sebastian Noack wrote: > LGTM Alright, but I'm not landing this as-is, I'll move it to sitescripts first and put up another patch set (it will end up looking slightly different in sitescripts).
Created a new review for moving this to sitescripts: https://codereview.adblockplus.org/29329537/ Closing this one. |