|
|
DescriptionIssue 3176 - Add metadata to content blocker lists
Patch Set 1 #
Total comments: 15
Patch Set 2 : Addressed feedback #Patch Set 3 : Remove unused StringIO import #
Total comments: 4
Patch Set 4 : Addressed further feedback #
Total comments: 23
Patch Set 5 : Addressed more feedback from Felix and Sebastian #
Total comments: 8
Patch Set 6 : Addressed some of Sebastian's feedback #Patch Set 7 : Just grab version number instead of parsing header #
Total comments: 3
Patch Set 8 : Improved regexp #MessagesTotal messages: 25
Patch Set 1 https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:65: def generate_metadata(filter_lists, expires="4 days"): It is unclear where the expires value for content blocking lists is supposed to come from. I've started a discussion about that in the issue, but in the mean time I've just hard coded it to "4 days". https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:77: metadata = generate_metadata(filter_lists) I'm doing it this way to avoid having to load the block list into memory, parse it as JSON again and then mutate it to add the metadata. I realise it's kind of ugly though, I'm open to either approach.
https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:54: with closing(StringIO(filter_list)) as stream: Never mind closing a StringIO. It doesn't do anything that wouldn't happen anyway. There are no resources that needs to be released associated with a StringIO. The close method merely exist for compatibility. Wait.., why do you use a StringIO in the first place? How about simply using a multi-line regexp? But wait again.., why do you parse the filter list headers in the first place? We don't need any of these data in the generated block list. https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:61: else: Nit: If you negate the logic you don't need an else-block. if not match: break header[match.group(1)] = match.group(2) https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:65: def generate_metadata(filter_lists, expires="4 days"): On 2015/11/27 16:28:11, kzar wrote: > It is unclear where the expires value for content blocking lists is supposed to > come from. I've started a discussion about that in the issue, but in the mean > time I've just hard coded it to "4 days". The expiration interval should be configured in sitescripts.ini. https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:77: metadata = generate_metadata(filter_lists) On 2015/11/27 16:28:11, kzar wrote: > I'm doing it this way to avoid having to load the block list into memory, parse > it as JSON again and then mutate it to add the metadata. I realise it's kind of > ugly though, I'm open to either approach. We don't have to care too much about memory consumption here. I would be interested in how much the difference is though, to get an idea whether it's even worth considering a more complex solution. https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:82: destination_file.flush() Any particular reason you flush the file here?
Patch Set 2 : Addressed feedback https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:54: with closing(StringIO(filter_list)) as stream: On 2015/11/30 13:55:49, Sebastian Noack wrote: > Never mind closing a StringIO. It doesn't do anything that wouldn't happen > anyway. There are no resources that needs to be released associated with a > StringIO. The close method merely exist for compatibility. > > Wait.., why do you use a StringIO in the first place? How about simply using a > multi-line regexp? > > But wait again.., why do you parse the filter list headers in the first place? > We don't need any of these data in the generated block list. We need the Version field, but otherwise Done. https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:65: def generate_metadata(filter_lists, expires="4 days"): On 2015/11/30 13:55:49, Sebastian Noack wrote: > On 2015/11/27 16:28:11, kzar wrote: > > It is unclear where the expires value for content blocking lists is supposed > to > > come from. I've started a discussion about that in the issue, but in the mean > > time I've just hard coded it to "4 days". > > The expiration interval should be configured in sitescripts.ini. Done. https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:77: metadata = generate_metadata(filter_lists) On 2015/11/30 13:55:49, Sebastian Noack wrote: > On 2015/11/27 16:28:11, kzar wrote: > > I'm doing it this way to avoid having to load the block list into memory, > parse > > it as JSON again and then mutate it to add the metadata. I realise it's kind > of > > ugly though, I'm open to either approach. > > We don't have to care too much about memory consumption here. I would be > interested in how much the difference is though, to get an idea whether it's > even worth considering a more complex solution. Done.
Patch Set 3 : Remove unused StringIO import
https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:54: with closing(StringIO(filter_list)) as stream: On 2015/11/30 15:13:11, kzar wrote: > On 2015/11/30 13:55:49, Sebastian Noack wrote: > > Never mind closing a StringIO. It doesn't do anything that wouldn't happen > > anyway. There are no resources that needs to be released associated with a > > StringIO. The close method merely exist for compatibility. > > > > Wait.., why do you use a StringIO in the first place? How about simply using a > > multi-line regexp? > > > > But wait again.., why do you parse the filter list headers in the first place? > > We don't need any of these data in the generated block list. > > We need the Version field, but otherwise Done. Well, you set the version field based on the current timestamp. So there is no need to parse the source filter lists. Or do I miss something? https://codereview.adblockplus.org/29331148/diff/29331611/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331611/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:76: block_list["rules"] = json.loads(process.communicate()[0]) Yes, this works as abp2blocklist (currently) reads all input before generating any output. However, that assumption is unsafe. And if that ever changes the print call above will hang forever while waiting for the subprocess's stdin's write buffer becoming empty again. The solution: Use a thread to write to process.stdin. And while on it use json.load to lazily read from process.stdout.
https://codereview.adblockplus.org/29331148/diff/29331611/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331611/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:58: metadata = { Nit: Use an OrderedDict to make sure that the metadata are at the top, to make the file easier to read for humans.
Patch Set 4 : Addressed further feedback https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:54: with closing(StringIO(filter_list)) as stream: On 2015/11/30 15:49:19, Sebastian Noack wrote: > On 2015/11/30 15:13:11, kzar wrote: > > On 2015/11/30 13:55:49, Sebastian Noack wrote: > > > Never mind closing a StringIO. It doesn't do anything that wouldn't happen > > > anyway. There are no resources that needs to be released associated with a > > > StringIO. The close method merely exist for compatibility. > > > > > > Wait.., why do you use a StringIO in the first place? How about simply using > a > > > multi-line regexp? > > > > > > But wait again.., why do you parse the filter list headers in the first > place? > > > We don't need any of these data in the generated block list. > > > > We need the Version field, but otherwise Done. > > Well, you set the version field based on the current timestamp. So there is no > need to parse the source filter lists. Or do I miss something? That's the version for the block list, in the sources section we keep a record of the version of each filter list. https://codereview.adblockplus.org/29331148/diff/29331611/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331611/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:58: metadata = { On 2015/11/30 16:00:07, Sebastian Noack wrote: > Nit: Use an OrderedDict to make sure that the metadata are at the top, to make > the file easier to read for humans. Done. https://codereview.adblockplus.org/29331148/diff/29331611/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:76: block_list["rules"] = json.loads(process.communicate()[0]) On 2015/11/30 15:49:19, Sebastian Noack wrote: > Yes, this works as abp2blocklist (currently) reads all input before generating > any output. However, that assumption is unsafe. And if that ever changes the > print call above will hang forever while waiting for the subprocess's stdin's > write buffer becoming empty again. > > The solution: Use a thread to write to process.stdin. And while on it use > json.load to lazily read from process.stdout. I've not done much threaded programming in Python before, this look OK?
https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:54: with closing(StringIO(filter_list)) as stream: On 2015/11/30 17:06:00, kzar wrote: > On 2015/11/30 15:49:19, Sebastian Noack wrote: > > On 2015/11/30 15:13:11, kzar wrote: > > > On 2015/11/30 13:55:49, Sebastian Noack wrote: > > > > Never mind closing a StringIO. It doesn't do anything that wouldn't happen > > > > anyway. There are no resources that needs to be released associated with a > > > > StringIO. The close method merely exist for compatibility. > > > > > > > > Wait.., why do you use a StringIO in the first place? How about simply > using > > a > > > > multi-line regexp? > > > > > > > > But wait again.., why do you parse the filter list headers in the first > > place? > > > > We don't need any of these data in the generated block list. > > > > > > We need the Version field, but otherwise Done. > > > > Well, you set the version field based on the current timestamp. So there is no > > need to parse the source filter lists. Or do I miss something? > > That's the version for the block list, in the sources section we keep a record > of the version of each filter list. You still won't have to parse the header of the files, you can read the version from the "ABP-Notification-Version" HTTP header in the response.
Might as well have a look at the whole thing. Looks good, just some minor remarks. https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:33: def update_abp2blocklist(): These functions were prefixed with an underscore for a reason, see: https://adblockplus.org/coding-style#python https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:61: ("version", datetime.utcnow().strftime("%Y%m%d%H%M")), FWIW, we're using `time.strftime("%Y%m%d%H%M", time.gmtime())` for this in the notification handler. But it seems to me those are equivalent. https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:67: for k in ["url", "Version"]}) Nit: Sebastian convinced me a while ago that tuples are better than lists for immutable lists like this. https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:98: write_block_list([easylist], Nit: "block list" is highly ambiguous, we often use that term for ABP style filter lists. "content blocker list" is the correct term, but I can see why you'd want to keep it short here, fair enough.
https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:33: def update_abp2blocklist(): On 2015/12/01 08:43:09, Felix Dahlke wrote: > These functions were prefixed with an underscore for a reason, see: > https://adblockplus.org/coding-style#python Well, one could argue that this is less a module (to be imported somewhere else), but more like a stand-alone script providing a CLI as high-level interface. Therefore these methods wouldn't necessarily have to be considered private. I probably wouldn't have prefixed them myself in this case. However, removing the prefixes is a completely unrelated change. So I agree, that it should be removed from this patch. https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:61: ("version", datetime.utcnow().strftime("%Y%m%d%H%M")), On 2015/12/01 08:43:08, Felix Dahlke wrote: > FWIW, we're using `time.strftime("%Y%m%d%H%M", time.gmtime())` for this in the > notification handler. But it seems to me those are equivalent. I tend to agree, creating a datetime object is unnecessary here. Moreover, I always have to wrap my head around how datetime handles timezones. I wouldn't insist though. https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:67: for k in ["url", "Version"]}) On 2015/12/01 08:43:08, Felix Dahlke wrote: > Nit: Sebastian convinced me a while ago that tuples are better than lists for > immutable lists like this. Well, frankly, I don't think that it matters in the case here. However, when you assign a sequence to a variable, but changing it isn't intended or supported by your code, using an immutable type is more appropriate and prevents you from certain bugs, later. But if the sequence isn't accessible after defining it anyway, I don't really care. https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:98: write_block_list([easylist], On 2015/12/01 08:43:08, Felix Dahlke wrote: > Nit: "block list" is highly ambiguous, we often use that term for ABP style > filter lists. "content blocker list" is the correct term, but I can see why > you'd want to keep it short here, fair enough. Well, after all, the program called here is also called "abp2blocklist". However, if you want to have a more distinct term, I guess we have to invent one, because there isn't a name for the format we are generating here (container holding WebKit content blocker rules plus some metadata) yet.
https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:67: for k in ["url", "Version"]}) On 2015/12/01 10:18:39, Sebastian Noack wrote: > On 2015/12/01 08:43:08, Felix Dahlke wrote: > > Nit: Sebastian convinced me a while ago that tuples are better than lists for > > immutable lists like this. > > Well, frankly, I don't think that it matters in the case here. However, when you > assign a sequence to a variable, but changing it isn't intended or supported by > your code, using an immutable type is more appropriate and prevents you from > certain bugs, later. But if the sequence isn't accessible after defining it > anyway, I don't really care. Wouldn't insist either. https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:98: write_block_list([easylist], On 2015/12/01 10:18:39, Sebastian Noack wrote: > On 2015/12/01 08:43:08, Felix Dahlke wrote: > > Nit: "block list" is highly ambiguous, we often use that term for ABP style > > filter lists. "content blocker list" is the correct term, but I can see why > > you'd want to keep it short here, fair enough. > > Well, after all, the program called here is also called "abp2blocklist". > However, if you want to have a more distinct term, I guess we have to invent > one, because there isn't a name for the format we are generating here (container > holding WebKit content blocker rules plus some metadata) yet. Good point, let's really leave it alone.
https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:67: for k in ["url", "Version"]}) On 2015/12/01 10:26:29, Felix Dahlke wrote: > On 2015/12/01 10:18:39, Sebastian Noack wrote: > > On 2015/12/01 08:43:08, Felix Dahlke wrote: > > > Nit: Sebastian convinced me a while ago that tuples are better than lists > for > > > immutable lists like this. > > > > Well, frankly, I don't think that it matters in the case here. However, when > you > > assign a sequence to a variable, but changing it isn't intended or supported > by > > your code, using an immutable type is more appropriate and prevents you from > > certain bugs, later. But if the sequence isn't accessible after defining it > > anyway, I don't really care. > > Wouldn't insist either. For reference, I found a quite interesting answer on SO, on the difference between tuples and lists, from a more semantic point of view: https://stackoverflow.com/questions/626759/whats-the-difference-between-list-... I didn't consider that aspect yet, but I tend to agree. So in our scenario here, as structure doesn't matter (i.e. adding/removing/changing an item in the sequence of fields here wouldn't invalidate the whole data structure), a list would actually be more appropriate than a tuple, in their conclusion.
https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:75: def pipe_in(process): Nit: This is inconsistent. You pass in the process but access the filter_lists via the closure. If you keep it as an inner function I would access all variables via closure. However, I'd tend to move that function to the top-level. https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:83: thread.start_new_thread(pipe_in, (process,)) Please use the high-level threading module instead. Using the thread module directly can be considered deprecated as it got prefixed in Python 3.
https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:51: return filter_list I think it would be simpler if you just return a tuple like (body, url, version) here, unpacking it while iterating over the filter lists. In particular as these really seem to be the only data we need.
Patch Set 5 : Addressed more feedback from Felix and Sebastian https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:54: with closing(StringIO(filter_list)) as stream: On 2015/12/01 08:32:56, Felix Dahlke wrote: > On 2015/11/30 17:06:00, kzar wrote: > > On 2015/11/30 15:49:19, Sebastian Noack wrote: > > > On 2015/11/30 15:13:11, kzar wrote: > > > > On 2015/11/30 13:55:49, Sebastian Noack wrote: > > > > > Never mind closing a StringIO. It doesn't do anything that wouldn't > happen > > > > > anyway. There are no resources that needs to be released associated with > a > > > > > StringIO. The close method merely exist for compatibility. > > > > > > > > > > Wait.., why do you use a StringIO in the first place? How about simply > > using > > > a > > > > > multi-line regexp? > > > > > > > > > > But wait again.., why do you parse the filter list headers in the first > > > place? > > > > > We don't need any of these data in the generated block list. > > > > > > > > We need the Version field, but otherwise Done. > > > > > > Well, you set the version field based on the current timestamp. So there is > no > > > need to parse the source filter lists. Or do I miss something? > > > > That's the version for the block list, in the sources section we keep a record > > of the version of each filter list. > > You still won't have to parse the header of the files, you can read the version > from the "ABP-Notification-Version" HTTP header in the response. I don't see that header present? curl -I https://easylist-downloads.adblockplus.org/easylist_noadult.txt HTTP/1.1 200 OK Server: nginx Date: Tue, 01 Dec 2015 11:24:23 GMT Content-Type: text/plain; charset=utf-8 Content-Length: 1532799 Last-Modified: Tue, 01 Dec 2015 11:20:46 GMT Connection: close ETag: "565d828e-17637f" Strict-Transport-Security: max-age=31536000 Accept-Ranges: bytes https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:33: def update_abp2blocklist(): On 2015/12/01 10:18:39, Sebastian Noack wrote: > On 2015/12/01 08:43:09, Felix Dahlke wrote: > > These functions were prefixed with an underscore for a reason, see: > > https://adblockplus.org/coding-style#python > > Well, one could argue that this is less a module (to be imported somewhere > else), but more like a stand-alone script providing a CLI as high-level > interface. Therefore these methods wouldn't necessarily have to be considered > private. I probably wouldn't have prefixed them myself in this case. However, > removing the prefixes is a completely unrelated change. So I agree, that it > should be removed from this patch. I would prefer to leave them off, they don't really add anything in this context and they are kind of ugly. (It's true that it's an unrelated change, but as I'm changing / creating nearly all the functional definitions in this patch set I thought it made sense to change this one too.) https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:51: return filter_list On 2015/12/01 11:01:48, Sebastian Noack wrote: > I think it would be simpler if you just return a tuple like (body, url, version) > here, unpacking it while iterating over the filter lists. In particular as these > really seem to be the only data we need. Done. https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:61: ("version", datetime.utcnow().strftime("%Y%m%d%H%M")), On 2015/12/01 10:18:39, Sebastian Noack wrote: > On 2015/12/01 08:43:08, Felix Dahlke wrote: > > FWIW, we're using `time.strftime("%Y%m%d%H%M", time.gmtime())` for this in the > > notification handler. But it seems to me those are equivalent. > > I tend to agree, creating a datetime object is unnecessary here. Moreover, I > always have to wrap my head around how datetime handles timezones. I wouldn't > insist though. Done. https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:67: for k in ["url", "Version"]}) On 2015/12/01 10:38:09, Sebastian Noack wrote: > On 2015/12/01 10:26:29, Felix Dahlke wrote: > > On 2015/12/01 10:18:39, Sebastian Noack wrote: > > > On 2015/12/01 08:43:08, Felix Dahlke wrote: > > > > Nit: Sebastian convinced me a while ago that tuples are better than lists > > for > > > > immutable lists like this. > > > > > > Well, frankly, I don't think that it matters in the case here. However, when > > you > > > assign a sequence to a variable, but changing it isn't intended or supported > > by > > > your code, using an immutable type is more appropriate and prevents you from > > > certain bugs, later. But if the sequence isn't accessible after defining it > > > anyway, I don't really care. > > > > Wouldn't insist either. > > For reference, I found a quite interesting answer on SO, on the difference > between tuples and lists, from a more semantic point of view: > https://stackoverflow.com/questions/626759/whats-the-difference-between-list-... > > I didn't consider that aspect yet, but I tend to agree. So in our scenario here, > as structure doesn't matter (i.e. adding/removing/changing an item in the > sequence of fields here wouldn't invalidate the whole data structure), a list > would actually be more appropriate than a tuple, in their conclusion. Acknowledged. https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:75: def pipe_in(process): On 2015/12/01 10:47:14, Sebastian Noack wrote: > Nit: This is inconsistent. You pass in the process but access the filter_lists > via the closure. If you keep it as an inner function I would access all > variables via closure. However, I'd tend to move that function to the top-level. Done. https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:83: thread.start_new_thread(pipe_in, (process,)) On 2015/12/01 10:47:14, Sebastian Noack wrote: > Please use the high-level threading module instead. Using the thread module > directly can be considered deprecated as it got prefixed in Python 3. Done.
https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:54: with closing(StringIO(filter_list)) as stream: On 2015/12/01 12:13:38, kzar wrote: > On 2015/12/01 08:32:56, Felix Dahlke wrote: > > On 2015/11/30 17:06:00, kzar wrote: > > > On 2015/11/30 15:49:19, Sebastian Noack wrote: > > > > On 2015/11/30 15:13:11, kzar wrote: > > > > > On 2015/11/30 13:55:49, Sebastian Noack wrote: > > > > > > Never mind closing a StringIO. It doesn't do anything that wouldn't > > happen > > > > > > anyway. There are no resources that needs to be released associated > with > > a > > > > > > StringIO. The close method merely exist for compatibility. > > > > > > > > > > > > Wait.., why do you use a StringIO in the first place? How about simply > > > using > > > > a > > > > > > multi-line regexp? > > > > > > > > > > > > But wait again.., why do you parse the filter list headers in the > first > > > > place? > > > > > > We don't need any of these data in the generated block list. > > > > > > > > > > We need the Version field, but otherwise Done. > > > > > > > > Well, you set the version field based on the current timestamp. So there > is > > no > > > > need to parse the source filter lists. Or do I miss something? > > > > > > That's the version for the block list, in the sources section we keep a > record > > > of the version of each filter list. > > > > You still won't have to parse the header of the files, you can read the > version > > from the "ABP-Notification-Version" HTTP header in the response. > > I don't see that header present? > > curl -I https://easylist-downloads.adblockplus.org/easylist_noadult.txt > > HTTP/1.1 200 OK > Server: nginx > Date: Tue, 01 Dec 2015 11:24:23 GMT > Content-Type: text/plain; charset=utf-8 > Content-Length: 1532799 > Last-Modified: Tue, 01 Dec 2015 11:20:46 GMT > Connection: close > ETag: "565d828e-17637f" > Strict-Transport-Security: max-age=31536000 > Accept-Ranges: bytes Ouch, big mixup on my end, we only have this for notifications at this point... Yes, I'm afraid there's no other way than to parse the file header then. https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:33: def update_abp2blocklist(): On 2015/12/01 12:13:38, kzar wrote: > On 2015/12/01 10:18:39, Sebastian Noack wrote: > > On 2015/12/01 08:43:09, Felix Dahlke wrote: > > > These functions were prefixed with an underscore for a reason, see: > > > https://adblockplus.org/coding-style#python > > > > Well, one could argue that this is less a module (to be imported somewhere > > else), but more like a stand-alone script providing a CLI as high-level > > interface. Therefore these methods wouldn't necessarily have to be considered > > private. I probably wouldn't have prefixed them myself in this case. However, > > removing the prefixes is a completely unrelated change. So I agree, that it > > should be removed from this patch. > > I would prefer to leave them off, they don't really add anything in this context > and they are kind of ugly. (It's true that it's an unrelated change, but as I'm > changing / creating nearly all the functional definitions in this patch set I > thought it made sense to change this one too.) I would argue that this is still a module - this script can only be run with `python -m`. It isn't a loadable library, that's true, but neither are our web modules. Also, It's not unheard of to move code between modules under bin and reusable modules. I really think we should be consistent here, but it's up to Sebastian. Let's however please clarify this in the coding style, it's obviously ambiguous right now.
https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:33: def update_abp2blocklist(): On 2015/12/01 14:04:19, Felix Dahlke wrote: > On 2015/12/01 12:13:38, kzar wrote: > > On 2015/12/01 10:18:39, Sebastian Noack wrote: > > > On 2015/12/01 08:43:09, Felix Dahlke wrote: > > > > These functions were prefixed with an underscore for a reason, see: > > > > https://adblockplus.org/coding-style#python > > > > > > Well, one could argue that this is less a module (to be imported somewhere > > > else), but more like a stand-alone script providing a CLI as high-level > > > interface. Therefore these methods wouldn't necessarily have to be > considered > > > private. I probably wouldn't have prefixed them myself in this case. > However, > > > removing the prefixes is a completely unrelated change. So I agree, that it > > > should be removed from this patch. > > > > I would prefer to leave them off, they don't really add anything in this > context > > and they are kind of ugly. (It's true that it's an unrelated change, but as > I'm > > changing / creating nearly all the functional definitions in this patch set I > > thought it made sense to change this one too.) > > I would argue that this is still a module - this script can only be run with > `python -m`. It isn't a loadable library, that's true, but neither are our web > modules. Also, It's not unheard of to move code between modules under bin and > reusable modules. I really think we should be consistent here, but it's up to > Sebastian. Let's however please clarify this in the coding style, it's obviously > ambiguous right now. Well, technically every piece of code is part of a module in Python. But the point is that all code that sees these functions is allowed/supposed to call them. Note that since the actual interface here is the CLI, the only code this applies to, is this script itself. So there is no context in which these function would/should be considered private. But if for some reason somebody imports this module rather than running it as a script, which they shouldn't do however, they would expect a more low-level API than provided by the CLI. As for consistency, it seems that it's only code you wrote that overuses prefixed/private functions in scenarios like this. For example have a look at ensure_dependencies.py or buildtools in general. https://codereview.adblockplus.org/29331148/diff/29331684/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331684/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:47: field_re = re.compile(r"^!\s*([^:\s]+):\s*(.+)$", re.MULTILINE) This logic can be simplified, in particular since we are only looking for the version: for match in re.finditer(r"^(?:([^![])|!\s*Version:\s*(.+)$)", filter_list, re.MULTILINE): end_of_headers, version = match.groups() if end_of_headers: break if version: return version https://codereview.adblockplus.org/29331148/diff/29331684/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:64: metadata["sources"].append({ "url": url, "version": version }) See https://www.python.org/dev/peps/pep-0008/#pet-peeves https://codereview.adblockplus.org/29331148/diff/29331684/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:82: if process.returncode: Note that returncode is set by the wait() method. So we should rather call it here rather than in the thread to avoid race conditions.
https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:33: def update_abp2blocklist(): On 2015/12/07 12:38:24, Sebastian Noack wrote: > On 2015/12/01 14:04:19, Felix Dahlke wrote: > > On 2015/12/01 12:13:38, kzar wrote: > > > On 2015/12/01 10:18:39, Sebastian Noack wrote: > > > > On 2015/12/01 08:43:09, Felix Dahlke wrote: > > > > > These functions were prefixed with an underscore for a reason, see: > > > > > https://adblockplus.org/coding-style#python > > > > > > > > Well, one could argue that this is less a module (to be imported somewhere > > > > else), but more like a stand-alone script providing a CLI as high-level > > > > interface. Therefore these methods wouldn't necessarily have to be > > considered > > > > private. I probably wouldn't have prefixed them myself in this case. > > However, > > > > removing the prefixes is a completely unrelated change. So I agree, that > it > > > > should be removed from this patch. > > > > > > I would prefer to leave them off, they don't really add anything in this > > context > > > and they are kind of ugly. (It's true that it's an unrelated change, but as > > I'm > > > changing / creating nearly all the functional definitions in this patch set > I > > > thought it made sense to change this one too.) > > > > I would argue that this is still a module - this script can only be run with > > `python -m`. It isn't a loadable library, that's true, but neither are our web > > modules. Also, It's not unheard of to move code between modules under bin and > > reusable modules. I really think we should be consistent here, but it's up to > > Sebastian. Let's however please clarify this in the coding style, it's > obviously > > ambiguous right now. > > Well, technically every piece of code is part of a module in Python. But the > point is that all code that sees these functions is allowed/supposed to call > them. Note that since the actual interface here is the CLI, the only code this > applies to, is this script itself. So there is no context in which these > function would/should be considered private. But if for some reason somebody > imports this module rather than running it as a script, which they shouldn't do > however, they would expect a more low-level API than provided by the CLI. > > As for consistency, it seems that it's only code you wrote that overuses > prefixed/private functions in scenarios like this. For example have a look at > ensure_dependencies.py or buildtools in general. > Wladimir's newer code in Sitescripts also uses those prefixes for stuff under bin. Anyway, your reasoning makes sense, so I'm fine with never using those private prefixes for executable scripts. But please do clarify it in the coding style.
Patch Set 6 : Addressed some of Sebastian's feedback https://codereview.adblockplus.org/29331148/diff/29331684/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331684/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:47: field_re = re.compile(r"^!\s*([^:\s]+):\s*(.+)$", re.MULTILINE) On 2015/12/07 12:38:24, Sebastian Noack wrote: > This logic can be simplified, in particular since we are only looking for the > version: > > for match in re.finditer(r"^(?:([^![])|!\s*Version:\s*(.+)$)", filter_list, > re.MULTILINE): > end_of_headers, version = match.groups() > if end_of_headers: > break > if version: > return version IMHO that doesn't look easier to read. Also it seems to me that we should just parse all the fields while at it. https://codereview.adblockplus.org/29331148/diff/29331684/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:64: metadata["sources"].append({ "url": url, "version": version }) On 2015/12/07 12:38:24, Sebastian Noack wrote: > See https://www.python.org/dev/peps/pep-0008/#pet-peeves Done. https://codereview.adblockplus.org/29331148/diff/29331684/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:82: if process.returncode: On 2015/12/07 12:38:24, Sebastian Noack wrote: > Note that returncode is set by the wait() method. So we should rather call it > here rather than in the thread to avoid race conditions. Done.
https://codereview.adblockplus.org/29331148/diff/29331684/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331684/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:47: field_re = re.compile(r"^!\s*([^:\s]+):\s*(.+)$", re.MULTILINE) On 2015/12/08 12:52:21, kzar wrote: > On 2015/12/07 12:38:24, Sebastian Noack wrote: > > This logic can be simplified, in particular since we are only looking for the > > version: > > > > for match in re.finditer(r"^(?:([^![])|!\s*Version:\s*(.+)$)", filter_list, > > re.MULTILINE): > > end_of_headers, version = match.groups() > > if end_of_headers: > > break > > if version: > > return version > > IMHO that doesn't look easier to read. Also it seems to me that we should just > parse all the fields while at it. Reading both pieces of code for the first time, I consider Sebastian's variant to be a bit easier to get, but the difference isn't huge. Considering how we really don't need anything but the version, I'd lean towards that one.
Patch Set 7 : Just grab version number instead of parsing header https://codereview.adblockplus.org/29331148/diff/29331684/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331684/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:47: field_re = re.compile(r"^!\s*([^:\s]+):\s*(.+)$", re.MULTILINE) On 2015/12/08 13:39:51, Felix Dahlke wrote: > On 2015/12/08 12:52:21, kzar wrote: > > On 2015/12/07 12:38:24, Sebastian Noack wrote: > > > This logic can be simplified, in particular since we are only looking for > the > > > version: > > > > > > for match in re.finditer(r"^(?:([^![])|!\s*Version:\s*(.+)$)", filter_list, > > > re.MULTILINE): > > > end_of_headers, version = match.groups() > > > if end_of_headers: > > > break > > > if version: > > > return version > > > > IMHO that doesn't look easier to read. Also it seems to me that we should just > > parse all the fields while at it. > > Reading both pieces of code for the first time, I consider Sebastian's variant > to be a bit easier to get, but the difference isn't huge. Considering how we > really don't need anything but the version, I'd lean towards that one. OK I've gone with a slightly simplified version of your approach Sebastian.
https://codereview.adblockplus.org/29331148/diff/29332446/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29332446/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:48: version = re.search(r"^(?:[^[!])|!\s*Version:\s*(.+)$", Indeed, we can simply use re.search instead iterating over all matches. However, note that with the regexp you use here begin of string is only matched for the pattern before the alternation. So for example you would also match "! foo ! Version: foo". Also note that re.search() will return None, and therefore the .group() causes an exception, if there is neither a Version header nor some filters in the list. So this should rather be: match = re.search(r"^(?:(?:[^[!])|!\s*Version:\s*(.+)$)", body, re.MULTILINE) version = match and match.group(1)
Patch Set 8 : Improved regexp https://codereview.adblockplus.org/29331148/diff/29332446/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29332446/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:48: version = re.search(r"^(?:[^[!])|!\s*Version:\s*(.+)$", On 2015/12/08 15:08:38, Sebastian Noack wrote: > Indeed, we can simply use re.search instead iterating over all matches. > > However, note that with the regexp you use here begin of string is only matched > for the pattern before the alternation. So for example you would also match "! > foo ! Version: foo". > > Also note that re.search() will return None, and therefore the .group() causes > an exception, if there is neither a Version header nor some filters in the list. > > So this should rather be: > > match = re.search(r"^(?:(?:[^[!])|!\s*Version:\s*(.+)$)", body, re.MULTILINE) > version = match and match.group(1) Well it's true about the "! foo ! Version: foo" match, we're not expecting that to happen but I guess checking for it can't harm. Done. As for the other point, I don't think it's true that we're expecting the match to be None. re.search() is returning a match for the version, or a match for a line not starting with [ or !. In the latter case there is still a match object, but .group(1) simply returns None as no capturing groups matched. So the end result is we either get the version number or None, and the filter list after the header is not searched.
https://codereview.adblockplus.org/29331148/diff/29332446/sitescripts/content... File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29332446/sitescripts/content... sitescripts/content_blocker_lists/bin/generate_lists.py:48: version = re.search(r"^(?:[^[!])|!\s*Version:\s*(.+)$", On 2015/12/08 15:37:43, kzar wrote: > On 2015/12/08 15:08:38, Sebastian Noack wrote: > > Indeed, we can simply use re.search instead iterating over all matches. > > > > However, note that with the regexp you use here begin of string is only > matched > > for the pattern before the alternation. So for example you would also match "! > > foo ! Version: foo". > > > > Also note that re.search() will return None, and therefore the .group() causes > > an exception, if there is neither a Version header nor some filters in the > list. > > > > So this should rather be: > > > > match = re.search(r"^(?:(?:[^[!])|!\s*Version:\s*(.+)$)", body, > re.MULTILINE) > > version = match and match.group(1) > > Well it's true about the "! foo ! Version: foo" match, we're not expecting that > to happen but I guess checking for it can't harm. Done. > > As for the other point, I don't think it's true that we're expecting the match > to be None. re.search() is returning a match for the version, or a match for a > line not starting with [ or !. In the latter case there is still a match object, > but .group(1) simply returns None as no capturing groups matched. > > So the end result is we either get the version number or None, and the filter > list after the header is not searched. Well, theoretically the filter list can be empty. But it's probably not important. LGTM.
LGTM |