Issue 29331148: Issue 3176 - Add metadata to content blocker lists

kzar

Patch Set 1 https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content_blocker_lists/bin/generate_lists.py File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content_blocker_lists/bin/generate_lists.py#newcode65 sitescripts/content_blocker_lists/bin/generate_lists.py:65: def generate_metadata(filter_lists, expires="4 days"): It is ...

Nov. 27, 2015, 4:28 p.m. (2015-11-27 16:28:11 UTC) #1

Sebastian Noack

https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content_blocker_lists/bin/generate_lists.py File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content_blocker_lists/bin/generate_lists.py#newcode54 sitescripts/content_blocker_lists/bin/generate_lists.py:54: with closing(StringIO(filter_list)) as stream: Never mind closing a StringIO. ...

Nov. 30, 2015, 1:55 p.m. (2015-11-30 13:55:49 UTC) #2

https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content...
File sitescripts/content_blocker_lists/bin/generate_lists.py (right):

https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content...
sitescripts/content_blocker_lists/bin/generate_lists.py:54: with
closing(StringIO(filter_list)) as stream:
Never mind closing a StringIO. It doesn't do anything that wouldn't happen
anyway. There are no resources that needs to be released associated with a
StringIO. The close method merely exist for compatibility.

Wait.., why do you use a StringIO in the first place? How about simply using a
multi-line regexp?

But wait again.., why do you parse the filter list headers in the first place?
We don't need any of these data in the generated block list.

https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content...
sitescripts/content_blocker_lists/bin/generate_lists.py:61: else:
Nit: If you negate the logic you don't need an else-block.

if not match:
  break
header[match.group(1)] = match.group(2)

https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content...
sitescripts/content_blocker_lists/bin/generate_lists.py:65: def
generate_metadata(filter_lists, expires="4 days"):
On 2015/11/27 16:28:11, kzar wrote:
> It is unclear where the expires value for content blocking lists is supposed
to
> come from. I've started a discussion about that in the issue, but in the mean
> time I've just hard coded it to "4 days".

The expiration interval should be configured in sitescripts.ini.

https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content...
sitescripts/content_blocker_lists/bin/generate_lists.py:77: metadata =
generate_metadata(filter_lists)
On 2015/11/27 16:28:11, kzar wrote:
> I'm doing it this way to avoid having to load the block list into memory,
parse
> it as JSON again and then mutate it to add the metadata. I realise it's kind
of
> ugly though, I'm open to either approach.

We don't have to care too much about memory consumption here. I would be
interested in how much the difference is though, to get an idea whether it's
even worth considering a more complex solution.

https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content...
sitescripts/content_blocker_lists/bin/generate_lists.py:82:
destination_file.flush()
Any particular reason you flush the file here?

kzar

Patch Set 2 : Addressed feedback https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content_blocker_lists/bin/generate_lists.py File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content_blocker_lists/bin/generate_lists.py#newcode54 sitescripts/content_blocker_lists/bin/generate_lists.py:54: with closing(StringIO(filter_list)) as ...

Nov. 30, 2015, 3:13 p.m. (2015-11-30 15:13:12 UTC) #3

Patch Set 2 : Addressed feedback

https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content...
File sitescripts/content_blocker_lists/bin/generate_lists.py (right):

https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content...
sitescripts/content_blocker_lists/bin/generate_lists.py:54: with
closing(StringIO(filter_list)) as stream:
On 2015/11/30 13:55:49, Sebastian Noack wrote:
> Never mind closing a StringIO. It doesn't do anything that wouldn't happen
> anyway. There are no resources that needs to be released associated with a
> StringIO. The close method merely exist for compatibility.
> 
> Wait.., why do you use a StringIO in the first place? How about simply using a
> multi-line regexp?
> 
> But wait again.., why do you parse the filter list headers in the first place?
> We don't need any of these data in the generated block list.

We need the Version field, but otherwise Done.

https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content...
sitescripts/content_blocker_lists/bin/generate_lists.py:65: def
generate_metadata(filter_lists, expires="4 days"):
On 2015/11/30 13:55:49, Sebastian Noack wrote:
> On 2015/11/27 16:28:11, kzar wrote:
> > It is unclear where the expires value for content blocking lists is supposed
> to
> > come from. I've started a discussion about that in the issue, but in the
mean
> > time I've just hard coded it to "4 days".
> 
> The expiration interval should be configured in sitescripts.ini.

Done.

https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content...
sitescripts/content_blocker_lists/bin/generate_lists.py:77: metadata =
generate_metadata(filter_lists)
On 2015/11/30 13:55:49, Sebastian Noack wrote:
> On 2015/11/27 16:28:11, kzar wrote:
> > I'm doing it this way to avoid having to load the block list into memory,
> parse
> > it as JSON again and then mutate it to add the metadata. I realise it's kind
> of
> > ugly though, I'm open to either approach.
> 
> We don't have to care too much about memory consumption here. I would be
> interested in how much the difference is though, to get an idea whether it's
> even worth considering a more complex solution.

Done.

Sebastian Noack

https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content_blocker_lists/bin/generate_lists.py File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content_blocker_lists/bin/generate_lists.py#newcode54 sitescripts/content_blocker_lists/bin/generate_lists.py:54: with closing(StringIO(filter_list)) as stream: On 2015/11/30 15:13:11, kzar wrote: ...

Nov. 30, 2015, 3:49 p.m. (2015-11-30 15:49:19 UTC) #5

Sebastian Noack

https://codereview.adblockplus.org/29331148/diff/29331611/sitescripts/content_blocker_lists/bin/generate_lists.py File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331611/sitescripts/content_blocker_lists/bin/generate_lists.py#newcode58 sitescripts/content_blocker_lists/bin/generate_lists.py:58: metadata = { Nit: Use an OrderedDict to make ...

Nov. 30, 2015, 4 p.m. (2015-11-30 16:00:07 UTC) #6

kzar

Patch Set 4 : Addressed further feedback https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content_blocker_lists/bin/generate_lists.py File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content_blocker_lists/bin/generate_lists.py#newcode54 sitescripts/content_blocker_lists/bin/generate_lists.py:54: with closing(StringIO(filter_list)) ...

Nov. 30, 2015, 5:06 p.m. (2015-11-30 17:06:00 UTC) #7

Patch Set 4 : Addressed further feedback

https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content...
File sitescripts/content_blocker_lists/bin/generate_lists.py (right):

https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content...
sitescripts/content_blocker_lists/bin/generate_lists.py:54: with
closing(StringIO(filter_list)) as stream:
On 2015/11/30 15:49:19, Sebastian Noack wrote:
> On 2015/11/30 15:13:11, kzar wrote:
> > On 2015/11/30 13:55:49, Sebastian Noack wrote:
> > > Never mind closing a StringIO. It doesn't do anything that wouldn't happen
> > > anyway. There are no resources that needs to be released associated with a
> > > StringIO. The close method merely exist for compatibility.
> > > 
> > > Wait.., why do you use a StringIO in the first place? How about simply
using
> a
> > > multi-line regexp?
> > > 
> > > But wait again.., why do you parse the filter list headers in the first
> place?
> > > We don't need any of these data in the generated block list.
> > 
> > We need the Version field, but otherwise Done.
> 
> Well, you set the version field based on the current timestamp. So there is no
> need to parse the source filter lists. Or do I miss something?

That's the version for the block list, in the sources section we keep a record
of the version of each filter list.

https://codereview.adblockplus.org/29331148/diff/29331611/sitescripts/content...
File sitescripts/content_blocker_lists/bin/generate_lists.py (right):

https://codereview.adblockplus.org/29331148/diff/29331611/sitescripts/content...
sitescripts/content_blocker_lists/bin/generate_lists.py:58: metadata = {
On 2015/11/30 16:00:07, Sebastian Noack wrote:
> Nit: Use an OrderedDict to make sure that the metadata are at the top, to make
> the file easier to read for humans.

Done.

https://codereview.adblockplus.org/29331148/diff/29331611/sitescripts/content...
sitescripts/content_blocker_lists/bin/generate_lists.py:76: block_list["rules"]
= json.loads(process.communicate()[0])
On 2015/11/30 15:49:19, Sebastian Noack wrote:
> Yes, this works as abp2blocklist (currently) reads all input before generating
> any output. However, that assumption is unsafe. And if that ever changes the
> print call above will hang forever while waiting for the subprocess's stdin's
> write buffer becoming empty again.
> 
> The solution: Use a thread to write to process.stdin. And while on it use
> json.load to lazily read from process.stdout.

I've not done much threaded programming in Python before, this look OK?

Felix Dahlke

https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content_blocker_lists/bin/generate_lists.py File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content_blocker_lists/bin/generate_lists.py#newcode54 sitescripts/content_blocker_lists/bin/generate_lists.py:54: with closing(StringIO(filter_list)) as stream: On 2015/11/30 17:06:00, kzar wrote: ...

Dec. 1, 2015, 8:32 a.m. (2015-12-01 08:32:56 UTC) #8

Felix Dahlke

Might as well have a look at the whole thing. Looks good, just some minor ...

Dec. 1, 2015, 8:43 a.m. (2015-12-01 08:43:09 UTC) #9

Sebastian Noack

https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content_blocker_lists/bin/generate_lists.py File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content_blocker_lists/bin/generate_lists.py#newcode33 sitescripts/content_blocker_lists/bin/generate_lists.py:33: def update_abp2blocklist(): On 2015/12/01 08:43:09, Felix Dahlke wrote: > ...

Dec. 1, 2015, 10:18 a.m. (2015-12-01 10:18:40 UTC) #10

https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content...
File sitescripts/content_blocker_lists/bin/generate_lists.py (right):

https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content...
sitescripts/content_blocker_lists/bin/generate_lists.py:33: def
update_abp2blocklist():
On 2015/12/01 08:43:09, Felix Dahlke wrote:
> These functions were prefixed with an underscore for a reason, see:
> https://adblockplus.org/coding-style#python

Well, one could argue that this is less a module (to be imported somewhere
else), but more like a stand-alone script providing a CLI as high-level
interface. Therefore these methods wouldn't necessarily have to be considered
private. I probably wouldn't have prefixed them myself in this case. However,
removing the prefixes is a completely unrelated change. So I agree, that it
should be removed from this patch.

https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content...
sitescripts/content_blocker_lists/bin/generate_lists.py:61: ("version",
datetime.utcnow().strftime("%Y%m%d%H%M")),
On 2015/12/01 08:43:08, Felix Dahlke wrote:
> FWIW, we're using `time.strftime("%Y%m%d%H%M", time.gmtime())` for this in the
> notification handler. But it seems to me those are equivalent.

I tend to agree, creating a datetime object is unnecessary here. Moreover, I
always have to wrap my head around how datetime handles timezones. I wouldn't
insist though.

https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content...
sitescripts/content_blocker_lists/bin/generate_lists.py:67: for k in ["url",
"Version"]})
On 2015/12/01 08:43:08, Felix Dahlke wrote:
> Nit: Sebastian convinced me a while ago that tuples are better than lists for
> immutable lists like this.

Well, frankly, I don't think that it matters in the case here. However, when you
assign a sequence to a variable, but changing it isn't intended or supported by
your code, using an immutable type is more appropriate and prevents you from
certain bugs, later. But if the sequence isn't accessible after defining it
anyway, I don't really care.

https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content...
sitescripts/content_blocker_lists/bin/generate_lists.py:98:
write_block_list([easylist],
On 2015/12/01 08:43:08, Felix Dahlke wrote:
> Nit: "block list" is highly ambiguous, we often use that term for ABP style
> filter lists. "content blocker list" is the correct term, but I can see why
> you'd want to keep it short here, fair enough.

Well, after all, the program called here is also called "abp2blocklist".
However, if you want to have a more distinct term, I guess we have to invent
one, because there isn't a name for the format we are generating here (container
holding WebKit content blocker rules plus some metadata) yet.

Felix Dahlke

https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content_blocker_lists/bin/generate_lists.py File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content_blocker_lists/bin/generate_lists.py#newcode67 sitescripts/content_blocker_lists/bin/generate_lists.py:67: for k in ["url", "Version"]}) On 2015/12/01 10:18:39, Sebastian ...

Dec. 1, 2015, 10:26 a.m. (2015-12-01 10:26:29 UTC) #11

Sebastian Noack

https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content_blocker_lists/bin/generate_lists.py File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content_blocker_lists/bin/generate_lists.py#newcode67 sitescripts/content_blocker_lists/bin/generate_lists.py:67: for k in ["url", "Version"]}) On 2015/12/01 10:26:29, Felix ...

Dec. 1, 2015, 10:38 a.m. (2015-12-01 10:38:10 UTC) #12

Sebastian Noack

https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content_blocker_lists/bin/generate_lists.py File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content_blocker_lists/bin/generate_lists.py#newcode75 sitescripts/content_blocker_lists/bin/generate_lists.py:75: def pipe_in(process): Nit: This is inconsistent. You pass in ...

Dec. 1, 2015, 10:47 a.m. (2015-12-01 10:47:14 UTC) #13

Sebastian Noack

https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content_blocker_lists/bin/generate_lists.py File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content_blocker_lists/bin/generate_lists.py#newcode51 sitescripts/content_blocker_lists/bin/generate_lists.py:51: return filter_list I think it would be simpler if ...

Dec. 1, 2015, 11:01 a.m. (2015-12-01 11:01:49 UTC) #14

kzar

Patch Set 5 : Addressed more feedback from Felix and Sebastian https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content_blocker_lists/bin/generate_lists.py File sitescripts/content_blocker_lists/bin/generate_lists.py (right): ...

Dec. 1, 2015, 12:13 p.m. (2015-12-01 12:13:39 UTC) #15

Patch Set 5 : Addressed more feedback from Felix and Sebastian

https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content...
File sitescripts/content_blocker_lists/bin/generate_lists.py (right):

https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content...
sitescripts/content_blocker_lists/bin/generate_lists.py:54: with
closing(StringIO(filter_list)) as stream:
On 2015/12/01 08:32:56, Felix Dahlke wrote:
> On 2015/11/30 17:06:00, kzar wrote:
> > On 2015/11/30 15:49:19, Sebastian Noack wrote:
> > > On 2015/11/30 15:13:11, kzar wrote:
> > > > On 2015/11/30 13:55:49, Sebastian Noack wrote:
> > > > > Never mind closing a StringIO. It doesn't do anything that wouldn't
> happen
> > > > > anyway. There are no resources that needs to be released associated
with
> a
> > > > > StringIO. The close method merely exist for compatibility.
> > > > > 
> > > > > Wait.., why do you use a StringIO in the first place? How about simply
> > using
> > > a
> > > > > multi-line regexp?
> > > > > 
> > > > > But wait again.., why do you parse the filter list headers in the
first
> > > place?
> > > > > We don't need any of these data in the generated block list.
> > > > 
> > > > We need the Version field, but otherwise Done.
> > > 
> > > Well, you set the version field based on the current timestamp. So there
is
> no
> > > need to parse the source filter lists. Or do I miss something?
> > 
> > That's the version for the block list, in the sources section we keep a
record
> > of the version of each filter list.
> 
> You still won't have to parse the header of the files, you can read the
version
> from the "ABP-Notification-Version" HTTP header in the response.

I don't see that header present? 

curl -I https://easylist-downloads.adblockplus.org/easylist_noadult.txt

HTTP/1.1 200 OK
Server: nginx
Date: Tue, 01 Dec 2015 11:24:23 GMT
Content-Type: text/plain; charset=utf-8
Content-Length: 1532799
Last-Modified: Tue, 01 Dec 2015 11:20:46 GMT
Connection: close
ETag: "565d828e-17637f"
Strict-Transport-Security: max-age=31536000
Accept-Ranges: bytes

https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content...
File sitescripts/content_blocker_lists/bin/generate_lists.py (right):

https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content...
sitescripts/content_blocker_lists/bin/generate_lists.py:33: def
update_abp2blocklist():
On 2015/12/01 10:18:39, Sebastian Noack wrote:
> On 2015/12/01 08:43:09, Felix Dahlke wrote:
> > These functions were prefixed with an underscore for a reason, see:
> > https://adblockplus.org/coding-style#python
> 
> Well, one could argue that this is less a module (to be imported somewhere
> else), but more like a stand-alone script providing a CLI as high-level
> interface. Therefore these methods wouldn't necessarily have to be considered
> private. I probably wouldn't have prefixed them myself in this case. However,
> removing the prefixes is a completely unrelated change. So I agree, that it
> should be removed from this patch.

I would prefer to leave them off, they don't really add anything in this context
and they are kind of ugly. (It's true that it's an unrelated change, but as I'm
changing / creating nearly all the functional definitions in this patch set I
thought it made sense to change this one too.)

https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content...
sitescripts/content_blocker_lists/bin/generate_lists.py:51: return filter_list
On 2015/12/01 11:01:48, Sebastian Noack wrote:
> I think it would be simpler if you just return a tuple like (body, url,
version)
> here, unpacking it while iterating over the filter lists. In particular as
these
> really seem to be the only data we need.

Done.

https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content...
sitescripts/content_blocker_lists/bin/generate_lists.py:61: ("version",
datetime.utcnow().strftime("%Y%m%d%H%M")),
On 2015/12/01 10:18:39, Sebastian Noack wrote:
> On 2015/12/01 08:43:08, Felix Dahlke wrote:
> > FWIW, we're using `time.strftime("%Y%m%d%H%M", time.gmtime())` for this in
the
> > notification handler. But it seems to me those are equivalent.
> 
> I tend to agree, creating a datetime object is unnecessary here. Moreover, I
> always have to wrap my head around how datetime handles timezones. I wouldn't
> insist though.

Done.

https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content...
sitescripts/content_blocker_lists/bin/generate_lists.py:67: for k in ["url",
"Version"]})
On 2015/12/01 10:38:09, Sebastian Noack wrote:
> On 2015/12/01 10:26:29, Felix Dahlke wrote:
> > On 2015/12/01 10:18:39, Sebastian Noack wrote:
> > > On 2015/12/01 08:43:08, Felix Dahlke wrote:
> > > > Nit: Sebastian convinced me a while ago that tuples are better than
lists
> > for
> > > > immutable lists like this.
> > > 
> > > Well, frankly, I don't think that it matters in the case here. However,
when
> > you
> > > assign a sequence to a variable, but changing it isn't intended or
supported
> > by
> > > your code, using an immutable type is more appropriate and prevents you
from
> > > certain bugs, later. But if the sequence isn't accessible after defining
it
> > > anyway, I don't really care.
> > 
> > Wouldn't insist either.
> 
> For reference, I found a quite interesting answer on SO, on the difference
> between tuples and lists, from a more semantic point of view:
>
https://stackoverflow.com/questions/626759/whats-the-difference-between-list-...
> 
> I didn't consider that aspect yet, but I tend to agree. So in our scenario
here,
> as structure doesn't matter (i.e. adding/removing/changing an item in the
> sequence of fields here wouldn't invalidate the whole data structure), a list
> would actually be more appropriate than a tuple, in their conclusion.

Acknowledged.

https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content...
sitescripts/content_blocker_lists/bin/generate_lists.py:75: def
pipe_in(process):
On 2015/12/01 10:47:14, Sebastian Noack wrote:
> Nit: This is inconsistent. You pass in the process but access the filter_lists
> via the closure. If you keep it as an inner function I would access all
> variables via closure. However, I'd tend to move that function to the
top-level.

Done.

https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content...
sitescripts/content_blocker_lists/bin/generate_lists.py:83:
thread.start_new_thread(pipe_in, (process,))
On 2015/12/01 10:47:14, Sebastian Noack wrote:
> Please use the high-level threading module instead. Using the thread module
> directly can be considered deprecated as it got prefixed in Python 3.

Done.

Felix Dahlke

https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content_blocker_lists/bin/generate_lists.py File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content_blocker_lists/bin/generate_lists.py#newcode54 sitescripts/content_blocker_lists/bin/generate_lists.py:54: with closing(StringIO(filter_list)) as stream: On 2015/12/01 12:13:38, kzar wrote: ...

Dec. 1, 2015, 2:04 p.m. (2015-12-01 14:04:20 UTC) #16

https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content...
File sitescripts/content_blocker_lists/bin/generate_lists.py (right):

https://codereview.adblockplus.org/29331148/diff/29331149/sitescripts/content...
sitescripts/content_blocker_lists/bin/generate_lists.py:54: with
closing(StringIO(filter_list)) as stream:
On 2015/12/01 12:13:38, kzar wrote:
> On 2015/12/01 08:32:56, Felix Dahlke wrote:
> > On 2015/11/30 17:06:00, kzar wrote:
> > > On 2015/11/30 15:49:19, Sebastian Noack wrote:
> > > > On 2015/11/30 15:13:11, kzar wrote:
> > > > > On 2015/11/30 13:55:49, Sebastian Noack wrote:
> > > > > > Never mind closing a StringIO. It doesn't do anything that wouldn't
> > happen
> > > > > > anyway. There are no resources that needs to be released associated
> with
> > a
> > > > > > StringIO. The close method merely exist for compatibility.
> > > > > > 
> > > > > > Wait.., why do you use a StringIO in the first place? How about
simply
> > > using
> > > > a
> > > > > > multi-line regexp?
> > > > > > 
> > > > > > But wait again.., why do you parse the filter list headers in the
> first
> > > > place?
> > > > > > We don't need any of these data in the generated block list.
> > > > > 
> > > > > We need the Version field, but otherwise Done.
> > > > 
> > > > Well, you set the version field based on the current timestamp. So there
> is
> > no
> > > > need to parse the source filter lists. Or do I miss something?
> > > 
> > > That's the version for the block list, in the sources section we keep a
> record
> > > of the version of each filter list.
> > 
> > You still won't have to parse the header of the files, you can read the
> version
> > from the "ABP-Notification-Version" HTTP header in the response.
> 
> I don't see that header present? 
> 
> curl -I https://easylist-downloads.adblockplus.org/easylist_noadult.txt
> 
> HTTP/1.1 200 OK
> Server: nginx
> Date: Tue, 01 Dec 2015 11:24:23 GMT
> Content-Type: text/plain; charset=utf-8
> Content-Length: 1532799
> Last-Modified: Tue, 01 Dec 2015 11:20:46 GMT
> Connection: close
> ETag: "565d828e-17637f"
> Strict-Transport-Security: max-age=31536000
> Accept-Ranges: bytes

Ouch, big mixup on my end, we only have this for notifications at this point...
Yes, I'm afraid there's no other way than to parse the file header then.

https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content...
File sitescripts/content_blocker_lists/bin/generate_lists.py (right):

https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content...
sitescripts/content_blocker_lists/bin/generate_lists.py:33: def
update_abp2blocklist():
On 2015/12/01 12:13:38, kzar wrote:
> On 2015/12/01 10:18:39, Sebastian Noack wrote:
> > On 2015/12/01 08:43:09, Felix Dahlke wrote:
> > > These functions were prefixed with an underscore for a reason, see:
> > > https://adblockplus.org/coding-style#python
> > 
> > Well, one could argue that this is less a module (to be imported somewhere
> > else), but more like a stand-alone script providing a CLI as high-level
> > interface. Therefore these methods wouldn't necessarily have to be
considered
> > private. I probably wouldn't have prefixed them myself in this case.
However,
> > removing the prefixes is a completely unrelated change. So I agree, that it
> > should be removed from this patch.
> 
> I would prefer to leave them off, they don't really add anything in this
context
> and they are kind of ugly. (It's true that it's an unrelated change, but as
I'm
> changing / creating nearly all the functional definitions in this patch set I
> thought it made sense to change this one too.)

I would argue that this is still a module - this script can only be run with
`python -m`. It isn't a loadable library, that's true, but neither are our web
modules. Also, It's not unheard of to move code between modules under bin and
reusable modules. I really think we should be consistent here, but it's up to
Sebastian. Let's however please clarify this in the coding style, it's obviously
ambiguous right now.

Sebastian Noack

https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content_blocker_lists/bin/generate_lists.py File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content_blocker_lists/bin/generate_lists.py#newcode33 sitescripts/content_blocker_lists/bin/generate_lists.py:33: def update_abp2blocklist(): On 2015/12/01 14:04:19, Felix Dahlke wrote: > ...

Dec. 7, 2015, 12:38 p.m. (2015-12-07 12:38:25 UTC) #17

https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content...
File sitescripts/content_blocker_lists/bin/generate_lists.py (right):

https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content...
sitescripts/content_blocker_lists/bin/generate_lists.py:33: def
update_abp2blocklist():
On 2015/12/01 14:04:19, Felix Dahlke wrote:
> On 2015/12/01 12:13:38, kzar wrote:
> > On 2015/12/01 10:18:39, Sebastian Noack wrote:
> > > On 2015/12/01 08:43:09, Felix Dahlke wrote:
> > > > These functions were prefixed with an underscore for a reason, see:
> > > > https://adblockplus.org/coding-style#python
> > > 
> > > Well, one could argue that this is less a module (to be imported somewhere
> > > else), but more like a stand-alone script providing a CLI as high-level
> > > interface. Therefore these methods wouldn't necessarily have to be
> considered
> > > private. I probably wouldn't have prefixed them myself in this case.
> However,
> > > removing the prefixes is a completely unrelated change. So I agree, that
it
> > > should be removed from this patch.
> > 
> > I would prefer to leave them off, they don't really add anything in this
> context
> > and they are kind of ugly. (It's true that it's an unrelated change, but as
> I'm
> > changing / creating nearly all the functional definitions in this patch set
I
> > thought it made sense to change this one too.)
> 
> I would argue that this is still a module - this script can only be run with
> `python -m`. It isn't a loadable library, that's true, but neither are our web
> modules. Also, It's not unheard of to move code between modules under bin and
> reusable modules. I really think we should be consistent here, but it's up to
> Sebastian. Let's however please clarify this in the coding style, it's
obviously
> ambiguous right now.

Well, technically every piece of code is part of a module in Python. But the
point is that all code that sees these functions is allowed/supposed to call
them. Note that since the actual interface here is the CLI, the only code this
applies to, is this script itself. So there is no context in which these
function would/should be considered private. But if for some reason somebody
imports this module rather than running it as a script, which they shouldn't do
however, they would expect a more low-level API than provided by the CLI.

As for consistency, it seems that it's only code you wrote that overuses
prefixed/private functions in scenarios like this. For example have a look at
ensure_dependencies.py or buildtools in general.

https://codereview.adblockplus.org/29331148/diff/29331684/sitescripts/content...
File sitescripts/content_blocker_lists/bin/generate_lists.py (right):

https://codereview.adblockplus.org/29331148/diff/29331684/sitescripts/content...
sitescripts/content_blocker_lists/bin/generate_lists.py:47: field_re =
re.compile(r"^!\s*([^:\s]+):\s*(.+)$", re.MULTILINE)
This logic can be simplified, in particular since we are only looking for the
version:

for match in re.finditer(r"^(?:([^![])|!\s*Version:\s*(.+)$)", filter_list,
re.MULTILINE):
  end_of_headers, version = match.groups()
  if end_of_headers:
    break
  if version:
    return version

https://codereview.adblockplus.org/29331148/diff/29331684/sitescripts/content...
sitescripts/content_blocker_lists/bin/generate_lists.py:64:
metadata["sources"].append({ "url": url, "version": version })
See https://www.python.org/dev/peps/pep-0008/#pet-peeves

https://codereview.adblockplus.org/29331148/diff/29331684/sitescripts/content...
sitescripts/content_blocker_lists/bin/generate_lists.py:82: if
process.returncode:
Note that returncode is set by the wait() method. So we should rather call it
here rather than in the thread to avoid race conditions.

Felix Dahlke

https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content_blocker_lists/bin/generate_lists.py File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content_blocker_lists/bin/generate_lists.py#newcode33 sitescripts/content_blocker_lists/bin/generate_lists.py:33: def update_abp2blocklist(): On 2015/12/07 12:38:24, Sebastian Noack wrote: > ...

Dec. 8, 2015, 6:51 a.m. (2015-12-08 06:51:24 UTC) #18

https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content...
File sitescripts/content_blocker_lists/bin/generate_lists.py (right):

https://codereview.adblockplus.org/29331148/diff/29331636/sitescripts/content...
sitescripts/content_blocker_lists/bin/generate_lists.py:33: def
update_abp2blocklist():
On 2015/12/07 12:38:24, Sebastian Noack wrote:
> On 2015/12/01 14:04:19, Felix Dahlke wrote:
> > On 2015/12/01 12:13:38, kzar wrote:
> > > On 2015/12/01 10:18:39, Sebastian Noack wrote:
> > > > On 2015/12/01 08:43:09, Felix Dahlke wrote:
> > > > > These functions were prefixed with an underscore for a reason, see:
> > > > > https://adblockplus.org/coding-style#python
> > > > 
> > > > Well, one could argue that this is less a module (to be imported
somewhere
> > > > else), but more like a stand-alone script providing a CLI as high-level
> > > > interface. Therefore these methods wouldn't necessarily have to be
> > considered
> > > > private. I probably wouldn't have prefixed them myself in this case.
> > However,
> > > > removing the prefixes is a completely unrelated change. So I agree, that
> it
> > > > should be removed from this patch.
> > > 
> > > I would prefer to leave them off, they don't really add anything in this
> > context
> > > and they are kind of ugly. (It's true that it's an unrelated change, but
as
> > I'm
> > > changing / creating nearly all the functional definitions in this patch
set
> I
> > > thought it made sense to change this one too.)
> > 
> > I would argue that this is still a module - this script can only be run with
> > `python -m`. It isn't a loadable library, that's true, but neither are our
web
> > modules. Also, It's not unheard of to move code between modules under bin
and
> > reusable modules. I really think we should be consistent here, but it's up
to
> > Sebastian. Let's however please clarify this in the coding style, it's
> obviously
> > ambiguous right now.
> 
> Well, technically every piece of code is part of a module in Python. But the
> point is that all code that sees these functions is allowed/supposed to call
> them. Note that since the actual interface here is the CLI, the only code this
> applies to, is this script itself. So there is no context in which these
> function would/should be considered private. But if for some reason somebody
> imports this module rather than running it as a script, which they shouldn't
do
> however, they would expect a more low-level API than provided by the CLI.
> 
> As for consistency, it seems that it's only code you wrote that overuses
> prefixed/private functions in scenarios like this. For example have a look at
> ensure_dependencies.py or buildtools in general. 
> 

Wladimir's newer code in Sitescripts also uses those prefixes for stuff under
bin. Anyway, your reasoning makes sense, so I'm fine with never using those
private prefixes for executable scripts. But please do clarify it in the coding
style.

kzar

Patch Set 6 : Addressed some of Sebastian's feedback https://codereview.adblockplus.org/29331148/diff/29331684/sitescripts/content_blocker_lists/bin/generate_lists.py File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331684/sitescripts/content_blocker_lists/bin/generate_lists.py#newcode47 sitescripts/content_blocker_lists/bin/generate_lists.py:47: ...

Dec. 8, 2015, 12:52 p.m. (2015-12-08 12:52:21 UTC) #19

Felix Dahlke

https://codereview.adblockplus.org/29331148/diff/29331684/sitescripts/content_blocker_lists/bin/generate_lists.py File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29331684/sitescripts/content_blocker_lists/bin/generate_lists.py#newcode47 sitescripts/content_blocker_lists/bin/generate_lists.py:47: field_re = re.compile(r"^!\s*([^:\s]+):\s*(.+)$", re.MULTILINE) On 2015/12/08 12:52:21, kzar wrote: ...

Dec. 8, 2015, 1:39 p.m. (2015-12-08 13:39:51 UTC) #20

kzar

Patch Set 7 : Just grab version number instead of parsing header https://codereview.adblockplus.org/29331148/diff/29331684/sitescripts/content_blocker_lists/bin/generate_lists.py File sitescripts/content_blocker_lists/bin/generate_lists.py ...

Dec. 8, 2015, 2:34 p.m. (2015-12-08 14:34:24 UTC) #21

Sebastian Noack

https://codereview.adblockplus.org/29331148/diff/29332446/sitescripts/content_blocker_lists/bin/generate_lists.py File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29332446/sitescripts/content_blocker_lists/bin/generate_lists.py#newcode48 sitescripts/content_blocker_lists/bin/generate_lists.py:48: version = re.search(r"^(?:[^[!])|!\s*Version:\s*(.+)$", Indeed, we can simply use re.search ...

Dec. 8, 2015, 3:08 p.m. (2015-12-08 15:08:39 UTC) #22

kzar

Patch Set 8 : Improved regexp https://codereview.adblockplus.org/29331148/diff/29332446/sitescripts/content_blocker_lists/bin/generate_lists.py File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29332446/sitescripts/content_blocker_lists/bin/generate_lists.py#newcode48 sitescripts/content_blocker_lists/bin/generate_lists.py:48: version = re.search(r"^(?:[^[!])|!\s*Version:\s*(.+)$", ...

Dec. 8, 2015, 3:37 p.m. (2015-12-08 15:37:44 UTC) #23

Sebastian Noack

https://codereview.adblockplus.org/29331148/diff/29332446/sitescripts/content_blocker_lists/bin/generate_lists.py File sitescripts/content_blocker_lists/bin/generate_lists.py (right): https://codereview.adblockplus.org/29331148/diff/29332446/sitescripts/content_blocker_lists/bin/generate_lists.py#newcode48 sitescripts/content_blocker_lists/bin/generate_lists.py:48: version = re.search(r"^(?:[^[!])|!\s*Version:\s*(.+)$", On 2015/12/08 15:37:43, kzar wrote: > ...

Dec. 8, 2015, 3:45 p.m. (2015-12-08 15:45:52 UTC) #24

LGTM

Issue 29331148: Issue 3176 - Add metadata to content blocker lists (Closed)

Description

Patch Set 1 #

Patch Set 2 : Addressed feedback #

Patch Set 3 : Remove unused StringIO import #

Patch Set 4 : Addressed further feedback #

Patch Set 5 : Addressed more feedback from Felix and Sebastian #

Patch Set 6 : Addressed some of Sebastian's feedback #

Patch Set 7 : Just grab version number instead of parsing header #

Patch Set 8 : Improved regexp #

Messages