abp/filters/renderer.py - Issue 29845767: Issue 6685 - Offer incremental filter list downloads

Delta Between Two Patch Sets: abp/filters/renderer.py

Issue 29845767: Issue 6685 - Offer incremental filter list downloads (Closed) Base URL: https://hg.adblockplus.org/python-abp/

Left Patch Set: Use iterables instead of str, stop repeating code Created Aug. 20, 2018, 6:18 p.m.

Right Patch Set: Address comments on PS8 Created Aug. 30, 2018, 5:37 p.m.

Left:
Right:

Use n/p to move between diff chunks; N/P to move between comments.

Jump to:

Left: Side by side diff | Download
Right: Side by side diff | Download

LEFT	RIGHT
1 # This file is part of Adblock Plus <https://adblockplus.org/>,	1 # This file is part of Adblock Plus <https://adblockplus.org/>,

2 # Copyright (C) 2006-present eyeo GmbH	2 # Copyright (C) 2006-present eyeo GmbH

3 #	3 #

4 # Adblock Plus is free software: you can redistribute it and/or modify	4 # Adblock Plus is free software: you can redistribute it and/or modify

5 # it under the terms of the GNU General Public License version 3 as	5 # it under the terms of the GNU General Public License version 3 as

6 # published by the Free Software Foundation.	6 # published by the Free Software Foundation.

7 #	7 #

8 # Adblock Plus is distributed in the hope that it will be useful,	8 # Adblock Plus is distributed in the hope that it will be useful,

9 # but WITHOUT ANY WARRANTY; without even the implied warranty of	9 # but WITHOUT ANY WARRANTY; without even the implied warranty of

10 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the	10 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the

(...skipping 106 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
117 """Insert metadata comment with version (a.k.a. date)."""	117 """Insert metadata comment with version (a.k.a. date)."""

118 first_line, rest = _first_and_rest(lines)	118 first_line, rest = _first_and_rest(lines)

119 version = Metadata('Version', time.strftime('%Y%m%d%H%M', time.gmtime()))	119 version = Metadata('Version', time.strftime('%Y%m%d%H%M', time.gmtime()))

120 return itertools.chain([first_line, version], rest)	120 return itertools.chain([first_line, version], rest)

121	121

122	122

123 def _remove_duplicates(lines):	123 def _remove_duplicates(lines):

124 """Remove duplicate metadata and headers."""	124 """Remove duplicate metadata and headers."""

125 # Always remove checksum -- a checksum coming from a fragment	125 # Always remove checksum -- a checksum coming from a fragment

126 # will not match for the rendered list.	126 # will not match for the rendered list.

127 seen = {'Checksum'}	127 seen = {'checksum'}

128 for i, line in enumerate(lines):	128 for i, line in enumerate(lines):

129 if line.type == 'metadata':	129 if line.type == 'metadata':

130 if line.key not in seen:	130 key = line.key.lower()

131 seen.add(line.key)	131 if key not in seen:

	132 seen.add(key)

132 yield line	133 yield line

133 elif line.type == 'header':	134 elif line.type == 'header':

134 if i == 0:	135 if i == 0:

135 yield line	136 yield line

136 else:	137 else:

137 yield line	138 yield line

138	139

139	140

140 def _validate(lines):	141 def _validate(lines):

141 """Validate the final list."""	142 """Validate the final list."""

(...skipping 34 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
176 _logger.info('Rendering: %s', name)	177 _logger.info('Rendering: %s', name)

177 lines, default_source = _get_and_parse_fragment(name, sources, top_source)	178 lines, default_source = _get_and_parse_fragment(name, sources, top_source)

178 lines = _process_includes(sources, default_source, [name], lines)	179 lines = _process_includes(sources, default_source, [name], lines)

179 for proc in [_process_timestamps, _insert_version, _remove_duplicates,	180 for proc in [_process_timestamps, _insert_version, _remove_duplicates,

180 _validate]:	181 _validate]:

181 lines = proc(lines)	182 lines = proc(lines)

182 return lines	183 return lines

183	184

184	185

185 def _split_list_for_diff(list_in):	186 def _split_list_for_diff(list_in):

186 filterlist, metadata, keys = (set() for i in range(3))	187 """Split a filter list into metadata, keys, and rules."""
Vasily Kuznetsov 2018/08/21 14:59:59 I'm not sure if maybe a slightly more readable ver I'm not sure if maybe a slightly more readable version of this line would be: filterlist, metadata, keys = set(), set(), set() or even: filterlist = set() metadata = set() keys = set() I don't feel too strongly about it so feel free to keep your original version or choose one of the proposals above as you like. rhowell 2018/08/27 22:06:27 Done. Show quoted text On 2018/08/21 14:59:59, Vasily Kuznetsov wrote: > I'm not sure if maybe a slightly more readable version of this line would be: > > filterlist, metadata, keys = set(), set(), set() > > or even: > > filterlist = set() > metadata = set() > keys = set() > > I don't feel too strongly about it so feel free to keep your original version or > choose one of the proposals above as you like. Done.	Vasily Kuznetsov 2018/08/31 12:11:02 Nit: strictly speaking this function now returns m Nit: strictly speaking this function now returns metadata and rules (the keys are just part of metadata). The docstring could be updated. rhowell 2018/08/31 21:21:02 Ah, good point! I'll update that. Show quoted text On 2018/08/31 12:11:02, Vasily Kuznetsov wrote: > Nit: strictly speaking this function now returns metadata and rules (the keys > are just part of metadata). The docstring could be updated. Ah, good point! I'll update that.
	188 metadata = {}

	189 rules = set()

187 for line in parse_filterlist(list_in):	190 for line in parse_filterlist(list_in):
Sebastian Noack 2018/08/21 19:42:46 When parsing generated filter lists in order to ge When parsing generated filter lists in order to generate diffs, we might want to make the parser ignore includes/instructions. Vasily Kuznetsov 2018/08/22 13:10:50 Theoretically there should be no includes/instruct Show quoted text On 2018/08/21 19:42:46, Sebastian Noack wrote: > When parsing generated filter lists in order to generate diffs, we might want to > make the parser ignore includes/instructions. Theoretically there should be no includes/instructions in the final lists (and we're not planning to produce diffs of fragments of filter lists), so this is not really necessary. And even if there were includes, the code below ignores them, which achieves the same outcome as the parsing ignoring them, so it seems that things are ok as is. Sebastian Noack 2018/08/22 14:31:06 Granted, this is quite an edge case. But in theory Show quoted text On 2018/08/22 13:10:50, Vasily Kuznetsov wrote: > On 2018/08/21 19:42:46, Sebastian Noack wrote: > > When parsing generated filter lists in order to generate diffs, we might want > to > > make the parser ignore includes/instructions. > > Theoretically there should be no includes/instructions in the final lists (and > we're not planning to produce diffs of fragments of filter lists), so this is > not really necessary. And even if there were includes, the code below ignores > them, which achieves the same outcome as the parsing ignoring them, so it seems > that things are ok as is. Granted, this is quite an edge case. But in theory if we'll find "%include" in the input the diff is generated from, then it seems somewhat more accurate to just treat it as a filter (like Adblock Plus would) rather than stripping this line (or perhaps a warning would be in order). But I don't feel too strong. Vasily Kuznetsov 2018/08/24 13:03:19 To do this we would need to start parsing fragment Show quoted text On 2018/08/22 14:31:06, Sebastian Noack wrote: > On 2018/08/22 13:10:50, Vasily Kuznetsov wrote: > > On 2018/08/21 19:42:46, Sebastian Noack wrote: > > > When parsing generated filter lists in order to generate diffs, we might > want > > to > > > make the parser ignore includes/instructions. > > > > Theoretically there should be no includes/instructions in the final lists (and > > we're not planning to produce diffs of fragments of filter lists), so this is > > not really necessary. And even if there were includes, the code below ignores > > them, which achieves the same outcome as the parsing ignoring them, so it > seems > > that things are ok as is. > > Granted, this is quite an edge case. But in theory if we'll find "%include" in > the input the diff is generated from, then it seems somewhat more accurate to > just treat it as a filter (like Adblock Plus would) rather than stripping this > line (or perhaps a warning would be in order). But I don't feel too strong. To do this we would need to start parsing fragments of filter lists differently from rendered filter lists. It does make sense but it also makes things more complicated and the cases where it's relevant seem to be pretty unrealistic and edge-casey. I lean towards keeping the code unified and parsing "%include"s as instructions everywhere until we have evidence that there are cases where it would be useful to make the distinction. rhowell 2018/08/27 22:06:27 Yeah, I was working under the assumption that the Show quoted text On 2018/08/24 13:03:19, Vasily Kuznetsov wrote: > On 2018/08/22 14:31:06, Sebastian Noack wrote: > > On 2018/08/22 13:10:50, Vasily Kuznetsov wrote: > > > On 2018/08/21 19:42:46, Sebastian Noack wrote: > > > > When parsing generated filter lists in order to generate diffs, we might > > want > > > to > > > > make the parser ignore includes/instructions. > > > > > > Theoretically there should be no includes/instructions in the final lists > (and > > > we're not planning to produce diffs of fragments of filter lists), so this > is > > > not really necessary. And even if there were includes, the code below > ignores > > > them, which achieves the same outcome as the parsing ignoring them, so it > > seems > > > that things are ok as is. > > > > Granted, this is quite an edge case. But in theory if we'll find "%include" > in > > the input the diff is generated from, then it seems somewhat more accurate to > > just treat it as a filter (like Adblock Plus would) rather than stripping this > > line (or perhaps a warning would be in order). But I don't feel too strong. > > To do this we would need to start parsing fragments of filter lists differently > from rendered filter lists. It does make sense but it also makes things more > complicated and the cases where it's relevant seem to be pretty unrealistic and > edge-casey. > > I lean towards keeping the code unified and parsing "%include"s as instructions > everywhere until we have evidence that there are cases where it would be useful > to make the distinction. Yeah, I was working under the assumption that the full lists wouldn't have %includes in them, and `%include` appears to be the only instruction the parser knows. Are there any other instructions that might appear, that we might need to handle? If not, I would agree with Vasily, that it's not necessary to handle this case.
188 if line.type == 'metadata' and 'Checksum' not in line.to_string():	191 if line.type == 'metadata':

189 metadata.add(line.to_string())	192 metadata[line.key.lower()] = line

190 keys.add(line.key)

191 elif line.type == 'filter':	193 elif line.type == 'filter':

192 filterlist.add(line.to_string())	194 rules.add(line.to_string())

193 return filterlist, metadata, keys	195 return metadata, rules

194	196

195	197

196 def render_diff(base, latest):	198 def render_diff(base, latest):

197 """Return a diff between two filter lists.	199 """Return a diff between two filter lists.

198	200

199 Parameters	201 Parameters

200 ----------	202 ----------

201 base : iterator of str	203 base : iterator of str

202 The base (old) list that we want to update to latest.	204 The base (old) list that we want to update to latest.

203 lastest : iterator of str	205 lastest : iterator of str

204 The latest (most recent) list that we want to update to.	206 The latest (most recent) list that we want to update to.

205	207

206 Returns	208 Returns

207 -------	209 -------

208 iterable of str	210 iterable of str

209 A diff between two lists (https://issues.adblockplus.org/ticket/6685)	211 A diff between two lists (https://issues.adblockplus.org/ticket/6685)

210	212

211 """	213 """

212 latest_fl, latest_md, latest_keys = _split_list_for_diff(latest)	214 latest_metadata, latest_rules = _split_list_for_diff(latest)

213 base_fl, base_md, base_keys = _split_list_for_diff(base)	215 base_metadata, base_rules = _split_list_for_diff(base)

214

215 new_md = latest_md - base_md

216 removed_keys = base_keys - latest_keys

217 add_fl = latest_fl - base_fl

218 remove_fl = base_fl - latest_fl

219	216

220 yield '[Adblock Plus Diff]'	217 yield '[Adblock Plus Diff]'

221 for item in new_md:	218 for key, latest in latest_metadata.items():

222 yield item	219 base = base_metadata.get(key)

223 for key in removed_keys:	220 if not base or base.value != latest.value:

224 # If a special comment has been removed, enter it with a blank value	221 yield latest.to_string()

225 # so the client will set it back to the default value	222 for key in set(base_metadata) - set(latest_metadata):

226 yield '! {}:'.format(key)	223 yield '! {}:'.format(base_metadata[key].key)

227 for item in add_fl:	224 for rule in base_rules - latest_rules:

228 yield '+ {}'.format(item)	225 yield '- {}'.format(rule)

229 for item in remove_fl:	226 for rule in latest_rules - base_rules:

230 yield '- {}'.format(item)	227 yield '+ {}'.format(rule)
Sebastian Noack 2018/08/21 19:42:46 In the specification, we demand the client to proc In the specification, we demand the client to process removed filters first. But given this implementation, it seems simpler if we'd rather make the diff list the removed filters first, so that the client can process the diff in one pass. Sebastian Noack 2018/08/22 16:03:29 After discussing this on IRC, Vasily agrees. I upd Show quoted text On 2018/08/21 19:42:46, Sebastian Noack wrote: > In the specification, we demand the client to process removed filters first. But > given this implementation, it seems simpler if we'd rather make the diff list > the removed filters first, so that the client can process the diff in one pass. After discussing this on IRC, Vasily agrees. I updated the spec accordingly. Vasily Kuznetsov 2018/08/24 13:03:19 👍 Show quoted text On 2018/08/22 16:03:29, Sebastian Noack wrote: > On 2018/08/21 19:42:46, Sebastian Noack wrote: > > In the specification, we demand the client to process removed filters first. > But > > given this implementation, it seems simpler if we'd rather make the diff list > > the removed filters first, so that the client can process the diff in one > pass. > > After discussing this on IRC, Vasily agrees. I updated the spec accordingly. 👍 rhowell 2018/08/27 22:06:27 Done. Show quoted text On 2018/08/24 13:03:19, Vasily Kuznetsov wrote: > On 2018/08/22 16:03:29, Sebastian Noack wrote: > > On 2018/08/21 19:42:46, Sebastian Noack wrote: > > > In the specification, we demand the client to process removed filters first. > > But > > > given this implementation, it seems simpler if we'd rather make the diff > list > > > the removed filters first, so that the client can process the diff in one > > pass. > > > > After discussing this on IRC, Vasily agrees. I updated the spec accordingly. > > 👍 Done.
LEFT	RIGHT