|
|
Created:
Sept. 3, 2018, 6:43 p.m. by Sebastian Noack Modified:
Sept. 11, 2018, 5:23 p.m. Reviewers:
Manish Jethani Visibility:
Public. |
DescriptionIssue 6923 - Only parse metadata from special comments at the top of the file
Patch Set 1 #
Total comments: 9
Patch Set 2 : Change end-of-metadata semantics #
Total comments: 8
Patch Set 3 : Use more obscure regular expression #Patch Set 4 : Reverted regexp #
Total comments: 2
Patch Set 5 : Rebased #Patch Set 6 : Addressed comment about tests #MessagesTotal messages: 15
https://codereview.adblockplus.org/29873570/diff/29873571/lib/synchronizer.js File lib/synchronizer.js (right): https://codereview.adblockplus.org/29873570/diff/29873571/lib/synchronizer.js... lib/synchronizer.js:165: let match = /^\s*!\s*(\S.*?)\s*:\s*(.*)/.exec(lines[i]); You might wonder why I changed this regular expression. This is necessary because many filter lists contain special/metadata comments with a malformed key (most notably "Last modified"). While such a key has no effect, without having this regular expression match them, we'd ignore any metadata that might follow.
https://codereview.adblockplus.org/29873570/diff/29873571/lib/synchronizer.js File lib/synchronizer.js (right): https://codereview.adblockplus.org/29873570/diff/29873571/lib/synchronizer.js... lib/synchronizer.js:165: let match = /^\s*!\s*(\S.*?)\s*:\s*(.*)/.exec(lines[i]); On 2018/09/03 18:47:13, Sebastian Noack wrote: > You might wonder why I changed this regular expression. This is necessary > because many filter lists contain special/metadata comments with a malformed key > (most notably "Last modified"). While such a key has no effect, without having > this regular expression match them, we'd ignore any metadata that might follow. Acknowledged. https://codereview.adblockplus.org/29873570/diff/29873571/lib/synchronizer.js... lib/synchronizer.js:170: if (keyword in params) BTW I find in my tests that `params.hasOwnProperty(keyword)` is much faster than `keyword in params`. Like about 30% faster here. https://codereview.adblockplus.org/29873570/diff/29873571/lib/synchronizer.js... lib/synchronizer.js:173: lines.splice(i--, 1); Since we're considering malformed keywords as well, shouldn't we splice out those lines too? So we could start with `paramIndex = 1` and then after the loop just do `lines.splice(1, paramIndex - 1)`.
https://codereview.adblockplus.org/29873570/diff/29873571/lib/synchronizer.js File lib/synchronizer.js (right): https://codereview.adblockplus.org/29873570/diff/29873571/lib/synchronizer.js... lib/synchronizer.js:170: if (keyword in params) On 2018/09/04 07:14:41, Manish Jethani wrote: > BTW I find in my tests that `params.hasOwnProperty(keyword)` is much faster than > `keyword in params`. Like about 30% faster here. IMO the in-operator reads much nicer, and this isn't a hotspot anymore (now this code is only hit a few times for each filter list parsed). Also if we want to further optimize this code, we might want to benchmark it in context, rather than relying on micro-benchmarks. It's possible that hasOwnProperty() isn't faster after all in this context, since the prototype and properties of params might be known to the JIT. Either way, this seems to be an unrelated change, since the only thing I changed in this line is the indentation. https://codereview.adblockplus.org/29873570/diff/29873571/lib/synchronizer.js... lib/synchronizer.js:173: lines.splice(i--, 1); On 2018/09/04 07:14:41, Manish Jethani wrote: > Since we're considering malformed keywords as well, shouldn't we splice out > those lines too? > > So we could start with `paramIndex = 1` and then after the loop just do > `lines.splice(1, paramIndex - 1)`. Historically, if it's not recognized as supported metadata, we treat it as CommentFilter. Moreover, splice() is expensive, so if we want to optimize here, we might rather not remove any items at all here.
LGTM https://codereview.adblockplus.org/29873570/diff/29873571/lib/synchronizer.js File lib/synchronizer.js (right): https://codereview.adblockplus.org/29873570/diff/29873571/lib/synchronizer.js... lib/synchronizer.js:170: if (keyword in params) On 2018/09/04 14:22:06, Sebastian Noack wrote: > On 2018/09/04 07:14:41, Manish Jethani wrote: > > BTW I find in my tests that `params.hasOwnProperty(keyword)` is much faster > than > > `keyword in params`. Like about 30% faster here. > > IMO the in-operator reads much nicer, and this isn't a hotspot anymore (now this > code is only hit a few times for each filter list parsed). Also if we want to > further optimize this code, we might want to benchmark it in context, rather > than relying on micro-benchmarks. It's possible that hasOwnProperty() isn't > faster after all in this context, since the prototype and properties of params > might be known to the JIT. Either way, this seems to be an unrelated change, > since the only thing I changed in this line is the indentation. Acknowledged. (Aside from performance, it's also about correctness. `'toString' in params` would also be `true`, which is wrong here. Also I'm pretty sure `hasOwnProperty` performs better when the keys being looked up are not statically embedded in the source code, which is the case here. But I agree this would be a separate issue, unrelated to what we're doing here.) https://codereview.adblockplus.org/29873570/diff/29873571/lib/synchronizer.js... lib/synchronizer.js:173: lines.splice(i--, 1); On 2018/09/04 14:22:06, Sebastian Noack wrote: > On 2018/09/04 07:14:41, Manish Jethani wrote: > > Since we're considering malformed keywords as well, shouldn't we splice out > > those lines too? > > > > So we could start with `paramIndex = 1` and then after the loop just do > > `lines.splice(1, paramIndex - 1)`. > > Historically, if it's not recognized as supported metadata, we treat it as > CommentFilter. Moreover, splice() is expensive, so if we want to optimize here, > we might rather not remove any items at all here. OK, fair enough. I'd like to consider not using `splice` here but let's make that a separate change.
https://codereview.adblockplus.org/29873570/diff/29873571/lib/synchronizer.js File lib/synchronizer.js (right): https://codereview.adblockplus.org/29873570/diff/29873571/lib/synchronizer.js... lib/synchronizer.js:165: let match = /^\s*!\s*(\S.*?)\s*:\s*(.*)/.exec(lines[i]); On 2018/09/04 07:14:41, Manish Jethani wrote: > On 2018/09/03 18:47:13, Sebastian Noack wrote: > > You might wonder why I changed this regular expression. This is necessary > > because many filter lists contain special/metadata comments with a malformed > key > > (most notably "Last modified"). While such a key has no effect, without having > > this regular expression match them, we'd ignore any metadata that might > follow. > > Acknowledged. You have a point there. But in this regard, IMO we should rather use a Map object here. But I leave that up to another change (by someone else).
Vasily and I agreed to simplify the the condition after which metadata aren't parsed anymore. Now, we stop looking for metadata after a lines that isn't a comment or an empty comment, rather than accounting specifically for metadata with malformed keys. What do you think?
On 2018/09/04 20:17:52, Sebastian Noack wrote: > Vasily and I agreed to simplify the the condition after which metadata aren't > parsed anymore. Now, we stop looking for metadata after a lines that isn't a > comment or an empty comment, rather than accounting specifically for metadata > with malformed keys. What do you think? I wonder why the exception for empty comments. Isn't it just easier to think about (and program for) if we just do it for the first set of comments and stop at the first non-comment?
https://codereview.adblockplus.org/29873570/diff/29874655/lib/synchronizer.js File lib/synchronizer.js (right): https://codereview.adblockplus.org/29873570/diff/29874655/lib/synchronizer.js... lib/synchronizer.js:165: let match = /^\s*!\s*(?:(\w+)\s*:\s*(.*)|\S)/.exec(lines[i]); How about this one: `/^\s*!\s*([^\s:]*)\s*(:)?\s*(.*)/` It seems simpler to me but maybe it performs worse, I'd have to check. Anyway, so with this we can check for `match[1]` being truthy. https://codereview.adblockplus.org/29873570/diff/29874655/lib/synchronizer.js... lib/synchronizer.js:169: if (match[1] != undefined) We never compare with `undefined` in core. `undefined` is a global variable (unlike `null`, which is a language keyword) and must be looked up in each scope. It is slow. Normally we do `typeof foo != "undefined"` all over the place.
On 2018/09/05 17:39:36, Manish Jethani wrote: > On 2018/09/04 20:17:52, Sebastian Noack wrote: > > Vasily and I agreed to simplify the the condition after which metadata aren't > > parsed anymore. Now, we stop looking for metadata after a lines that isn't a > > comment or an empty comment, rather than accounting specifically for metadata > > with malformed keys. What do you think? > > I wonder why the exception for empty comments. Isn't it just easier to think > about (and program for) if we just do it for the first set of comments and stop > at the first non-comment? In practice a lot of filter lists (including EasyList) put an empty comment (not empty line) between their metadata and some actual comments. https://codereview.adblockplus.org/29873570/diff/29874655/lib/synchronizer.js File lib/synchronizer.js (right): https://codereview.adblockplus.org/29873570/diff/29874655/lib/synchronizer.js... lib/synchronizer.js:165: let match = /^\s*!\s*(?:(\w+)\s*:\s*(.*)|\S)/.exec(lines[i]); On 2018/09/05 18:10:08, Manish Jethani wrote: > How about this one: `/^\s*!\s*([^\s:]*)\s*(:)?\s*(.*)/` > > It seems simpler to me but maybe it performs worse, I'd have to check. Done. Not sure if it's any easier to read through. > Anyway, so with this we can check for `match[1]` being truthy. Nope, we have to check for match[2] instead. https://codereview.adblockplus.org/29873570/diff/29874655/lib/synchronizer.js... lib/synchronizer.js:169: if (match[1] != undefined) On 2018/09/05 18:10:08, Manish Jethani wrote: > We never compare with `undefined` in core. `undefined` is a global variable > (unlike `null`, which is a language keyword) and must be looked up in each > scope. It is slow. Normally we do `typeof foo != "undefined"` all over the > place. Actually, I cannot redefine undefined, neither through assignment nor through Object.defineProperty(). But still it's different from null (trying to assign to it is a syntax error), while undefined seems to be a read-only property on the global object. Technically both could be optimized similarly, but not idea if they are. Just in case, also note that null == undefined, which might come handy, if this should be a concern, and you don't want to treat actual null (if the value can be) differently.
https://codereview.adblockplus.org/29873570/diff/29874655/lib/synchronizer.js File lib/synchronizer.js (right): https://codereview.adblockplus.org/29873570/diff/29874655/lib/synchronizer.js... lib/synchronizer.js:165: let match = /^\s*!\s*(?:(\w+)\s*:\s*(.*)|\S)/.exec(lines[i]); On 2018/09/06 00:56:03, Sebastian Noack wrote: > On 2018/09/05 18:10:08, Manish Jethani wrote: > > How about this one: `/^\s*!\s*([^\s:]*)\s*(:)?\s*(.*)/` > > > > It seems simpler to me but maybe it performs worse, I'd have to check. > > Done. Not sure if it's any easier to read through. > > > Anyway, so with this we can check for `match[1]` being truthy. > > Nope, we have to check for match[2] instead. I stand corrected, I meant `match[2]` A couple of things to note with this regular expression: 1. Empty keys are valid, but of course they would get ignored in the code that follows 2. Keys can now contain non-ASCII characters; if we stick with this, perhaps we should document it; values could already contain non-ASCII characters, there should be no reason to disallow them from the keys 3. This will continue even after empty comments; if you disagree, we need to update it (I think it's fine to continue parsing after empty comments but you're aware of the full context of this issue) https://codereview.adblockplus.org/29873570/diff/29874655/lib/synchronizer.js... lib/synchronizer.js:169: if (match[1] != undefined) On 2018/09/06 00:56:03, Sebastian Noack wrote: > On 2018/09/05 18:10:08, Manish Jethani wrote: > > We never compare with `undefined` in core. `undefined` is a global variable > > (unlike `null`, which is a language keyword) and must be looked up in each > > scope. It is slow. Normally we do `typeof foo != "undefined"` all over the > > place. > > Actually, I cannot redefine undefined, neither through assignment nor through > Object.defineProperty(). But still it's different from null (trying to assign to > it is a syntax error), while undefined seems to be a read-only property on the > global object. Technically both could be optimized similarly, but not idea if > they are. You can just shadow `undefined` in the local scope, which you cannot do with `null`. > Just in case, also note that null == undefined, which might come handy, if this > should be a concern, and you don't want to treat actual null (if the value can > be) differently. Acknowledged.
https://codereview.adblockplus.org/29873570/diff/29874655/lib/synchronizer.js File lib/synchronizer.js (right): https://codereview.adblockplus.org/29873570/diff/29874655/lib/synchronizer.js... lib/synchronizer.js:165: let match = /^\s*!\s*(?:(\w+)\s*:\s*(.*)|\S)/.exec(lines[i]); On 2018/09/06 05:53:53, Manish Jethani wrote: > 2. Keys can now contain non-ASCII characters; if we stick with this, perhaps > we should document it; values could already contain non-ASCII characters, there > should be no reason to disallow them from the keys This is just a stop gap, what is eventually considered a valid key is defined by the contents of the params object. So whether the regexp allows non-ascii or not doesn't matter, as long as we keep looking for more metadata after an invalid key (regardless of whether it contained non-ascii or whitespaces, or simply isn't given in the params object). > 3. This will continue even after empty comments; if you disagree, we need to > update it (I think it's fine to continue parsing after empty comments but you're > aware of the full context of this issue) Opps, I missed that. Went back to my regular expression now. As I said, look at EasyList for example, the metadata follow some non-metadata comments separated by an empty comment. That is why I want to stop looking for more metadata after an empty comment.
LGTM, except the one comment about the tests https://codereview.adblockplus.org/29873570/diff/29874655/lib/synchronizer.js File lib/synchronizer.js (right): https://codereview.adblockplus.org/29873570/diff/29874655/lib/synchronizer.js... lib/synchronizer.js:165: let match = /^\s*!\s*(?:(\w+)\s*:\s*(.*)|\S)/.exec(lines[i]); On 2018/09/06 14:31:36, Sebastian Noack wrote: > On 2018/09/06 05:53:53, Manish Jethani wrote: > > 2. Keys can now contain non-ASCII characters; if we stick with this, perhaps > > we should document it; values could already contain non-ASCII characters, > there > > should be no reason to disallow them from the keys > > This is just a stop gap, what is eventually considered a valid key is defined by > the contents of the params object. So whether the regexp allows non-ascii or not > doesn't matter, as long as we keep looking for more metadata after an invalid > key (regardless of whether it contained non-ascii or whitespaces, or simply > isn't given in the params object). > > > 3. This will continue even after empty comments; if you disagree, we need to > > update it (I think it's fine to continue parsing after empty comments but > you're > > aware of the full context of this issue) > > Opps, I missed that. Went back to my regular expression now. > > As I said, look at EasyList for example, the metadata follow some non-metadata > comments separated by an empty comment. That is why I want to stop looking for > more metadata after an empty comment. Acknowledged. https://codereview.adblockplus.org/29873570/diff/29876565/test/synchronizer.js File test/synchronizer.js (left): https://codereview.adblockplus.org/29873570/diff/29876565/test/synchronizer.j... test/synchronizer.js:262: return [Cr.NS_OK, 200, "[Adblock]\nfoo\n!Expires: " + currentTest.expiration + "\nbar"]; We should probably just add a "!" in front of foo here (and all the other places with this change)
https://codereview.adblockplus.org/29873570/diff/29876565/test/synchronizer.js File test/synchronizer.js (left): https://codereview.adblockplus.org/29873570/diff/29876565/test/synchronizer.j... test/synchronizer.js:262: return [Cr.NS_OK, 200, "[Adblock]\nfoo\n!Expires: " + currentTest.expiration + "\nbar"]; On 2018/09/10 07:24:49, Manish Jethani wrote: > We should probably just add a "!" in front of foo here (and all the other places > with this change) Done.
LGTM
On 2018/09/10 18:52:11, Manish Jethani wrote: > LGTM You can close this now. |