|
|
Created:
Jan. 26, 2018, 5:32 p.m. by kzar Modified:
March 7, 2018, 5:59 a.m. CC:
sergei Visibility:
Public. |
DescriptionIssue 6329 - Allow whitespace in filter option values
This change is necessary for the addition of the $csp filter option since whitespace in Content Security Policies is significant. I've skipped (most of) the normalisation step, instead doing it at the point the filter is parsed. My key insight there was that we always seem to call Filter.fromText directly afterward the normalise function anyway. This makes the code a little harder to follow perhaps, but has the benefit that we don't wastefully parse element hiding filters during normalisation, only to repeat much of that work again in Filters.fromText.
Note that normalisation is not identical as before for RegExpFilters, previously something like "example $ domain=foo.com" might have worked, but no longer.
Patch Set 1 #
Total comments: 22
Patch Set 2 : Addressed some of Manish's initial feedback #
MessagesTotal messages: 9
Patch Set 1
https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.j... lib/filterClasses.js:104: * basic normalisation and parsing where possible before deferring to the right I noticed that the documentation is now using the British spelling of "normalization". I don't have a preference either way other than maybe for consistency it should use the American spelling. https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.j... lib/filterClasses.js:111: let filter = Filter.knownFilters.get(text); This means that two otherwise identical filters (post normalization) could be treated as different filters after this change. https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.j... lib/filterClasses.js:711: let origText; Why not initialize origText to null for consistency? https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.j... lib/filterClasses.js:721: let options = match[1].replace(/\s/g, ""); I don't get this part. Doesn't this mean that "$csp=script-src 'none'" will become "$csp=script-src'none'"? https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.j... lib/filterClasses.js:731: option = option.substring(0, separatorIndex); Any particular reason this is now using String.substring? https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.j... lib/filterClasses.js:779: return new InvalidFilter(origText, "filter_invalid_regexp"); origText is no longer the original text so it has basically lost its meaning. On the other hand it's not very useful either. Maybe we can get rid of the extra variable? https://codereview.adblockplus.org/29680684/diff/29680685/lib/synchronizer.js File lib/synchronizer.js (right): https://codereview.adblockplus.org/29680684/diff/29680685/lib/synchronizer.js... lib/synchronizer.js:283: line = Filter.stripJunk(line); Since most of the code has moved out anyway, why not just move the rest of it into Filter.fromText and get rid of Filter.stripJunk? On the other hand, it seems like there might have been a good reason for separating out the normalization from the rest of the parsing, since Filter.normalize is called from only two places whereas Filter.fromText is called from multiple places. Anyway I'm in favor of simplifying this further and just keep everything in Filter.fromText unless it's a performance concern.
https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.j... lib/filterClasses.js:721: let options = match[1].replace(/\s/g, ""); On 2018/02/05 16:09:58, Manish Jethani wrote: > I don't get this part. Doesn't this mean that "$csp=script-src 'none'" will > become "$csp=script-src'none'"? I see, so the other patch builds on this. Got it.
Patch Set 2 : Addressed some of Manish's initial feedback https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.j... lib/filterClasses.js:104: * basic normalisation and parsing where possible before deferring to the right On 2018/02/05 16:09:57, Manish Jethani wrote: > I noticed that the documentation is now using the British spelling of > "normalization". I don't have a preference either way other than maybe for > consistency it should use the American spelling. Kind of ironic heh, Done. https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.j... lib/filterClasses.js:111: let filter = Filter.knownFilters.get(text); On 2018/02/05 16:09:57, Manish Jethani wrote: > This means that two otherwise identical filters (post normalization) could be > treated as different filters after this change. Any ideas? This whole change was a pain, perhaps there is a better approach which I missed. https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.j... lib/filterClasses.js:711: let origText; On 2018/02/05 16:09:58, Manish Jethani wrote: > Why not initialize origText to null for consistency? Done. https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.j... lib/filterClasses.js:721: let options = match[1].replace(/\s/g, ""); On 2018/02/06 05:14:39, Manish Jethani wrote: > On 2018/02/05 16:09:58, Manish Jethani wrote: > > I don't get this part. Doesn't this mean that "$csp=script-src 'none'" will > > become "$csp=script-src'none'"? > > I see, so the other patch builds on this. Got it. Sorry I should have left a comment on the review here to explain, but yep you got it. https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.j... lib/filterClasses.js:731: option = option.substring(0, separatorIndex); On 2018/02/05 16:09:58, Manish Jethani wrote: > Any particular reason this is now using String.substring? Well since I noticed that `substr` was being used mistakenly all this time, but happened to work since the offset was always 0. Quite misleading. https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.j... lib/filterClasses.js:779: return new InvalidFilter(origText, "filter_invalid_regexp"); On 2018/02/05 16:09:58, Manish Jethani wrote: > origText is no longer the original text so it has basically lost its meaning. On > the other hand it's not very useful either. Maybe we can get rid of the extra > variable? The point of `origText` is to show a more useful warning to the user when they have added an invalid filter. It's still useful for that purpose, though I agree the name sucks. I'm open to suggestions there. https://codereview.adblockplus.org/29680684/diff/29680685/lib/synchronizer.js File lib/synchronizer.js (right): https://codereview.adblockplus.org/29680684/diff/29680685/lib/synchronizer.js... lib/synchronizer.js:283: line = Filter.stripJunk(line); On 2018/02/05 16:09:58, Manish Jethani wrote: > Since most of the code has moved out anyway, why not just move the rest of it > into Filter.fromText and get rid of Filter.stripJunk? > > On the other hand, it seems like there might have been a good reason for > separating out the normalization from the rest of the parsing, since > Filter.normalize is called from only two places whereas Filter.fromText is > called from multiple places. > > Anyway I'm in favor of simplifying this further and just keep everything in > Filter.fromText unless it's a performance concern. Yea, perhaps we should do that if we continue with this approach. I wonder if there's a better way, I quite like normalisation happening in separate step, but it's a pain to allow the whitespace we need that way.
Sorry about the delay. https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.j... lib/filterClasses.js:111: let filter = Filter.knownFilters.get(text); On 2018/02/13 11:56:46, kzar wrote: > On 2018/02/05 16:09:57, Manish Jethani wrote: > > This means that two otherwise identical filters (post normalization) could be > > treated as different filters after this change. > > Any ideas? This whole change was a pain, perhaps there is a better approach > which I missed. How about leaving the functions as they are, but changing the implementations a little bit? For example, Filter.normalize could mostly work as it does, but we could add an additional condition for "$csp=". Pseudocode: Filter.normalize = function(text) { ... else if (textContainsCSPOption) { stripAllSpacesExceptSpacesInCSPOption; } return text.replace(/\s/g, ""); } So if there's a "$csp=" option in the text then it'll preserve the spaces within that option's value, otherwise it will strip all spaces. https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.j... lib/filterClasses.js:731: option = option.substring(0, separatorIndex); On 2018/02/13 11:56:47, kzar wrote: > On 2018/02/05 16:09:58, Manish Jethani wrote: > > Any particular reason this is now using String.substring? > > Well since I noticed that `substr` was being used mistakenly all this time, but > happened to work since the offset was always 0. Quite misleading. Why is it a mistake to use substr? Anyway, the change is unrelated so I think it should be part of a separate patch (and then the change should be made all over the code). https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.j... lib/filterClasses.js:779: return new InvalidFilter(origText, "filter_invalid_regexp"); On 2018/02/13 11:56:46, kzar wrote: > On 2018/02/05 16:09:58, Manish Jethani wrote: > > origText is no longer the original text so it has basically lost its meaning. > On > > the other hand it's not very useful either. Maybe we can get rid of the extra > > variable? > > The point of `origText` is to show a more useful warning to the user when they > have added an invalid filter. It's still useful for that purpose, though I agree > the name sucks. I'm open to suggestions there. What if we keep origText pointing to text as it was passed in (with no modifications)?
https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.j... lib/filterClasses.js:111: let filter = Filter.knownFilters.get(text); On 2018/02/20 16:00:57, Manish Jethani wrote: > ... > else if (textContainsCSPOption) > { > stripAllSpacesExceptSpacesInCSPOption; > } > return text.replace(/\s/g, ""); > } I meant of course: ... else if (textContainsCSPOption) { return stripAllSpacesExceptSpacesInCSPOption; /* return statement */ } return text.replace(/\s/g, ""); }
https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.j... lib/filterClasses.js:111: let filter = Filter.knownFilters.get(text); On 2018/02/21 14:49:18, Manish Jethani wrote: > ... > else if (textContainsCSPOption) > { > return stripAllSpacesExceptSpacesInCSPOption; /* return statement */ > } > return text.replace(/\s/g, ""); > } Well while I agree that doing filter normalisation as a separate step is probably the right option (filter caching etc) I don't think your suggestion is practical - I mean forget all this, how about we implement the pseudocode of `blockAllAdverts(website); /* block adverts */` instead? So far we strip all whitespace from the filters, so let's take an example filter: whatev er$ c s P =csp, do main= adblockplus.org or another one whatev er$ c s p =csp,, We don't know if the part following the '$' is a valid options string until we have stripped the spaces and checked if the regexp matches. Nor do we know if any of those options are content security policies until the whitespace is stripped. We also don't want to do any more work than necessary in the (by far) common case that a filter doesn't contain a $csp option. So to do what you suggest I figure we'd have to do something like: 1. Create a copy of the filter with all whitespace stripped. 2. If /csp=/i doesn't match simply return that stripped copy, otherwise... 3. Go back and figure out the positions where one or more whitespaces were stripped relative to the stripped copy. 4. See if the options regexp matches the stripped string, if so... 5. Figure out the positions of all the $csp option values in the stripped copy. 6. For each position that falls within a $csp option value check if some whitespace was stripped there, if so insert a space back in. 7. Finally return the stripped copy with the $csp spaces restored! I don't think that's practical, even a rough implementation which took us to number 5 was becoming unwieldy. Therefore I propose that instead we trim any surrounding whitespace from the filter and replace whitespace with single spaces. That's not as good sure, but its a whole lot simpler and I figure it's good enough since; 1.) The `Filter.normalize` function does not fully normalise filters anyway (e.g. `foo$domain=FOO.COM` vs `foo$domain=foo.com`). 2.) We can still trim / strip all whitespace from filter values and the regexp part at the time of parsing, the same as how options are upper cased. 3.) I think that in practice it's hard to do things like `foo$ c s p= bar` by mistake. Trailing whitespace like `foo$csp=bar ` sure but we'd still handle that. (In my testing it works in practice too, at least with EasyList and AA.) So anyway this comment is already far too long, sorry about that. I will close this review and update the next one[1] to also include my suggested tweak to `Filter.normalize`. [1] - https://codereview.adblockplus.org/29680689/
Message was sent while issue was closed.
https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.j... lib/filterClasses.js:111: let filter = Filter.knownFilters.get(text); On 2018/03/06 14:30:38, kzar wrote: > On 2018/02/21 14:49:18, Manish Jethani wrote: > > > ... > > else if (textContainsCSPOption) > > { > > return stripAllSpacesExceptSpacesInCSPOption; /* return statement */ > > } > > return text.replace(/\s/g, ""); > > } > > Well while I agree that doing filter normalisation as a separate step is > probably the right option (filter caching etc) I don't think your suggestion is > practical - I mean forget all this, how about we implement the pseudocode of > `blockAllAdverts(website); /* block adverts */` instead? You know my suggestion was way more specific than "blockAllAdverts" :) How about adding these lines at the end of Filter.normalize: let {index, 1: options} = optionsWithSpacesRegExp.exec(text) || {}; if (options) { return text.substring(0, index).replace(/ /g, "") + "$" + options.split(",").map(normalizeOption).join(","); } return text.replace(/ /g, ""); optionsWithSpacesRegExp and normalizeOption are defined as follows: const optionsWithSpacesRegExp = /\$( *~? ?[ \w-]+(?:=[^,]+)?(?:, *~? ?[ \w-]+(?:=[^,]+)?)*)$/; const normalizeOption = option => { let [, name, value] = /(.*?)(=.*)/.exec(option) || [, option, ""]; name = name.replace(/ /g, ""); if (name.toUpperCase() != "CSP") value = value.replace(/ /g, ""); return name + value; }; We could also keep the uppercasing here (rather change it to lowercasing) so we don't have to do it in other places. What do you think? I do like the idea of keeping this patch so that the other one remains easier to review. I really don't think we have to change the behavior of Filter.normalize at all with respect to non-CSP filters (i.e. all filters that are currently supported). We could easily just test for "csp=" first as an optimization. i.e. let {index, 1: options} = /c *s *p *=/i.test(text) && optionsWithSpacesRegExp.exec(text) || {};
Message was sent while issue was closed.
https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.j... lib/filterClasses.js:111: let filter = Filter.knownFilters.get(text); On 2018/03/07 00:13:55, Manish Jethani wrote: > How about adding these lines at the end of Filter.normalize: > > [snip] Another way to do it is by splitting on "csp=": let index = text.indexOf("$"); if (index >= 0) { let fragments = text.substring(index + 1) .split(/((?:^|,) *c *s *p *=)([^,]+)/i); let csp = false; for (let i = 0; i < fragments.length; i++) { if (csp) { csp = false; } else { fragments[i] = fragments[i].replace(/ +/g, ""); csp = /^,?csp=$/i.test(fragments[i]); } } return text.substring(0, index + 1).replace(/ +/g, "") + fragments.join(""); } |