|
|
Created:
Jan. 26, 2018, 5:41 p.m. by kzar Modified:
March 22, 2018, 3:58 p.m. CC:
sergei Visibility:
Public. |
DescriptionIssue 6329 - Add the CSP filter type
See also the (now closed) review[1] which proceeded this.
[1] - https://codereview.adblockplus.org/29680684/
Patch Set 1 #
Total comments: 6
Patch Set 2 : Rebased, brought back Filter.normalize, other changes #Patch Set 3 : Check for report-to and report-uri directives more carefully #
Total comments: 4
Patch Set 4 : Fully normalise whitespace, avoid strict equality #
Total comments: 20
Patch Set 5 : New approach, added more unit tests #Patch Set 6 : Use space literal instead of \s where possible #
Total comments: 10
Patch Set 7 : Strip whitespace in batches where possible #Patch Set 8 : Addressed some of Manish's feedback #
Total comments: 25
Patch Set 9 : Addressed some more nits #
Total comments: 21
Patch Set 10 : Prevent the referrer directive #
Total comments: 2
Patch Set 11 : Added test case for case insensitive forbidden directives #Patch Set 12 : Improve comments #
Total comments: 14
Patch Set 13 : Added more tests, blacklisted directives and added a tiny optimisation #
Total comments: 2
Patch Set 14 : Go with Manish's do ... while ... suggestion #
Total comments: 2
Patch Set 15 : Removed duplicate base-uri directive from blacklist #
Total comments: 7
Patch Set 16 : Addressed Sebastian's feedback #
MessagesTotal messages: 58
Patch Set 1
https://codereview.adblockplus.org/29680689/diff/29680690/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29680690/lib/filterClasses.j... lib/filterClasses.js:758: csp = value.toLowerCase(); We could avoid uppercasing and then lowercasing the value here by first only uppercasing the name part of the option and here deciding if the value should be uppercase or lowercase. https://codereview.adblockplus.org/29680689/diff/29680690/lib/filterClasses.j... lib/filterClasses.js:760: if (csp.includes("report-")) Couldn't the value contain "report-" in a different place than the directive name? e.g. "default-src https://report-generator.example.com". I think this check should be more specific. https://codereview.adblockplus.org/29680689/diff/29680690/lib/filterClasses.j... lib/filterClasses.js:819: CSP: 512, Perhaps a good idea to place CSP after WEBRTC?
Patch Set 2 : Rebased, brought back Filter.normalize, other changes Patch Set 3 : Check for report-to and report-uri directives more carefully For more context see this discussion in the other review https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.j... https://codereview.adblockplus.org/29680689/diff/29680690/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29680690/lib/filterClasses.j... lib/filterClasses.js:758: csp = value.toLowerCase(); On 2018/02/06 05:54:20, Manish Jethani wrote: > We could avoid uppercasing and then lowercasing the value here by first only > uppercasing the name part of the option and here deciding if the value should be > uppercase or lowercase. Done. https://codereview.adblockplus.org/29680689/diff/29680690/lib/filterClasses.j... lib/filterClasses.js:760: if (csp.includes("report-")) On 2018/02/06 05:54:20, Manish Jethani wrote: > Couldn't the value contain "report-" in a different place than the directive > name? e.g. "default-src https://report-generator.example.com%22. I think this > check should be more specific. Done. https://codereview.adblockplus.org/29680689/diff/29680690/lib/filterClasses.j... lib/filterClasses.js:819: CSP: 512, On 2018/02/06 05:54:20, Manish Jethani wrote: > Perhaps a good idea to place CSP after WEBRTC? Done.
https://codereview.adblockplus.org/29680689/diff/29715744/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29715744/lib/filterClasses.j... lib/filterClasses.js:746: if (option == "CSP") I've left a comment on the other patch, I think we should still do this as part of normalization. https://codereview.adblockplus.org/29680689/diff/29715744/lib/filterClasses.j... lib/filterClasses.js:764: if (csp === null) As far as I can tell the strict equality is unnecessary here, it can just be `if (!csp) { ... }`. Also same below.
Patch Set 4 : Fully normalise whitespace, avoid strict equality https://codereview.adblockplus.org/29680689/diff/29715744/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29715744/lib/filterClasses.j... lib/filterClasses.js:746: if (option == "CSP") On 2018/03/07 00:22:13, Manish Jethani wrote: > I've left a comment on the other patch, I think we should still do this as part > of normalization. Done. https://codereview.adblockplus.org/29680689/diff/29715744/lib/filterClasses.j... lib/filterClasses.js:764: if (csp === null) On 2018/03/07 00:22:13, Manish Jethani wrote: > As far as I can tell the strict equality is unnecessary here, it can just be `if > (!csp) { ... }`. Also same below. Done.
https://codereview.adblockplus.org/29680689/diff/29720596/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29720596/lib/filterClasses.j... lib/filterClasses.js:192: let strippedText = text.replace(/\s/g, ""); Since we have already stripped out all non-space whitespace, we can just use " " (space) instead of \s here? let strippedText = text.replace(/ /g, ""); https://codereview.adblockplus.org/29680689/diff/29720596/lib/filterClasses.j... lib/filterClasses.js:196: let optionsMatch = Filter.optionsRegExp.exec(strippedText); We can just look for "$" here and if none is present then there are no options, right? We don't have to match the regular expression. Combined with the above then we could do: if (strippedText.indexOf("$") == -1 || !/[$,]csp=/i.test(strippedText)) return strippedText; https://codereview.adblockplus.org/29680689/diff/29720596/lib/filterClasses.j... lib/filterClasses.js:203: let optionsPosition = 0; This part is not being used at all. https://codereview.adblockplus.org/29680689/diff/29720596/lib/filterClasses.j... lib/filterClasses.js:219: parts.push( We've already stripped whitespace once, now we're stripping it again. This only affects filters with a $csp option, but I wonder if we shouldn't just check for $csp before stripping whitespace the first time. i.e. above: if (text.indexOf("$") == -1 || !/[$,] *c *s *p *=/i.test(text)) return text.replace(/ /g, "");
https://codereview.adblockplus.org/29680689/diff/29720596/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29720596/lib/filterClasses.j... lib/filterClasses.js:190: // For most regexp filters we strip all whitespace, but the values of $csp So this doesn't work for the following cases: 1. foo?csp=b a r$csp=script-src 'self' 2. foo$bar=c s p = ba z,cs p = script-src 'self' (In the second case, assume "bar" is a future option type that can take a value that contains the substring "csp=") I'm just wondering now, why not just use the following instead of all this: let index = text.indexOf("$"); if (index >= 0) { let options = text.substring(index + 1).split(","); for (let i = 0; i < options.length; i++) { let csp = /^ *c *s *p *=/.exec(options[i]); if (csp) { options[i] = csp[0].replace(/ +/g, "") + options[i].substring(csp[0].length).replace(/ +/g, " ") .trim(); } else { options[i] = options[i].replace(/ +/g, ""); } } return text.substring(0, index + 1).replace(/ +/g, "") + options.join(","); } return text.replace(/ +/g, ""); This is easier to understand, technically more correct (will not get confused with "csp=" in the option value or in the URL part of the filter), and performs just as well.
https://codereview.adblockplus.org/29680689/diff/29720596/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29720596/lib/filterClasses.j... lib/filterClasses.js:190: // For most regexp filters we strip all whitespace, but the values of $csp On 2018/03/12 18:53:59, Manish Jethani wrote: > So this doesn't work for the following cases: > > 1. foo?csp=b a r$csp=script-src 'self' > 2. foo$bar=c s p = ba z,cs p = script-src 'self' > > (In the second case, assume "bar" is a future option type that can take a value > that contains the substring "csp=") > > I'm just wondering now, why not just use the following instead of all this: > > let index = text.indexOf("$"); > if (index >= 0) > { > let options = text.substring(index + 1).split(","); > for (let i = 0; i < options.length; i++) > { > let csp = /^ *c *s *p *=/.exec(options[i]); > if (csp) > { > options[i] = csp[0].replace(/ +/g, "") + > options[i].substring(csp[0].length).replace(/ +/g, " ") > .trim(); > } > else > { > options[i] = options[i].replace(/ +/g, ""); > } > } > > return text.substring(0, index + 1).replace(/ +/g, "") + options.join(","); > } > > return text.replace(/ +/g, ""); > > This is easier to understand, technically more correct (will not get confused > with "csp=" in the option value or in the URL part of the filter), and performs > just as well. Splitting the options list into an array (for every filter that has options), likely has measurable performance implications in the common case.
https://codereview.adblockplus.org/29680689/diff/29720596/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29720596/lib/filterClasses.j... lib/filterClasses.js:190: // For most regexp filters we strip all whitespace, but the values of $csp On 2018/03/12 23:19:01, Sebastian Noack wrote: > On 2018/03/12 18:53:59, Manish Jethani wrote: > > I'm just wondering now, why not just use the following instead of all this: > > > > [...] > > Splitting the options list into an array (for every filter that has options), > likely has measurable performance implications in the common case. You mean in terms of memory consumption? I ran a test with both implementations and the one I suggested above wasn't any slower than the original. Since strings are immutable in JS, I would be surprised if String.split didn't just return strings that simply pointed back to the same memory locations in the original string. If we want to avoid creating that array we could try with generators: function* splitOptions(string) { let startIndex = 0; let index = -1; while ((index = string.indexOf(",", startIndex = index + 1)) != -1) yield string.substring(startIndex, index + 1); yield string.substring(startIndex); } ... let index = text.indexOf("$"); if (index >= 0) { let options = ""; for (let fragment of splitOptions(text.substring(index))) { let csp = /^ *c *s *p *=/.exec(fragment); if (csp) { options += csp[0].replace(/ +/g, "") + fragment.substring(csp[0].length).replace(/ +/g, " ").trim(); } else { options += fragment.replace(/ +/g, ""); } } return text.substring(0, index).replace(/ +/g, "") + options; } This seems to be a bit slower but not too much.
https://codereview.adblockplus.org/29680689/diff/29720596/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29720596/lib/filterClasses.j... lib/filterClasses.js:190: // For most regexp filters we strip all whitespace, but the values of $csp On 2018/03/13 06:59:30, Manish Jethani wrote: > > If we want to avoid creating that array we could try with generators: > > [...] OK, so I inlined that bit, now this seems to perform well: let index = text.indexOf("$"); if (index >= 0) { let options = text.substring(index + 1); text = text.substring(0, index + 1).replace(/ +/g, ""); for (let start = 0, end; end != options.length; start = end + 1) { end = options.indexOf(",", start); if (end == -1) end = options.length; let fragment = options.substring(start, end); let csp = /^ *c *s *p *=/.exec(fragment); if (csp) { text += csp[0].replace(/ +/g, "") + fragment.substring(csp[0].length).replace(/ +/g, " ").trim(); } else { text += fragment.replace(/ +/g, ""); } } return text; }
Patch Set 5 : New approach, added more unit tests Patch Set 6 : Use space literal instead of \s where possible https://codereview.adblockplus.org/29680689/diff/29720596/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29720596/lib/filterClasses.j... lib/filterClasses.js:190: // For most regexp filters we strip all whitespace, but the values of $csp On 2018/03/12 18:53:59, Manish Jethani wrote: > So this doesn't work for the following cases: > > 1. foo?csp=b a r$csp=script-src 'self' > 2. foo$bar=c s p = ba z,cs p = script-src 'self' Well spotted, I've added some unit tests for those cases. https://codereview.adblockplus.org/29680689/diff/29720596/lib/filterClasses.j... lib/filterClasses.js:192: let strippedText = text.replace(/\s/g, ""); On 2018/03/12 16:34:53, Manish Jethani wrote: > Since we have already stripped out all non-space whitespace, we can just use " " > (space) instead of \s here? > > let strippedText = text.replace(/ /g, ""); I think you're right, Done. https://codereview.adblockplus.org/29680689/diff/29720596/lib/filterClasses.j... lib/filterClasses.js:196: let optionsMatch = Filter.optionsRegExp.exec(strippedText); On 2018/03/12 16:34:53, Manish Jethani wrote: > We can just look for "$" here and if none is present then there are no options, right? No, that's not good enough unfortunately. There can be a $ (or multiple) with no options part. I added a test case for that as well FWIW. https://codereview.adblockplus.org/29680689/diff/29720596/lib/filterClasses.j... lib/filterClasses.js:203: let optionsPosition = 0; On 2018/03/12 16:34:53, Manish Jethani wrote: > This part is not being used at all. Whoops, you're right. I've fixed that now. https://codereview.adblockplus.org/29680689/diff/29720596/lib/filterClasses.j... lib/filterClasses.js:219: parts.push( On 2018/03/12 16:34:53, Manish Jethani wrote: > We've already stripped whitespace once, now we're stripping it again. This only > affects filters with a $csp option, but I wonder if we shouldn't just check for > $csp before stripping whitespace the first time. > > i.e. above: > > if (text.indexOf("$") == -1 || !/[$,] *c *s *p *=/i.test(text)) > return text.replace(/ /g, ""); No because we're expecting thousands of non-csp filters and maybe 5 csp filters. We should optimise for the common case.
https://codereview.adblockplus.org/29680689/diff/29722732/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29722732/lib/filterClasses.j... lib/filterClasses.js:187: return domain.replace(/ /g, "") + separator + selector.trim(); By the way, I tried to find out if / +/g is any better than / /g. Here's a test I copied and pasted into the Chrome and Firefox consoles: (function() { let data = new Array(1000000); for (let i = 0; i < data.length; i++) { data[i] = Math.random().toString(36).slice(2).split("").join( new Array(Math.ceil(Math.random() * 10)).fill(" ").join("") ); } console.log(data.slice(0, 10)); let startTime = Date.now(); for (let i = 0; i < data.length; i++) data[i] = data[i].replace(/ /g, ""); console.log("time:", Date.now() - startTime); console.log(data.slice(0, 10)); })(); So this prints "time: 2128" or something in that order (slightly faster on Firefox) for me. If I replace the regular expression with / +/g, it prints "time: 812" or something in that order (again slightly faster on Firefox). I think it cuts the execution time in half at least. I repeated the test several times of course to account for the VM optimizing that function and variance in general. So we should go ahead and use / +/g in my opinion. Note that we can also do the same for /[^\S ]/g above since that too is just a single character match.
Patch Set 7 : Strip whitespace in batches where possible https://codereview.adblockplus.org/29680689/diff/29722732/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29722732/lib/filterClasses.j... lib/filterClasses.js:187: return domain.replace(/ /g, "") + separator + selector.trim(); On 2018/03/14 15:27:30, Manish Jethani wrote: > By the way, I tried to find out if / +/g is any better than / /g. > > Here's a test I copied and pasted into the Chrome and Firefox consoles: > > (function() > { > let data = new Array(1000000); > for (let i = 0; i < data.length; i++) > { > data[i] = Math.random().toString(36).slice(2).split("").join( > new Array(Math.ceil(Math.random() * 10)).fill(" ").join("") > ); > } > > console.log(data.slice(0, 10)); > > let startTime = Date.now(); > for (let i = 0; i < data.length; i++) > data[i] = data[i].replace(/ /g, ""); > > console.log("time:", Date.now() - startTime); > > console.log(data.slice(0, 10)); > })(); > > So this prints "time: 2128" or something in that order (slightly faster on > Firefox) for me. If I replace the regular expression with / +/g, it prints > "time: 812" or something in that order (again slightly faster on Firefox). I > think it cuts the execution time in half at least. > > I repeated the test several times of course to account for the VM optimizing > that function and variance in general. > > So we should go ahead and use / +/g in my opinion. > > Note that we can also do the same for /[^\S ]/g above since that too is just a > single character match. Done.
https://codereview.adblockplus.org/29680689/diff/29720596/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29720596/lib/filterClasses.j... lib/filterClasses.js:196: let optionsMatch = Filter.optionsRegExp.exec(strippedText); On 2018/03/14 13:54:38, kzar wrote: > On 2018/03/12 16:34:53, Manish Jethani wrote: > > We can just look for "$" here and if none is present then there are no > options, right? > > No, that's not good enough unfortunately. There can be a $ (or multiple) with no > options part. I added a test case for that as well FWIW. OK, but I wonder if we shouldn't just look for a "$" first anyway before running the regular expression. I see the same thing being done in RegExpFilter.fromText: let match = (text.indexOf("$") >= 0 ? Filter.optionsRegExp.exec(text) : null); if (match) { ... } https://codereview.adblockplus.org/29680689/diff/29722732/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29722732/lib/filterClasses.j... lib/filterClasses.js:192: if (!/csp=/i.test(strippedText)) So how about adding another check here: if (strippedText.indexOf("$") == -1 || !/csp=/i.test(strippedText)) return strippedText; Because EasyList starts off with a ton of filters with no "$" Even better: let dollarIndex = strippedText.lastIndexOf("$"); if (dollarIndex == -1 || !/csp=/i.test(strippedText.substring(dollarIndex + 1))) return strippedText; We could probably reuse dollarIndex below (see my other comments). Also I would change /csp=/i to /\bcsp=/i https://codereview.adblockplus.org/29680689/diff/29722732/lib/filterClasses.j... lib/filterClasses.js:203: for (let i = beforeOptions.split("$").length; i > 0; i--) Could you explain this part? I'm not sure what this is supposed to do. If the idea is to get the part after the last "$" in the original text, we could just use String.lastIndexOf here. let optionsText = text.substring(text.lastIndexOf("$") + 1); For what it's worth neither the original code nor my suggestion works with the following case: /foo$/$ csp = script-src http://example.com/?$1=1&$2=2&$3=3 "$" is a valid character in the query string part of the URL: https://stackoverflow.com/questions/1547899/which-characters-make-a-url-invalid Thankfully there's no case for having a query string or indeed even the path part of the URL in the $csp filter. Not yet anyway, but things may change in the future with the CSP standard. Further the filter author can always encode the "$" (recommended anyway) so it should be OK. https://codereview.adblockplus.org/29680689/diff/29722732/lib/filterClasses.j... lib/filterClasses.js:206: let options = []; Why create a separate array for the stripped version? We could just do this: let options = optionsText.split(","); for (let i = 0; i < options.length; i++) options[i] = options[i].replace(...); https://codereview.adblockplus.org/29680689/diff/29722732/lib/filterClasses.j... lib/filterClasses.js:209: let cspMatch = /^( *c *s *p *=)([^,]+)/i.exec(option); The regular expression here could be /^( *c *s *p *=)(.+)/i because we already know that there are no commas there. And why do we need the capturing groups? let cspMatch = /^ *c *s *p *=/.exec(option); if (cspMatch) { options.push( cspMatch[0].replace(/ /g, "") + option.substring(cspMatch[0].length).trim().replace(/ +/g, " ") ); } I have a feeling that this might actually be faster, though I haven't tested this.
Patch Set 8 : Addressed some of Manish's feedback https://codereview.adblockplus.org/29680689/diff/29722732/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29722732/lib/filterClasses.j... lib/filterClasses.js:192: if (!/csp=/i.test(strippedText)) On 2018/03/14 18:10:50, Manish Jethani wrote: > So how about adding another check here: > > if (strippedText.indexOf("$") == -1 || !/csp=/i.test(strippedText)) > return strippedText; > > Because EasyList starts off with a ton of filters with no "$" Seems you're right, of 65k filters EasyList has 47k with no options part. Done. https://codereview.adblockplus.org/29680689/diff/29722732/lib/filterClasses.j... lib/filterClasses.js:203: for (let i = beforeOptions.split("$").length; i > 0; i--) On 2018/03/14 18:10:50, Manish Jethani wrote: > Could you explain this part? I'm not sure what this is supposed to do. This code uses the index of the options part of the stripped filter text to figure out what the index of the options part is of the original filter text. It does that by counting the number of '$'s before the options start in the stripped string and skipping past that many in the non stripped string. > For what it's worth neither the original code nor my suggestion works with the > following case: > > /foo$/$ csp = script-src http://example.com/?$1=1&$2=2&$3=3 I've added a test case for that and it worked OK, have I missed something? https://codereview.adblockplus.org/29680689/diff/29722732/lib/filterClasses.j... lib/filterClasses.js:206: let options = []; On 2018/03/14 18:10:50, Manish Jethani wrote: > Why create a separate array for the stripped version? We could just do this: > > let options = optionsText.split(","); > for (let i = 0; i < options.length; i++) > options[i] = options[i].replace(...); Done. https://codereview.adblockplus.org/29680689/diff/29722732/lib/filterClasses.j... lib/filterClasses.js:209: let cspMatch = /^( *c *s *p *=)([^,]+)/i.exec(option); On 2018/03/14 18:10:50, Manish Jethani wrote: > The regular expression here could be /^( *c *s *p *=)(.+)/i because we already > know that there are no commas there. > > And why do we need the capturing groups? > > let cspMatch = /^ *c *s *p *=/.exec(option); > if (cspMatch) > { > options.push( > cspMatch[0].replace(/ /g, "") + > option.substring(cspMatch[0].length).trim().replace(/ +/g, " ") > ); > } > > I have a feeling that this might actually be faster, though I haven't tested > this. Done.
https://codereview.adblockplus.org/29680689/diff/29720596/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29720596/lib/filterClasses.j... lib/filterClasses.js:190: // For most regexp filters we strip all whitespace, but the values of $csp On 2018/03/13 07:35:00, Manish Jethani wrote: > On 2018/03/13 06:59:30, Manish Jethani wrote: > > > > If we want to avoid creating that array we could try with generators: > > > > [...] > > OK, so I inlined that bit, now this seems to perform well: > > let index = text.indexOf("$"); > if (index >= 0) > { > let options = text.substring(index + 1); > text = text.substring(0, index + 1).replace(/ +/g, ""); > > for (let start = 0, end; end != options.length; start = end + 1) > { > end = options.indexOf(",", start); > if (end == -1) > end = options.length; > > let fragment = options.substring(start, end); > let csp = /^ *c *s *p *=/.exec(fragment); > if (csp) > { > text += csp[0].replace(/ +/g, "") + > fragment.substring(csp[0].length).replace(/ +/g, " ").trim(); > } > else > { > text += fragment.replace(/ +/g, ""); > } > } > > return text; > } Any reason, why this suggestion was ignored, without a comment? https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.j... lib/filterClasses.js:179: if (/^ *!/.test(text)) It seems previously, comments were allowed do be preceded by any kind of withe space (\s), now you are explicitly checking for the space character. Why? https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.j... lib/filterClasses.js:187: return domain.replace(/ +/g, "") + separator + selector.trim(); Similarly, here. https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.j... lib/filterClasses.js:204: optionsText = optionsText.substr(optionsText.indexOf("$") + 1); It seems the same can be achieved with less code and without creating an array: for (let i; (i = optionsText.indexOf("$")) != -1;) optionsText = optionsText.substr(i + 1);
https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.j... lib/filterClasses.js:179: if (/^ *!/.test(text)) On 2018/03/14 20:59:20, Sebastian Noack wrote: > It seems previously, comments were allowed do be preceded by any kind of withe > space (\s), now you are explicitly checking for the space character. Why? In the preceding code we're stripping out all non-space whitespace characters (e.g. "\n") using /[^\S ]/ so we end up with only spaces and non-whitespace characters. Now we only need to strip out spaces. https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.j... lib/filterClasses.js:204: optionsText = optionsText.substr(optionsText.indexOf("$") + 1); On 2018/03/14 20:59:20, Sebastian Noack wrote: > It seems the same can be achieved with less code and without creating an array: > > for (let i; (i = optionsText.indexOf("$")) != -1;) > optionsText = optionsText.substr(i + 1); This only looks for the last "$" in the string, which can be achieved using String.lastIndexOf, but as Dave points out it's not correct. There could be legit "$" within the options text as well. This is not actually true right now, but may be in the future; however, since we already don't support commas in the value of the $csp option, we should also not support "$" in the value. If the filter author needs to use a "$" they can encode it as "%24" [1]. Am I right? I can't think of an actual reason why we'd want to support "$" in the options part of the filter. The most complex option is in fact $csp, but even here there's no instance of needing to support "$" in the value [2] [1]: https://stackoverflow.com/questions/1547899/which-characters-make-a-url-invalid [2]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Security-Po...
https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.j... lib/filterClasses.js:192: if (!strippedText.includes("$") || !/csp=/i.test(strippedText)) Can we make this /\bcsp=/i instead? https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.j... lib/filterClasses.js:213: options[i] = option.substring(0, cspMatch[0].length).replace(/ +/g, "") + Well option.substring(0, cspMatch[0].length) is the same as cspMatch[0] so can we skip the unnecessary substring here? We already have the string we need. https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.j... lib/filterClasses.js:214: option.substr(cspMatch[0].length).trim().replace(/ +/g, " "); Any reason why this uses substr instead of substring? We need to pick a favorite and stick with it unless necessary otherwise, and I think substring is the best choice. substr wasn't even standardized until much later. https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Obje... https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Obje...
https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.j... lib/filterClasses.js:204: optionsText = optionsText.substr(optionsText.indexOf("$") + 1); On 2018/03/15 08:09:59, Manish Jethani wrote: > On 2018/03/14 20:59:20, Sebastian Noack wrote: > > It seems the same can be achieved with less code and without creating an > array: > > > > for (let i; (i = optionsText.indexOf("$")) != -1;) > > optionsText = optionsText.substr(i + 1); > > This only looks for the last "$" in the string, which can be achieved using > String.lastIndexOf, but as Dave points out it's not correct. There could be > legit "$" within the options text as well. This is not actually true right now, > but may be in the future; however, since we already don't support commas in the > value of the $csp option, we should also not support "$" in the value. If the > filter author needs to use a "$" they can encode it as "%24" [1]. Am I right? > > I can't think of an actual reason why we'd want to support "$" in the options > part of the filter. The most complex option is in fact $csp, but even here > there's no instance of needing to support "$" in the value [2] > > [1]: > https://stackoverflow.com/questions/1547899/which-characters-make-a-url-invalid > [2]: > https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Security-Po... Actually I still don't get why we have to do this. We should run the options regular expression on the unstripped text. if (!text.includes("$") || !/\bc *s *p *=/i.test(text)) return text.replace(/ +/g, ""); let optionsMatch = Filter.optionsRegExp.exec(text); if (!optionsMatch) return text.replace(/ +/g, ""); ... return beforeOptions.replace(/ +/g, "") + "$" + options.join(); optionsMatch already points to the correct location, plus we have skipped one unnecessary cycle of stripping spaces out of the options text.
Patch Set 9 : Addressed some more nits https://codereview.adblockplus.org/29680689/diff/29720596/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29720596/lib/filterClasses.j... lib/filterClasses.js:190: // For most regexp filters we strip all whitespace, but the values of $csp On 2018/03/14 20:59:20, Sebastian Noack wrote: > On 2018/03/13 07:35:00, Manish Jethani wrote: > > On 2018/03/13 06:59:30, Manish Jethani wrote: > > > > > > If we want to avoid creating that array we could try with generators: > > > > > > [...] > > > > OK, so I inlined that bit, now this seems to perform well: > > > > let index = text.indexOf("$"); > > if (index >= 0) > > { > > let options = text.substring(index + 1); > > text = text.substring(0, index + 1).replace(/ +/g, ""); > > > > for (let start = 0, end; end != options.length; start = end + 1) > > { > > end = options.indexOf(",", start); > > if (end == -1) > > end = options.length; > > > > let fragment = options.substring(start, end); > > let csp = /^ *c *s *p *=/.exec(fragment); > > if (csp) > > { > > text += csp[0].replace(/ +/g, "") + > > fragment.substring(csp[0].length).replace(/ +/g, " ").trim(); > > } > > else > > { > > text += fragment.replace(/ +/g, ""); > > } > > } > > > > return text; > > } > > Any reason, why this suggestion was ignored, without a comment? I ignored that since it assumed the first '$' was the start of the options string, which I pointed out elsewhere is not necessarily the case. Sorry I thought that was obvious but I guess not. https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.j... lib/filterClasses.js:179: if (/^ *!/.test(text)) On 2018/03/15 08:09:59, Manish Jethani wrote: > On 2018/03/14 20:59:20, Sebastian Noack wrote: > > It seems previously, comments were allowed do be preceded by any kind of withe > > space (\s), now you are explicitly checking for the space character. Why? > > In the preceding code we're stripping out all non-space whitespace characters > (e.g. "\n") using /[^\S ]/ so we end up with only spaces and non-whitespace > characters. Now we only need to strip out spaces. (See this comment as well https://codereview.adblockplus.org/29680689/diff/29720596/lib/filterClasses.j... ) https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.j... lib/filterClasses.js:187: return domain.replace(/ +/g, "") + separator + selector.trim(); On 2018/03/14 20:59:20, Sebastian Noack wrote: > Similarly, here. See our responses above. https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.j... lib/filterClasses.js:192: if (!strippedText.includes("$") || !/csp=/i.test(strippedText)) On 2018/03/15 08:23:17, Manish Jethani wrote: > Can we make this /\bcsp=/i instead? Done. https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.j... lib/filterClasses.js:204: optionsText = optionsText.substr(optionsText.indexOf("$") + 1); On 2018/03/15 08:09:59, Manish Jethani wrote: > ... however, since we already don't support commas in the > value of the $csp option, we should also not support "$" in the value. I can't tell if you're trolling or not at this point. If we don't care about slightly changing how existing filters are parsed why didn't we go with the original suggestion of doing something like `text.trim().replace(/ +/, " ")`? https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.j... lib/filterClasses.js:204: optionsText = optionsText.substr(optionsText.indexOf("$") + 1); On 2018/03/15 09:11:06, Manish Jethani wrote: > Actually I still don't get why we have to do this. We should run the > options regular expression on the unstripped text. Then the options regexp might not match depending on the whitespace. Yes we could change the optionsRegexp to add an optional whitespace inbetween every character, but I don't want to do that since it's already huge and also it's used later in the common case for parsing regexp filters. https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.j... lib/filterClasses.js:213: options[i] = option.substring(0, cspMatch[0].length).replace(/ +/g, "") + On 2018/03/15 08:23:17, Manish Jethani wrote: > Well option.substring(0, cspMatch[0].length) is the same as cspMatch[0] so can > we skip the unnecessary substring here? We already have the string we need. Done. https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.j... lib/filterClasses.js:214: option.substr(cspMatch[0].length).trim().replace(/ +/g, " "); On 2018/03/15 08:23:17, Manish Jethani wrote: > Any reason why this uses substr instead of substring? We need to pick a favorite > and stick with it unless necessary otherwise, and I think substring is the best > choice. substr wasn't even standardized until much later. substr and substring are the same unless the second parameter is passed, since the second parameter wasn't passed here I went with the shorter option, which seems consistent with the rest of this file. I'm more worried about this use of substr[1] since it passes the second parameter as an index, it just happens to work since the offset is always 0. But you already disagreed that I should change that... [1] - https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.j...
https://codereview.adblockplus.org/29680689/diff/29720596/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29720596/lib/filterClasses.j... lib/filterClasses.js:190: // For most regexp filters we strip all whitespace, but the values of $csp On 2018/03/15 10:26:24, kzar wrote: > On 2018/03/14 20:59:20, Sebastian Noack wrote: > > On 2018/03/13 07:35:00, Manish Jethani wrote: > > > On 2018/03/13 06:59:30, Manish Jethani wrote: > > > > > > > > If we want to avoid creating that array we could try with generators: > > > > > > > > [...] > > > > > > OK, so I inlined that bit, now this seems to perform well: > > > > > > let index = text.indexOf("$"); > > > if (index >= 0) > > > { > > > let options = text.substring(index + 1); > > > text = text.substring(0, index + 1).replace(/ +/g, ""); > > > > > > for (let start = 0, end; end != options.length; start = end + 1) > > > { > > > end = options.indexOf(",", start); > > > if (end == -1) > > > end = options.length; > > > > > > let fragment = options.substring(start, end); > > > let csp = /^ *c *s *p *=/.exec(fragment); > > > if (csp) > > > { > > > text += csp[0].replace(/ +/g, "") + > > > fragment.substring(csp[0].length).replace(/ +/g, " > ").trim(); > > > } > > > else > > > { > > > text += fragment.replace(/ +/g, ""); > > > } > > > } > > > > > > return text; > > > } > > > > Any reason, why this suggestion was ignored, without a comment? > > I ignored that since it assumed the first '$' was the start of the options > string, which I pointed out elsewhere is not necessarily the case. Sorry I > thought that was obvious but I guess not. I think Sebastian is referring to the rest of the code, i.e. the part where we already have the correct index of the options part. So basically starting with this: let options = text.substring(index + 1); text = text.substring(0, index + 1).replace(/ +/g, ""); for (let start = 0, end; end != options.length; start = end + 1) ... In my tests I found that this was much faster than using String.split, but arguably it makes the code harder to read. https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.j... lib/filterClasses.js:204: optionsText = optionsText.substr(optionsText.indexOf("$") + 1); On 2018/03/15 10:26:24, kzar wrote: > On 2018/03/15 08:09:59, Manish Jethani wrote: > > ... however, since we already don't support commas in the > > value of the $csp option, we should also not support "$" in the value. > > I can't tell if you're trolling or not at this point. If we don't care about > slightly changing how existing filters are parsed why didn't we go with the > original suggestion of doing something like `text.trim().replace(/ +/, " ")`? Sorry, then I must have misunderstood. Are "$" supported in option values right now? If they are not already supported, why do we care about supporting them now? We already know that the value of the $csp option doesn't need to have "$" in it. https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.j... lib/filterClasses.js:204: optionsText = optionsText.substr(optionsText.indexOf("$") + 1); On 2018/03/15 10:26:24, kzar wrote: > On 2018/03/15 09:11:06, Manish Jethani wrote: > > Actually I still don't get why we have to do this. We should run the > > options regular expression on the unstripped text. > > Then the options regexp might not match depending on the whitespace. Yes we > could change the optionsRegexp to add an optional whitespace inbetween every > character, but I don't want to do that since it's already huge and also it's > used later in the common case for parsing regexp filters. Sorry, yeah, I see what you mean. It wouldn't work.
https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.j... lib/filterClasses.js:202: let optionsText = text; By the way, I was wondering if we could avoid creating the array here. I tried this one: let strippedDollarIndex = -1; let dollarIndex = -1; do { strippedDollarIndex = beforeOptions.indexOf("$", strippedDollarIndex + 1); dollarIndex = text.indexOf("$", dollarIndex + 1); } while (strippedDollarIndex != -1); let optionsText = text.substring(dollarIndex + 1); This is consistently faster (~1.2s vs. ~1.6s for 10 million iterations here) than the one that creates an array using String.split.
https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.j... lib/filterClasses.js:204: optionsText = optionsText.substr(optionsText.indexOf("$") + 1); On 2018/03/15 10:42:42, Manish Jethani wrote: > On 2018/03/15 10:26:24, kzar wrote: > > On 2018/03/15 08:09:59, Manish Jethani wrote: > > > ... however, since we already don't support commas in the > > > value of the $csp option, we should also not support "$" in the value. > > > > I can't tell if you're trolling or not at this point. If we don't care about > > slightly changing how existing filters are parsed why didn't we go with the > > original suggestion of doing something like `text.trim().replace(/ +/, " ")`? > > Sorry, then I must have misunderstood. Are "$" supported in option values right > now? If they are not already supported, why do we care about supporting them > now? We already know that the value of the $csp option doesn't need to have "$" > in it. OK, I think I understand what you mean. normalize("foo$domain=bar.com$") This currently works, and it will continue to work because there's no "csp=" in there. normalize("foo$csp=script-src 'self',domain=bar.com$") This currently "works" (the $csp option isn't normalized correctly), and it'll break if we only take the last "$" to be the starting point of the options text. Well, so it's a "breaking" change, I agree, but what does it actually break? There are no actual filters that would break because of this. It's only a breaking change in theory. We can "officially" not support "$" in option values from now on, and nobody would be affected by it. So if we are to be practical I'd say let's just take the index of the last "$". If in the future there's a case where an option value needs to have an actual "$", then we can switch to more complicated code. Let me know what you think.
https://codereview.adblockplus.org/29680689/diff/29720596/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29720596/lib/filterClasses.j... lib/filterClasses.js:190: // For most regexp filters we strip all whitespace, but the values of $csp On 2018/03/15 10:42:41, Manish Jethani wrote: > I think Sebastian is referring to the rest of the code, i.e. the part where we > already have the correct index of the options part. > ... So instead of splitting the options string by "," instead using substring and indexOf(",") to pick out each option? > In my tests I found that this was much faster than using > String.split, but arguably it makes the code harder to read. Well fair enough, but I feel like optimising this rarely used code path at the cost of code readability isn't worth it. I would see the utility of moving the logic out into a generator function *parseOptions which yields each option part however, since that could also be used in the parsing code which is worth optimising. However if I got it right using a generator here would make performance even worse than simply splitting the string. https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.j... lib/filterClasses.js:204: optionsText = optionsText.substr(optionsText.indexOf("$") + 1); On 2018/03/15 10:42:42, Manish Jethani wrote: > Sorry, then I must have misunderstood. Are "$" supported in option values right > now? If they are not already supported, why do we care about supporting them > now? We already know that the value of the $csp option doesn't need to have "$" > in it. Well the part of the filter following a "$" is considered to be the option string if it matches the options regexp, otherwise it's not and that part is considered to be a part of the filter's regexp. Take this example: somead$domain=valid.com|$invalid.com So far it blocks requests matching /somead/ for the domain valid.com. With your suggestion it would instead block requests matching /somead\$domain=valid\.com\|\$invalid\.com/ for all domains. FWIW I agree that this probably doesn't matter much in practice, but I have had questions about these types of edge cases from users (vai Mapx) before. Take this filter /$file/analytics. IIRC they were confused why it worked, but one like /$file didn't. Well that's since "$file" is considered a valid options string and not used for the regexp where as $file/analytics" isn't.
https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.j... lib/filterClasses.js:204: optionsText = optionsText.substr(optionsText.indexOf("$") + 1); On 2018/03/15 11:25:47, Manish Jethani wrote: > We can "officially" not support "$" in option values > from now on... So if we are to be practical > I'd say let's just take the index of the last "$". Sure or we could "officially" not support filters with random spaces inside just the same. So we just do `.trim().replace(/ /+, " ")`, if the user gets lucky their filter still works, but if the space was in an option value name (or similar) things break. I think it's too late to talk about being practical in this codereview to be honest. > ... and nobody would be affected by it. Well that's speculation, hopefully few users would be anyway. But that also applies to my above point.
https://codereview.adblockplus.org/29680689/diff/29720596/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29720596/lib/filterClasses.j... lib/filterClasses.js:190: // For most regexp filters we strip all whitespace, but the values of $csp On 2018/03/15 11:38:50, kzar wrote: > On 2018/03/15 10:42:41, Manish Jethani wrote: > > I think Sebastian is referring to the rest of the code, i.e. the part where we > > already have the correct index of the options part. > > ... > > So instead of splitting the options string by "," instead using substring and > indexOf(",") to pick out each option? Yes, it just turned out to be faster. > > In my tests I found that this was much faster than using > > String.split, but arguably it makes the code harder to read. > > Well fair enough, but I feel like optimising this rarely used code path at the > cost of code readability isn't worth it. I'm cool with using String.split here in the interested of readability/maintainability. > I would see the utility of moving the logic out into a generator function > *parseOptions which yields each option part however, since that could also be > used in the parsing code which is worth optimising. However if I got it right > using a generator here would make performance even worse than simply splitting > the string. Yeah, I tested with the generator, it performs worse. String.split is faster. https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.j... lib/filterClasses.js:204: optionsText = optionsText.substr(optionsText.indexOf("$") + 1); On 2018/03/15 11:43:52, kzar wrote: > On 2018/03/15 11:25:47, Manish Jethani wrote: > > We can "officially" not support "$" in option values > > from now on... So if we are to be practical > > I'd say let's just take the index of the last "$". > > Sure or we could "officially" not support filters with random spaces inside just > the same. So we just do `.trim().replace(/ /+, " ")`, if the user gets lucky But the two are not comparable, are they? If we don't support spaces, we break filters that used to work; if we don't support "$" in a filter that _also_ contains "csp=" we break nothing. A filter can contain spaces because the person who wrote it thought they could include a space after the "$" or after each ",", I mean how do they know better? That's not the case with "$", there's no reason to include a "$" in an option value (can you think of any?). You would have to very deliberately add a "$" for who-knows-why. (Note: a domain name can't contain "$") > their filter still works, but if the space was in an option value name (or > similar) things break. I think it's too late to talk about being practical in > this codereview to be honest. I'm not aware of the history of this normalize function, but I wouldn't expect option names or value to contain spaces in practice. I suspect the code strips out _all_ spaces instead of just spaces between option names and value is because it's easier and faster to do text.replace(/ /g, "") than to split the string up and remove spaces around each name and each value. If that's the case, by all means let's just not care about spaces within option names and non-CSP option values. It's totally legit though to write a filter like "foo $ domain = bar.com " and expect it to work. Anyway, so if you feel strongly about this then I don't mind reverting to the original text.replace(/ +/g, " ") though you'd be in a better position to say if it would break anything (literally we'd be changing the output of the function for potentially lots of actual filters out there). > > ... and nobody would be affected by it. > > Well that's speculation, hopefully few users would be anyway. But that also > applies to my above point.
https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.j... lib/filterClasses.js:204: optionsText = optionsText.substr(optionsText.indexOf("$") + 1); On 2018/03/15 12:00:11, Manish Jethani wrote: > Anyway, so if you feel strongly about this then I don't mind reverting to the > original text.replace(/ +/g, " ") though you'd be in a better position to say if > [...] To sum this up, the normalize function looks good to me as it is, though we could optimize further, but that's your call. I'm looking at the other parts of this patch now.
https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.j... lib/filterClasses.js:204: optionsText = optionsText.substr(optionsText.indexOf("$") + 1); On 2018/03/15 12:14:21, Manish Jethani wrote: > To sum this up, the normalize function looks good to me as it is, though we > could optimize further, but that's your call. > > I'm looking at the other parts of this patch now. Thanks OK. (If you can see how to optimise the common code path further I'm interested, but not for the csp code path unless it doesn't hurt readability further.)
https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.j... lib/filterClasses.js:829: if (/(;|^)\s*report-(to|uri)\b/.test(csp)) Since we allow single spaces in the CSP, we should strip them here before testing the regular expression: if (/(;|^)report-(to|uri)\b/.test(csp.replace(/ /g, ""))) Reason: even though the browsers that we know today might ignore something like "r e p o r t-uri=http://example.com", we should cover the case where either by design or by accident the browser starts accepting spaces here (browsers do tend to be liberal in the kinds of inputs they'll accept and try to interpret correctly). It would be a shame if because of a browser change somebody was able to violate our users' privacy. From the security and privacy point of view I have the following recommendations: 1. It seems there's an outdated "referrer" directive [1]. This is bad for privacy if exploited, similar to report-uri/report-to, in particular if it is set to "unsafe-url". We should strip it out as well. 2. Since we don't know which directives in the future that would be introduced in browser would be bad for privacy and security, this should be a whitelist rather than a blacklist. We should allow the following directives only: connect-src, default-src, font-src, frame-src, img-src, media-src, object-src, script-src, style-src, base-uri, plugin-types, sandbox, form-action, frame-ancestors, navigation-to, block-all-mixed-content, require-sri-for, and upgrade-insecure-requests (note: no manifest-src, no worker-src). 3. 'unsafe-eval' should be stripped out from the policy. I can't think of a single reason why a filter author should be able to add 'unsafe-eval' to the policy. Further, none of the directives that are actually injected into the headers should make the policy more liberal than what it would be otherwise. i.e. if the response says "Content-Security-Policy: sandbox;" and we are about to inject "Content-Security-Policy: sandbox allow-scripts;", we should not inject it. This should be taken care of on the adblockpluschrome side. [1]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Security-Po...
https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.j... lib/filterClasses.js:829: if (/(;|^)\s*report-(to|uri)\b/.test(csp)) On 2018/03/15 13:24:44, Manish Jethani wrote: > Further, none of the directives that are actually injected into the headers > should make the policy more liberal than what it would be otherwise. i.e. if the > response says "Content-Security-Policy: sandbox;" and we are about to inject > "Content-Security-Policy: sandbox allow-scripts;", we should not inject it. This > should be taken care of on the adblockpluschrome side. I don't think that's a problem, Sebastian noticed previously[1] that CSPs are combined exclusively[2] and so if one policy blocks something that trumps the other that allows the thing. (Will reply to the other points later.) [1] - https://codereview.adblockplus.org/29356231/#msg24 [2] - https://www.w3.org/TR/CSP2/#enforcing-multiple-policies
https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29722910/lib/filterClasses.j... lib/filterClasses.js:204: optionsText = optionsText.substr(optionsText.indexOf("$") + 1); On 2018/03/15 13:13:33, kzar wrote: > On 2018/03/15 12:14:21, Manish Jethani wrote: > > To sum this up, the normalize function looks good to me as it is, though we > > could optimize further, but that's your call. > > > > I'm looking at the other parts of this patch now. > > Thanks OK. (If you can see how to optimise the common code path further I'm > interested, but not for the csp code path unless it doesn't hurt readability > further.) Acknowledged. https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.j... lib/filterClasses.js:829: if (/(;|^)\s*report-(to|uri)\b/.test(csp)) On 2018/03/15 13:44:45, kzar wrote: > On 2018/03/15 13:24:44, Manish Jethani wrote: > > Further, none of the directives that are actually injected into the headers > > should make the policy more liberal than what it would be otherwise. i.e. if > the > > response says "Content-Security-Policy: sandbox;" and we are about to inject > > "Content-Security-Policy: sandbox allow-scripts;", we should not inject it. > This > > should be taken care of on the adblockpluschrome side. > > I don't think that's a problem, Sebastian noticed previously[1] that CSPs are > combined exclusively[2] and so if one policy blocks something that trumps the > other that allows the thing. Thanks, I was just wondering about this. In that case we're OK. By the way I wonder if whitelisting the directives (1) should be part of the normalization step, or (2) should be here, or (3) should be one level up in the stack at the time of header injection (to allow more flexibility).
https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.j... lib/filterClasses.js:829: if (/(;|^)\s*report-(to|uri)\b/.test(csp)) On 2018/03/15 13:47:33, Manish Jethani wrote: > By the way I wonder if whitelisting the directives (1) should be part of the > normalization step, or (2) should be here, or (3) should be one level up in the > stack at the time of header injection (to allow more flexibility). Also, in the interest of landing this (so we can start playing with it), we could do the whitelisting/blacklisting part as a separate patch. Then we can take our time to decide where, what, and how. We'll just have to be careful about not accidentally shipping it as it is.
https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.j... lib/filterClasses.js:829: if (/(;|^)\s*report-(to|uri)\b/.test(csp)) On 2018/03/15 13:47:33, Manish Jethani wrote: > By the way I wonder if whitelisting the directives (1) should be part of the > normalization step, or (2) should be here, or (3) should be one level up in the > stack at the time of header injection (to allow more flexibility). It should be here at parse time IMO. https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.j... lib/filterClasses.js:829: if (/(;|^)\s*report-(to|uri)\b/.test(csp)) On 2018/03/15 13:24:44, Manish Jethani wrote: > 1. It seems there's an outdated "referrer" directive [1]. This is bad for > privacy if exploited, similar to report-uri/report-to, in particular if it is > set to "unsafe-url". We should strip it out as well. Well spotted, there's a non-outdated version called Referrer-Policy too[1]. I think we should only allow "no-referrer" for both, since while omitting the referrer information might be useful for adblocking anything else could be a privacy concern. > Since we allow single spaces in the CSP, we should strip them here before > testing the regular expression: > > if (/(;|^)report-(to|uri)\b/.test(csp.replace(/ /g, ""))) > > Reason: even though the browsers that we know today might ignore something like > "r e p o r t-uri=http://example.com", we should cover the case where either by > design or by accident the browser starts accepting spaces here (browsers do tend > to be liberal in the kinds of inputs they'll accept and try to interpret > correctly). It would be a shame if because of a browser change somebody was able > to violate our users' privacy. > > 2. Since we don't know which directives in the future that would be introduced > in browser would be bad for privacy and security, this should be a whitelist > rather than a blacklist. We should allow the following directives only: > connect-src, default-src, font-src, frame-src, img-src, media-src, object-src, > script-src, style-src, base-uri, plugin-types, sandbox, form-action, > frame-ancestors, navigation-to, block-all-mixed-content, require-sri-for, and > upgrade-insecure-requests I disagree about stripping spaces in the off-chance that a browser introduces a bug in the future. As for the directive whitelist I see your point, but I can see arguments against. Do you have an opinion here Sebastian? > (note: no manifest-src, no worker-src). I'm curious, why did you omit those? worker-src in particular could be useful for preventing circumvention. > 3. 'unsafe-eval' should be stripped out from the policy. I can't think of a > single reason why a filter author should be able to add 'unsafe-eval' to the > policy. True there's not much point, but also there's not much harm. If our policy can't allow anything that's already forbidden then who cares right? [1] - https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Referrer-Policy https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.j... lib/filterClasses.js:829: if (/(;|^)\s*report-(to|uri)\b/.test(csp)) On 2018/03/15 13:50:48, Manish Jethani wrote: > On 2018/03/15 13:47:33, Manish Jethani wrote: > > By the way I wonder if whitelisting the directives (1) should be part of the > > normalization step, or (2) should be here, or (3) should be one level up in > the > > stack at the time of header injection (to allow more flexibility). > > Also, in the interest of landing this (so we can start playing with it), we > could do the whitelisting/blacklisting part as a separate patch. Then we can > take our time to decide where, what, and how. We'll just have to be careful > about not accidentally shipping it as it is. I'd prefer not to land this without the security / privacy concerns addressed. If you want to play with it all I try to keep my feature branches up to date in GitHub[1][2][3]. [1] - https://github.com/kzar/adblockpluschrome/tree/5241-csp-option [2] - https://github.com/kzar/adblockpluscore/tree/6329-csp-option [3] - https://github.com/kzar/adblockplusui/tree/6451-csp-filters
https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.j... lib/filterClasses.js:829: if (/(;|^)\s*report-(to|uri)\b/.test(csp)) On 2018/03/15 14:13:38, kzar wrote: > On 2018/03/15 13:24:44, Manish Jethani wrote: > > > [...] > > (note: no manifest-src, no worker-src). > > I'm curious, why did you omit those? worker-src in particular could be useful > for preventing circumvention. My only concern was to not allow filter authors to make a document's CSP more permissive than it already is. Since that's apparently not possible, I'm OK with these. > > 3. 'unsafe-eval' should be stripped out from the policy. I can't think of a > > single reason why a filter author should be able to add 'unsafe-eval' to the > > policy. > > True there's not much point, but also there's not much harm. If our policy can't > allow anything that's already forbidden then who cares right? Same as above. It's not an issue, we can allow it. https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.j... lib/filterClasses.js:829: if (/(;|^)\s*report-(to|uri)\b/.test(csp)) On 2018/03/15 14:13:38, kzar wrote: > On 2018/03/15 13:24:44, Manish Jethani wrote: > > 1. It seems there's an outdated "referrer" directive [1]. This is bad for > > privacy if exploited, similar to report-uri/report-to, in particular if it is > > set to "unsafe-url". We should strip it out as well. > > Well spotted, there's a non-outdated version called Referrer-Policy too[1]. I > think we should only allow "no-referrer" for both, since while omitting the > referrer information might be useful for adblocking anything else could be a > privacy concern. Referrer-Policy doesn't really concern us since we're not setting that header, right? If I understand correctly, the $csp filter option is only supposed to affect the Content-Security-Policy header. if the site sets its own Referrer-Policy header then so be it, it's not on us.
On 2018/03/15 14:28:32, Manish Jethani wrote: > Referrer-Policy doesn't really concern us since we're not setting that header, > right? If I understand correctly, the $csp filter option is only supposed to > affect the Content-Security-Policy header. if the site sets its own > Referrer-Policy header then so be it, it's not on us. Oh yea, you're right. Sorry I had misread the documentation, I thought that was the new name for the CSP directive.
Patch Set 10 : Prevent the referrer directive https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.j... lib/filterClasses.js:828: // [1] - https://github.com/gorhill/uBlock/blob/67e06f53b4d73df6179f6d320553a55da4ead4... Note: I no longer attribute uBlock here since we pretty much completely rewrote the regexp by this point.
Looks good so far provided we decide to blacklist the three directives rather than whitelist all the others. I prefer the latter, but let's hear from Sebastian. Two more things I want to do: (1) review the test file; (2) take a more thorough look into the privacy/security implications (e.g. by trying to write an exploit). I'll try to finish this soon, thanks for your patience. https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.j... lib/filterClasses.js:828: // [1] - https://github.com/gorhill/uBlock/blob/67e06f53b4d73df6179f6d320553a55da4ead4... On 2018/03/15 16:09:47, kzar wrote: > Note: I no longer attribute uBlock here since we pretty much completely rewrote > the regexp by this point. Acknowledged.
https://codereview.adblockplus.org/29680689/diff/29720596/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29720596/lib/filterClasses.j... lib/filterClasses.js:190: // For most regexp filters we strip all whitespace, but the values of $csp On 2018/03/15 12:00:11, Manish Jethani wrote: > On 2018/03/15 11:38:50, kzar wrote: > > On 2018/03/15 10:42:41, Manish Jethani wrote: > > > In my tests I found that this was much faster than using > > > String.split, but arguably it makes the code harder to read. > > > > Well fair enough, but I feel like optimising this rarely used code path at the > > cost of code readability isn't worth it. > > I'm cool with using String.split here in the interested of > readability/maintainability. I guess, if it's only in the code path hit for $csp filters, the performance penalty is acceptable.
https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.j... lib/filterClasses.js:829: if (/(;|^)\s*report-(to|uri)\b/.test(csp)) On 2018/03/15 14:13:38, kzar wrote: > I disagree about stripping spaces in the off-chance that a browser introduces a > bug in the future. Agreed. In the end, there is nothing we can effectively do to account for any possibly future change in browser's behavior. And this particular change seems especially obscure/unlikely. On the other hand, as more complexity we add, as more likely we introduce bugs effecting current browser behavior. > As for the directive whitelist I see your point, but I can > see arguments against. Do you have an opinion here Sebastian? The list of directives (relevant to ad blocking) will likely keep growing. So maintaining a whitelist will be a permanent effort. Also it will keep filter list authors from using the newest directives, as they have to wait for us to release a new version with the directive added to the whitelist, and for users to update their ad blocker. Directives that could leak information are quite a special case and (unless they change the semantics of report-uri/report-to again, which seems unlikely), I don't see a case to add yet another one in the future. Though its still possible. In the same way, it is possible that CSS selectors could leak information in the future, yet we don't consider whitelisting safe CSS syntax. Also it isn't that easy for an attacker to get a malicious filter to be used on a target machine. The filter lists used by default, come from trusted sources and we are in the control of the mirrors. So an attacker would have to host their own filter list and then trick the user in subscribing to this filter list. In the end, it might be more effective, while not much more difficult, to trick the user into installing malware instead. I don't think that this invalidates all security concerns about filter lists, but the attack vector seems smaller than when dealing with input from most other external sources. So I think a blacklist is appropriate and sufficient. > I'm curious, why did you omit those? worker-src in particular could be useful > for preventing circumvention. Absolutely, we need worker-src. Abuse of SharedWorker/ServiceWorker for circumvention (since they cannot be associated with the tab/frame they are related to), was the main motivation for the $csp filter in the first place. On 2018/03/15 13:24:44, Manish Jethani wrote: > 3. 'unsafe-eval' should be stripped out from the policy. I can't think of a > single reason why a filter author should be able to add 'unsafe-eval' to the > policy. There is a case for 'unsafe-eval'. If a website relies on dynamic code evaluation, but you want to block other script sources that cannot be blocked by other means, e.g. inline scripts, you would inject a policy like "script-src http: 'unsafe-eval'". On the other hand, as already concluded, there is no harm that can be caused with 'unsafe-eval' since the policy cannot make the document more permissive, anyway. On 2018/03/15 13:47:33, Manish Jethani wrote: > By the way I wonder if whitelisting the directives (1) should be part of the > normalization step, or (2) should be here, or (3) should be one level up in the > stack at the time of header injection (to allow more flexibility). I think validation is preferable over normalization here. If we'd just strip those directives silently, it will be more difficult to figure out why they don't work, or worse filter list authors won't notice and add them to the filter list anyway causing potential security issues with other ad blockers that are less careful. If we mark the filter as invalid, however, this will trigger a visible error message when adding it as a custom filter (which filter list authors generally do first to test their filters).
https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.j... lib/filterClasses.js:202: let optionsText = text; On 2018/03/15 11:13:25, Manish Jethani wrote: > By the way, I was wondering if we could avoid creating the array here. I tried > this one: > > let strippedDollarIndex = -1; > let dollarIndex = -1; > > do > { > strippedDollarIndex = beforeOptions.indexOf("$", strippedDollarIndex + 1); > dollarIndex = text.indexOf("$", dollarIndex + 1); > } > while (strippedDollarIndex != -1); > > let optionsText = text.substring(dollarIndex + 1); > > This is consistently faster (~1.2s vs. ~1.6s for 10 million iterations here) > than the one that creates an array using String.split. Out of curiosity, what do you think about this? It performs better. There's no array created, there's no substring operation. I find it easier to grok as well, but I guess that's subjective. It could be a separate function: function findOptionsIndex(stripped, original) { let strippedIndex = -1; let originalIndex = -1; do { strippedIndex = stripped.indexOf("$", strippedIndex + 1); originalIndex = original.indexOf("$", originalIndex + 1); } while (strippedIndex != -1); return originalIndex; }
The tests look good to me. https://codereview.adblockplus.org/29680689/diff/29723635/test/filterClasses.js File test/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29723635/test/filterClasses.... test/filterClasses.js:317: compareFilter(test, "bla$csp=report-uri", ["type=invalid", "text=bla$csp=report-uri", "reason=filter_invalid_csp"]); We should add mixed case tests here for these privacy-sensitive directives (i.e. "rePort-uRi"), what do you think?
Patch Set 11 : Added test case for case insensitive forbidden directives Patch Set 12 : Improve comments https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.j... lib/filterClasses.js:202: let optionsText = text; On 2018/03/16 06:47:09, Manish Jethani wrote: > Out of curiosity, what do you think about this? ... Well I think that what we have already is plenty good enough, this is a rarely used code-path. While this kind of thing is interesting, the effort would be better spent elsewhere (e.g. removing the split when we parse the options string later). I mean we're expecting a user to have maybe 5 or 10 of these filters, so with your numbers and being generous and saying 10 csp filters we're talking 1.6 microseconds vs 1.2 microseconds here, or in other words a saving of 400 nanoseconds. Since we're already playing code-golf I thought I'd test it[1], seems my numbers match yours. Your approach is nearly a quarter faster and the difference seems to be down to the `split().length` operation rather than the `substr` calls. [1] - https://jsperf.com/findoptionstext https://codereview.adblockplus.org/29680689/diff/29723635/test/filterClasses.js File test/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29723635/test/filterClasses.... test/filterClasses.js:317: compareFilter(test, "bla$csp=report-uri", ["type=invalid", "text=bla$csp=report-uri", "reason=filter_invalid_csp"]); On 2018/03/16 08:54:40, Manish Jethani wrote: > We should add mixed case tests here for these privacy-sensitive directives (i.e. > "rePort-uRi"), what do you think? Done.
https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.j... lib/filterClasses.js:202: let optionsText = text; On 2018/03/16 11:29:05, kzar wrote: > On 2018/03/16 06:47:09, Manish Jethani wrote: > > > Out of curiosity, what do you think about this? ... > > Well I think that what we have already is plenty good enough, this is a rarely > used code-path. While this kind of thing is interesting, the effort would be > better spent elsewhere (e.g. removing the split when we parse the options string But the effort has already been spent, the code has already been written, right? I mean there's nothing to do here but change a few lines of code. It's a rarely used code path now, what if there's a new option like $csp that's actually used very often? Then somebody may remove that /\bcsp=/i check in the future, but they won't think about optimizing the part that follows because they would assume it's already optimal (because why would the ABP team write suboptimal code?). > later). I mean we're expecting a user to have maybe 5 or 10 of these filters, so > with your numbers and being generous and saying 10 csp filters we're talking 1.6 > microseconds vs 1.2 microseconds here, or in other words a saving of 400 > nanoseconds. This assumption about only 10 $csp filters may not hold in the future. Honestly, I'm not seeing the downside of choosing code that's both more efficient and easier to understand, but I'll leave it here. > Since we're already playing code-golf I thought I'd test it[1], seems my numbers > match yours. Your approach is nearly a quarter faster and the difference seems > to be down to the `split().length` operation rather than the `substr` calls. > > [1] - https://jsperf.com/findoptionstext The current version is 60% slower here on Chrome 65 and 76% slower on Firefox 58 (both running on macOS Sierra).
https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.j... lib/filterClasses.js:829: if (/(;|^)\s*report-(to|uri)\b/.test(csp)) On 2018/03/15 19:03:05, Sebastian Noack wrote: > In the end, there is nothing we can effectively do to account for any > possibly future change in browser's behavior. And this particular change seems > especially obscure/unlikely. On the other hand, as more complexity we add, as > more likely we introduce bugs effecting current browser behavior. Ack. > The list of directives (relevant to ad blocking) will likely keep growing. So > maintaining a whitelist will be a permanent effort. Also it will keep filter > list authors from using the newest directives, as they have to wait for us to > release a new version with the directive added to the whitelist, and for users > to update their ad blocker. Maintaining a blacklist is also a permanent effort, isn't it? Unless we don't care as much about security/privacy, which we do (?). So we have two options: 1. Maintain a whitelist, let filter authors report missing directives ("It's not working for me!"), add the directives as and when, and do a new release after sufficient testing 2. Maintain a blacklist, let somebody report a security/privacy issue (hopefully found on a beta version of the browser, but possibly later when the browser is already stable and widely deployed and somebody has found a way to exploit ABP and made the news), add the directive to the blacklist, do an emergency patch release CSP not working for filter authors is far more acceptable to me than a vulnerability that's already being exploited in the wild (possibly without anybody's knowledge) because of ABP. > Directives that could leak information are quite a special case and (unless > they change the semantics of report-uri/report-to again, which seems unlikely), > I don't see a case to add yet another one in the future. Though its still > possible. In the same way, it is possible that CSS selectors could leak > information in the future, yet we don't consider whitelisting safe CSS syntax. The comparison with CSS is not fair, an HTTP header is far more powerful, especially one that literally has "security" in its name (Content-Security-Policy). > Also it isn't that easy for an attacker to get a malicious filter to be used on > a target machine. The filter lists used by default, come from trusted sources > and we are in the control of the mirrors. So an attacker would have to host > their own filter list and then trick the user in subscribing to this filter > list. In the end, it might be more effective, while not much more difficult, to > trick the user into installing malware instead. I don't think that this > invalidates all security concerns about filter lists, but the attack vector > seems smaller than when dealing with input from most other external sources. A filter doesn't have to be malicious for there to be a security issue, it can be (and usually is) a genuine oversight. But OK, it really depends on the tradeoffs, I see your point here. > There is a case for 'unsafe-eval'. If a website relies on dynamic code > evaluation, but you want to block other script sources that cannot be blocked by > other means, e.g. inline scripts, you would inject a policy like "script-src > http: 'unsafe-eval'". On the other hand, as already concluded, there is no harm > that can be caused with 'unsafe-eval' since the policy cannot make the document > more permissive, anyway. Ack. > I think validation is preferable over normalization here. If we'd just strip > those directives silently, it will be more difficult to figure out why they > don't work, or worse filter list authors won't notice and add them to the filter > list anyway causing potential security issues with other ad blockers that are > less careful. If we mark the filter as invalid, however, this will trigger a > visible error message when adding it as a custom filter (which filter list > authors generally do first to test their filters). Ack.
LGTM
On 2018/03/16 12:53:15, Manish Jethani wrote: > LGTM Sorry, I found another one to blacklist: base-uri. Consider the following HTML: <base href="http://example.com/"> <a href="foo.html">foo</a> Now clicking on the link will take you to http://example.com/foo.html no matter what the URL of the document. If the following header is set: Content-Security-Policy: base-uri 'none' Then the link will go to "foo.html" relative to the document's URL. This could lead to, for example, sensitive data being sent to the wrong server unintentionally (e.g. <form action="submit">). Do we really need to support this?
On 2018/03/16 15:14:10, Manish Jethani wrote: > On 2018/03/16 12:53:15, Manish Jethani wrote: > > LGTM > > Sorry, I found another one to blacklist: base-uri. > > [...] Another one: upgrade-insecure-requests. This one will treat all HTTP URLs in the document as HTTPS URLs, which means it'll send the requests to a different port number, which may be running a different server for all anyone knows (problem: data sent to wrong destination). This also means that the full referrer URL will be sent as opposed to just the host name by default (problem: more data sent than is intended). It also has an effect on the block-all-mixed-content directive (problem: overrides site's security policy). Do we need this?
https://codereview.adblockplus.org/29680689/diff/29724561/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29724561/lib/filterClasses.j... lib/filterClasses.js:176: text = text.replace(/[^\S ]+/g, ""); This change (including the same changes below) seems unrelated and since it seems to be a good optimization, if it's not difficult I would recommend to make it in separate commit and it can be landed pretty quickly. https://codereview.adblockplus.org/29680689/diff/29724561/lib/filterClasses.j... lib/filterClasses.js:179: if (/^ *!/.test(text)) This change seems unrelated too and to tell the truth it's not clear that it should be done. It can be a good optimization but IMO it should be confirmed that there are only spaces (0x20). https://codereview.adblockplus.org/29680689/diff/29724561/lib/filterClasses.j... lib/filterClasses.js:196: let optionsMatch = Filter.optionsRegExp.exec(strippedText); Since parsing time is very important could you please estimate how many times it reaches this line and goes below for most popular subscriptions? Perhaps it can be done as a part of hub#1020. https://codereview.adblockplus.org/29680689/diff/29724561/lib/filterClasses.j... lib/filterClasses.js:196: let optionsMatch = Filter.optionsRegExp.exec(strippedText); I would like to carefully double check what is happening here and few lines below, but have to run. https://codereview.adblockplus.org/29680689/diff/29724561/lib/filterClasses.j... lib/filterClasses.js:804: domains = value.toUpperCase(); Is this change related to the issue? When I was profiling I discovered that we are duplicating a lot of strings by changing their case at another stage, so this change can be just a good optimization. The changing the case here can result that instead of creating several string objects containing the same data we can create a string object with the data once and other string objects will be just a references to that original one. If it's not related could you please make it in a separate commit? https://codereview.adblockplus.org/29680689/diff/29724561/test/filterClasses.js File test/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29724561/test/filterClasses.... test/filterClasses.js:422: "ads"); Although perhaps it's correct, currently it's not supposed that a filter text is spread over several lines, though the can be some white spacing symbols like '\n' either before the actual filter text or after it. If taking it into account will result in more efficient code then I would recommend to do it because not fast enough parsing of filters is a pretty serious issue currently on mobile platforms.
Patch Set 13 : Added more tests, blacklisted directives and added a tiny optimisation > Sorry, I found another one to blacklist: base-uri. > Another one: upgrade-insecure-requests. Done. https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.j... lib/filterClasses.js:202: let optionsText = text; On 2018/03/16 12:05:06, Manish Jethani wrote: > It's a rarely used code path now, what if there's a new option like $csp that's > actually used very often? Then we have bigger problems, e.g. we ran the options regexp on the string twice for those filters. > This assumption about only 10 $csp filters may not hold in the future. Wasn't an assumption, I looked at the uBlock filter list which is where this filter option is already being used. OK I got it wrong and it's 24 not 10... but sure perhaps that number will change in the future. > Honestly, I'm not seeing the downside of choosing code that's both more > efficient and easier to understand, but I'll leave it here. You said yourself that it's subjective which is easier to understand and FWIW I don't agree with you here. Either way the downside is that time spent talking endlessly about something that *does not matter* is wasted, there's an opportunity cost associated - we could have spent that time talking about how to optimise a code path that does matter instead. > The current version is 60% slower here on Chrome 65 and 76% slower on Firefox 58 > (both running on macOS Sierra). I can't reproduce that, I see it as almost 25% slower. https://codereview.adblockplus.org/29680689/diff/29724561/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29724561/lib/filterClasses.j... lib/filterClasses.js:176: text = text.replace(/[^\S ]+/g, ""); On 2018/03/16 16:49:25, sergei wrote: > This change (including the same changes below) seems unrelated and since it > seems to be a good optimization, if it's not difficult I would recommend to make > it in separate commit and it can be landed pretty quickly. I would rather just land this all together (same below). https://codereview.adblockplus.org/29680689/diff/29724561/lib/filterClasses.j... lib/filterClasses.js:179: if (/^ *!/.test(text)) On 2018/03/16 16:49:24, sergei wrote: > ... it's not clear that it should be done. It can be a good > optimization but IMO it should be confirmed that there are only spaces > (0x20). It is the case since we've already replaced other whitespace above. https://codereview.adblockplus.org/29680689/diff/29724561/lib/filterClasses.j... lib/filterClasses.js:196: let optionsMatch = Filter.optionsRegExp.exec(strippedText); On 2018/03/16 16:49:25, sergei wrote: > Since parsing time is very important could you please estimate how many times it > reaches this line and goes below for most popular subscriptions? Perhaps it can > be done as a part of hub#1020. None in EasyList, 24 in the uBlock filter list[1]. [1] - https://raw.githubusercontent.com/uBlockOrigin/uAssets/master/filters/filters... https://codereview.adblockplus.org/29680689/diff/29724561/lib/filterClasses.j... lib/filterClasses.js:804: domains = value.toUpperCase(); On 2018/03/16 16:49:24, sergei wrote: > Is this change related to the issue? Yes. https://codereview.adblockplus.org/29680689/diff/29725558/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29725558/lib/filterClasses.j... lib/filterClasses.js:220: for (let i = -2; i != -1; i = beforeOptions.indexOf("$", i + 1)) Here you go Manish, I still don't see the point personally. (There's another possible optimisation, the first optionsText.indexOf("$") could start at beforeOptions.indexOf("$") or beforeOptions.length if that's -1. I couldn't see a way to do that in practice that sped things up much however and it hurt readability further.)
https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.j... lib/filterClasses.js:202: let optionsText = text; On 2018/03/17 14:35:06, kzar wrote: > On 2018/03/16 12:05:06, Manish Jethani wrote: > > It's a rarely used code path now, what if there's a new option like $csp > that's > > actually used very often? > > Then we have bigger problems, e.g. we ran the options regexp on the string twice > for those filters. > > > This assumption about only 10 $csp filters may not hold in the future. > > Wasn't an assumption, I looked at the uBlock filter list which is where this > filter option is already being used. OK I got it wrong and it's 24 not 10... but > sure perhaps that number will change in the future. > > > Honestly, I'm not seeing the downside of choosing code that's both more > > efficient and easier to understand, but I'll leave it here. > > You said yourself that it's subjective which is easier to understand and FWIW I > don't agree with you here. Either way the downside is that time spent talking > endlessly about something that *does not matter* is wasted, there's an > opportunity cost associated - we could have spent that time talking about how to > optimise a code path that does matter instead. > > > The current version is 60% slower here on Chrome 65 and 76% slower on Firefox > 58 > > (both running on macOS Sierra). > > I can't reproduce that, I see it as almost 25% slower. I actually agree with Manish here, in both the performance concerns and the readability. In particular the `split`ing and `substring`s are way less efficient than just indexes. One has to figure out that the trick with split is actually to merely count the number of occurrence of the dollar sign and it also requires to pay attention where the stripped string and the original one. Since right now it's not an issue on practice I would not insist on anything what blocks the change from landing. However could someone please create the issue with the concerns on https://issues.adblockplus.org/? Additionally I'm recalling hub#7471, perhaps we could measure the performance of emscripten parsing code as the part of this issue and if it makes sense to bring from the emscripten merely the parsing functionality instead of full filter classes and Co. JIC, I would like to remind that not fast enough parsing of the filters is currently a serious issue for us. https://codereview.adblockplus.org/29680689/diff/29724561/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29724561/lib/filterClasses.j... lib/filterClasses.js:176: text = text.replace(/[^\S ]+/g, ""); On 2018/03/17 14:35:06, kzar wrote: > On 2018/03/16 16:49:25, sergei wrote: > > This change (including the same changes below) seems unrelated and since it > > seems to be a good optimization, if it's not difficult I would recommend to > make > > it in separate commit and it can be landed pretty quickly. > > I would rather just land this all together (same below). OK, however if it gets a bit longer then could you please do it in a separate commit, additionally IMO this as well as perhaps some other changes are worth mentioning in the commit message. https://codereview.adblockplus.org/29680689/diff/29724561/lib/filterClasses.j... lib/filterClasses.js:179: if (/^ *!/.test(text)) On 2018/03/17 14:35:07, kzar wrote: > On 2018/03/16 16:49:24, sergei wrote: > > ... it's not clear that it should be done. It can be a good > > optimization but IMO it should be confirmed that there are only spaces > > (0x20). > > It is the case since we've already replaced other whitespace above. Acknowledged. https://codereview.adblockplus.org/29680689/diff/29724561/lib/filterClasses.j... lib/filterClasses.js:196: let optionsMatch = Filter.optionsRegExp.exec(strippedText); On 2018/03/17 14:35:07, kzar wrote: > On 2018/03/16 16:49:25, sergei wrote: > > Since parsing time is very important could you please estimate how many times > it > > reaches this line and goes below for most popular subscriptions? Perhaps it > can > > be done as a part of hub#1020. > > None in EasyList, 24 in the uBlock filter list[1]. > > [1] - > https://raw.githubusercontent.com/uBlockOrigin/uAssets/master/filters/filters... Acknowledged. BTW, now it also additionally executes `!strippedText.includes("$")` and if it includes then` !/\bcsp=/i.test(strippedText)` for every filter. Fortunately `Filter.normalize` seems called not each time when stored filters are read, so it should not be crucial but still important to know because it can affect the first time run, in particular on mobile platforms.
Patch Set 14 : Go with Manish's do ... while ... suggestion https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.j... lib/filterClasses.js:202: let optionsText = text; On 2018/03/19 15:47:30, sergei wrote: > I actually agree with Manish here, in both the performance concerns > and the readability. Well I still disagree on the readability aspect, but it seems I'm now outnumbered there. Done. > In particular the `split`ing and `substring`s are way less > efficient than just indexes. That's half true. `.split("$").length` is slower than the `indexOf("$", ...)` approach. But I did not find using `substring` made much of a difference at all. As of Patch Set 13 (which still used substring) the performance was approx 2% slower than Manish's suggestion. > I would like to remind that not fast enough parsing of the filters > is currently a serious issue for us. Well IMO if this is a concern we'd be better off: 1. Making practical compromises with filter normalisation / parsing. For example just doing `.replace(/ +/, " ").trim()` at normalisation instead of all this $csp related stuff. See my point of view here[1]. We could also go with Manish's suggestion of remove support for option values containing "$"s if it helps. 2. Focusing on more critical parts of the code, e.g. how we do a .split of the options string in the fromText function. If you agree about point 1 let us know which parts are especially painful for performance, perhaps there are other small tweaks with what we allow that could improve performance further. [1] - https://codereview.adblockplus.org/29680684/#msg7 https://codereview.adblockplus.org/29680689/diff/29724561/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29724561/lib/filterClasses.j... lib/filterClasses.js:196: let optionsMatch = Filter.optionsRegExp.exec(strippedText); On 2018/03/19 15:47:31, sergei wrote: > BTW, now it also additionally execute `!strippedText.includes("$")` > and if it includes then` !/\bcsp=/i.test(strippedText)` for every > filter. Yes I know.
https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29723593/lib/filterClasses.j... lib/filterClasses.js:202: let optionsText = text; On 2018/03/17 14:35:06, kzar wrote: > On 2018/03/16 12:05:06, Manish Jethani wrote: > > It's a rarely used code path now, what if there's a new option like $csp > that's > > actually used very often? > > Then we have bigger problems, e.g. we ran the options regexp on the string twice > for those filters. > > > This assumption about only 10 $csp filters may not hold in the future. > > Wasn't an assumption, I looked at the uBlock filter list which is where this > filter option is already being used. OK I got it wrong and it's 24 not 10... but > sure perhaps that number will change in the future. > > > Honestly, I'm not seeing the downside of choosing code that's both more > > efficient and easier to understand, but I'll leave it here. > > You said yourself that it's subjective which is easier to understand and FWIW I > don't agree with you here. Either way the downside is that time spent talking > endlessly about something that *does not matter* is wasted, there's an > opportunity cost associated - we could have spent that time talking about how to > optimise a code path that does matter instead. > > > The current version is 60% slower here on Chrome 65 and 76% slower on Firefox > 58 > > (both running on macOS Sierra). > > I can't reproduce that, I see it as almost 25% slower. Acknowledged. https://codereview.adblockplus.org/29680689/diff/29725558/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29725558/lib/filterClasses.j... lib/filterClasses.js:220: for (let i = -2; i != -1; i = beforeOptions.indexOf("$", i + 1)) On 2018/03/17 14:35:08, kzar wrote: > Here you go Manish, I still don't see the point personally. > > (There's another possible optimisation, the first optionsText.indexOf("$") could > start at beforeOptions.indexOf("$") or beforeOptions.length if that's -1. I > couldn't see a way to do that in practice that sped things up much however and > it hurt readability further.) Acknowledged. https://codereview.adblockplus.org/29680689/diff/29727600/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29727600/lib/filterClasses.j... lib/filterClasses.js:107: "base-uri", "referrer", "report-to", "report-uri", "base-uri", base-uri is repeated here.
Patch Set 15 : Removed duplicate base-uri directive from blacklist https://codereview.adblockplus.org/29680689/diff/29727600/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29727600/lib/filterClasses.j... lib/filterClasses.js:107: "base-uri", "referrer", "report-to", "report-uri", "base-uri", On 2018/03/19 18:03:39, Manish Jethani wrote: > base-uri is repeated here. Done.
LGTM
https://codereview.adblockplus.org/29680689/diff/29727620/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29727620/lib/filterClasses.j... lib/filterClasses.js:108: ]); I don't get why we create this Set here, just to turn it into an array, to turn it into a string to insert into regular expression, below. IMO, better: Just use an array here, reducing the boilerplate. IMO, even better: Just use a regular expression literal below. Note that ESLint won't complain about long lines due to regular expression literals. https://codereview.adblockplus.org/29680689/diff/29727620/lib/filterClasses.j... lib/filterClasses.js:809: csp.push(value); Does that mean that if multiple $csp options are given in the same filter that they are merged? Any reason for these semantics (assuming my assumption is correct)? This seems inconsistent with the $domain option (or not?). If you want to insert multiple policies, you can just separate them by semicolon within the same option value (or not?). https://codereview.adblockplus.org/29680689/diff/29727620/lib/filterClasses.j... lib/filterClasses.js:847: csp = csp.join("; ").toLowerCase(); If we convert the CSP to lower case, shouldn't this rather happen during normalization? But why do we convert it to lower case in the first place? Is it guaranteed that all parts of a CSP are case-insensitive?
Patch Set 16 : Addressed Sebastian's feedback https://codereview.adblockplus.org/29680689/diff/29727620/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680689/diff/29727620/lib/filterClasses.j... lib/filterClasses.js:108: ]); On 2018/03/21 16:23:13, Sebastian Noack wrote: > I don't get why we create this Set here, just to turn it into an array, to turn > it into a string to insert into regular expression, below. > > IMO, better: Just use an array here, reducing the boilerplate. > > IMO, even better: Just use a regular expression literal below. Note that ESLint > won't complain about long lines due to regular expression literals. I created the Set since I figured being about to do `cspBlacklist.has(directive)` might be useful sometime, but you're right it's better to keep things simple. Done. https://codereview.adblockplus.org/29680689/diff/29727620/lib/filterClasses.j... lib/filterClasses.js:809: csp.push(value); On 2018/03/21 16:23:13, Sebastian Noack wrote: > Does that mean that if multiple $csp options are given in the same > filter that they are merged? Right. > Any reason for these semantics... This seems inconsistent with the > $domain option... you can just separate them by semicolon within the > same option value Well I listed the inability of specifying multiple $csp values for a filter and the inability to have two $csp filters both apply at the same time in the issue as known limitations. Manish asked me why we have those limitations, and while the latter was kind of a pain the former seemed easy enough and so I did it. But your comment here made me reconsider, I've removed this logic again now. https://codereview.adblockplus.org/29680689/diff/29727620/lib/filterClasses.j... lib/filterClasses.js:847: csp = csp.join("; ").toLowerCase(); On 2018/03/21 16:23:13, Sebastian Noack wrote: > If we convert the CSP to lower case, shouldn't this rather happen > during normalization? Yes and no. While I see your logic of course we currently do not normalise the case of the filter text at normalisation time at all. So if I was to normalise the case of the $csp option there I would argue we should otherwise normalise the case of filters there too. (Yes, it's true that Filter.normalize does not fully normalise filters, that was part of my argument[1] for not worrying as much about how we normalise whitespace.) [1] - https://codereview.adblockplus.org/29680684/#msg7 https://codereview.adblockplus.org/29680689/diff/29727620/lib/filterClasses.j... lib/filterClasses.js:847: csp = csp.join("; ").toLowerCase(); On 2018/03/21 16:23:13, Sebastian Noack wrote: > Is it guaranteed that all parts of a CSP are case-insensitive? Hmm good point, initially I converted the value to lower case here since we previously converted it to upper case which broke everything. But since I adjusted that logic we could instead leave the case alone I suppose. Done.
LGTM
LGTM |