|
|
Created:
Feb. 21, 2016, 1:48 p.m. by kzar Modified:
Feb. 26, 2016, 4:27 p.m. Reviewers:
Sebastian Noack Visibility:
Public. |
DescriptionIssue 3670 - Make rules case sensitive where possible
(Depends on this review: https://codereview.adblockplus.org/29336753/ )
Patch Set 1 #
Total comments: 8
Patch Set 2 : Rebased and addressed feedback #Patch Set 3 : Fix edge case for filters ending with "://" #
Total comments: 4
Patch Set 4 : Keep rules with non alpha characters after hostname ends case sensitive #Patch Set 5 : Handle "/" separately and the fall through #
Total comments: 8
Patch Set 6 : Fixed mistake with logic for * and ^, addressed other feedback #
Total comments: 9
Patch Set 7 : Addressed more feedback #
Total comments: 2
Patch Set 8 : Improved documentation of toRegexp function and switched to JSDoc syntax #MessagesTotal messages: 14
Patch Set 1
https://codereview.adblockplus.org/29336787/diff/29336788/lib/contentBlockerL... File lib/contentBlockerLists.js (right): https://codereview.adblockplus.org/29336787/diff/29336788/lib/contentBlockerL... lib/contentBlockerLists.js:175: function caseSensitive(filter) It probably makes sense to integrate the logic here into the state machine that converts the filter to a regular expression. Also note that we have to convert filters that doesn't explicitly use the $match_case option to lowercase. Otherwise, filters like ||EXAMPLE.COM would never match, while they previously did match requests like "https://example.com". https://codereview.adblockplus.org/29336787/diff/29336788/lib/contentBlockerL... lib/contentBlockerLists.js:180: if (!(filter.text.startsWith("||") || filter.text.startsWith("://"))) Filters usually don't start with "://". But filters like http://adserver.com are supposed to be marked case sensitive. https://codereview.adblockplus.org/29336787/diff/29336788/lib/contentBlockerL... lib/contentBlockerLists.js:184: let boundary = filter.text.substr(offset).search(/\/\?\*\^/); This regular expression would only match a seqeunce of "/?*^". However, we are looking for any of these charecters. It seems you forgot the brackets. But as I said in the comment above, this logic should be rather integrated in the state machine that we have in place, anyway. https://codereview.adblockplus.org/29336787/diff/29336788/lib/contentBlockerL... lib/contentBlockerLists.js:385: trigger: {"url-filter": matchDomain, Same for element hiding filters; we have to make sure to convert the host to lowercase.
Patch Set 2 : Rebased and addressed feedback https://codereview.adblockplus.org/29336787/diff/29336788/lib/contentBlockerL... File lib/contentBlockerLists.js (right): https://codereview.adblockplus.org/29336787/diff/29336788/lib/contentBlockerL... lib/contentBlockerLists.js:175: function caseSensitive(filter) On 2016/02/21 21:42:30, Sebastian Noack wrote: > It probably makes sense to integrate the logic here into the state machine that > converts the filter to a regular expression. > > Also note that we have to convert filters that doesn't explicitly use the > $match_case option to lowercase. Otherwise, filters like ||EXAMPLE.COM would > never match, while they previously did match requests like > "https://example.com". Done. https://codereview.adblockplus.org/29336787/diff/29336788/lib/contentBlockerL... lib/contentBlockerLists.js:180: if (!(filter.text.startsWith("||") || filter.text.startsWith("://"))) On 2016/02/21 21:42:30, Sebastian Noack wrote: > Filters usually don't start with "://". But filters like http://adserver.com are > supposed to be marked case sensitive. > Done. https://codereview.adblockplus.org/29336787/diff/29336788/lib/contentBlockerL... lib/contentBlockerLists.js:184: let boundary = filter.text.substr(offset).search(/\/\?\*\^/); On 2016/02/21 21:42:30, Sebastian Noack wrote: > This regular expression would only match a seqeunce of "/?*^". However, we are > looking for any of these charecters. It seems you forgot the brackets. But as I > said in the comment above, this logic should be rather integrated in the state > machine that we have in place, anyway. (Yep, forgot the brackets.) https://codereview.adblockplus.org/29336787/diff/29336788/lib/contentBlockerL... lib/contentBlockerLists.js:385: trigger: {"url-filter": matchDomain, On 2016/02/21 21:42:30, Sebastian Noack wrote: > Same for element hiding filters; we have to make sure to convert the host to > lowercase. Done.
Patch Set 3 : Fix edge case for filters ending with "://"
https://codereview.adblockplus.org/29336787/diff/29337682/lib/contentBlockerL... File lib/contentBlockerList.js (right): https://codereview.adblockplus.org/29336787/diff/29337682/lib/contentBlockerL... lib/contentBlockerList.js:112: if (containsHostname && i < lastIndex) I think caseSensitive shouldn't be reset to false on any of these charecters, but on any alpha character that follows. For example ||example.com/42 can be case sensitive as well. https://codereview.adblockplus.org/29336787/diff/29337682/lib/contentBlockerL... lib/contentBlockerList.js:115: i >= 2 && text[i-2] == ":" && text[i-1] == "/" && c == "/") I think it would make more sense to handle "/" separately and the fall through to the code here: case "/": if (!containsHostname && text.charAt(i-2) == ":" && text.charAt(i-1) == "/") containsHostname = caseSensitive = true; case "?": case "*": case "^": ...
Patch Set 4 : Keep rules with non alpha characters after hostname ends case sensitive Patch Set 5 : Handle "/" separately and the fall through https://codereview.adblockplus.org/29336787/diff/29337682/lib/contentBlockerL... File lib/contentBlockerList.js (right): https://codereview.adblockplus.org/29336787/diff/29337682/lib/contentBlockerL... lib/contentBlockerList.js:112: if (containsHostname && i < lastIndex) On 2016/02/24 20:31:05, Sebastian Noack wrote: > I think caseSensitive shouldn't be reset to false on any of these charecters, > but on any alpha character that follows. For example ||example.com/42 can be > case sensitive as well. Done. https://codereview.adblockplus.org/29336787/diff/29337682/lib/contentBlockerL... lib/contentBlockerList.js:115: i >= 2 && text[i-2] == ":" && text[i-1] == "/" && c == "/") On 2016/02/24 20:31:05, Sebastian Noack wrote: > I think it would make more sense to handle "/" separately and the fall through > to the code here: > > case "/": > if (!containsHostname && text.charAt(i-2) == ":" > && text.charAt(i-1) == "/") > containsHostname = caseSensitive = true; > > case "?": > case "*": > case "^": > ... Done.
https://codereview.adblockplus.org/29336787/diff/29337694/lib/contentBlockerL... File lib/contentBlockerList.js (right): https://codereview.adblockplus.org/29336787/diff/29337694/lib/contentBlockerL... lib/contentBlockerList.js:113: if (!hostnameStarted && i >= 2 && text[i-2] == ":" && text[i-1] == "/") Note that when you use charAt() as in my initial suggestion you shouldn't have to bother about out of bound reads, making the i>=2 check redundant. https://codereview.adblockplus.org/29336787/diff/29337694/lib/contentBlockerL... lib/contentBlockerList.js:117: break; If I get the logic here straight, there is an inconsistency: If we encounter "://" the last slash wouldn't be escaped as we bail out here. However, for all other occurences of slashes we would fall through, and eventually escape it, right? But we shouldn't escape slashes no matter what. https://codereview.adblockplus.org/29336787/diff/29337694/lib/contentBlockerL... lib/contentBlockerList.js:119: case "?": case "*": case "^": "*" and "^" are already handled and bail out above. So this code is never reached for these characters. https://codereview.adblockplus.org/29336787/diff/29337694/lib/contentBlockerL... lib/contentBlockerList.js:133: if (hostnameFinished && (c >= "a" && c <= "z" || c >= "A" && c <= "Z")) Note that this check can be simlified with a regexp: /[a-z]/i.test(c). But I can understand if you don't want to use a regexp in a state machine. I don't have a strong opinion either, so I leave it up to you. However, if you stick to the current logic, IMO it would be slightly easier to read like that: if (hostnameFinished && (c >= "a" && c <= "z" || c >= "A" && c <= "Z"))
Patch Set 6 : Fixed mistake with logic for * and ^, addressed other feedback https://codereview.adblockplus.org/29336787/diff/29337694/lib/contentBlockerL... File lib/contentBlockerList.js (right): https://codereview.adblockplus.org/29336787/diff/29337694/lib/contentBlockerL... lib/contentBlockerList.js:113: if (!hostnameStarted && i >= 2 && text[i-2] == ":" && text[i-1] == "/") On 2016/02/24 21:47:06, Sebastian Noack wrote: > Note that when you use charAt() as in my initial suggestion you shouldn't have > to bother about out of bound reads, making the i>=2 check redundant. Done. https://codereview.adblockplus.org/29336787/diff/29337694/lib/contentBlockerL... lib/contentBlockerList.js:117: break; On 2016/02/24 21:47:07, Sebastian Noack wrote: > If I get the logic here straight, there is an inconsistency: If we encounter > "://" the last slash wouldn't be escaped as we bail out here. However, for all > other occurences of slashes we would fall through, and eventually escape it, > right? But we shouldn't escape slashes no matter what. No, for all other occurrences of slashes we fall through, but then push the unescaped version as `c != "?"` is true. I've done it this way because otherwise hostnameStarted will be set to true and then immediately hostnameFinished will be set to true as well. (I'm not entirely sure the change to use a fall through here helped, which is why I uploaded that change as a separate patch set.) In fact, after addressing your other feedback this logic makes no sense at all. I've reworked it. https://codereview.adblockplus.org/29336787/diff/29337694/lib/contentBlockerL... lib/contentBlockerList.js:119: case "?": case "*": case "^": On 2016/02/24 21:47:06, Sebastian Noack wrote: > "*" and "^" are already handled and bail out above. So this code is never > reached for these characters. Done. https://codereview.adblockplus.org/29336787/diff/29337694/lib/contentBlockerL... lib/contentBlockerList.js:133: if (hostnameFinished && (c >= "a" && c <= "z" || c >= "A" && c <= "Z")) On 2016/02/24 21:47:06, Sebastian Noack wrote: > Note that this check can be simlified with a regexp: /[a-z]/i.test(c). But I can > understand if you don't want to use a regexp in a state machine. I don't have a > strong opinion either, so I leave it up to you. However, if you stick to the > current logic, IMO it would be slightly easier to read like that: > > if (hostnameFinished && (c >= "a" && c <= "z" || > c >= "A" && c <= "Z")) Done.
https://codereview.adblockplus.org/29336787/diff/29337696/lib/contentBlockerL... File lib/contentBlockerList.js (right): https://codereview.adblockplus.org/29336787/diff/29337696/lib/contentBlockerL... lib/contentBlockerList.js:111: hostnameStarted = caseSensitive = true; Nit: Sometimes you have the regular expression conversion logic first, sometimes you have case sensitivity logic first. Mind doing it consistently? https://codereview.adblockplus.org/29336787/diff/29337696/lib/contentBlockerL... lib/contentBlockerList.js:121: result.push("/"); If we move this case just above the default case, we could simply pass through, and wouldn't need |result.push("/");break;| here. https://codereview.adblockplus.org/29336787/diff/29337696/lib/contentBlockerL... lib/contentBlockerList.js:126: case ".": case "+": case "$": case "{": Nit: It doesn't really matter, but you wrap after the 5th (instead 4th) case, we have an equal number of cases per line, and the parantheses/braces/etc are on the same line as their corresponding counter part. ;) https://codereview.adblockplus.org/29336787/diff/29337696/lib/contentBlockerL... lib/contentBlockerList.js:161: if (result.caseSensitive) I think the logic here would be a little more straight forward if you put it like that: if (result.caseSensitive) trigger["url-filter"] = trigger["url-filter"].toLowerCase(); if (result.caseSensitive || filter.matchCase) trigger["url-filter-is-case-sensitive"] = true; Note that this also simplifies the calling code. (Moreover, I'd personally use a variable for the regexp rather than duplicating the object lookup, but I leave that up to you.)
https://codereview.adblockplus.org/29336787/diff/29337696/lib/contentBlockerL... File lib/contentBlockerList.js (right): https://codereview.adblockplus.org/29336787/diff/29337696/lib/contentBlockerL... lib/contentBlockerList.js:161: if (result.caseSensitive) On 2016/02/24 22:53:49, Sebastian Noack wrote: > I think the logic here would be a little more straight forward if you put it > like that: > > if (result.caseSensitive) > trigger["url-filter"] = trigger["url-filter"].toLowerCase(); > > if (result.caseSensitive || filter.matchCase) > trigger["url-filter-is-case-sensitive"] = true; > > Note that this also simplifies the calling code. > > (Moreover, I'd personally use a variable for the regexp rather than duplicating > the object lookup, but I leave that up to you.) > Ah wait, the logic is incorrect anyway, with either approach of us. If $match_case is requested, we shouldn't convert the filter to lower case. It's more a theoretical case as a filter like ||EXAMPLE.COM^$match_case would never match anything, but we should behave consistent with Adblock Plus here. So the logic is supposed to be: if (result.caseSensitive && !filter.matchCase) trigger["url-filter"] = trigger["url-filter"].toLowerCase(); if (result.caseSensitive || filter.matchCase) trigger["url-filter-is-case-sensitive"] = true;
Patch Set 7 : Addressed more feedback https://codereview.adblockplus.org/29336787/diff/29337696/lib/contentBlockerL... File lib/contentBlockerList.js (right): https://codereview.adblockplus.org/29336787/diff/29337696/lib/contentBlockerL... lib/contentBlockerList.js:111: hostnameStarted = caseSensitive = true; On 2016/02/24 22:53:49, Sebastian Noack wrote: > Nit: Sometimes you have the regular expression conversion logic first, sometimes > you have case sensitivity logic first. Mind doing it consistently? Done. https://codereview.adblockplus.org/29336787/diff/29337696/lib/contentBlockerL... lib/contentBlockerList.js:121: result.push("/"); On 2016/02/24 22:53:49, Sebastian Noack wrote: > If we move this case just above the default case, we could simply pass through, > and wouldn't need |result.push("/");break;| here. Done. https://codereview.adblockplus.org/29336787/diff/29337696/lib/contentBlockerL... lib/contentBlockerList.js:126: case ".": case "+": case "$": case "{": On 2016/02/24 22:53:48, Sebastian Noack wrote: > Nit: It doesn't really matter, but you wrap after the 5th (instead 4th) case, we > have an equal number of cases per line, and the parantheses/braces/etc are on > the same line as their corresponding counter part. ;) Done. https://codereview.adblockplus.org/29336787/diff/29337696/lib/contentBlockerL... lib/contentBlockerList.js:161: if (result.caseSensitive) On 2016/02/24 22:53:49, Sebastian Noack wrote: > I think the logic here would be a little more straight forward if you put it > like that: > > if (result.caseSensitive) > trigger["url-filter"] = trigger["url-filter"].toLowerCase(); > > if (result.caseSensitive || filter.matchCase) > trigger["url-filter-is-case-sensitive"] = true; > > Note that this also simplifies the calling code. > > (Moreover, I'd personally use a variable for the regexp rather than duplicating > the object lookup, but I leave that up to you.) > Done.
LGTM!
https://codereview.adblockplus.org/29336787/diff/29337698/lib/contentBlockerL... File lib/contentBlockerList.js (right): https://codereview.adblockplus.org/29336787/diff/29337698/lib/contentBlockerL... lib/contentBlockerList.js:68: // Convert the "regexpSource" part of a filter's text to a regular expression, Sorry for commenting after LGTM. Just one more nit: While I don't consider documentation for internal APIs mandatory, if you comment private functions, please still use JSDoc syntax.
Patch Set 8 : Improved documentation of toRegexp function and switched to JSDoc syntax https://codereview.adblockplus.org/29336787/diff/29337698/lib/contentBlockerL... File lib/contentBlockerList.js (right): https://codereview.adblockplus.org/29336787/diff/29337698/lib/contentBlockerL... lib/contentBlockerList.js:68: // Convert the "regexpSource" part of a filter's text to a regular expression, On 2016/02/25 02:28:35, Sebastian Noack wrote: > Sorry for commenting after LGTM. Just one more nit: While I don't consider > documentation for internal APIs mandatory, if you comment private functions, > please still use JSDoc syntax. Done. |