|
|
Created:
Nov. 5, 2014, 2:19 p.m. by Sebastian Noack Modified:
Nov. 10, 2014, 2:26 p.m. CC:
Thomas Greiner Visibility:
Public. |
DescriptionIssue 1527 - Properly escape generated CSS selectors
Patch Set 1 : #
Total comments: 10
Patch Set 2 : Addressed comments #
Total comments: 3
Patch Set 3 : Renamed function and added test case #
Total comments: 14
Patch Set 4 : Addressed comments #
Total comments: 1
MessagesTotal messages: 14
Don't you think we should better overescape slightly but have simpler rules instead?
On 2014/11/05 14:29:18, Wladimir Palant wrote: > Don't you think we should better overescape slightly but have simpler rules > instead? I'd prefer to keep the generated filters as human-readable and compact as possible. Can you comment on the lines, which you think can be simplified?
Only commenting on the escaping code right now, not the rest of it. http://codereview.adblockplus.org/4935175632846848/diff/5105650963054592/incl... File include.postload.js (right): http://codereview.adblockplus.org/4935175632846848/diff/5105650963054592/incl... include.postload.js:35: if ((code >= 0 && code <= 31) || (code >= 48 && code <= 57) || code == 127) How about: if (code <= 0x1F || /\d/.test(chr) || code == 0x7F) Why I prefer this: 1. The character code cannot be negative, and there is nothing around that this check should be consistent with. 2. Hexadecimal makes the connection to regular expressions below easier to understand. 3. Digits are better described with a regexp rather than ASCII codes. http://codereview.adblockplus.org/4935175632846848/diff/5105650963054592/incl... include.postload.js:36: return "\\" + code.toString(16) + (pos + 1 < s.length ? " " : ""); Why create a special case for end of string? That extra space certainly won't hurt. http://codereview.adblockplus.org/4935175632846848/diff/5105650963054592/incl... include.postload.js:44: return '"' + value.replace(/["\\\r\n\f\0]/g, escapeChar) + '"'; How about generally escaping all non-printable characters? return '"' + value.replace(/[\0x00-\0x1F"\\], escapeChar/) + '"'; This removes a bunch of assumptions, and non-printable characters are generally uncommon enough that the overescaping here shouldn't matter. http://codereview.adblockplus.org/4935175632846848/diff/5105650963054592/incl... include.postload.js:48: return s.replace(/^\d|^-(?![^\d-])|[^\w-]/g, escapeChar); How about always escaping leading dashes? These are uncommon anyway and the rule will become much simpler: // Alphanumerical characters, dash and underscore generally don't need to be escaped, first character can only be a letter however. return s.replace(/^[\d-]|[^\w-]/, escapeChar);
http://codereview.adblockplus.org/4935175632846848/diff/5105650963054592/incl... File include.postload.js (right): http://codereview.adblockplus.org/4935175632846848/diff/5105650963054592/incl... include.postload.js:31: if (code >= 128) I just realized that I can easily check in the regex for non-ascii in order to get rid of that check. http://codereview.adblockplus.org/4935175632846848/diff/5105650963054592/incl... include.postload.js:35: if ((code >= 0 && code <= 31) || (code >= 48 && code <= 57) || code == 127) On 2014/11/05 15:14:41, Wladimir Palant wrote: > How about: > > if (code <= 0x1F || /\d/.test(chr) || code == 0x7F) > > Why I prefer this: > > 1. The character code cannot be negative, and there is nothing around that this > check should be consistent with. > 2. Hexadecimal makes the connection to regular expressions below easier to > understand. > 3. Digits are better described with a regexp rather than ASCII codes. I already considered using that regex to check for digits. But I find it slightly inconsistent. But never mind. http://codereview.adblockplus.org/4935175632846848/diff/5105650963054592/incl... include.postload.js:36: return "\\" + code.toString(16) + (pos + 1 < s.length ? " " : ""); On 2014/11/05 15:14:41, Wladimir Palant wrote: > Why create a special case for end of string? That extra space certainly won't > hurt. I find the extra space in the end confusing. It can be easily misread at the end of a quoted string, and it looks ugly when there are duplicated spaces between two tokens. But given this is a corner case, I guess we can remove that check to simplify the code. http://codereview.adblockplus.org/4935175632846848/diff/5105650963054592/incl... include.postload.js:44: return '"' + value.replace(/["\\\r\n\f\0]/g, escapeChar) + '"'; On 2014/11/05 15:14:41, Wladimir Palant wrote: > How about generally escaping all non-printable characters? > > return '"' + value.replace(/[\0x00-\0x1F"\\], escapeChar/) + '"'; > > This removes a bunch of assumptions, and non-printable characters are generally > uncommon enough that the overescaping here shouldn't matter. Yeah, un-escaped control characters make quoted strings harder to read anyway. I have restructured the code, to have the code points for the control characters only given once. http://codereview.adblockplus.org/4935175632846848/diff/5105650963054592/incl... include.postload.js:48: return s.replace(/^\d|^-(?![^\d-])|[^\w-]/g, escapeChar); On 2014/11/05 15:14:41, Wladimir Palant wrote: > How about always escaping leading dashes? These are uncommon anyway and the rule > will become much simpler: > > // Alphanumerical characters, dash and underscore generally don't need to be > escaped, first character can only be a letter however. > return s.replace(/^[\d-]|[^\w-]/, escapeChar); I'd prefer to keep the logic for escaping dashes. Dashes in IDs and classes are pretty common, and occasionally they occur in the first character. And not over-escaping those, doesn't add any complexity to the JS code, but just a simple branch to the regex, which is justified IMO.
Looks good except for that "id with leading dash" logic. Hoping to get some feedback from Dave on that. http://codereview.adblockplus.org/4935175632846848/diff/5105650963054592/incl... File include.postload.js (right): http://codereview.adblockplus.org/4935175632846848/diff/5105650963054592/incl... include.postload.js:48: return s.replace(/^\d|^-(?![^\d-])|[^\w-]/g, escapeChar); On 2014/11/05 18:09:04, Sebastian Noack wrote: > And not > over-escaping those, doesn't add any complexity to the JS code, but just a > simple branch to the regex, which is justified IMO. I'll leave the decision on whether that branch in the regexp can be described as simple up to Dave. As to IDs starting with a dash - I don't buy into this being anywhere near common. At least EasyList and EasyList Germany only have filters referring to #\2d for one website. And it is really the entire ID in that case, so escaping is unavoidable. Not sure whether that site indeed had some element with id="-" at some point, it doesn't seem to have it now. http://codereview.adblockplus.org/4935175632846848/diff/5685265389584384/incl... File include.postload.js (right): http://codereview.adblockplus.org/4935175632846848/diff/5685265389584384/incl... include.postload.js:50: /^\d|^-(?![^\d-])|[^\w-\u0080-\uFFFF]/g, Constructing character classes that deal with large ranges was highly inefficient last time I tried using them. However, http://jsperf.com/character-classes-performance seems to confirm that there is no significant performance difference in modern browsers (anything below factor 10 doesn't really matter here, performance of this code isn't critical).
http://codereview.adblockplus.org/4935175632846848/diff/5685265389584384/incl... File include.postload.js (right): http://codereview.adblockplus.org/4935175632846848/diff/5685265389584384/incl... include.postload.js:50: /^\d|^-(?![^\d-])|[^\w-\u0080-\uFFFF]/g, On 2014/11/05 19:56:52, Wladimir Palant wrote: > Constructing character classes that deal with large ranges was highly > inefficient last time I tried using them. However, > http://jsperf.com/character-classes-performance seems to confirm that there is > no significant performance difference in modern browsers (anything below factor > 10 doesn't really matter here, performance of this code isn't critical). You forgot to run the regex, but there still isn't a notable difference: http://jsperf.com/character-classes-performance/2
On 2014/11/05 19:56:52, Wladimir Palant wrote: > Looks good except for that "id with leading dash" logic. Hoping to get some > feedback from Dave on that. > Well after looking through everything it appears that leading hyphens are OK as long as they're not followed by a digit or another hyphen. The regexp Sebastian has written seems fine to match those cases. (I have tested the escapeToken function for those situations as well and it works as I expected.) I can't find any information to say if escaping the leading hyphen like "\-" is enough to then make the selector valid or not however. I tried to read the "Grammer of CSS" document from w3 but if I'm honest I found it a little hard to digest.
On 2014/11/06 12:27:48, kzar wrote: > On 2014/11/05 19:56:52, Wladimir Palant wrote: > > Looks good except for that "id with leading dash" logic. Hoping to get some > > feedback from Dave on that. > > > > Well after looking through everything it appears that leading hyphens are OK as > long as they're not followed by a digit or another hyphen. The regexp Sebastian > has written seems fine to match those cases. (I have tested the escapeToken > function for those situations as well and it works as I expected.) > > I can't find any information to say if escaping the leading hyphen like "\-" is > enough to then make the selector valid or not however. I tried to read the > "Grammer of CSS" document from w3 but if I'm honest I found it a little hard to > digest. Over-escaping is generally fine. Theoretically we could escape any single character of the selector (except for those that are supposed to have a special meaning, like the leading # when selecting by id), and it will work in every browser. However, I'd prefer not to over-escape here.
As discussed with Wladmir, I've added a test case and renamed the function to escapeCSS(). Also I realized that RegExp objects using the global flag have state. Hence control characters weren't always escaped correctly. This also has been fixed now.
http://codereview.adblockplus.org/4935175632846848/diff/5685265389584384/incl... File include.postload.js (right): http://codereview.adblockplus.org/4935175632846848/diff/5685265389584384/incl... include.postload.js:50: /^\d|^-(?![^\d-])|[^\w-\u0080-\uFFFF]/g, On 2014/11/06 08:25:14, Sebastian Noack wrote: > You forgot to run the regex, but there still isn't a notable difference: > http://jsperf.com/character-classes-performance/2 No, I didn't intend to run it - the slowdown I know happened during initialization.
http://codereview.adblockplus.org/4935175632846848/diff/5728116278296576/incl... File include.postload.js (right): http://codereview.adblockplus.org/4935175632846848/diff/5728116278296576/incl... include.postload.js:39: return '"' + value.replace(new RegExp('["\\\\]|' + ctrlChar.source, "g"), escapeChar) + '"'; Generating regular expressions dynamically for no good reason whatsoever is a clear step back as far as readability and maintainability goes :-( http://codereview.adblockplus.org/4935175632846848/diff/5728116278296576/incl... include.postload.js:44: return s.replace(/^\d|^-(?![^\d\-])|[^\w\-\u0080-\uFFFF]/g, escapeChar); Thinking more about this, I'm definitely opposed to a special case for dashes not followed by a digit or dash here. 1) This special case is irrelevant on real websites. 2) It makes the rules more complicated and much harder to verify (as opposed to: "always escape dashes if they are the first character"). 3) Mozilla implements this logic differently which shows that this rule isn't canonical. 4) This makes the otherwise trivial regular expression much harder to understand. http://codereview.adblockplus.org/4935175632846848/diff/5728116278296576/quni... File qunit/tests/cssEscaping.js (right): http://codereview.adblockplus.org/4935175632846848/diff/5728116278296576/quni... qunit/tests/cssEscaping.js:7: frame.addEventListener("load", function() Nit: This should be part of test setup (defined in module() call), with teardown that will remove the frame from the document again. Setup can also call stop() and start() to perform asynchronous actions. Note the setup function can also initialize escapeCSS and quote properties: this.escapeCSS = frame.contentWindow.escapeCSS; http://codereview.adblockplus.org/4935175632846848/diff/5728116278296576/quni... qunit/tests/cssEscaping.js:14: document.body.removeChild(frame); Won't removing the frame immediately prevent escapeCSS from calling other functions from that frame? http://codereview.adblockplus.org/4935175632846848/diff/5728116278296576/quni... qunit/tests/cssEscaping.js:16: function testSelector(opts) { Nit: opening bracket on next line please. http://codereview.adblockplus.org/4935175632846848/diff/5728116278296576/quni... qunit/tests/cssEscaping.js:92: testEscape("-" + chr); How about testing a letter as leading character as well? http://codereview.adblockplus.org/4935175632846848/diff/5728116278296576/quni... qunit/tests/cssEscaping.js:96: testEscape("😻♥ä"); Better not have Unicode characters in a JavaScript file - browsers don't always recognize UTF-8 correctly. "\u1234\u4321\u009F" will be less error-prone.
http://codereview.adblockplus.org/4935175632846848/diff/5728116278296576/incl... File include.postload.js (right): http://codereview.adblockplus.org/4935175632846848/diff/5728116278296576/incl... include.postload.js:39: return '"' + value.replace(new RegExp('["\\\\]|' + ctrlChar.source, "g"), escapeChar) + '"'; On 2014/11/06 21:26:06, Wladimir Palant wrote: > Generating regular expressions dynamically for no good reason whatsoever is a > clear step back as far as readability and maintainability goes :-( The idea was not to hard-code the char codes for control characters twice. But the way I previously reused the regex lead to buggy behavior, since RegExp objects using the global flag, persist the postion of the last match and resume at that position when calling .test() the next time. But frankly, I also dislike this construct, and removed it now. http://codereview.adblockplus.org/4935175632846848/diff/5728116278296576/incl... include.postload.js:44: return s.replace(/^\d|^-(?![^\d\-])|[^\w\-\u0080-\uFFFF]/g, escapeChar); On 2014/11/06 21:26:06, Wladimir Palant wrote: > Thinking more about this, I'm definitely opposed to a special case for dashes > not followed by a digit or dash here. > > 1) This special case is irrelevant on real websites. > 2) It makes the rules more complicated and much harder to verify (as opposed to: > "always escape dashes if they are the first character"). > 3) Mozilla implements this logic differently which shows that this rule isn't > canonical. > 4) This makes the otherwise trivial regular expression much harder to > understand. It's probably not worth to argue here. But I still think that the rule is easier to read when the leading dash isn't escaped, and therefore should only be escaped when necessary, and that a look-ahead doesn't make the regex much harder to understand. Instead escaping the leading dash, you can also escape the digit or dash in the second position. That is what Gecko's CSS.escape() function does, when the string starts with a dash followed by a digit. However, Gecko doesn't escape sequences of leading dashes, since those need no escaping there. http://codereview.adblockplus.org/4935175632846848/diff/5728116278296576/quni... File qunit/tests/cssEscaping.js (right): http://codereview.adblockplus.org/4935175632846848/diff/5728116278296576/quni... qunit/tests/cssEscaping.js:7: frame.addEventListener("load", function() On 2014/11/06 21:26:06, Wladimir Palant wrote: > Nit: This should be part of test setup (defined in module() call), with teardown > that will remove the frame from the document again. Setup can also call stop() > and start() to perform asynchronous actions. > > Note the setup function can also initialize escapeCSS and quote properties: > > this.escapeCSS = frame.contentWindow.escapeCSS; I already wondered whether qunit has setup/teardown hooks. http://codereview.adblockplus.org/4935175632846848/diff/5728116278296576/quni... qunit/tests/cssEscaping.js:14: document.body.removeChild(frame); On 2014/11/06 21:26:06, Wladimir Palant wrote: > Won't removing the frame immediately prevent escapeCSS from calling other > functions from that frame? No, the test passes without any errors. I think it's because the cached functions, keep references to either their window object, or directly to the functions they call and variables they access. http://codereview.adblockplus.org/4935175632846848/diff/5728116278296576/quni... qunit/tests/cssEscaping.js:16: function testSelector(opts) { On 2014/11/06 21:26:06, Wladimir Palant wrote: > Nit: opening bracket on next line please. Done. http://codereview.adblockplus.org/4935175632846848/diff/5728116278296576/quni... qunit/tests/cssEscaping.js:92: testEscape("-" + chr); On 2014/11/06 21:26:06, Wladimir Palant wrote: > How about testing a letter as leading character as well? I considered that case redundant. But fair enough. http://codereview.adblockplus.org/4935175632846848/diff/5728116278296576/quni... qunit/tests/cssEscaping.js:96: testEscape("😻♥ä"); On 2014/11/06 21:26:06, Wladimir Palant wrote: > Better not have Unicode characters in a JavaScript file - browsers don't always > recognize UTF-8 correctly. "\u1234\u4321\u009F" will be less error-prone. I actually enjoyed seeing a cat with heart-shaped eyes, when reading this file. ;) But never mind.
LGTM with the last nit fixed. http://codereview.adblockplus.org/4935175632846848/diff/5704837555552256/quni... File qunit/tests/cssEscaping.js (right): http://codereview.adblockplus.org/4935175632846848/diff/5704837555552256/quni... qunit/tests/cssEscaping.js:7: frame.srcdoc='<script src="../include.postload.js"></script>'; Nit: Missing spaces around = |