|
|
Created:
May 31, 2017, 2:42 a.m. by Manish Jethani Modified:
July 18, 2017, 6:21 p.m. Reviewers:
Sebastian Noack CC:
kzar Base URL:
https://hg.adblockplus.org/abp2blocklist Visibility:
Public. |
DescriptionIssue 5283 - Add support for $websocket and $webrtc
Patch Set 1 #
Total comments: 9
Patch Set 2 : Generate additional rules if filter contains URL scheme #Patch Set 3 : Generate single rule for filter covering all raw types #Patch Set 4 : Add top-level exception after copying rule #Patch Set 5 : Rebase #
MessagesTotal messages: 15
Patch Set 1 Comments inline. https://codereview.adblockplus.org/29452289/diff/29452290/lib/abp2blocklist.js File lib/abp2blocklist.js (left): https://codereview.adblockplus.org/29452289/diff/29452290/lib/abp2blocklist.j... lib/abp2blocklist.js:257: // Limit rules to HTTP(S) URLs This has been moved into parseFilterRegexpSource. https://codereview.adblockplus.org/29452289/diff/29452290/node_modules/filter... File node_modules/filterClasses.js (right): https://codereview.adblockplus.org/29452289/diff/29452290/node_modules/filter... node_modules/filterClasses.js:558: WEBSOCKET: 128, I have just manually added these two here instead of regenerating the file. If we do upgrade to the latest version of this file, I think it should be a separate commit. https://codereview.adblockplus.org/29452289/diff/29452290/test/abp2blocklist.js File test/abp2blocklist.js (right): https://codereview.adblockplus.org/29452289/diff/29452290/test/abp2blocklist.... test/abp2blocklist.js:87: testRules(test, ["/foo", "||test.com", "|http://example.com/foo"], [ We can talk about this. As I said earlier, I don't see why the string "http" at the beginning of the filter text should be treated as a special case. For example, "httpseverywhere.com" will also match "ftp://httpseverywhere.com" but it shouldn't.
https://codereview.adblockplus.org/29452289/diff/29452290/test/abp2blocklist.js File test/abp2blocklist.js (right): https://codereview.adblockplus.org/29452289/diff/29452290/test/abp2blocklist.... test/abp2blocklist.js:87: testRules(test, ["/foo", "||test.com", "|http://example.com/foo"], [ On 2017/05/31 02:51:43, Manish Jethani wrote: > We can talk about this. As I said earlier, I don't see why the string "http" at > the beginning of the filter text should be treated as a special case. For > example, "httpseverywhere.com" will also match "ftp://httpseverywhere.com" but > it shouldn't. A pattern of "http://example.com" should match any URL that contains this substring. This is what Adblock Plus (without Content Blockers) does, and some filter lists may rely on this behavior. However, "example.com" should not match "ftp://example.com" or "file:///home/user/example.com/foo.txt", as Adblock Plus is used to ignore these protocols, since anything potentially blocked in a local document or one downloaded via FTP is usually undesired. Currently this is achieved by prefixing such filters with "^https?://.*". But if the protocol is already given in the filter we have to omit the additional prefix because we'd otherwise translate filters like "http://example.com" to "Ä¥ttps?://.*https://example.com" which would no longer match http://example.com. I see how this is a problem now where we want to consider WebSockets, also because we cannot just use "^(http|ws)s?://.*" as there is no alternation in regular expressions used by Content Blockers. Also this workaround wasn't ultimately accurrate as in case of filters like "h*://example.com" or "s://example" we'd still add a redundant protocol part. So rather than always adding a protocol part to filters without implied beginning, we should never do so. But the question still is what is about ftp:// and file:// URLs. Does they go through the Content Blocker anyway? I actually never confirmed that, so perhaps this workaround was redundant from the beginning. Otherwise, there is also the option to just add an exception rule for ftp:// and/or file:// URLs to the generated lists.
https://codereview.adblockplus.org/29452289/diff/29452290/test/abp2blocklist.js File test/abp2blocklist.js (right): https://codereview.adblockplus.org/29452289/diff/29452290/test/abp2blocklist.... test/abp2blocklist.js:87: testRules(test, ["/foo", "||test.com", "|http://example.com/foo"], [ On 2017/05/31 07:09:26, Sebastian Noack wrote: > On 2017/05/31 02:51:43, Manish Jethani wrote: > > We can talk about this. As I said earlier, I don't see why the string "http" > at > > the beginning of the filter text should be treated as a special case. For > > example, "httpseverywhere.com" will also match "ftp://httpseverywhere.com" but > > it shouldn't. > > A pattern of "http://example.com" should match any URL that contains this > substring. This is what Adblock Plus (without Content Blockers) does, and some > filter lists may rely on this behavior. > > [...] I see what you mean. In that case we should check for "https?://" and not "http" to be more specific. Also I think we shouldn't block HTTP URLs for a filter that's supposed to specifically block a WebSocket URL, for example (this would be especially problematic for filters without a hostname part). I think it's a good idea to continue adding the URL scheme. We're almost there, I can handle the case where the scheme is already part of the URL pattern (i.e. the above case). Let me post another patch.
https://codereview.adblockplus.org/29452289/diff/29452290/test/abp2blocklist.js File test/abp2blocklist.js (right): https://codereview.adblockplus.org/29452289/diff/29452290/test/abp2blocklist.... test/abp2blocklist.js:87: testRules(test, ["/foo", "||test.com", "|http://example.com/foo"], [ On 2017/05/31 07:22:46, Manish Jethani wrote: > On 2017/05/31 07:09:26, Sebastian Noack wrote: > > On 2017/05/31 02:51:43, Manish Jethani wrote: > > > We can talk about this. As I said earlier, I don't see why the string "http" > > at > > > the beginning of the filter text should be treated as a special case. For > > > example, "httpseverywhere.com" will also match "ftp://httpseverywhere.com" > but > > > it shouldn't. > > > > A pattern of "http://example.com" should match any URL that contains this > > substring. This is what Adblock Plus (without Content Blockers) does, and some > > filter lists may rely on this behavior. > > > > [...] > > I see what you mean. In that case we should check for "https?://" and not "http" > to be more specific. > > Also I think we shouldn't block HTTP URLs for a filter that's supposed to > specifically block a WebSocket URL, for example (this would be especially > problematic for filters without a hostname part). I think it's a good idea to > continue adding the URL scheme. We're almost there, I can handle the case where > the scheme is already part of the URL pattern (i.e. the above case). > > Let me post another patch. How about this: * If the filter neither starts with | nor contains :// * If the typeMask includes neither WEBSOCKET not WEBRTC * Generate a single rule prefixed with ^https?://.* * If the typeMask includes only WEBSOCKET * Generate a single rule prefixed with ^wss?://.* * If the typeMask includes only WEBRTC * Generate 2 rules, prefixed with: ^stuns?:.*, ^turns?:.* * If the typeMask includes WEBSOCKET and any type other than WEBRTC * Generate 2 rules, prefixed with: ^wss?://.*, ^https?://.* * If the typeMask includes WEBRTC and any type other than WEBSOCKET * Generate 3 rules, prefixed with: ^stuns?:.*, ^turns?:.*, ^https?://.* * If the typeMask includes WEBSOCKET, WEBRTC and any other type * Generate a single rule without addional protocol handler * Else * Generate a single rule without addional protocol handler And in case file:// and/or ftp:// is an issue, we just add exception rules.
Patch Set 2: Generate additional rules if filter contains URL scheme With this latest patch, we're doing a more accurate translation of ABP filters to content blocker rules, including support for both WebRTC URL schemes (STUN and TURN). For the filter "http://example.com", now we generate two rules: "^http://example.com" and "^https?://.*http://example.com". This way we can match the URL scheme at the beginning as well as anywhere in the URL, while still restricting matches to HTTP(S) URLs. For "stun:foo$webrtc" this will generate "^stun:foo" as well as "^stuns?:.*stun:foo" and "^turns?:.*stun:foo". More comments inline. https://codereview.adblockplus.org/29452289/diff/29452290/test/abp2blocklist.js File test/abp2blocklist.js (right): https://codereview.adblockplus.org/29452289/diff/29452290/test/abp2blocklist.... test/abp2blocklist.js:87: testRules(test, ["/foo", "||test.com", "|http://example.com/foo"], [ On 2017/05/31 08:05:46, Sebastian Noack wrote: > How about this: These rules are both complicated and wrong. For example, the URL scheme for WebRTC does not include the two slashes. > * If the filter neither starts with | nor contains :// > * If the typeMask includes neither WEBSOCKET not WEBRTC > * Generate a single rule prefixed with ^https?://.* > * If the typeMask includes only WEBSOCKET > * Generate a single rule prefixed with ^wss?://.* > * If the typeMask includes only WEBRTC > * Generate 2 rules, prefixed with: ^stuns?:.*, ^turns?:.* > * If the typeMask includes WEBSOCKET and any type other than WEBRTC > * Generate 2 rules, prefixed with: ^wss?://.*, ^https?://.* We would have to split the resource types here. e.g. "raw" for the WebRTC rule and other types for the non-WebRTC rule. We can do this, I even did it at first, but do we really need this? It's unnecessarily complicated. Let's call this a limitation of this program, but we can simply not support this type of filter, at least for now. i.e. if you mix WebSocket or WebRTC with other types, we only generate a rule for the other types. I think this is OK, we can improve on this later if we need to. > * If the typeMask includes WEBRTC and any type other than WEBSOCKET > * Generate 3 rules, prefixed with: ^stuns?:.*, ^turns?:.*, ^https?://.* > * If the typeMask includes WEBSOCKET, WEBRTC and any other type > * Generate a single rule without addional protocol handler Then it would match any URL scheme, which may be fine in practice, it's just weird to do this only sometimes and not other times. I have a much better solution with my latest patch. > * Else > * Generate a single rule without addional protocol handler > > And in case file:// and/or ftp:// is an issue, we just add exception rules. I think we should try to do as accurate a translation of ABP filters to content blocker rules as we can. This may make the code more complicated but it would be worth it.
On 2017/06/02 08:02:36, Manish Jethani wrote: > For the filter "http://example.com", now we generate two rules: > "^http://example.com" and "^https?://.*http://example.com". This way we can > match the URL scheme at the beginning as well as anywhere in the URL, while > still restricting matches to HTTP(S) URLs. Would this mean we'd generate way more rules when converting EasyList?
On 2017/06/02 08:32:43, kzar wrote: > On 2017/06/02 08:02:36, Manish Jethani wrote: > > > For the filter "http://example.com", now we generate two rules: > > "^http://example.com" and "^https?://.*http://example.com". This way we can > > match the URL scheme at the beginning as well as anywhere in the URL, while > > still restricting matches to HTTP(S) URLs. > > Would this mean we'd generate way more rules when converting EasyList? This affects only one filter in EasyList+AA. Normally a filter like that is anchored (i.e. "|http://...").
https://codereview.adblockplus.org/29452289/diff/29452290/test/abp2blocklist.js File test/abp2blocklist.js (right): https://codereview.adblockplus.org/29452289/diff/29452290/test/abp2blocklist.... test/abp2blocklist.js:87: testRules(test, ["/foo", "||test.com", "|http://example.com/foo"], [ On 2017/06/02 08:02:36, Manish Jethani wrote: > On 2017/05/31 08:05:46, Sebastian Noack wrote: > > How about this: > > These rules are both complicated and wrong. For example, the URL scheme for > WebRTC does not include the two slashes. > > > * If the filter neither starts with | nor contains :// > > * If the typeMask includes neither WEBSOCKET not WEBRTC > > * Generate a single rule prefixed with ^https?://.* > > * If the typeMask includes only WEBSOCKET > > * Generate a single rule prefixed with ^wss?://.* > > * If the typeMask includes only WEBRTC > > * Generate 2 rules, prefixed with: ^stuns?:.*, ^turns?:.* > > * If the typeMask includes WEBSOCKET and any type other than WEBRTC > > * Generate 2 rules, prefixed with: ^wss?://.*, ^https?://.* > > We would have to split the resource types here. e.g. "raw" for the WebRTC rule > and other types for the non-WebRTC rule. We can do this, I even did it at first, > but do we really need this? It's unnecessarily complicated. Let's call this a > limitation of this program, but we can simply not support this type of filter, > at least for now. i.e. if you mix WebSocket or WebRTC with other types, we only > generate a rule for the other types. I think this is OK, we can improve on this > later if we need to. Note that "raw" is for all requests initiated by code, this includes XMLHTTPREQUEST, WEBSOCKET and WEBRTC. However, if we prefix the pattern with ^wss?://.*, ^stuns?:.* or ^turns?:.* we could potentially even omit the resource type because the URL pattern already implies a subset of "raw". Similarly when a filter doesn't use any type options, and therefore should match all resource types, we can just omit the resource type along the protocol prefix as well. So the only scenario in which we'd further have to split the rules are filters like ||example.com$websocket,image. These are rare enough that I wouldn't mind the addional rules produced, on the other hand, if this will get too complicated a limitation here might be acceptable too. But filters with no type option are quite common, and should be applied to WEBSOCKET and WEBRTC as well. > I think we should try to do as accurate a translation of ABP filters to content > blocker rules as we can. This may make the code more complicated but it would be > worth it. Well, Adblock Plus generally ignores the protocol when matching filters which don't explicitly include the protocol part. But non-HTTP/WS/RTC URLs are ignored before filter matching takes place. So IMO ignoring the protocol here as well, and then ignore these protocols by an exception rule which we add automatically, doesn't seem any less accurrate to me.
Patch Set 3: Generate single rule for filter covering all raw types For filters without an option, e.g. "foo", this will generate a single rule with a generic URL scheme pattern. This even works for filters with options covering all raw types, e.g. "foo$websocket,webrtc,xmlhttprequest,script". If one of the raw types is missing, it'll generate multiple rules. https://codereview.adblockplus.org/29452289/diff/29452290/test/abp2blocklist.js File test/abp2blocklist.js (right): https://codereview.adblockplus.org/29452289/diff/29452290/test/abp2blocklist.... test/abp2blocklist.js:87: testRules(test, ["/foo", "||test.com", "|http://example.com/foo"], [ On 2017/06/02 10:58:50, Sebastian Noack wrote: > On 2017/06/02 08:02:36, Manish Jethani wrote: > > On 2017/05/31 08:05:46, Sebastian Noack wrote: > > > How about this: > > > > These rules are both complicated and wrong. For example, the URL scheme for > > WebRTC does not include the two slashes. > > > > > * If the filter neither starts with | nor contains :// > > > * If the typeMask includes neither WEBSOCKET not WEBRTC > > > * Generate a single rule prefixed with ^https?://.* > > > * If the typeMask includes only WEBSOCKET > > > * Generate a single rule prefixed with ^wss?://.* > > > * If the typeMask includes only WEBRTC > > > * Generate 2 rules, prefixed with: ^stuns?:.*, ^turns?:.* > > > * If the typeMask includes WEBSOCKET and any type other than WEBRTC > > > * Generate 2 rules, prefixed with: ^wss?://.*, ^https?://.* > > > > We would have to split the resource types here. e.g. "raw" for the WebRTC rule > > and other types for the non-WebRTC rule. We can do this, I even did it at > first, > > but do we really need this? It's unnecessarily complicated. Let's call this a > > limitation of this program, but we can simply not support this type of filter, > > at least for now. i.e. if you mix WebSocket or WebRTC with other types, we > only > > generate a rule for the other types. I think this is OK, we can improve on > this > > later if we need to. > > Note that "raw" is for all requests initiated by code, this includes > XMLHTTPREQUEST, WEBSOCKET and WEBRTC. However, if we prefix the pattern with > ^wss?://.*, ^stuns?:.* or ^turns?:.* we could potentially even omit the resource > type because the URL pattern already implies a subset of "raw". Similarly when a > filter doesn't use any type options, and therefore should match all resource > types, we can just omit the resource type along the protocol prefix as well. So > the only scenario in which we'd further have to split the rules are filters like > ||example.com$websocket,image. These are rare enough that I wouldn't mind the > addional rules produced, on the other hand, if this will get too complicated a > limitation here might be acceptable too. But filters with no type option are > quite common, and should be applied to WEBSOCKET and WEBRTC as well. Makes sense, done. > > I think we should try to do as accurate a translation of ABP filters to > content > > blocker rules as we can. This may make the code more complicated but it would > be > > worth it. > > Well, Adblock Plus generally ignores the protocol when matching filters which > don't explicitly include the protocol part. But non-HTTP/WS/RTC URLs are ignored > before filter matching takes place. So IMO ignoring the protocol here as well, > and then ignore these protocols by an exception rule which we add automatically, > doesn't seem any less accurrate to me. Alright, I agree.
(Moving myself to Cc since I think Sebastian should review this one.)
LGTM
Patch Set 4: Add top-level exception after copying rule Rebased. Sorry I messed up the last patch set. It overwrote the latest commit to master. Now this one is properly rebased. We have to exclude the top-level URL from the trigger as per the last commit to master, and we have to do this for each rule after we make a copy. I had to update the unit tests accordingly.
Patch Set 5: Rebase There were quite a few conflicts, all resolved.
Still LGTM |