|
|
Created:
May 7, 2018, 3:38 p.m. by Manish Jethani Modified:
March 16, 2019, 3:54 p.m. Reviewers:
kzar CC:
sergei Base URL:
https://hg.adblockplus.org/adblockpluscore/ Visibility:
Public. |
DescriptionElemHide.getSelectorsForDomain does filter matching for every domain. This is not necessary; instead, it can do filter matching only for known domains (i.e. those that have filters that would apply to them). For any random domain like x4aabd7c0ddcd.com or manishjethani.com the lookups can be much, much faster, because there are no domain-specific filters that apply and no exceptions to check against (except for generic exceptions like "#@#foo", which would still apply to "##foo" on x4aabd7c0ddcd.com or manishjethani.com).
Since the list of generic selectors that apply on most domains is the same, ElemHide.getSelectorsForDomain can cache this list and, for most domains, return the cached list concatenated with the generated list of domain-specific selectors.
With these changes, the performance improvement for most domains is 10x for EastList+AA (as of 12 May 2018) and possibly significantly higher as the list of subscriptions grows.
There is no performance penalty for domains to which this optimization does not apply.
Patch Set 1 #
Total comments: 10
Patch Set 2 : Check generic exception set size before looking up #
Total comments: 10
Patch Set 3 : Look up filters before deciding path #Patch Set 4 : Merge getConditionalGenericSelectors into getConditionalSelectorsForDomain #
Total comments: 1
Patch Set 5 : Rebase #Patch Set 6 : Split out getConditionalGenericSelectors again and cache result #
Total comments: 2
Patch Set 7 : Used cached list for generic-friendly domains #
Total comments: 2
Patch Set 8 : Limit size of generic-friendly domains set #Patch Set 9 : Rebase on master #Patch Set 10 : Clean up #Patch Set 11 : Fix JSDoc #Patch Set 12 : Use longest subdomain as key in generic-friendly domains set #Patch Set 13 : Move selector matching into its own function #
Total comments: 19
Patch Set 14 : Rebase #Patch Set 15 : Inline doesFilterApply #Patch Set 16 : Reinclude test/elemHide.js #Patch Set 17 : Strip out optimization for "generic-friendly" domains #Patch Set 18 : Avoid unnecessary Array.concat #MessagesTotal messages: 34
Patch Set 1 I'll file an issue for this shortly, but see the description here. I'll also post details on how I tested the performance in my next comment. https://codereview.adblockplus.org/29773570/diff/29773571/lib/elemHide.js File lib/elemHide.js (right): https://codereview.adblockplus.org/29773570/diff/29773571/lib/elemHide.js#new... lib/elemHide.js:30: * @type {Map.<string,?Map.<Filter,boolean>>} The value here can now be null. https://codereview.adblockplus.org/29773570/diff/29773571/lib/elemHide.js#new... lib/elemHide.js:95: * Returns a list of selectors that apply on any unknown domain There are generic selectors that are "conditional" (don't apply on all domains) that would still apply on any unknown domain. For example, "~foo.example.com##foo" would apply on all but foo.example.com, but if the domain we're looking up for is example.com ("unknown" domain) then it's as good as unconditional. https://codereview.adblockplus.org/29773570/diff/29773571/lib/elemHide.js#new... lib/elemHide.js:108: if (!genericExceptionSelectors.has(selector)) Here we have to check for generic exceptions like "#@#foo", which would disable "~foo.example.com##foo" on all domains (known or unknown). There are 16 such generic exceptions in EasyList+AA. https://codereview.adblockplus.org/29773570/diff/29773571/lib/elemHide.js#new... lib/elemHide.js:145: if (domain != "" && !filtersByDomain.has(domain)) We could have had a separate knownExceptionDomains set. I'm not sure it's worth it though, it's just easier to add a null entry here in filtersByDomain. Less maintenance. https://codereview.adblockplus.org/29773570/diff/29773571/lib/elemHide.js#new... lib/elemHide.js:184: if (domains) For exceptions too we should remember the domains, otherwise an exception like "foo.example.com#@#foo" will not be taken into account and the generic filter "##foo" will be applied to "foo.example.com" (since it would be considered to be an unknown domain). In this example we have to make sure that "foo.example.com" is a known domain. I've also updated the tests to check for this. https://codereview.adblockplus.org/29773570/diff/29773571/lib/elemHide.js#new... lib/elemHide.js:233: Note that we don't bother "unknowing" a domain once we know it. It's OK, this is an optimization anyway. It will still work for most domains out there. https://codereview.adblockplus.org/29773570/diff/29773571/lib/elemHide.js#new... lib/elemHide.js:332: if (isDomainKnown(currentDomain)) The call to isDomainKnown is the only additional cost now for known domains; otherwise, for a known domain, the code is the same. Based on my tests this call is not too expensive. We can also optimize it further, for example by adding the looked up domain once we've determined it to be known (e.g. when we look up "foo.example.com" and only "example.com" is in the filtersByDomain map, we can add "foo.example.com" so it's looked up faster next time). https://codereview.adblockplus.org/29773570/diff/29773571/lib/elemHide.js#new... lib/elemHide.js:369: selectors = selectors.concat(getConditionalGenericSelectors()); We might want to cache the value returned by getConditionalGenericSelectors, since it's the same each time (unless filters are added or removed), but I decided to avoid the additional bookkeeping for now. We can investigate this separately. I also have other ideas for efficiently caching results.
https://codereview.adblockplus.org/29773570/diff/29773571/lib/elemHide.js File lib/elemHide.js (right): https://codereview.adblockplus.org/29773570/diff/29773571/lib/elemHide.js#new... lib/elemHide.js:98: function getConditionalGenericSelectors() To give you some numbers, there are 18,300 unconditional selectors, 1,095 conditional generic selectors, and 16 generic exceptions in in EasyList+AA. So when you call ElemHide.getSelectorsForDomain on x4aabd7c0ddcd (unknown domain), first the list is populated with 18,300 selectors right away, then 1,095 selectors are added to it, after filtering out 16 generic exceptions from the actual 1,111 selectors. https://codereview.adblockplus.org/29773570/diff/29773571/test/elemHide.js File test/elemHide.js (right): https://codereview.adblockplus.org/29773570/diff/29773571/test/elemHide.js#ne... test/elemHide.js:63: addFilter("##star"); These cases are important, otherwise we might break generic exceptions while making optimizations of this kind.
Patch Set 2: Check generic exception set size before looking up https://codereview.adblockplus.org/29773570/diff/29773574/lib/elemHide.js File lib/elemHide.js (right): https://codereview.adblockplus.org/29773570/diff/29773574/lib/elemHide.js#new... lib/elemHide.js:108: if (genericExceptionSelectors.size == 0 || Right now some selectors are disabled on all domains if AA is enabled. There's no other way to do this. But in case in the future the size comes down to 0, it should help.
Now about how I tested this. First I generated a list of 3,000 filters using the following code: let f = d => console.log(d + "##i" + Math.random().toString(16).substring(2)); for (let i = 0; i < 1000; i++) { f("example.com"); r("foo.example.com"); r("~foo.example.com,example.com"); } I copied and pasted the result in the "My filter list" box on the options page. Then I went to https://adblockplus.org/subscriptions and subscribed to every single filter list over there. I added the following to the end of lib/elemHide.js: ElemHide.lookupTime = 0; if (typeof window != "undefined") window.ElemHide = ElemHide; And I added the following to the start and end of ElemHide.getSelectorsForDomain: getSelectorsForDomain(domain, criteria) { let start = performance.now(); ... ElemHide.lookupTime += performance.now() - start; return selectors; } Now I was ready to test. First I wanted to test the unknown domains. I ran the following in the background page console, both before and after applying the patch: let t = 0; let n = 0; (function test() { ElemHide.lookupTime = 0; let delay = (f, n) => new Promise(r => setTimeout(() => (f(), r()), n)); let jobs = []; for (let i = 0; i < 1000; i++) { jobs.push( delay(() => ElemHide.getSelectorsForDomain(`s${i}.x4aabd7c0ddcd.com`), i) ); } console.log(`${jobs.length} jobs`); Promise.all(jobs).then(() => { console.log(`Took ${0 | ElemHide.lookupTime}ms`); t += ElemHide.lookupTime; n += jobs.length; if (n < 10000) delay(test, 1000); else console.log("DONE!"); }); })(); There was no contest. While before applying the patch the code took the usual time to run (we'll see what "usual" shortly), after applying the patch it sped up 30x. What took ~30s to run took only ~1s after the patch. Then I tried known domains. Since I started by adding filters for example.com and foo.example.com, I changed x4aabd7c0ddcd.com in the above code snippet to foo.example.com and ran the code in the background page console again, both before and after the patch. Here are the results: Before: Took 27377ms Took 27338ms Took 27850ms Took 27142ms Took 28923ms Took 27422ms Took 27184ms Took 27173ms Took 27492ms Took 27788ms t / n = 27.5689 After: Took 28308ms Took 28586ms Took 29230ms Took 29138ms Took 29247ms Took 29607ms Took 29220ms Took 28474ms Took 28367ms Took 29474ms t / n = 28.9651 So clearly it slows down a bit (~5%) for known domains, but this is offset by the 3x to 30x performance improvement for unknown domains (which is every domain out there unless it is one of the subscribed to and enabled filter lists). You can try out the above too and let me know what kind of results you get.
I think it is worth an issue.
On 2018/05/07 17:23:33, sergei wrote: > I think it is worth an issue. Done. https://issues.adblockplus.org/ticket/6652
https://codereview.adblockplus.org/29773570/diff/29773574/lib/elemHide.js File lib/elemHide.js (right): https://codereview.adblockplus.org/29773570/diff/29773574/lib/elemHide.js#new... lib/elemHide.js:335: if (isDomainKnown(currentDomain)) I don't quite understand how your change speeds things up. Aren't we pretty much doing the same check in isDomainKnown as we were before (checking if the domain part is present in filtersByDomain)? If calling .has instead of .get or similar is helping couldn't we add that check into this loop instead of having a duplicate loop out in the isDomainKnown function? https://codereview.adblockplus.org/29773570/diff/29773574/lib/elemHide.js#new... lib/elemHide.js:372: selectors = selectors.concat(getConditionalGenericSelectors()); Won't selectors always be [] here? I wonder why we concat the two lists together instead of just returning getConditionalGenericSelectors().
https://codereview.adblockplus.org/29773570/diff/29773574/lib/elemHide.js File lib/elemHide.js (right): https://codereview.adblockplus.org/29773570/diff/29773574/lib/elemHide.js#new... lib/elemHide.js:335: if (isDomainKnown(currentDomain)) On 2018/05/08 13:46:47, kzar wrote: > I don't quite understand how your change speeds things up. Aren't we pretty much > doing the same check in isDomainKnown as we were before (checking if the domain > part is present in filtersByDomain)? > > If calling .has instead of .get or similar is helping couldn't we add that check > into this loop instead of having a duplicate loop out in the isDomainKnown > function? Let me come back with an answer after investigating this a little more. https://codereview.adblockplus.org/29773570/diff/29773574/lib/elemHide.js#new... lib/elemHide.js:372: selectors = selectors.concat(getConditionalGenericSelectors()); On 2018/05/08 13:46:47, kzar wrote: > Won't selectors always be [] here? I wonder why we concat the two lists together > instead of just returning getConditionalGenericSelectors(). No, the selectors array will contain the unconditional selectors, which is actually most selectors. I wrote previously: "To give you some numbers, there are 18,300 unconditional selectors, 1,095 conditional generic selectors, and 16 generic exceptions in in EasyList+AA."
https://codereview.adblockplus.org/29773570/diff/29773574/lib/elemHide.js File lib/elemHide.js (right): https://codereview.adblockplus.org/29773570/diff/29773574/lib/elemHide.js#new... lib/elemHide.js:335: if (isDomainKnown(currentDomain)) On 2018/05/08 14:20:43, Manish Jethani wrote: > On 2018/05/08 13:46:47, kzar wrote: > > I don't quite understand how your change speeds things up. Aren't we pretty > much > > doing the same check in isDomainKnown as we were before (checking if the > domain > > part is present in filtersByDomain)? > > > > If calling .has instead of .get or similar is helping couldn't we add that > check > > into this loop instead of having a duplicate loop out in the isDomainKnown > > function? > > Let me come back with an answer after investigating this a little more. Alright, I checked. Fifty percent of the speedup is owning to the fact that it's a separate function. This means we could start by separating out the existing while loop into its own function. I'll make a patch for that. The remaining is mainly owing to the fact that we don't call this.getException for unknown domains, because we already know that there are no domain-specific exceptions; instead we just do a lookup in the genericExceptionSelectors set for the filter's selector, which is much, much faster. Other minor things are that we don't create a set object and access its size property and "potentially" (but never in fact because it's an unknown domain, but the JS runtime doesn't know this) its has method. Another factor might be that we only get the keys of the filters map, whereas in the case of known domains we also get the values (isIncluded), which might be slower. I haven't checked specifically for this though. So we might end up with two functions, for known and unknown domains respectively: getConditionalSelectorsForDomain(domain, specificOnly) getConditionalGenericSelectors() We might be able to combine these into one function (if there's no domain specified then it's the same as the second case?), depending on how much cleaner it is and how much it affects performance.
https://codereview.adblockplus.org/29773570/diff/29773574/lib/elemHide.js File lib/elemHide.js (right): https://codereview.adblockplus.org/29773570/diff/29773574/lib/elemHide.js#new... lib/elemHide.js:335: if (isDomainKnown(currentDomain)) On 2018/05/08 15:49:00, Manish Jethani wrote: > On 2018/05/08 14:20:43, Manish Jethani wrote: > > On 2018/05/08 13:46:47, kzar wrote: > > > I don't quite understand how your change speeds things up. Aren't we pretty > > much > > > doing the same check in isDomainKnown as we were before (checking if the > > domain > > > part is present in filtersByDomain)? > > > > > > If calling .has instead of .get or similar is helping couldn't we add that > > check > > > into this loop instead of having a duplicate loop out in the isDomainKnown > > > function? > > > > Let me come back with an answer after investigating this a little more. > > Alright, I checked. > > Fifty percent of the speedup is owning to the fact that it's a separate > function. Interesting, I wonder why that is. Is something in this function causing it to deoptimise? > The remaining is mainly owing to the fact that we don't call > this.getException for unknown domains... I wonder why getException is slower than your new logic. In other words I wonder if we could make use of your tweak to filtersByDomain (storing null for exceptions) in the getException function instead. https://codereview.adblockplus.org/29773570/diff/29773574/lib/elemHide.js#new... lib/elemHide.js:372: selectors = selectors.concat(getConditionalGenericSelectors()); On 2018/05/08 14:20:43, Manish Jethani wrote: > On 2018/05/08 13:46:47, kzar wrote: > > Won't selectors always be [] here? > No, the selectors array will contain the unconditional selectors, which is > actually most selectors. Ah yea, gotya.
https://codereview.adblockplus.org/29773570/diff/29773574/lib/elemHide.js File lib/elemHide.js (right): https://codereview.adblockplus.org/29773570/diff/29773574/lib/elemHide.js#new... lib/elemHide.js:335: if (isDomainKnown(currentDomain)) On 2018/05/08 17:13:43, kzar wrote: > On 2018/05/08 15:49:00, Manish Jethani wrote: > > On 2018/05/08 14:20:43, Manish Jethani wrote: > > > On 2018/05/08 13:46:47, kzar wrote: > > > > I don't quite understand how your change speeds things up. Aren't we > pretty > > > much > > > > doing the same check in isDomainKnown as we were before (checking if the > > > domain > > > > part is present in filtersByDomain)? > > > > > > > > If calling .has instead of .get or similar is helping couldn't we add that > > > check > > > > into this loop instead of having a duplicate loop out in the isDomainKnown > > > > function? > > > > > > Let me come back with an answer after investigating this a little more. > > > > Alright, I checked. > > > > Fifty percent of the speedup is owning to the fact that it's a separate > > function. > > Interesting, I wonder why that is. Is something in this function causing it to > deoptimise? Yes, I found out: https://codereview.adblockplus.org/29774573/ See the first comment on that patch. It was that we were modifying the unconditional selectors array. My suggestion is to use concatenation of two arrays. > > The remaining is mainly owing to the fact that we don't call > > this.getException for unknown domains... > > I wonder why getException is slower than your new logic. In other words I wonder > if we could make use of your tweak to filtersByDomain (storing null for > exceptions) in the getException function instead. Let me look into that.
https://codereview.adblockplus.org/29773570/diff/29773574/lib/elemHide.js File lib/elemHide.js (right): https://codereview.adblockplus.org/29773570/diff/29773574/lib/elemHide.js#new... lib/elemHide.js:335: if (isDomainKnown(currentDomain)) On 2018/05/08 17:37:13, Manish Jethani wrote: > On 2018/05/08 17:13:43, kzar wrote: > [...] > > > I wonder why getException is slower than your new logic. In other words I > wonder > > if we could make use of your tweak to filtersByDomain (storing null for > > exceptions) in the getException function instead. > > Let me look into that. getException should really be called hasException and return boolean. If the domain is unknown (has no entry in filtersByDomain), it should check genericExceptionSelectors instead of doing the usual stuff. Or the caller could take care of this, either way. Anyway I have something in mind here so I'll update the patch tomorrow after rebasing on #29774573. We should land that one first since the benefits are more obvious.
On 2018/05/08 17:49:30, Manish Jethani wrote: > Anyway I have something in mind here so I'll update the patch tomorrow after > rebasing on #29774573. We should land that one first since the benefits are more > obvious. I'd prefer that you renamed 6652 to something like "Make further optimisations to the getSelectorsForDomain code" and then list what we're going to change there. Then we can make both this getException optimisation and the other optimisation where we avoid pushing to that array come under that issue. Then finally your other unrelated changes (e.g. using default arguments) can be reviewied and committed separately as noissue(s).
On 2018/05/09 10:18:11, kzar wrote: > On 2018/05/08 17:49:30, Manish Jethani wrote: > > Anyway I have something in mind here so I'll update the patch tomorrow after > > rebasing on #29774573. We should land that one first since the benefits are > more > > obvious. > > I'd prefer that you renamed 6652 to something like "Make further optimisations > to the getSelectorsForDomain code" and then list what we're going to change > there. Then we can make both this getException optimisation and the other > optimisation where we avoid pushing to that array come under that issue. Then > finally your other unrelated changes (e.g. using default arguments) can be > reviewied and committed separately as noissue(s). OK, done.
Patch Set 3: Look up filters before deciding path Now rebased on patch #29774573. We look up the domain-specific filters before deciding which path to use. This removes the penalty for known domains, so now it performs just as before for that case. Now the question is whether we can merge getConditionalSelectorsForDomain and getConditionalGenericSelectors into one function.
Patch Set 4: Merge getConditionalGenericSelectors into getConditionalSelectorsForDomain I'm still not convinced merging these two is a good idea. We had two functions that performed optimally for their specific cases, now we have one that compromises performance for both cases. https://codereview.adblockplus.org/29773570/diff/29776578/lib/elemHide.js File lib/elemHide.js (right): https://codereview.adblockplus.org/29773570/diff/29776578/lib/elemHide.js#new... lib/elemHide.js:69: let genericExceptions = new Map(); This needs to be a map actually, because we want to delete individual filters from this and only remove the key when the last filter been removed.
Patch Set 5: Rebase Patch Set 6: Split out getConditionalGenericSelectors again and cache result Patch Set 7: Used cached list for generic-friendly domains Now we have the 30x optimization even for most known domains. Most known domains don't have any domain-specific exceptions for generic filters, so we can just reuse the cached list of conditional generic selectors. This cached list is really small and generating it each time is just wasting CPU, because nothing changes (as far as this list is concerned) from one page load to the next as the user navigates on a site. In my opinion caching is and returning the cached result in most cases is the sensible things to do. We are already doing this for the list of "unconditional" selectors. https://codereview.adblockplus.org/29773570/diff/29778560/lib/elemHide.js File lib/elemHide.js (right): https://codereview.adblockplus.org/29773570/diff/29778560/lib/elemHide.js#new... lib/elemHide.js:151: if (conditionalGenericSelectors) Return cached value. It's always the same each time (and only ~1,100 selectors in EasyList+AA), unless new filters are added or removed. https://codereview.adblockplus.org/29773570/diff/29778560/lib/elemHide.js#new... lib/elemHide.js:240: conditionalGenericSelectors = null; We could be more intelligent about when to clear this, but there's no need. https://codereview.adblockplus.org/29773570/diff/29778566/lib/elemHide.js File lib/elemHide.js (right): https://codereview.adblockplus.org/29773570/diff/29778566/lib/elemHide.js#new... lib/elemHide.js:175: genericFriendlyDomains.add(domain); Most domains will fall in this category. We might have to limit the size of this set, but then again this can only be a domain that occurs specifically in a subscription or a subdomain of such a domain.
Patch Set 8: Limit size of generic-friendly domains set https://codereview.adblockplus.org/29773570/diff/29778566/lib/elemHide.js File lib/elemHide.js (right): https://codereview.adblockplus.org/29773570/diff/29778566/lib/elemHide.js#new... lib/elemHide.js:175: genericFriendlyDomains.add(domain); On 2018/05/11 10:00:32, Manish Jethani wrote: > We might have to limit the size of this set, but then again this can only be a > domain that occurs specifically in a subscription or a subdomain of such a > domain. Done.
Patch Set 9: Rebase on master
Patch Set 10: Clean up Patch Set 11: Fix JSDoc A note about the "generic-friendly domains" optimization. Out of 10,899 domains in EasyList+AA, 9,419 are generic friendly, meaning we can return the cached conditionalGenericSelectors list for these domains. In other words, most domains don't need any work for generic filters, which is most of the work anyway (almost all selectors in the returned list are from generic filters). We've been doing a lot of work on every single frame load only because of 1,480 domains.
Patch Set 12: Use longest subdomain as key in generic-friendly domains set We use the longest subdomain of this domain found in our data structures as the key to check if the domain is "generic friendly." For example, given foo.example.com, there may be an entry for example.com in our data structures (e.g. "example.com###foo"), so we use that subdomain as the key. This way we make only one entry and it works for all subdomains of example.com, except those that have specific entries (e.g. "~bar.example.com##.no-bar").
Patch Set 13: Move selector matching into its own function Fewer lines of code. Later we might want to export this function so ElemHideEmulation can use it.
https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js File lib/elemHide.js (left): https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js#old... lib/elemHide.js:271: if (specificOnly && currentDomain == "") Why don't we instead keep track of if filtersByDomain.has(currentDomain) (for non "") is ever true here. If it wasn't we could cache the result of the selectors which applied for "", or reuse the cached value. It seems to me that would be a simpler way to use your insight about known/unknown domains without adding as much complexity. https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js File lib/elemHide.js (right): https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js#new... lib/elemHide.js:64: * @type {Map.<string,Filter[]>} I guess you need to rebase again. https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js#new... lib/elemHide.js:78: let conditionalGenericSelectors = null; I think this cache is probably a good idea. I figure it should have a name more like unknownDomainSelectors though. Also I wonder if we could avoid duplicating what's already cached in unconditionalSelectors? https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js#new... lib/elemHide.js:102: filtersByDomain.set(domain, null); Am I missing something or do we never remove the domain from filtersByDomain again? I guess previously it wasn't significant if a domain was left with an empty Map entry, but now it means something if filtersByDomain.has(domain). Anyway, I like this part of your change, reusing filtersByDomain for the unknown/known domain check seems like a good idea. https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js#new... lib/elemHide.js:471: let selectors = getConditionalSelectorsForDomain(domain, specificOnly); I feel like some of these changes (like this one, moving the logic of getSelectorsForDomain out into a different function) are just made for the sake of it. I'd rather you only changed what was necessary.
Patch Set 14: Rebase Patch Set 15: Inline doesFilterApply https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js File lib/elemHide.js (left): https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js#old... lib/elemHide.js:271: if (specificOnly && currentDomain == "") On 2018/05/15 13:26:30, kzar wrote: > Why don't we instead keep track of if filtersByDomain.has(currentDomain) (for > non "") is ever true here. If it wasn't we could cache the result of the > selectors which applied for "", or reuse the cached value. It seems to me that > would be a simpler way to use your insight about known/unknown domains without > adding as much complexity. filtersByDomain.has(domain) will be true for ~11,000 domains. The optimization in question still applies to ~9,500 out of those ~11,000 domains, but for that we need to know specifically if there are any domain-specific exceptions (for the domain in question) for any generic filters. What you're suggesting works for unknown domains, not for known domains. https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js File lib/elemHide.js (right): https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js#new... lib/elemHide.js:64: * @type {Map.<string,Filter[]>} On 2018/05/15 13:26:31, kzar wrote: > I guess you need to rebase again. Done. https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js#new... lib/elemHide.js:78: let conditionalGenericSelectors = null; On 2018/05/15 13:26:31, kzar wrote: > I think this cache is probably a good idea. I figure it should have a name more > like unknownDomainSelectors though. Also I wonder if we could avoid duplicating It also applies to known domains (see my other comment as well as the patch description); these are generic selectors that apply to most domains but not all domains. > what's already cached in unconditionalSelectors? This cache does not include those selectors, it only has "conditional generic" selectors. The criterion for a selector to be in this cache is that it should apply on most domains, even most "known" domains, but it would still have exceptions for some domains. e.g. ~example.com###foo", or "##foo" and then "example.com#@##foo" (separate filter list like AA). Also see this comment for numbers: https://codereview.adblockplus.org/29773570#msg3 https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js#new... lib/elemHide.js:102: filtersByDomain.set(domain, null); On 2018/05/15 13:26:30, kzar wrote: > Am I missing something or do we never remove the domain from filtersByDomain > again? I guess previously it wasn't significant if a domain was left with an > empty Map entry, but now it means something if filtersByDomain.has(domain). You're right, we never remove a domain like that from filtersByDomain. This means that if a domain was previously known, it never becomes unknown. That's fine, it would still end up in the genericFriendlyDomains set because no domain-specific exceptions would apply, so the performance would be the same for generic filters. Part of the optimization would be disabled for such a domain because we would still go looking for domain-specific filters (and we might find none). I think that's OK. Otherwise we would have to do more bookkeeping and I don't think it's worth it. In a typical browsing sessions you're not going add and remove individual filter lists anyway. I'm guessing when filters lists are downloaded the clear method is called and everything is repopulated again anyway. Correct me if I'm wrong. https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js#new... lib/elemHide.js:471: let selectors = getConditionalSelectorsForDomain(domain, specificOnly); On 2018/05/15 13:26:31, kzar wrote: > I feel like some of these changes (like this one, moving the logic of > getSelectorsForDomain out into a different function) are just made for the sake > of it. I'd rather you only changed what was necessary. Yeah this patch has been updated a lot. Now the function is almost empty. Anyway, my rationale for doing it this way was that selectors is a combination of unconditional and conditional selectors, there's already a dedicate function for unconditional selectors. Let me inline getConditionalSelectorsForDomain in my next update.
https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js File lib/elemHide.js (right): https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js#new... lib/elemHide.js:471: let selectors = getConditionalSelectorsForDomain(domain, specificOnly); On 2018/05/15 16:00:08, Manish Jethani wrote: > On 2018/05/15 13:26:31, kzar wrote: > > I feel like some of these changes (like this one, moving the logic of > > getSelectorsForDomain out into a different function) are just made for the > sake > > of it. I'd rather you only changed what was necessary. > > Yeah this patch has been updated a lot. Now the function is almost empty. > Anyway, my rationale for doing it this way was that selectors is a combination > of unconditional and conditional selectors, there's already a dedicate function > for unconditional selectors. Let me inline getConditionalSelectorsForDomain in > my next update. Actually there are three return statements (out of four) in getConditionalGenericSelectors that would require a call to getUnconditionalSelectors, because we need to concatenate that list as well. Isn't that code duplication? This seems much cleaner to me. Otherwise we have to make sure there's only one return statement, but then we would be adding additional indentation and additional if-else logic just so we have only one call to getUnconditionalSelectors at the end of the function. What do you think?
https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js File lib/elemHide.js (right): https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js#new... lib/elemHide.js:139: function getSpecificFiltersForDomain(domain) Note: getSpecificFiltersForDomain could be reused by ElemHideEmulation with minor modifications here and elsewhere (mainly checking for the type property of the Filter object). https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js#new... lib/elemHide.js:167: function matchSelectors(domain, filtersList, excludeSet) Note: matchSelectors could be reused by ElemHideEmulation. We could export these functions. https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js#new... lib/elemHide.js:471: let selectors = getConditionalSelectorsForDomain(domain, specificOnly); On 2018/05/15 16:16:15, Manish Jethani wrote: > On 2018/05/15 16:00:08, Manish Jethani wrote: > > On 2018/05/15 13:26:31, kzar wrote: > > > I feel like some of these changes (like this one, moving the logic of > > > getSelectorsForDomain out into a different function) are just made for the > > sake > > > of it. I'd rather you only changed what was necessary. > > > > Yeah this patch has been updated a lot. Now the function is almost empty. > > Anyway, my rationale for doing it this way was that selectors is a combination > > of unconditional and conditional selectors, there's already a dedicate > function > > for unconditional selectors. Let me inline getConditionalSelectorsForDomain in > > my next update. > > Actually there are three return statements (out of four) in > getConditionalGenericSelectors that would require a call to > [...] I meant getConditionalSelectorsForDomain of course.
Patch Set 16: Reinclude test/elemHide.js
https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js File lib/elemHide.js (left): https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js#old... lib/elemHide.js:271: if (specificOnly && currentDomain == "") On 2018/05/15 16:00:08, Manish Jethani wrote: > On 2018/05/15 13:26:30, kzar wrote: > > Why don't we instead keep track of if filtersByDomain.has(currentDomain) (for > > non "") is ever true here. If it wasn't we could cache the result of the > > selectors which applied for "", or reuse the cached value. It seems to me that > > would be a simpler way to use your insight about known/unknown domains without > > adding as much complexity. > > filtersByDomain.has(domain) will be true for ~11,000 domains. The optimization > in question still applies to ~9,500 out of those ~11,000 domains, but for that > we need to know specifically if there are any domain-specific exceptions (for > the domain in question) for any generic filters. > > What you're suggesting works for unknown domains, not for known domains. Alright, how about this, perhaps a dumb idea but hear me out... - We no longer store null for exception filters in filtersByDomain. - We keep a lookup of exception domains, whenever we add an exception we add the (non-generic) domains to that. - We keep a cache of the generic selectors that generally match for domains. Then - If the domain is found in the exceptions lookup we just do things the slow way. - Otherwise if the domain is found in filtersByDomain we do things the slow way for the domain, but concatinate the generic selectors cache to that. - Otherwise we just return the generic selectors cache. Basically I think your insight about generally being able to return the generically applied selectors is good, but I think your changes so far are quite sweeping and add a lot of complexity. I'd like to see if we could get most of the performance benefit from your insight without such a large change that adds so much complexity. https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js File lib/elemHide.js (right): https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js#new... lib/elemHide.js:102: filtersByDomain.set(domain, null); On 2018/05/15 16:00:08, Manish Jethani wrote: > On 2018/05/15 13:26:30, kzar wrote: > > Am I missing something or do we never remove the domain from filtersByDomain > > again? I guess previously it wasn't significant if a domain was left with an > > empty Map entry, but now it means something if filtersByDomain.has(domain). > > You're right, we never remove a domain like that from filtersByDomain. This > means that if a domain was previously known, it never becomes unknown. That's > fine, it would still end up in the genericFriendlyDomains set because no > domain-specific exceptions would apply, so the performance would be the same for > generic filters. Part of the optimization would be disabled for such a domain > because we would still go looking for domain-specific filters (and we might find > none). I think that's OK. Otherwise we would have to do more bookkeeping and I > don't think it's worth it. > > In a typical browsing sessions you're not going add and remove individual filter > lists anyway. I'm guessing when filters lists are downloaded the clear method is > called and everything is repopulated again anyway. Correct me if I'm wrong. Acknowledged. https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js#new... lib/elemHide.js:139: function getSpecificFiltersForDomain(domain) On 2018/05/15 16:20:59, Manish Jethani wrote: > Note: getSpecificFiltersForDomain could be reused by ElemHideEmulation with > minor modifications here and elsewhere (mainly checking for the type property of > the Filter object). Acknowledged. https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js#new... lib/elemHide.js:205: return specificOnly ? [] : getConditionalGenericSelectors(); Nit: Seems a waste to create a new empty Array when we already know we have one.
https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js File lib/elemHide.js (left): https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js#old... lib/elemHide.js:271: if (specificOnly && currentDomain == "") On 2018/05/18 14:34:53, kzar wrote: > On 2018/05/15 16:00:08, Manish Jethani wrote: > > On 2018/05/15 13:26:30, kzar wrote: > > > Why don't we instead keep track of if filtersByDomain.has(currentDomain) > (for > > > non "") is ever true here. If it wasn't we could cache the result of the > > > selectors which applied for "", or reuse the cached value. It seems to me > that > > > would be a simpler way to use your insight about known/unknown domains > without > > > adding as much complexity. > > > > filtersByDomain.has(domain) will be true for ~11,000 domains. The optimization > > in question still applies to ~9,500 out of those ~11,000 domains, but for that > > we need to know specifically if there are any domain-specific exceptions (for > > the domain in question) for any generic filters. > > > > What you're suggesting works for unknown domains, not for known domains. > > Alright, how about this, perhaps a dumb idea but hear me out... > > - We no longer store null for exception filters in filtersByDomain. > - We keep a lookup of exception domains, whenever we add an exception we add the > (non-generic) domains to that. If I understanding the suggestion correctly, you're saying that instead of putting the domains with only exceptions in the filtersByDomain map but with the value set to null, we put them in a separate set. This would work, it would just mean two lookups instead of one (which is probably OK). But ... > - We keep a cache of the generic selectors that generally match for domains. In order to make sure that we're not talking past each other, let's use the terminology used in the patch. Selectors fall into three categories: (1) unconditional selectors: apply on all domains (e.g. "##.ad" with no domain-specific exceptions) (2) conditional generic selectors: generally apply on all domains, except for domains with domain-specific exceptions (or those excluded in the filter itself) for some generic selectors (e.g. "~no-foo.example.com###foo" does not apply on no-foo.example.com, and if there's an exception like "bar.com#@##foo" then it does not apply on bar.com either, but will still apply on all other domains) (3) domain-specific selectors: apply on specific domains (e.g. "bar.example.com##div[data-bar]" applies only on bar.example.com) If you're suggesting that we merge the caches of unconditional selectors and conditional generic selectors into a single cache, that would be OK for most domains, but it won't be good for the ~9,500 domains in EasyList+AA that have no domain-specific exceptions for generic selectors; for those cases then, we'd have to generate the list of unconditional selectors (~18,000 selectors in EasyList+AA) (while we're currently caching this list) as well as the list of conditional generic selectors. In the current patch, for such domains, we only generate the list of conditional generic selectors. So if we really want to do this in the best way then we need two caches. > Then > - If the domain is found in the exceptions lookup we just do things the slow > way. This is what the patch did initially (see Patch Set 1) before I realized that even most known domains could benefit from this optimization. Also, as you pointed out in the initial comments, if we were going to extract all the subdomains of a domain to check if any of the subdomains was in the set of known domains (for foo.example.com we need to look for foo.example.com, example.com, and com in the filtersByDomain map), why not get the filters for those subdomains as well, since we need the filters for specific selector matching anyway? This is why there's a getSpecificFiltersForDomain function now. If there are no specific filters for a domain or any of its subdomains, clearly then we can just return the cached list of generic selectors. > - Otherwise if the domain is found in filtersByDomain we do things the slow way > for the domain, but concatinate the generic selectors cache to that. This would not work for ~1,500 domains out of ~11,000 domains in EasyList+AA, it would yield incorrect results, because these domains have domain-specific exceptions for some generic selectors. > - Otherwise we just return the generic selectors cache. > > Basically I think your insight about generally being able to return the > generically applied selectors is good, but I think your changes so far are quite > sweeping and add a lot of complexity. I'd like to see if we could get most of > the performance benefit from your insight without such a large change that adds > so much complexity. So I think the problem is complex and the solution must necessarily reflect some of that complexity if we want to optimize maximally for every case. In general I'n not concerned about complexity as long as it yields correct results and does it efficiently, but of course the more complex the logic becomes the harder it becomes to maintain, so there's a good case for keeping things simple. We have a 100+ million user base though, if I understand correctly, and given that this code runs on every single HTML document (including on mobile), I'm still very much in favor of optimizing this as much as possible. If it becomes too complex, we can address that with abstractions (i.e. separate things into into functions that do certain well-defined and documented things) and add unit tests for those abstractions. I'm OK with splitting this patch up into two though, so that one iteration only optimizes for unknown domains while the second one takes the optimization to most known domains (again, this is ~9500 domains out of ~11,000 domains in EasyList+AA) and their subdomains. I still think it works best as a single patch though, as it allows us to look at the bigger picture and take everything into consideration at once. There are pros and cons. What do you think?
On 2018/05/18 17:18:27, Manish Jethani wrote: > We have a 100+ million user base though, if I understand correctly... Please don't talk publicly about usage numbers without checking first, also we talk about "active devices" not users. For the record I'm not sure what our numbers are at the moment. Either way I get your point that we have lots of users, but to be fair that's also a case for keeping the code manageable. As is that we don't have nearly as many developers. > I'm OK with splitting this patch up into two though, so that one iteration only > optimizes for unknown domains while the second one takes the optimization to > most known domains... Yes please, that sounds good to me. A patch set that keeps the complexity down whilst still improving performance for the unknown domain case would be great.
Patch Set 17: Strip out optimization for "generic-friendly" domains On 2018/05/22 13:59:40, kzar wrote: > On 2018/05/18 17:18:27, Manish Jethani wrote: > > We have a 100+ million user base though, if I understand correctly... > > Please don't talk publicly about usage numbers without checking first, also we > talk about "active devices" not users. For the record I'm not sure what our > numbers are at the moment. OK, it's on the website so I thought these numbers are well known [1] [1]: https://adblockplus.org/blog/100-million-users-100-million-thank-yous > Either way I get your point that we have lots of users, but to be fair that's > also a case for keeping the code manageable. As is that we don't have nearly as > many developers. OK, let's see how this tradeoff works out. In my opinion it should be possible to keep it manageable and optimize the heck out of it, especially when it comes to applying this optimization to ~9,500 domains in EasyList+AA (since we're already applying it to unknown domains now). But let's do this one step at a time. > > I'm OK with splitting this patch up into two though, so that one iteration > only > > optimizes for unknown domains while the second one takes the optimization to > > most known domains... > > Yes please, that sounds good to me. A patch set that keeps the complexity down > whilst still improving performance for the unknown domain case would be great. Done.
Patch Set 18: Avoid unnecessary Array.concat
On 2018/05/23 00:12:26, Manish Jethani wrote: > > On 2018/05/18 17:18:27, Manish Jethani wrote: > > > We have a 100+ million user base though, if I understand correctly... > > > > Please don't talk publicly about usage numbers without checking first, also we > > talk about "active devices" not users. For the record I'm not sure what our > > numbers are at the moment. > > OK, it's on the website so I thought these numbers are well known [1] > > [1]: https://adblockplus.org/blog/100-million-users-100-million-thank-yous Fair enough, I stand corrected! |