Rietveld Code Review Tool
Help | Bug tracker | Discussion group | Source code | Sign in
(31)

Issue 29773570: Issue 6652 - Implement fast selector lookups for most domains

Can't Edit
Can't Publish+Mail
Start Review
Created:
1 week, 6 days ago by Manish Jethani
Modified:
2 days, 13 hours ago
Reviewers:
kzar
CC:
sergei
Base URL:
https://hg.adblockplus.org/adblockpluscore/
Visibility:
Public.

Description

Based on https://hg.adblockplus.org/adblockpluscore/rev/96cabd413f6d ElemHide.getSelectorsForDomain does filter matching for every domain. This is not necessary; instead, it can do filter matching only for known domains (i.e. those that have filters that would apply to them). For any random domain like x4aabd7c0ddcd.com or manishjethani.com the lookups can be much, much faster, because there are no domain-specific filters that apply and no exceptions to check against (except for generic exceptions like "#@#foo", which would still apply to "##foo" on x4aabd7c0ddcd.com or manishjethani.com). The same optimization also works for known domains that have no domain-specific exceptions for any generic filters. In EasyList+AA, out of 10,899 domains, 9,419 fall in this category (as of 12 May 2018). In other words, ElemHide.getSelectorsForDomain can do faster lookups for all but 1,480 domains. Since the list of generic selectors that apply on most domains is the same, ElemHide.getSelectorsForDomain can cache this list and, for most domains, return the cached list concatenated with the generated list of domain-specific selectors. With these changes, the performance improvement for most domains is 10x for EastList+AA (as of 12 May 2018) and possibly significantly higher as the list of subscriptions grows. There is no performance penalty for domains to which this optimization does not apply.

Patch Set 1 #

Total comments: 10

Patch Set 2 : Check generic exception set size before looking up #

Total comments: 10

Patch Set 3 : Look up filters before deciding path #

Patch Set 4 : Merge getConditionalGenericSelectors into getConditionalSelectorsForDomain #

Total comments: 1

Patch Set 5 : Rebase #

Patch Set 6 : Split out getConditionalGenericSelectors again and cache result #

Total comments: 2

Patch Set 7 : Used cached list for generic-friendly domains #

Total comments: 2

Patch Set 8 : Limit size of generic-friendly domains set #

Patch Set 9 : Rebase on master #

Patch Set 10 : Clean up #

Patch Set 11 : Fix JSDoc #

Patch Set 12 : Use longest subdomain as key in generic-friendly domains set #

Patch Set 13 : Move selector matching into its own function #

Total comments: 19

Patch Set 14 : Rebase #

Patch Set 15 : Inline doesFilterApply #

Patch Set 16 : Reinclude test/elemHide.js #

Unified diffs Side-by-side diffs Delta from patch set Stats (+222 lines, -46 lines) Patch
M lib/elemHide.js View 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 5 chunks +213 lines, -46 lines 0 comments Download
M test/elemHide.js View 15 1 chunk +9 lines, -0 lines 0 comments Download

Messages

Total messages: 30
Manish Jethani
1 week, 6 days ago (2018-05-07 15:38:36 UTC) #1
Manish Jethani
Patch Set 1 I'll file an issue for this shortly, but see the description here. ...
1 week, 6 days ago (2018-05-07 16:03:03 UTC) #2
Manish Jethani
https://codereview.adblockplus.org/29773570/diff/29773571/lib/elemHide.js File lib/elemHide.js (right): https://codereview.adblockplus.org/29773570/diff/29773571/lib/elemHide.js#newcode98 lib/elemHide.js:98: function getConditionalGenericSelectors() To give you some numbers, there are ...
1 week, 6 days ago (2018-05-07 16:10:34 UTC) #3
Manish Jethani
Patch Set 2: Check generic exception set size before looking up https://codereview.adblockplus.org/29773570/diff/29773574/lib/elemHide.js File lib/elemHide.js (right): ...
1 week, 6 days ago (2018-05-07 16:15:09 UTC) #4
Manish Jethani
Now about how I tested this. First I generated a list of 3,000 filters using ...
1 week, 6 days ago (2018-05-07 16:31:14 UTC) #5
sergei
I think it is worth an issue.
1 week, 6 days ago (2018-05-07 17:23:33 UTC) #6
Manish Jethani
On 2018/05/07 17:23:33, sergei wrote: > I think it is worth an issue. Done. https://issues.adblockplus.org/ticket/6652
1 week, 6 days ago (2018-05-07 17:53:47 UTC) #7
kzar
https://codereview.adblockplus.org/29773570/diff/29773574/lib/elemHide.js File lib/elemHide.js (right): https://codereview.adblockplus.org/29773570/diff/29773574/lib/elemHide.js#newcode335 lib/elemHide.js:335: if (isDomainKnown(currentDomain)) I don't quite understand how your change ...
1 week, 5 days ago (2018-05-08 13:46:47 UTC) #8
Manish Jethani
https://codereview.adblockplus.org/29773570/diff/29773574/lib/elemHide.js File lib/elemHide.js (right): https://codereview.adblockplus.org/29773570/diff/29773574/lib/elemHide.js#newcode335 lib/elemHide.js:335: if (isDomainKnown(currentDomain)) On 2018/05/08 13:46:47, kzar wrote: > I ...
1 week, 5 days ago (2018-05-08 14:20:43 UTC) #9
Manish Jethani
https://codereview.adblockplus.org/29773570/diff/29773574/lib/elemHide.js File lib/elemHide.js (right): https://codereview.adblockplus.org/29773570/diff/29773574/lib/elemHide.js#newcode335 lib/elemHide.js:335: if (isDomainKnown(currentDomain)) On 2018/05/08 14:20:43, Manish Jethani wrote: > ...
1 week, 5 days ago (2018-05-08 15:49:01 UTC) #10
kzar
https://codereview.adblockplus.org/29773570/diff/29773574/lib/elemHide.js File lib/elemHide.js (right): https://codereview.adblockplus.org/29773570/diff/29773574/lib/elemHide.js#newcode335 lib/elemHide.js:335: if (isDomainKnown(currentDomain)) On 2018/05/08 15:49:00, Manish Jethani wrote: > ...
1 week, 5 days ago (2018-05-08 17:13:44 UTC) #11
Manish Jethani
https://codereview.adblockplus.org/29773570/diff/29773574/lib/elemHide.js File lib/elemHide.js (right): https://codereview.adblockplus.org/29773570/diff/29773574/lib/elemHide.js#newcode335 lib/elemHide.js:335: if (isDomainKnown(currentDomain)) On 2018/05/08 17:13:43, kzar wrote: > On ...
1 week, 5 days ago (2018-05-08 17:37:13 UTC) #12
Manish Jethani
https://codereview.adblockplus.org/29773570/diff/29773574/lib/elemHide.js File lib/elemHide.js (right): https://codereview.adblockplus.org/29773570/diff/29773574/lib/elemHide.js#newcode335 lib/elemHide.js:335: if (isDomainKnown(currentDomain)) On 2018/05/08 17:37:13, Manish Jethani wrote: > ...
1 week, 5 days ago (2018-05-08 17:49:30 UTC) #13
kzar
On 2018/05/08 17:49:30, Manish Jethani wrote: > Anyway I have something in mind here so ...
1 week, 4 days ago (2018-05-09 10:18:11 UTC) #14
Manish Jethani
On 2018/05/09 10:18:11, kzar wrote: > On 2018/05/08 17:49:30, Manish Jethani wrote: > > Anyway ...
1 week, 4 days ago (2018-05-09 15:40:00 UTC) #15
Manish Jethani
Patch Set 3: Look up filters before deciding path Now rebased on patch #29774573. We ...
1 week, 4 days ago (2018-05-09 19:57:56 UTC) #16
Manish Jethani
Patch Set 4: Merge getConditionalGenericSelectors into getConditionalSelectorsForDomain I'm still not convinced merging these two is ...
1 week, 4 days ago (2018-05-09 21:35:45 UTC) #17
Manish Jethani
Patch Set 5: Rebase Patch Set 6: Split out getConditionalGenericSelectors again and cache result Patch ...
1 week, 2 days ago (2018-05-11 10:00:32 UTC) #18
Manish Jethani
Patch Set 8: Limit size of generic-friendly domains set https://codereview.adblockplus.org/29773570/diff/29778566/lib/elemHide.js File lib/elemHide.js (right): https://codereview.adblockplus.org/29773570/diff/29778566/lib/elemHide.js#newcode175 lib/elemHide.js:175: ...
1 week, 2 days ago (2018-05-11 10:09:09 UTC) #19
Manish Jethani
Patch Set 9: Rebase on master
1 week, 2 days ago (2018-05-11 13:31:55 UTC) #20
Manish Jethani
Patch Set 10: Clean up Patch Set 11: Fix JSDoc A note about the "generic-friendly ...
1 week, 1 day ago (2018-05-12 10:35:05 UTC) #21
Manish Jethani
Patch Set 12: Use longest subdomain as key in generic-friendly domains set We use the ...
1 week, 1 day ago (2018-05-12 17:39:35 UTC) #22
Manish Jethani
Patch Set 13: Move selector matching into its own function Fewer lines of code. Later ...
1 week ago (2018-05-13 12:21:44 UTC) #23
kzar
https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js File lib/elemHide.js (left): https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js#oldcode271 lib/elemHide.js:271: if (specificOnly && currentDomain == "") Why don't we ...
5 days, 17 hours ago (2018-05-15 13:26:31 UTC) #24
Manish Jethani
Patch Set 14: Rebase Patch Set 15: Inline doesFilterApply https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js File lib/elemHide.js (left): https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js#oldcode271 lib/elemHide.js:271: ...
5 days, 15 hours ago (2018-05-15 16:00:08 UTC) #25
Manish Jethani
https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js File lib/elemHide.js (right): https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js#newcode471 lib/elemHide.js:471: let selectors = getConditionalSelectorsForDomain(domain, specificOnly); On 2018/05/15 16:00:08, Manish ...
5 days, 14 hours ago (2018-05-15 16:16:15 UTC) #26
Manish Jethani
https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js File lib/elemHide.js (right): https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js#newcode139 lib/elemHide.js:139: function getSpecificFiltersForDomain(domain) Note: getSpecificFiltersForDomain could be reused by ElemHideEmulation ...
5 days, 14 hours ago (2018-05-15 16:20:59 UTC) #27
Manish Jethani
Patch Set 16: Reinclude test/elemHide.js
5 days, 14 hours ago (2018-05-15 16:27:19 UTC) #28
kzar
https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js File lib/elemHide.js (left): https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js#oldcode271 lib/elemHide.js:271: if (specificOnly && currentDomain == "") On 2018/05/15 16:00:08, ...
2 days, 16 hours ago (2018-05-18 14:34:54 UTC) #29
Manish Jethani
2 days, 13 hours ago (2018-05-18 17:18:27 UTC) #30
https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js
File lib/elemHide.js (left):

https://codereview.adblockplus.org/29773570/diff/29780555/lib/elemHide.js#old...
lib/elemHide.js:271: if (specificOnly && currentDomain == "")
On 2018/05/18 14:34:53, kzar wrote:
> On 2018/05/15 16:00:08, Manish Jethani wrote:
> > On 2018/05/15 13:26:30, kzar wrote:
> > > Why don't we instead keep track of if filtersByDomain.has(currentDomain)
> (for
> > > non "") is ever true here. If it wasn't we could cache the result of the
> > > selectors which applied for "", or reuse the cached value. It seems to me
> that
> > > would be a simpler way to use your insight about known/unknown domains
> without
> > > adding as much complexity.
> > 
> > filtersByDomain.has(domain) will be true for ~11,000 domains. The
optimization
> > in question still applies to ~9,500 out of those ~11,000 domains, but for
that
> > we need to know specifically if there are any domain-specific exceptions
(for
> > the domain in question) for any generic filters.
> > 
> > What you're suggesting works for unknown domains, not for known domains.
> 
> Alright, how about this, perhaps a dumb idea but hear me out...
> 
> - We no longer store null for exception filters in filtersByDomain.
> - We keep a lookup of exception domains, whenever we add an exception we add
the
> (non-generic) domains to that.

If I understanding the suggestion correctly, you're saying that instead of
putting the domains with only exceptions in the filtersByDomain map but with the
value set to null, we put them in a separate set. This would work, it would just
mean two lookups instead of one (which is probably OK).

But ...

> - We keep a cache of the generic selectors that generally match for domains.

In order to make sure that we're not talking past each other, let's use the
terminology used in the patch.

Selectors fall into three categories:

 (1) unconditional selectors: apply on all domains (e.g. "##.ad" with no
domain-specific exceptions)

 (2) conditional generic selectors: generally apply on all domains, except for
domains with domain-specific exceptions (or those excluded in the filter itself)
for some generic selectors (e.g. "~no-foo.example.com###foo" does not apply on
no-foo.example.com, and if there's an exception like "bar.com#@##foo" then it
does not apply on bar.com either, but will still apply on all other domains)

 (3) domain-specific selectors: apply on specific domains (e.g.
"bar.example.com##div[data-bar]" applies only on bar.example.com)

If you're suggesting that we merge the caches of unconditional selectors and
conditional generic selectors into a single cache, that would be OK for most
domains, but it won't be good for the ~9,500 domains in EasyList+AA that have no
domain-specific exceptions for generic selectors; for those cases then, we'd
have to generate the list of unconditional selectors (~18,000 selectors in
EasyList+AA) (while we're currently caching this list) as well as the list of
conditional generic selectors. In the current patch, for such domains, we only
generate the list of conditional generic selectors.

So if we really want to do this in the best way then we need two caches.

> Then
> - If the domain is found in the exceptions lookup we just do things the slow
> way.

This is what the patch did initially (see Patch Set 1) before I realized that
even most known domains could benefit from this optimization.

Also, as you pointed out in the initial comments, if we were going to extract
all the subdomains of a domain to check if any of the subdomains was in the set
of known domains (for foo.example.com we need to look for foo.example.com,
example.com, and com in the filtersByDomain map), why not get the filters for
those subdomains as well, since we need the filters for specific selector
matching anyway? This is why there's a getSpecificFiltersForDomain function now.
If there are no specific filters for a domain or any of its subdomains, clearly
then we can just return the cached list of generic selectors.

> - Otherwise if the domain is found in filtersByDomain we do things the slow
way
> for the domain, but concatinate the generic selectors cache to that.

This would not work for ~1,500 domains out of ~11,000 domains in EasyList+AA, it
would yield incorrect results, because these domains have domain-specific
exceptions for some generic selectors.

> - Otherwise we just return the generic selectors cache.
> 
> Basically I think your insight about generally being able to return the
> generically applied selectors is good, but I think your changes so far are
quite
> sweeping and add a lot of complexity. I'd like to see if we could get most of
> the performance benefit from your insight without such a large change that
adds
> so much complexity.

So I think the problem is complex and the solution must necessarily reflect some
of that complexity if we want to optimize maximally for every case.

In general I'n not concerned about complexity as long as it yields correct
results and does it efficiently, but of course the more complex the logic
becomes the harder it becomes to maintain, so there's a good case for keeping
things simple. We have a 100+ million user base though, if I understand
correctly, and given that this code runs on every single HTML document
(including on mobile), I'm still very much in favor of optimizing this as much
as possible. If it becomes too complex, we can address that with abstractions
(i.e. separate things into into functions that do certain well-defined and
documented things) and add unit tests for those abstractions.

I'm OK with splitting this patch up into two though, so that one iteration only
optimizes for unknown domains while the second one takes the optimization to
most known domains (again, this is ~9500 domains out of ~11,000 domains in
EasyList+AA) and their subdomains. I still think it works best as a single patch
though, as it allows us to look at the bigger picture and take everything into
consideration at once. There are pros and cons.

What do you think?
Sign in to reply to this message.

Powered by Google App Engine
RSS Feeds Recent Issues | This issue
This is Rietveld 87257f5