Issue 29426594: Issue 3673 - Merge closely matching rules

Issue 29426594: Issue 3673 - Merge closely matching rules (Closed)

Created:
May 1, 2017, 7:37 p.m. by Manish Jethani

Modified:
July 31, 2017, 8:45 a.m.

Reviewers:
Sebastian Noack, kzar

CC:
Felix Dahlke

Base URL:
https://hg.adblockplus.org/abp2blocklist

Visibility:
Public.

Description

Issue 3673 - Merge closely matching rules SCOPE Our goal is to merge two or more rules into a single rule and to do this for as many rules as is feasible given limited developer time and limitations of runtime performance. We do not aim to merge all possible rules that can be merged. We may want to do this in the future [1], but it is outside the scope of this issue. HIGH-LEVEL ALGORITHM What follows is a high-level description of the algorithm. Two objects are said to be mergeable by property X if they are otherwise identical except for the value of property X. For example, {X: 1, Y: 1} and {X: 2, Y: 1} can be merged into {X: [1, 2], Y: 1}, therefore they are "mergeable by X". {X: 1, Y: 1} and {X: 1, Y: 2} on the other hand are not mergeable by X, but they are mergeable by Y. {X: 1, Y: 1} and {X: 2, Y: 2} are neither mergeable by X nor mergeable by Y. They may or may not be "mergeable by X and Y", depending on the semantics of X and Y [2], but we do not handle this case. The rule set is divided into logical groups: CSS rules, CSS exception rules, blocking rules, and blocking exception rules. While the order of the logical groups is significant, the order of the rules within each logical group is not. We can only safely merge rules that belong to the same logical group. We start by dividing each logical group further into groups of rules mergeable by URL filter (rule.trigger["url-filter"]). We then try to merge as many rules within each group as we "feasibly" can using a simple algorithm, which is described below. Next we take the output of the above and divide the rules again into groups of rules mergeable by resource type (rule.trigger["resource-type"]). We then merge all the rules within each group into a single rule. Since the resource type is a list, merging a group of two or more rules is as simple as making a union of all the resource types in the group. Next we take the output of the above and do the same thing for the "if-domain" property (rule.trigger["if-domain"]), the value of which is also a mergeable list. The output of the above is the merged logical group. The final rule set is a concatenation of all the merged logical groups. MERGING BY URL FILTER The algorithm for merging by URL filter is as follows. For each pair combination of rules in the group [3], we check if the URL filters are a "close match". Two URL filters are a close match either if they differ only in one character (i.e. a Levenshtein distance of 1) or if it's possible to convert one into the other by deleting or inserting one and only one contiguous range of characters from one into the other. For example, "ad" and "ads", "adv" and "ads", and "adv" and "advertisement" are all close matches. Next we sort the group by the number of close matches, such that a rule with N close matches appears before a rule with N - 1 close matches in the group. We then iterate over the group and try to merge each rule with as many of its close matches as we can. For example, the rules with URL filters "ad", "ads", and "adv" are all merged into one with the URL filter "ad[sv]?". Note that the URL filter "adv" is a close match with "advertisement" but also with "ad" and "ads". In this case we merge it into "ad[vs]?" and not "adv(ertisement)?", since the former combines more rules into one. LIMITATIONS & AREAS FOR IMPROVEMENT We only merge with one set of parentheses. For example, "ad" and "advertisement" will be merged into "ad(vertisement)?", but "ad", "advert", and "advertisement" will not be merged into "ad(vert(isement)?)?" [4]. Further, we never mix parentheses with other kinds of merging. For example, "ads", "adv", and "adsserver", and "advserver" will not be merged into "ad[sv](server)?" [4]. Also, since URL filters are regular expressions to begin with, if the delta between two URL filters contains a regular expression special character like "\", "+", "?", and so on, the two are not considered to be a close match [4]. RESULTS Below are the results of running the merge function on the various subscription options available in Adblock Plus for Safari on macOS. * EasyList+AA Rule merging took 918ms Before: 49,261 rules After: 43,617 rules * EasyList Germany+EasyList+AA Rule merging took 960ms Before: 55,646 rules After: 49,674 rules * EasyList China+EasyList+AA Rule merging took 968ms Before: 57,725 rules After: 51,912 rules [5] * ABPindo+EasyList+AA Rule merging took 861ms Before: 50,250 rules After: 44,561 rules * Bulgarian list+EasyList+AA Rule merging took 877ms Before: 50,446 rules After: 44,777 rules * EasyList Czech and Slovak+EasyList+AA Rule merging took 864ms Before: 50,188 rules After: 44,520 rules * EasyList Dutch+EasyList+AA Rule merging took 882ms Before: 50,399 rules After: 44,710 rules * EasyList Hebrew+EasyList+AA Rule merging took 860ms Before: 50,069 rules After: 44,342 rules * EasyList Italy+EasyList+AA Rule merging took 934ms Before: 52,148 rules After: 46,404 rules * EasyList Lithuania+EasyList+AA Rule merging took 880ms Before: 50,262 rules After: 44,599 rules * Latvian List+EasyList+AA Rule merging took 845ms Before: 49,596 rules After: 43,937 rules * Liste AR+Liste FR+EasyList+AA Rule merging took 977ms Before: 60,398 rules After: 53,769 rules [5] * Liste FR+EasyList+AA Rule merging took 955ms Before: 58,875 rules After: 52,352 rules [5] * ROList+EasyList+AA Rule merging took 858ms Before: 49,556 rules After: 43,897 rules * RuAdList+EasyList+AA Rule merging took 1,070ms Before: 62,919 rules After: 56,663 rules [5] Footnotes: 1: Owing to the relative simplicity of the algorithm, there may still be mergeable rules that remain unmerged. A simple way to address this would be to run the algorithm recursively until it is no longer able to merge any more rules (essentially brute force). We could also come up with something cleverer. Based on tests with real data, it appears not worth trying. 2: For content blocker rules, there are no two properties that can be merged together. If "if-domain" and "unless-domain" were allowed to be specified together, two rules could be merged by both properties, but this is not the case. 3: This is necessarily O(N^2). We can reduce it to O(N) by specifying a "heuristic range" within which to look for close matches. In fact, we have tried with a heuristic range of 10, and it appears to yield good results. 4: This is an area for improvement. 5: This is still above the 50,000 rules limit.

Patch Set 1 #

Patch Set 2 : Ignore special characters #

Total comments: 1

Patch Set 3 : Look for matches only within a certain range #

Patch Set 4 : Fix bugs and add unit tests #

Patch Set 5 : Add advanced merge support #

Total comments: 27

Patch Set 6 : Improved matching algorithm #

Patch Set 7 : Make it work in Safari #

Patch Set 8 : Simplify and optimize #

Patch Set 9 : Simplify closeMatch function #

Patch Set 10 : Merge rules by resource-type and if-domain #

Patch Set 11 : Clean up #

Patch Set 12 : Filter out redundant rules before merging #

Total comments: 27

Patch Set 13 : Move merge options to constructor and update comments #

Total comments: 6

Patch Set 14 : Use for..of instead of forEach where possible #

Total comments: 2

Patch Set 15 : Change options object into a single boolean option #

Patch Set 16 : Merge rules in groups #

Total comments: 7

Patch Set 17 : Make generateRules asynchronous #

Patch Set 18 : Make more code asynchronous #

Patch Set 19 : Implement merge options "all", "auto", and "none" #

Patch Set 20 : Remove redundant async #

Total comments: 6

Patch Set 21 : Split variable declaration statement #

Total comments: 3

Patch Set 22 : Rebase with minor changes #

Total comments: 4

Patch Set 23 : Map async callees asynchronously #

Patch Set 24 : Rebase #

Created: July 28, 2017, 1:31 p.m.

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+990 lines, -284 lines)			Patch
M	abp2blocklist.js	View	1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21	1 chunk	+15 lines, -14 lines	0 comments	Download
M	lib/abp2blocklist.js	View	1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23	4 chunks	+572 lines, -10 lines	0 comments	Download
M	test/abp2blocklist.js	View	1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23	1 chunk	+403 lines, -260 lines	0 comments	Download

Messages

Total messages: 67

Expand All Messages | Collapse All Messages

kzar

Wow, I'm impressed you came up with something so fast. Could you add some unit ...

May 1, 2017, 7:52 p.m. (2017-05-01 19:52:55 UTC) #3

kzar

Also could you run the script on EasyList before and after your changes? I'd be ...

May 2, 2017, 9:24 a.m. (2017-05-02 09:24:46 UTC) #4

Manish Jethani

On 2017/05/02 09:24:46, kzar wrote: > Also could you run the script on EasyList before ...

May 2, 2017, 3:38 p.m. (2017-05-02 15:38:25 UTC) #5

Manish Jethani

Patch Set 2: Ignore special characters It's a minor change, but it seemed like we ...

May 2, 2017, 3:39 p.m. (2017-05-02 15:39:53 UTC) #6

Manish Jethani

Patch Set 3: Look for matches only within a certain range Now we're talking. With ...

May 2, 2017, 4:53 p.m. (2017-05-02 16:53:09 UTC) #7

Manish Jethani

On 2017/05/02 09:24:46, kzar wrote: > Another thing to check, have you made sure the ...

May 2, 2017, 7:02 p.m. (2017-05-02 19:02:02 UTC) #8

Manish Jethani

Patch Set 4: Fix bugs and add unit tests Fixed some bugs so that there ...

May 3, 2017, 12:31 a.m. (2017-05-03 00:31:19 UTC) #9

Manish Jethani

Patch Set 5: Add advanced merge support This is beyond the scope of the issue, ...

May 3, 2017, 4:46 a.m. (2017-05-03 04:46:54 UTC) #10

kzar

> Also, the added execution is easily offset by using a more efficient > implementation ...

May 3, 2017, 10:22 a.m. (2017-05-03 10:22:18 UTC) #11

> Also, the added execution is easily offset by using a more efficient
> implementation of tldjs.getDomain. I was able to make the pre-change
> script run ~85% faster by swapping in my own implementation. We
> could also just use the implementation in adblockpluschrome. I'm not
> yet fully aware of the context and where this script is supposed to
> be running, but I'll look into it separately.

So abp2blocklist is used in two ways as far as I'm aware.

 1. For the generation of our blocking lists used by our iOS and other
    mobile apps. This happens using Node.js and therefore the slow tldjs
    module. IIRC this is actually done manually since false-positives are
    still common enough to be a problem. Speed is not important here.
 2. For the content blocking feature of Adblock Plus for Safari. This
    happens when the user enabled the experimental option from the
    extension's option page. The lists are generated at runtime, so speed
    is important, but the slow tldjs module isn't used. This use case is
    less of a focus now since we've moved Safari support onto a separate
    branch of adblockpluschrome, but I guess it's still worth considering.

Things that matter:

 - Safari currently has a limit of 50,000 blocking rules. This means that
   EasyList + AA can be enabled, but isn't enough for the addition of
   region specific filter lists such as EasyList Dutch. If we can reduce
   the number of blocking rules generated enough to allow for extra lists
   to be enabled that would be a big win.
 - Accuracy of blocking. Since Adblock Plus filters don't translate perfectly
   to content blocking rules we have had issues with false negatives and
   (much worse) false positives. If we can trust abp2blocklist enough to
   generate the blocking lists for our mobile apps automatically it would be
   great.

I've also just realised I'm pretty much repeating what Felix said in a recent
email to Hubert, I'll forward that on to you now.

kzar

On 2017/05/03 10:22:18, kzar wrote: > - Accuracy of blocking. Since Adblock Plus filters don't ...

May 3, 2017, 10:27 a.m. (2017-05-03 10:27:49 UTC) #12

kzar

Just checking in IRC and we do still care about supporting Adblock Plus for Safari ...

May 3, 2017, 10:47 a.m. (2017-05-03 10:47:24 UTC) #13

kzar

I need more time to get my head around what you're doing here but I've ...

May 3, 2017, 11:17 a.m. (2017-05-03 11:17:24 UTC) #14

Manish Jethani

On 2017/05/03 10:27:49, kzar wrote: > If you're interested here's[1] the repository for those manually ...

May 3, 2017, 1:57 p.m. (2017-05-03 13:57:20 UTC) #15

Manish Jethani

Comments inline. https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.js File lib/abp2blocklist.js (right): https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.js#newcode369 lib/abp2blocklist.js:369: function closeMatch(s, t, {multi = false} = ...

May 3, 2017, 2:41 p.m. (2017-05-03 14:41:54 UTC) #16

Comments inline.

https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.js
File lib/abp2blocklist.js (right):

https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.j...
lib/abp2blocklist.js:369: function closeMatch(s, t, {multi = false} = {})
On 2017/05/03 11:17:24, kzar wrote:
> I've not seen this syntax before `{multi = false} = {}` could you explain?

This:

  function func(param1, param2, {option1 = false} = {})
  {
    ...
  }

Is the equivalent of this:

  function func(param1, param2, options)
  {
    let option1 = false;

    if (options && options.option1 !== undefined) // Note: strict equality
      option1 = options.option1;

    ...
  }

It's part of the ES6 standard.

http://2ality.com/2015/01/es6-destructuring.html#simulating-named-parameters-...

The idea is that a function will take a list of parameters that are strictly
required for it to do its job, and then a bunch of options. The list of options
can keep growing over time. Here they can be added in a backwards compatible way
without much fuss. This is a common pattern in JavaScript.

https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.j...
lib/abp2blocklist.js:417: // We don't deal with special characters for now.
On 2017/05/03 11:17:24, kzar wrote:
> So we skip special characters in the url-filter regexp, but I wonder if that's
> enough to avoid messing up regexps. For example
> 
> s = "[ab]"
> b = "[bc]"
> 
> Wouldn't we end up with "[abc]"?

The above is not a good example because this is not even supported. The "edit
distance" (Levenshtein distance) between the two is greater than 1. It's going
to treat the above two inputs as "not close matches".

However let's take this one: "[ab]" and "[ac]". It's going to mess up and give
you "[a[bc]]".

I thought of supporting this, but then I decided against it after looking at the
actual values for the URL filters. There is not a single instance of this case.
The code right now makes the assumption that this never occurs in the input.

The ignoring of special characters is more for when the character appears as a
literal in the original filter. For example, "foo.com/?ad" becomes
"^https?://foo\.com/\?ad", and we just don't want to deal with the "?" because
it is escaped (which means we really have to consider the preceding backslash
with it, and that would complicate things).

https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.j...
lib/abp2blocklist.js:476: rulesInfo[index] = {rule};
On 2017/05/03 11:17:24, kzar wrote:
> I'm not sure syntax like this will work for Safari 9 :(

I'll check, but if it doesn't work then I'll have to change this in a few
places.

kzar

https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.js File lib/abp2blocklist.js (right): https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.js#newcode369 lib/abp2blocklist.js:369: function closeMatch(s, t, {multi = false} = {}) On ...

May 3, 2017, 3:19 p.m. (2017-05-03 15:19:04 UTC) #17

https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.js
File lib/abp2blocklist.js (right):

https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.j...
lib/abp2blocklist.js:369: function closeMatch(s, t, {multi = false} = {})
On 2017/05/03 14:41:54, Manish Jethani wrote:
> On 2017/05/03 11:17:24, kzar wrote:
> > I've not seen this syntax before `{multi = false} = {}` could you explain?
> 
> This:
> 
>   function func(param1, param2, {option1 = false} = {})
>   {
>     ...
>   }
> 
> Is the equivalent of this:
> 
>   function func(param1, param2, options)
>   {
>     let option1 = false;
> 
>     if (options && options.option1 !== undefined) // Note: strict equality
>       option1 = options.option1;
> 
>     ...
>   }
> 
> It's part of the ES6 standard.
> 
>
http://2ality.com/2015/01/es6-destructuring.html#simulating-named-parameters-...
> 
> The idea is that a function will take a list of parameters that are strictly
> required for it to do its job, and then a bunch of options. The list of
options
> can keep growing over time. Here they can be added in a backwards compatible
way
> without much fuss. This is a common pattern in JavaScript.

Acknowledged.

https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.j...
lib/abp2blocklist.js:417: // We don't deal with special characters for now.
On 2017/05/03 14:41:54, Manish Jethani wrote:
> However let's take this one: "[ab]" and "[ac]". It's going to mess up and give
> you "[a[bc]]".
> 
> I thought of supporting this, but then I decided against it after looking at
the
> actual values for the URL filters. There is not a single instance of this
case.
> The code right now makes the assumption that this never occurs in the input.

Maybe add a comment explaining that assumption?

https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.j...
lib/abp2blocklist.js:454: let copy = {
How about `return Object.create(rule, {"url-filter": {value: undefined}});`

https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.j...
lib/abp2blocklist.js:470: const heuristicRange = 100;
Since the code either runs in a place where speed really matters, or where it
doesn't matter at all, perhaps we should change this and the multi option?

I was thinking "multi" could be renamed "slow" or something like that. If slow's
true then match multiple character changes and have no heuristicRange limit. We
could perhaps use the option to make other decisions in the future too, if we
think of other ways to process the rules.

https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.j...
lib/abp2blocklist.js:502: if (rulesInfo[i].stringifiedWithoutURLFilter ==
I wonder if we could create a lookup table stringifiedWithoutURLFilter =>
[ruleInfo] instead?

Manish Jethani

Patch Set 6: Improved matching algorithm Now instead of trying to do both kinds of ...

May 4, 2017, 2:49 a.m. (2017-05-04 02:49:32 UTC) #18

Patch Set 6: Improved matching algorithm

Now instead of trying to do both kinds of matches within the same loop, I've
separated the logic out into two different algorithms. If the strings differ by
zero or one character in length, we try one; otherwise we try the other one. It
works beautifully, easier to understand as well.

I ran an "advanced" and "exhaustive" (test every rule against every other)
version of this on the same easylist.txt, here are the results:

Before:

 *  Execution time: ~15s
 *  Output size: 38,955 rules

After:

 *  Execution time: 1m ~25s
 *  Output size: 36,649 rules

It's smaller by more than 2,000 rules. I'm still getting ideas.

I also manually examined about 15% of the generated rules, looks OK to me except
one thing: the hyphen ("-") should appear first within square brackets,
otherwise it might be a parse error. i.e. "[-/_]" instead of "[/-_]". I'll fix
this.

https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.js
File lib/abp2blocklist.js (right):

https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.j...
lib/abp2blocklist.js:377: // If the string lenghts differ by more than one
character, we cannot arrive
On 2017/05/03 11:17:24, kzar wrote:
> Nit: Typo "lenghts".

Done.

https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.j...
lib/abp2blocklist.js:417: // We don't deal with special characters for now.
On 2017/05/03 15:19:04, kzar wrote:
> On 2017/05/03 14:41:54, Manish Jethani wrote:
> > However let's take this one: "[ab]" and "[ac]". It's going to mess up and
give
> > you "[a[bc]]".
> > 
> > I thought of supporting this, but then I decided against it after looking at
> the
> > actual values for the URL filters. There is not a single instance of this
> case.
> > The code right now makes the assumption that this never occurs in the input.
> 
> Maybe add a comment explaining that assumption?
> 

Added a comment to explain this.

https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.j...
lib/abp2blocklist.js:422: if (diff == 0)
On 2017/05/03 11:17:24, kzar wrote:
> Nit: Couldn't this be an `else if` too?

Done.

https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.j...
lib/abp2blocklist.js:432: else if (diff > 0)
On 2017/05/03 11:17:24, kzar wrote:
> Nit: Please use braces since the clause spans multiple lines.

Done.

https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.j...
lib/abp2blocklist.js:454: let copy = {
On 2017/05/03 15:19:04, kzar wrote:
> How about `return Object.create(rule, {"url-filter": {value: undefined}});`

That would not work for multiple reasons, but most notably because "url-filter"
is a property of rule.trigger, not rule itself.

The other reason is that an object created that way has no own, enumerable
properties, therefore the stringified version is "{}".

https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.j...
lib/abp2blocklist.js:470: const heuristicRange = 100;
On 2017/05/03 15:19:04, kzar wrote:
> Since the code either runs in a place where speed really matters, or where it
> doesn't matter at all, perhaps we should change this and the multi option?
> 
> I was thinking "multi" could be renamed "slow" or something like that. If
slow's
> true then match multiple character changes and have no heuristicRange limit.
We
> could perhaps use the option to make other decisions in the future too, if we
> think of other ways to process the rules.

In the latest update the generateRules function takes both general options like
fastMerge and more specific options like advancedMerge (experimental or slow
stuff) and exhaustiveMerge (test every rule against every other rule). If fast
is true, it'll skip both these; if set to false, it'll be as experimental and as
slow as it can (but you can still turn off individual settings).

https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.j...
lib/abp2blocklist.js:502: if (rulesInfo[i].stringifiedWithoutURLFilter ==
On 2017/05/03 15:19:04, kzar wrote:
> I wonder if we could create a lookup table stringifiedWithoutURLFilter =>
> [ruleInfo] instead?

I'm not sure what the benefit of that would be.

We have two ruleInfo objects, and we want to find out if they're identical
except for the URL filter. Now it's as simple as a string comparison. I actually
tried the lookup table approach (we don't even need to store an array there,
just an empty object), it is significantly slower.

But you give me a good idea: compare hashes instead. I just take the first 32
bits of the SHA-1 hash of the string, and it makes it more than 10% faster.

https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.j...
lib/abp2blocklist.js:564: ruleInfo.multiEditMatch);
On 2017/05/03 11:17:24, kzar wrote:
> Nit: Long line.

Done.

Manish Jethani

Patch Set 7: Make it work in Safari OK, so I got this working in ...

May 5, 2017, 4:51 a.m. (2017-05-05 04:51:32 UTC) #19

Manish Jethani

Patch Set 7: Simplify and optimize The main change is that we're now processing rules ...

May 5, 2017, 10:10 p.m. (2017-05-05 22:10:17 UTC) #20

Manish Jethani

On 2017/05/05 22:10:17, Manish Jethani wrote: > Patch Set 7: Simplify and optimize Sorry, that's ...

May 5, 2017, 10:10 p.m. (2017-05-05 22:10:54 UTC) #21

Manish Jethani

On 2017/05/04 02:49:32, Manish Jethani wrote: > On 2017/05/03 15:19:04, kzar wrote: > > I ...

May 5, 2017, 10:32 p.m. (2017-05-05 22:32:10 UTC) #22

Manish Jethani

Patch Set 10: Merge rules by resource-type and if-domain It turns out we had some ...

May 6, 2017, 5:24 p.m. (2017-05-06 17:24:41 UTC) #23

Manish Jethani

Updated description with something of a specification. Updated title to reflect expanded scope of the ...

May 7, 2017, 5:12 p.m. (2017-05-07 17:12:10 UTC) #24

Manish Jethani

Patch Set 12: Filter out redundant rules before merging If three rules that are otherwise ...

May 7, 2017, 10:50 p.m. (2017-05-07 22:50:37 UTC) #25

Manish Jethani

On 2017/05/07 22:50:37, Manish Jethani wrote: > "^https?://.*/ad", "^https?://.*/ads", and "^https?://.*/advertisement", where > previously we ...

May 7, 2017, 11:02 p.m. (2017-05-07 23:02:13 UTC) #26

kzar

> Below are the results of running the merge function on the > various subscription ...

May 8, 2017, 8:13 a.m. (2017-05-08 08:13:04 UTC) #27

> Below are the results of running the merge function on the 
> various subscription options available in Adblock Plus for
> Safari on macOS.

Awesome. Since some combinations have gone from above 50,000 to below it's going
to be worth including this in Adblock Plus for Safari. Could you open an issue
in Trac that's blocked by #3673 to update the abp2blocklist dependency?

Thanks for the detailed explanation of what's going on in the codereview
description. Even though that's not what I meant in my email it turned out to be
really useful!

Your changes are looking really good so far, thanks for adding thorough unit
tests as well.

https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.js
File lib/abp2blocklist.js (right):

https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.j...
lib/abp2blocklist.js:476: rulesInfo[index] = {rule};
On 2017/05/03 14:41:54, Manish Jethani wrote:
> On 2017/05/03 11:17:24, kzar wrote:
> > I'm not sure syntax like this will work for Safari 9 :(
> 
> I'll check, but if it doesn't work then I'll have to change this
> in a few places.

You mentioned testing the code on Safari now, but unless I missed it not the
version. Which version did you test with?

https://codereview.adblockplus.org/29426594/diff/29432555/lib/abp2blocklist.js
File lib/abp2blocklist.js (right):

https://codereview.adblockplus.org/29426594/diff/29432555/lib/abp2blocklist.j...
lib/abp2blocklist.js:371: // This function returns an edit operation, one of
"substitute", "delete",
Mind using the JSDoc syntax for the comment explaining this function?

https://codereview.adblockplus.org/29426594/diff/29432555/lib/abp2blocklist.j...
lib/abp2blocklist.js:427: if (t.length - j == i || s[s.length - j - 1] !=
t[t.length - j - 1])
I find this part hard to grok, could you add a comment?

https://codereview.adblockplus.org/29426594/diff/29432555/lib/abp2blocklist.j...
lib/abp2blocklist.js:443: // If there are any special characters in the delta,
bail.
Nit: IMO this comment doesn't add much.

https://codereview.adblockplus.org/29426594/diff/29432555/lib/abp2blocklist.j...
lib/abp2blocklist.js:515: const heuristicRange = 10;
Could you mention about the behaviour (or lack thereof) of heursiticRange when
exhaustive is true in this comment?

https://codereview.adblockplus.org/29426594/diff/29432555/lib/abp2blocklist.j...
lib/abp2blocklist.js:518: // Throw out obviously redundant rules.
Nit: Please surround with braces since it spans multiple lines.

https://codereview.adblockplus.org/29426594/diff/29432555/lib/abp2blocklist.j...
lib/abp2blocklist.js:779: let rulesInfo = rules.map(rule => ({rule}));
What's the purpose of this line?

https://codereview.adblockplus.org/29426594/diff/29432555/lib/abp2blocklist.j...
lib/abp2blocklist.js:782: .forEach(mergeableRulesInfo =>
Any reason you used forEach instead of for ... of ... here (and below)? We
prefer the latter unless there's a good reason.

Oh, there is one problem with that though, the JS Hydra transpiler produces code
that evaluates the right hand part of for ... of ... loops for each iteration.
So bear that in mind if you do use those, we often assigned the value to a
variable first to avoid expensive operations from being performed repeatedly!

https://codereview.adblockplus.org/29426594/diff/29432555/lib/abp2blocklist.j...
lib/abp2blocklist.js:795: .forEach(mergeableRulesInfo =>
This logic seems pretty much the same as above, I wonder if we could avoid
duplicating it somehow?

https://codereview.adblockplus.org/29426594/diff/29432555/lib/abp2blocklist.j...
lib/abp2blocklist.js:801: rulesInfo = rulesInfo.filter(ruleInfo =>
!ruleInfo.merged);
Maybe a dumb question, but instead of setting the merged or redundant property
on ruleInfo Objects and then filtering those away couldn't we just delete them
immediately instead?

https://codereview.adblockplus.org/29426594/diff/29432555/lib/abp2blocklist.j...
lib/abp2blocklist.js:924: let mergeOptions = {
Couldn't you just pass options.exhaustiveMerge straight through instead of
putting it inside this Object?

Also wouldn't it make more sense to take the merging options in the constructor
instead of this method? I think the defaultOptions logic above looks especially
out of place here.

Manish Jethani

Comments inline. I acknowledge the rest of your comments and will make the suggested changes ...

May 8, 2017, 2:04 p.m. (2017-05-08 14:04:00 UTC) #28

Comments inline.

I acknowledge the rest of your comments and will make the suggested changes in
my next update.

I'll also file an issue for including this in Safari for macOS and update Trac
issue #3673 to reflect the current work.

https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.js
File lib/abp2blocklist.js (right):

https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.j...
lib/abp2blocklist.js:476: rulesInfo[index] = {rule};
On 2017/05/08 08:13:02, kzar wrote:
> On 2017/05/03 14:41:54, Manish Jethani wrote:
> > On 2017/05/03 11:17:24, kzar wrote:
> > > I'm not sure syntax like this will work for Safari 9 :(
> > 
> > I'll check, but if it doesn't work then I'll have to change this
> > in a few places.
> 
> You mentioned testing the code on Safari now, but unless I missed it not the
> version. Which version did you test with?

I've been testing with Safari 10.

Anyway, this is not a problem because JS Hydra transpiles {rule} into {rule:
rule}.

https://codereview.adblockplus.org/29426594/diff/29432555/lib/abp2blocklist.js
File lib/abp2blocklist.js (right):

https://codereview.adblockplus.org/29426594/diff/29432555/lib/abp2blocklist.j...
lib/abp2blocklist.js:779: let rulesInfo = rules.map(rule => ({rule}));
On 2017/05/08 08:13:02, kzar wrote:
> What's the purpose of this line?

The purpose of this line is to create one "ruleInfo" object for each rule, where
the ruleInfo object has one property, "rule", pointing to the rule it belongs
to.

Essentially this:

let rulesInfo = new Array(rules.length);
for (let i = 0; i < rules.length; i++)
  rulesInfo[i] = {rule: rule};

https://codereview.adblockplus.org/29426594/diff/29432555/lib/abp2blocklist.j...
lib/abp2blocklist.js:782: .forEach(mergeableRulesInfo =>
On 2017/05/08 08:13:03, kzar wrote:
> Any reason you used forEach instead of for ... of ... here (and below)? We
> prefer the latter unless there's a good reason.

It was for (let value of map.values()) { ... } at first, but as you said JS
Hydra can't handle non-array iterables. It looks for the length property on the
iterable, trying to convert this into a regular for loop with an index.

i.e. for (let i = 0; i < map.values().length; i++) { let value =
map.values()[i]; ... }

The only other way to do this is:

let mapValues = Array.from(map.values());
for (let value of mapValues) { ... }

If we simply call the forEach method on the map object, that does the job as
well, without creating another intermediate array. I thought this was better
because these intermediate arrays would again take up more memory.

https://codereview.adblockplus.org/29426594/diff/29432555/lib/abp2blocklist.j...
lib/abp2blocklist.js:795: .forEach(mergeableRulesInfo =>
On 2017/05/08 08:13:02, kzar wrote:
> This logic seems pretty much the same as above, I wonder if we could avoid
> duplicating it somehow?

I suppose we could merge them into one, but then we'd end up with an if...else
or a function lookup inside that loop. They're also sufficiently different that
I thought it made sense to keep them separate. The first one is really
expensive, the second one is relatively very cheap.

I could still merge them if you feel like that would be better. We'll also have
to check performance again. I've found making small changes can hurt performance
in a big way.

For example, this:

obj[flag ? "prop1" : "prop2"] = value;

Is way, way slower than this:

if (flag)
  obj.prop1 = value;
else
  obj.prop2 = value;

For whatever reason.

https://codereview.adblockplus.org/29426594/diff/29432555/lib/abp2blocklist.j...
lib/abp2blocklist.js:801: rulesInfo = rulesInfo.filter(ruleInfo =>
!ruleInfo.merged);
On 2017/05/08 08:13:02, kzar wrote:
> Maybe a dumb question, but instead of setting the merged or redundant property
> on ruleInfo Objects and then filtering those away couldn't we just delete them
> immediately instead?

I could try that, but then we'd have to adjust the indexes i and j every time we
delete something from the array. Also deleting individual elements from an array
should be way more expensive than marking them first and then filtering them out
later using the filter method.

Manish Jethani

Patch Set 13: Move merge options to constructor and update comments Addressed all review comments. ...

May 8, 2017, 11:12 p.m. (2017-05-08 23:12:49 UTC) #29

Manish Jethani

On 2017/05/08 08:13:04, kzar wrote: > Thanks for the detailed explanation of what's going on ...

May 8, 2017, 11:29 p.m. (2017-05-08 23:29:56 UTC) #30

kzar

https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.js File lib/abp2blocklist.js (right): https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.js#newcode476 lib/abp2blocklist.js:476: rulesInfo[index] = {rule}; On 2017/05/08 14:03:58, Manish Jethani wrote: ...

May 9, 2017, 10:05 a.m. (2017-05-09 10:05:48 UTC) #31

https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.js
File lib/abp2blocklist.js (right):

https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.j...
lib/abp2blocklist.js:476: rulesInfo[index] = {rule};
On 2017/05/08 14:03:58, Manish Jethani wrote:
> On 2017/05/08 08:13:02, kzar wrote:
> > On 2017/05/03 14:41:54, Manish Jethani wrote:
> > > On 2017/05/03 11:17:24, kzar wrote:
> > > > I'm not sure syntax like this will work for Safari 9 :(
> > > 
> > > I'll check, but if it doesn't work then I'll have to change this
> > > in a few places.
> > 
> > You mentioned testing the code on Safari now, but unless I missed it not the
> > version. Which version did you test with?
> 
> I've been testing with Safari 10.
> 
> Anyway, this is not a problem because JS Hydra transpiles {rule} into {rule:
> rule}.

I think you should test with Safari 9 at least once to make sure it still works.
If you don't have a access to a copy you can us our test machines from
https://app.webmate.io/

https://codereview.adblockplus.org/29426594/diff/29432555/lib/abp2blocklist.js
File lib/abp2blocklist.js (right):

https://codereview.adblockplus.org/29426594/diff/29432555/lib/abp2blocklist.j...
lib/abp2blocklist.js:443: // If there are any special characters in the delta,
bail.
On 2017/05/08 23:12:48, Manish Jethani wrote:
> On 2017/05/08 08:13:02, kzar wrote:
> > Nit: IMO this comment doesn't add much.
> 
> I'm slightly rephrasing here, but since I've added comments all over the
> function now, I'll leave this here.

Fair enough. Thanks for those extra comments, they helped me grok the algorithm.

https://codereview.adblockplus.org/29426594/diff/29432555/lib/abp2blocklist.j...
lib/abp2blocklist.js:779: let rulesInfo = rules.map(rule => ({rule}));
On 2017/05/08 14:03:59, Manish Jethani wrote:
> On 2017/05/08 08:13:02, kzar wrote:
> > What's the purpose of this line?
> 
> The purpose of this line is to create one "ruleInfo" object for each rule,
where
> the ruleInfo object has one property, "rule", pointing to the rule it belongs
> to.
> 
> Essentially this:
> 
> let rulesInfo = new Array(rules.length);
> for (let i = 0; i < rules.length; i++)
>   rulesInfo[i] = {rule: rule};

Argh, sorry that was an especially dumb question. In my defence a cat jumped on
my head at 4am that morning, I couldn't get back to sleep and was kind of dazed
all day!

https://codereview.adblockplus.org/29426594/diff/29432555/lib/abp2blocklist.j...
lib/abp2blocklist.js:782: .forEach(mergeableRulesInfo =>
On 2017/05/08 14:04:00, Manish Jethani wrote:
> On 2017/05/08 08:13:03, kzar wrote:
> > Any reason you used forEach instead of for ... of ... here (and below)? We
> > prefer the latter unless there's a good reason.
> 
> It was for (let value of map.values()) { ... } at first, but as you said JS
> Hydra can't handle non-array iterables. It looks for the length property on
the
> iterable, trying to convert this into a regular for loop with an index.
> 
> i.e. for (let i = 0; i < map.values().length; i++) { let value =
> map.values()[i]; ... }
> 
> The only other way to do this is:
> 
> let mapValues = Array.from(map.values());
> for (let value of mapValues) { ... }
> 
> If we simply call the forEach method on the map object, that does the job as
> well, without creating another intermediate array. I thought this was better
> because these intermediate arrays would again take up more memory.

Well groupRulesByMergeableProperty creates the array either way, so assuming you
didn't keep a reference to that after the second array is created I don't see
that it would use more memory. That said I don't care too much and IMO this is a
pretty reasonable use of forEach.

https://codereview.adblockplus.org/29426594/diff/29432555/lib/abp2blocklist.j...
lib/abp2blocklist.js:795: .forEach(mergeableRulesInfo =>
On 2017/05/08 14:03:59, Manish Jethani wrote:
> On 2017/05/08 08:13:02, kzar wrote:
> > This logic seems pretty much the same as above, I wonder if we could avoid
> > duplicating it somehow?
> 
> I suppose we could merge them into one, but then we'd end up with an if...else
> or a function lookup inside that loop. They're also sufficiently different
that
> I thought it made sense to keep them separate. The first one is really
> expensive, the second one is relatively very cheap.
> 
> I could still merge them if you feel like that would be better. We'll also
have
> to check performance again. I've found making small changes can hurt
performance
> in a big way.
> 
> For example, this:
> 
> obj[flag ? "prop1" : "prop2"] = value;
> 
> Is way, way slower than this:
> 
> if (flag)
>   obj.prop1 = value;
> else
>   obj.prop2 = value;
> 
> For whatever reason.

Acknowledged.

https://codereview.adblockplus.org/29426594/diff/29432555/lib/abp2blocklist.j...
lib/abp2blocklist.js:801: rulesInfo = rulesInfo.filter(ruleInfo =>
!ruleInfo.merged);
On 2017/05/08 14:03:58, Manish Jethani wrote:
> On 2017/05/08 08:13:02, kzar wrote:
> > Maybe a dumb question, but instead of setting the merged or redundant
property
> > on ruleInfo Objects and then filtering those away couldn't we just delete
them
> > immediately instead?
> 
> I could try that, but then we'd have to adjust the indexes i and j every time
we
> delete something from the array. Also deleting individual elements from an
array
> should be way more expensive than marking them first and then filtering them
out
> later using the filter method.

Acknowledged.

https://codereview.adblockplus.org/29426594/diff/29433655/lib/abp2blocklist.js
File lib/abp2blocklist.js (right):

https://codereview.adblockplus.org/29426594/diff/29433655/lib/abp2blocklist.j...
lib/abp2blocklist.js:799: const defaultOptions = {exhaustive: false};
Have defaultOptions be a property on ContentBlockerList.prototype so you can use
it both here and in the constructor?

https://codereview.adblockplus.org/29426594/diff/29433655/lib/abp2blocklist.j...
lib/abp2blocklist.js:947: rules = mergeRules(rules, {exhaustive:
this.options.exhaustiveMerge});
Why wrap the exhaustiveMerge option in an Object here?

kzar

On 2017/05/08 23:29:56, Manish Jethani wrote: > On 2017/05/08 08:13:04, kzar wrote: > > > ...

May 9, 2017, 10:11 a.m. (2017-05-09 10:11:58 UTC) #32

Manish Jethani

Patch Set 14: Use for..of instead of forEach where possible https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.js File lib/abp2blocklist.js (right): https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.js#newcode476 ...

May 9, 2017, 3:52 p.m. (2017-05-09 15:52:47 UTC) #33

Patch Set 14: Use for..of instead of forEach where possible

https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.js
File lib/abp2blocklist.js (right):

https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.j...
lib/abp2blocklist.js:476: rulesInfo[index] = {rule};
On 2017/05/09 10:05:46, kzar wrote:
> On 2017/05/08 14:03:58, Manish Jethani wrote:
> > On 2017/05/08 08:13:02, kzar wrote:
> > > On 2017/05/03 14:41:54, Manish Jethani wrote:
> > > > On 2017/05/03 11:17:24, kzar wrote:
> > > > > I'm not sure syntax like this will work for Safari 9 :(
> > > > 
> > > > I'll check, but if it doesn't work then I'll have to change this
> > > > in a few places.
> > > 
> > > You mentioned testing the code on Safari now, but unless I missed it not
the
> > > version. Which version did you test with?
> > 
> > I've been testing with Safari 10.
> > 
> > Anyway, this is not a problem because JS Hydra transpiles {rule} into {rule:
> > rule}.
> 
> I think you should test with Safari 9 at least once to make sure it still
works.
> If you don't have a access to a copy you can us our test machines from
> https://app.webmate.io/

"{rule: rule}" ought to work in every single JS engine in existence. It makes
sense to test everything out in Safari 9 though, I'll do that and get back with
the results.

https://codereview.adblockplus.org/29426594/diff/29432555/lib/abp2blocklist.js
File lib/abp2blocklist.js (right):

https://codereview.adblockplus.org/29426594/diff/29432555/lib/abp2blocklist.j...
lib/abp2blocklist.js:779: let rulesInfo = rules.map(rule => ({rule}));
On 2017/05/09 10:05:46, kzar wrote:

> Argh, sorry that was an especially dumb question. In my defence a cat jumped
on
> my head at 4am that morning, I couldn't get back to sleep and was kind of
dazed
> all day!

Haha, no worries :)

https://codereview.adblockplus.org/29426594/diff/29432555/lib/abp2blocklist.j...
lib/abp2blocklist.js:782: .forEach(mergeableRulesInfo =>
On 2017/05/09 10:05:46, kzar wrote:
> On 2017/05/08 14:04:00, Manish Jethani wrote:
> > On 2017/05/08 08:13:03, kzar wrote:
> > > Any reason you used forEach instead of for ... of ... here (and below)? We
> > > prefer the latter unless there's a good reason.
> > 
> > It was for (let value of map.values()) { ... } at first, but as you said JS
> > Hydra can't handle non-array iterables. It looks for the length property on
> the
> > iterable, trying to convert this into a regular for loop with an index.
> > 
> > i.e. for (let i = 0; i < map.values().length; i++) { let value =
> > map.values()[i]; ... }
> > 
> > The only other way to do this is:
> > 
> > let mapValues = Array.from(map.values());
> > for (let value of mapValues) { ... }
> > 
> > If we simply call the forEach method on the map object, that does the job as
> > well, without creating another intermediate array. I thought this was better
> > because these intermediate arrays would again take up more memory.
> 
> Well groupRulesByMergeableProperty creates the array either way, so assuming
you
> didn't keep a reference to that after the second array is created I don't see
> that it would use more memory. That said I don't care too much and IMO this is
a
> pretty reasonable use of forEach.

The intermediate array here would an array of the arrays created inside
groupRulesByMergeableProperty.

Since you brought this up, I noticed that we're using a forEach in that function
itself where we could use for..of. I've changed it now both in
groupRulesByMergeableProperty and mergeRulesByArrayProperty.

https://codereview.adblockplus.org/29426594/diff/29433655/lib/abp2blocklist.js
File lib/abp2blocklist.js (right):

https://codereview.adblockplus.org/29426594/diff/29433655/lib/abp2blocklist.j...
lib/abp2blocklist.js:799: const defaultOptions = {exhaustive: false};
On 2017/05/09 10:05:47, kzar wrote:
> Have defaultOptions be a property on ContentBlockerList.prototype so you can
use
> it both here and in the constructor?

Actually the options for the ContentBlockerList constructor are different than
the options for the mergeRules function, which is not part of ContentBlockerList
at all. For what it's worth I think it might even be a good idea to move the
merging code out into a separate module. I don't know if this would be better or
worse for performance.

https://codereview.adblockplus.org/29426594/diff/29433655/lib/abp2blocklist.j...
lib/abp2blocklist.js:947: rules = mergeRules(rules, {exhaustive:
this.options.exhaustiveMerge});
On 2017/05/09 10:05:47, kzar wrote:
> Why wrap the exhaustiveMerge option in an Object here?

This is because mergeRules takes an option called "exhaustive", whereas the
ContentBlockerList constructor takes an option called "exhaustiveMerge". The
name of the latter is more specific because "exhaustive" could mean just
anything to the constructor (exhaustive merging, exhaustive rule generation,
exhaustive filter addition, possibly other things the ContentBlockerList class
starts to do over time as we add features). On the other hand, the mergeRules
function, which could very well be its own module, does one and only one thing:
it merges rules. There is no possible ambiguity about the meaning of the word
"exhaustive" there.

Now all of this would really matter if both ContentBlockerList and mergeRules
were public APIs. The latter is not even exposed outside of this file, it really
doesn't matter one way or another.

kzar

https://codereview.adblockplus.org/29426594/diff/29433655/lib/abp2blocklist.js File lib/abp2blocklist.js (right): https://codereview.adblockplus.org/29426594/diff/29433655/lib/abp2blocklist.js#newcode947 lib/abp2blocklist.js:947: rules = mergeRules(rules, {exhaustive: this.options.exhaustiveMerge}); On 2017/05/09 15:52:47, Manish ...

May 9, 2017, 4:50 p.m. (2017-05-09 16:50:59 UTC) #34

kzar

I wonder if there should be the merge option at all? We don't want to ...

May 9, 2017, 4:54 p.m. (2017-05-09 16:54:46 UTC) #35

Manish Jethani

Patch Set 15: Change options object into a single boolean option https://codereview.adblockplus.org/29426594/diff/29433655/lib/abp2blocklist.js File lib/abp2blocklist.js (right): ...

May 9, 2017, 5:32 p.m. (2017-05-09 17:32:12 UTC) #36

Manish Jethani

On 2017/05/09 16:54:46, kzar wrote: > I wonder if there should be the merge option ...

May 9, 2017, 5:36 p.m. (2017-05-09 17:36:29 UTC) #37

kzar

LGTM assuming it still works for Safari 9. By the way are you definitely hitting ...

May 10, 2017, 8:40 a.m. (2017-05-10 08:40:40 UTC) #38

Manish Jethani

On 2017/05/10 08:40:40, kzar wrote: > LGTM assuming it still works for Safari 9. Now ...

May 10, 2017, 11:49 a.m. (2017-05-10 11:49:25 UTC) #39

Manish Jethani

On 2017/05/10 11:49:25, Manish Jethani wrote: > I'll file the issue for integrating this into ...

May 11, 2017, 8 p.m. (2017-05-11 20:00:42 UTC) #40

Manish Jethani

I just realized that this might be broken because the order of the "ignore-previous-rules" type ...

May 18, 2017, 11:33 p.m. (2017-05-18 23:33:46 UTC) #41

kzar

On 2017/05/18 23:33:46, Manish Jethani wrote: > I just realized that this might be broken ...

May 19, 2017, 12:13 p.m. (2017-05-19 12:13:38 UTC) #42

Manish Jethani

Patch Set 16: Merge rules in groups Generate rules in groups and then merge rules ...

May 20, 2017, 9:09 p.m. (2017-05-20 21:09:51 UTC) #43

Manish Jethani

On 2017/05/20 21:09:51, Manish Jethani wrote: > Patch Set 16: Merge rules in groups > ...

May 20, 2017, 9:59 p.m. (2017-05-20 21:59:05 UTC) #44

Sebastian Noack

https://codereview.adblockplus.org/29426594/diff/29443583/abp2blocklist.js File abp2blocklist.js (right): https://codereview.adblockplus.org/29426594/diff/29443583/abp2blocklist.js#newcode25 abp2blocklist.js:25: var blockerList = new ContentBlockerList({merge: true, exhaustiveMerge: true}); Any ...

May 21, 2017, 9:07 p.m. (2017-05-21 21:07:29 UTC) #45

kzar

May 22, 2017, 8:47 a.m. (2017-05-22 08:47:01 UTC) #46

Manish Jethani

May 22, 2017, 9:28 a.m. (2017-05-22 09:28:29 UTC) #47

https://codereview.adblockplus.org/29426594/diff/29443583/abp2blocklist.js
File abp2blocklist.js (right):

https://codereview.adblockplus.org/29426594/diff/29443583/abp2blocklist.js#ne...
abp2blocklist.js:25: var blockerList = new ContentBlockerList({merge: true,
exhaustiveMerge: true});
On 2017/05/22 08:47:01, kzar wrote:
> On 2017/05/21 21:07:29, Sebastian Noack wrote:
> > Any particular reason why we even make these configurable, and not always
> merge
> > the rules as good as we can?
> 
> Well the exhaustiveMerge option makes sense, since turning that on slows
things
> down quite a bit and we probably don't want to do that for Safari users where
> the rules are being generated on their end. I agree about the merge option
> however, I can't see why wouldn't ever want it. I guess maybe it's useful for
> the tests?

Yes, the exhaustive merge gives us 5-10% extra gains, but at a significant cost,
it's not feasible to run that in the browser as it freezes the browser (the user
sees the beachball icon and can't do anything for a while).

The merge being an option, that's useful for the unit tests for now (we can fix
this), but also the Safari client might want to pass false here if the rule set
is already going to be within the 50,000-rule limit. This is true for the
default option (EasyList+AA) already. Skipping the merge just makes it a snappy
user experience.

I still think we can do more in terms of user experience by doing the merge in
stages, using a generator for example.

  function* mergeRules(rules)
  {
    // Merge some rules
    yield rules;
    // Merge some more rules
    yield rules;
    ...
  }

  ...

  let iterator = mergeRules(rules);
  setTimeout(function iterate() {
    let next = mergeRules.next();
    rules = next.value;
    if (!next.done)
      setTimeout(iterate, interval);
    else
      addRulesToSafari(rules);
  }, interval);

This will do it in stages so it won't freeze the browser.

Sebastian Noack

Also, as discussed on IRC, https://codereview.adblockplus.org/29426594/diff/29443583/abp2blocklist.js File abp2blocklist.js (right): https://codereview.adblockplus.org/29426594/diff/29443583/abp2blocklist.js#newcode25 abp2blocklist.js:25: var blockerList = new ...

May 22, 2017, 9:39 a.m. (2017-05-22 09:39:25 UTC) #48

Sebastian Noack

May 22, 2017, 9:54 a.m. (2017-05-22 09:54:58 UTC) #49

https://codereview.adblockplus.org/29426594/diff/29443583/abp2blocklist.js
File abp2blocklist.js (right):

https://codereview.adblockplus.org/29426594/diff/29443583/abp2blocklist.js#ne...
abp2blocklist.js:25: var blockerList = new ContentBlockerList({merge: true,
exhaustiveMerge: true});
On 2017/05/22 09:28:28, Manish Jethani wrote:
> On 2017/05/22 08:47:01, kzar wrote:
> > On 2017/05/21 21:07:29, Sebastian Noack wrote:
> > > Any particular reason why we even make these configurable, and not always
> > merge
> > > the rules as good as we can?
> > 
> > Well the exhaustiveMerge option makes sense, since turning that on slows
> things
> > down quite a bit and we probably don't want to do that for Safari users
where
> > the rules are being generated on their end. I agree about the merge option
> > however, I can't see why wouldn't ever want it. I guess maybe it's useful
for
> > the tests?
> 
> Yes, the exhaustive merge gives us 5-10% extra gains, but at a significant
cost,
> it's not feasible to run that in the browser as it freezes the browser (the
user
> sees the beachball icon and can't do anything for a while).
> 
> The merge being an option, that's useful for the unit tests for now (we can
fix
> this), but also the Safari client might want to pass false here if the rule
set
> is already going to be within the 50,000-rule limit. This is true for the
> default option (EasyList+AA) already. Skipping the merge just makes it a
snappy
> user experience.
> 
> I still think we can do more in terms of user experience by doing the merge in
> stages, using a generator for example.
> 
>   function* mergeRules(rules)
>   {
>     // Merge some rules
>     yield rules;
>     // Merge some more rules
>     yield rules;
>     ...
>   }
> 
>   ...
> 
>   let iterator = mergeRules(rules);
>   setTimeout(function iterate() {
>     let next = mergeRules.next();
>     rules = next.value;
>     if (!next.done)
>       setTimeout(iterate, interval);
>     else
>       addRulesToSafari(rules);
>   }, interval);
> 
> This will do it in stages so it won't freeze the browser.

Oh, if Safari entirely freezes, that should be avoided of course. I already
thought about doing the conversion into a WebWorker, but the problem with that
approach would be to get the bulk of filter data into the worker. Using stages
seems like a better idea.

On the other hand, we are plan to replace the Safari extension with a different
product which downloads the pre-generated lists from the server anyway. So any
improvements that are limited to the current Safari extension might not be worth
it. But I guess I leave it up to you.

Manish Jethani

May 22, 2017, 5:39 p.m. (2017-05-22 17:39:52 UTC) #50

Sebastian Noack

May 23, 2017, 11:59 a.m. (2017-05-23 11:59:10 UTC) #51

https://codereview.adblockplus.org/29426594/diff/29443583/abp2blocklist.js
File abp2blocklist.js (right):

https://codereview.adblockplus.org/29426594/diff/29443583/abp2blocklist.js#ne...
abp2blocklist.js:25: var blockerList = new ContentBlockerList({merge: true,
exhaustiveMerge: true});
On 2017/05/22 17:39:52, Manish Jethani wrote:
> On 2017/05/22 09:39:25, Sebastian Noack wrote:
> > On 2017/05/22 08:47:01, kzar wrote:
> > > On 2017/05/21 21:07:29, Sebastian Noack wrote:
> > > > Any particular reason why we even make these configurable, and not
always
> > > merge
> > > > the rules as good as we can?
> > > 
> > > Well the exhaustiveMerge option makes sense, since turning that on slows
> > things
> > > down quite a bit and we probably don't want to do that for Safari users
> where
> > > the rules are being generated on their end. I agree about the merge option
> > > however, I can't see why wouldn't ever want it.
> > 
> > On the other hand, the Safari extension is currently mostly effected by the
> 50k
> > rules limit, hence would mostly benefit from these improvements, since we
> don't
> > even offer other filter lists in the iOS app yet.
> > 
> > As for the performance, it would be helpful if somebody could share some
> > benchmarks comparing the performance with an without (exhaustive) merging.
> 
> EasyList+AA
> 
> Non-exhaustive:
> 
>  *  Rule merging took 1530ms
>  *  Before: 49,198 rules
>  *  After: 43,668 rules
> 
> Exhaustive:
> 
>  *  Rule merging took 29667ms
>  *  Before: 49,198 rules
>  *  After: 42,794 rules
> 
> So the results improve by ~16% here if we do an exhaustive merge, but it's
> totally not feasible in the browser. I mean ~1.5s vs. ~30s.

Yeah, I think I agree. Even if we can avoid the freezing issue through using
timeouts, a delay of 30s (potentially more on slower hardware) on browser start,
extension installation, and filter change before the filters take effect would
be rather bad. So let's keep this behavior behind an option which we only use on
the server, not sure if we need two different options though.

Manish Jethani

Patch Set 17: Make generateRules asynchronous I've made the generateRules function and most of the ...

May 23, 2017, 4:25 p.m. (2017-05-23 16:25:29 UTC) #52

Patch Set 17: Make generateRules asynchronous

I've made the generateRules function and most of the merging process
asynchronous now.

  blockList.generateRules().then(rules =>
  {
    ...
  });

The benefit of doing this is that we're now able to do an exhaustive merge in
the browser.

I ran this in my Safari with EasyList+AA, it took 37 seconds to generate the
rule set (with exhaustive merging), and I did not notice much as I continued to
do my web browsing.

Now there might be an argument against this that the user cannot wait 37 seconds
after selecting a filter subscription. Well, they already have to wait for it to
download, they might as well wait a little longer. This happens in the
background anyway. We can show the appropriate messages in the UI (e.g. "The
subscription will be activated once it has been downloaded and processed",
etc.).

Also, if we don't want this to take 37 seconds in the background, we can still
not do an exhaustive merge but simply tweak the heuristicRange constant (e.g.
set it to 100 instead of 10). This will still give us better results, adding
only a few seconds to the process.

For the default subscription, we can simply do this:

  let rules = blockList.generateRules();
  if (rules.length >= 50000)
    blockList.mergeRules(rules);

The user doesn't have to wait to see ad blocking in action because the default
subscription is already within the 50,000-rule limit. For the rest of the
subscriptions, they have to wait anyway, they might as well wait a little
longer.

The other argument against this approach might be that this makes the code
harder to maintain. I'm going to say that it's a matter of opinion, but promise
chaining of this kind is pretty standard in JavaScript. JavaScript is mostly
asynchronous in the first place because we don't want the browser to freeze up.
Promises only make this easier to manage.

Let me know what you think.

I haven't "asynchronousized" the entire merge process. Right now it's about
140,000 asynchronous steps, some of them with pauses using setTimeout, but it's
not difficult to convert the rest of the code.

Sebastian Noack

The main problem with any delay here is that after starting up Safari (while potentially ...

May 23, 2017, 4:58 p.m. (2017-05-23 16:58:59 UTC) #53

Manish Jethani

Patch Set 18: Make more code asynchronous Now I've managed to narrow the gap between ...

May 24, 2017, 12:19 a.m. (2017-05-24 00:19:06 UTC) #54

Manish Jethani

Patch Set 19: Implement merge options "all", "auto", and "none" The merge option can now ...

May 24, 2017, 1:31 a.m. (2017-05-24 01:31:08 UTC) #55

kzar

https://codereview.adblockplus.org/29426594/diff/29446577/lib/abp2blocklist.js File lib/abp2blocklist.js (right): https://codereview.adblockplus.org/29426594/diff/29446577/lib/abp2blocklist.js#newcode465 lib/abp2blocklist.js:465: let i = 0, j = 0; Nit: Please ...

May 29, 2017, 9:49 a.m. (2017-05-29 09:49:55 UTC) #57

Manish Jethani

Patch Set 21: Split variable declaration statement https://codereview.adblockplus.org/29426594/diff/29446577/lib/abp2blocklist.js File lib/abp2blocklist.js (right): https://codereview.adblockplus.org/29426594/diff/29446577/lib/abp2blocklist.js#newcode465 lib/abp2blocklist.js:465: let i ...

May 31, 2017, 6:43 a.m. (2017-05-31 06:43:08 UTC) #58

kzar

https://codereview.adblockplus.org/29426594/diff/29452555/lib/abp2blocklist.js File lib/abp2blocklist.js (right): https://codereview.adblockplus.org/29426594/diff/29452555/lib/abp2blocklist.js#newcode1033 lib/abp2blocklist.js:1033: console.error("Rule merging took " + (Date.now() - t) + ...

July 7, 2017, 1:23 p.m. (2017-07-07 13:23:29 UTC) #59

Manish Jethani

July 8, 2017, 12:13 p.m. (2017-07-08 12:13:20 UTC) #60

kzar

July 10, 2017, 12:41 p.m. (2017-07-10 12:41:27 UTC) #61

Manish Jethani

Patch Set 22: Rebase with minor changes Rebased on master, minor improvements, updated unit tests ...

July 20, 2017, 3:48 p.m. (2017-07-20 15:48:59 UTC) #62

kzar

https://codereview.adblockplus.org/29426594/diff/29493609/lib/abp2blocklist.js File lib/abp2blocklist.js (right): https://codereview.adblockplus.org/29426594/diff/29493609/lib/abp2blocklist.js#newcode1101 lib/abp2blocklist.js:1101: return async(Array.from(map.values()).map(mergeableRulesInfo => () => If async always took ...

July 25, 2017, 12:18 p.m. (2017-07-25 12:18:54 UTC) #63

Manish Jethani

Patch Set 23: Map async callees asynchronously Comments inline. I'm also wondering if we should ...

July 28, 2017, 9:17 a.m. (2017-07-28 09:17:37 UTC) #64

Manish Jethani

Patch Set 24: Rebase This is rebased on master. There were two conflicts but nothing ...

July 28, 2017, 1:33 p.m. (2017-07-28 13:33:35 UTC) #66

LGTM

Expand All Messages | Collapse All Messages