Rietveld Code Review Tool
Help | Bug tracker | Discussion group | Source code | Sign in
(755)

Issue 29426594: Issue 3673 - Merge closely matching rules (Closed)

Can't Edit
Can't Publish+Mail
Start Review
Created:
5 months, 3 weeks ago by Manish Jethani
Modified:
2 months, 3 weeks ago
Reviewers:
kzar, Sebastian Noack
CC:
Felix Dahlke
Base URL:
https://hg.adblockplus.org/abp2blocklist
Visibility:
Public.

Description

Issue 3673 - Merge closely matching rules SCOPE Our goal is to merge two or more rules into a single rule and to do this for as many rules as is feasible given limited developer time and limitations of runtime performance. We do not aim to merge all possible rules that can be merged. We may want to do this in the future [1], but it is outside the scope of this issue. HIGH-LEVEL ALGORITHM What follows is a high-level description of the algorithm. Two objects are said to be mergeable by property X if they are otherwise identical except for the value of property X. For example, {X: 1, Y: 1} and {X: 2, Y: 1} can be merged into {X: [1, 2], Y: 1}, therefore they are "mergeable by X". {X: 1, Y: 1} and {X: 1, Y: 2} on the other hand are not mergeable by X, but they are mergeable by Y. {X: 1, Y: 1} and {X: 2, Y: 2} are neither mergeable by X nor mergeable by Y. They may or may not be "mergeable by X and Y", depending on the semantics of X and Y [2], but we do not handle this case. The rule set is divided into logical groups: CSS rules, CSS exception rules, blocking rules, and blocking exception rules. While the order of the logical groups is significant, the order of the rules within each logical group is not. We can only safely merge rules that belong to the same logical group. We start by dividing each logical group further into groups of rules mergeable by URL filter (rule.trigger["url-filter"]). We then try to merge as many rules within each group as we "feasibly" can using a simple algorithm, which is described below. Next we take the output of the above and divide the rules again into groups of rules mergeable by resource type (rule.trigger["resource-type"]). We then merge all the rules within each group into a single rule. Since the resource type is a list, merging a group of two or more rules is as simple as making a union of all the resource types in the group. Next we take the output of the above and do the same thing for the "if-domain" property (rule.trigger["if-domain"]), the value of which is also a mergeable list. The output of the above is the merged logical group. The final rule set is a concatenation of all the merged logical groups. MERGING BY URL FILTER The algorithm for merging by URL filter is as follows. For each pair combination of rules in the group [3], we check if the URL filters are a "close match". Two URL filters are a close match either if they differ only in one character (i.e. a Levenshtein distance of 1) or if it's possible to convert one into the other by deleting or inserting one and only one contiguous range of characters from one into the other. For example, "ad" and "ads", "adv" and "ads", and "adv" and "advertisement" are all close matches. Next we sort the group by the number of close matches, such that a rule with N close matches appears before a rule with N - 1 close matches in the group. We then iterate over the group and try to merge each rule with as many of its close matches as we can. For example, the rules with URL filters "ad", "ads", and "adv" are all merged into one with the URL filter "ad[sv]?". Note that the URL filter "adv" is a close match with "advertisement" but also with "ad" and "ads". In this case we merge it into "ad[vs]?" and not "adv(ertisement)?", since the former combines more rules into one. LIMITATIONS & AREAS FOR IMPROVEMENT We only merge with one set of parentheses. For example, "ad" and "advertisement" will be merged into "ad(vertisement)?", but "ad", "advert", and "advertisement" will not be merged into "ad(vert(isement)?)?" [4]. Further, we never mix parentheses with other kinds of merging. For example, "ads", "adv", and "adsserver", and "advserver" will not be merged into "ad[sv](server)?" [4]. Also, since URL filters are regular expressions to begin with, if the delta between two URL filters contains a regular expression special character like "\", "+", "?", and so on, the two are not considered to be a close match [4]. RESULTS Below are the results of running the merge function on the various subscription options available in Adblock Plus for Safari on macOS. * EasyList+AA Rule merging took 918ms Before: 49,261 rules After: 43,617 rules * EasyList Germany+EasyList+AA Rule merging took 960ms Before: 55,646 rules After: 49,674 rules * EasyList China+EasyList+AA Rule merging took 968ms Before: 57,725 rules After: 51,912 rules [5] * ABPindo+EasyList+AA Rule merging took 861ms Before: 50,250 rules After: 44,561 rules * Bulgarian list+EasyList+AA Rule merging took 877ms Before: 50,446 rules After: 44,777 rules * EasyList Czech and Slovak+EasyList+AA Rule merging took 864ms Before: 50,188 rules After: 44,520 rules * EasyList Dutch+EasyList+AA Rule merging took 882ms Before: 50,399 rules After: 44,710 rules * EasyList Hebrew+EasyList+AA Rule merging took 860ms Before: 50,069 rules After: 44,342 rules * EasyList Italy+EasyList+AA Rule merging took 934ms Before: 52,148 rules After: 46,404 rules * EasyList Lithuania+EasyList+AA Rule merging took 880ms Before: 50,262 rules After: 44,599 rules * Latvian List+EasyList+AA Rule merging took 845ms Before: 49,596 rules After: 43,937 rules * Liste AR+Liste FR+EasyList+AA Rule merging took 977ms Before: 60,398 rules After: 53,769 rules [5] * Liste FR+EasyList+AA Rule merging took 955ms Before: 58,875 rules After: 52,352 rules [5] * ROList+EasyList+AA Rule merging took 858ms Before: 49,556 rules After: 43,897 rules * RuAdList+EasyList+AA Rule merging took 1,070ms Before: 62,919 rules After: 56,663 rules [5] Footnotes: 1: Owing to the relative simplicity of the algorithm, there may still be mergeable rules that remain unmerged. A simple way to address this would be to run the algorithm recursively until it is no longer able to merge any more rules (essentially brute force). We could also come up with something cleverer. Based on tests with real data, it appears not worth trying. 2: For content blocker rules, there are no two properties that can be merged together. If "if-domain" and "unless-domain" were allowed to be specified together, two rules could be merged by both properties, but this is not the case. 3: This is necessarily O(N^2). We can reduce it to O(N) by specifying a "heuristic range" within which to look for close matches. In fact, we have tried with a heuristic range of 10, and it appears to yield good results. 4: This is an area for improvement. 5: This is still above the 50,000 rules limit.

Patch Set 1 #

Patch Set 2 : Ignore special characters #

Total comments: 1

Patch Set 3 : Look for matches only within a certain range #

Patch Set 4 : Fix bugs and add unit tests #

Patch Set 5 : Add advanced merge support #

Total comments: 27

Patch Set 6 : Improved matching algorithm #

Patch Set 7 : Make it work in Safari #

Patch Set 8 : Simplify and optimize #

Patch Set 9 : Simplify closeMatch function #

Patch Set 10 : Merge rules by resource-type and if-domain #

Patch Set 11 : Clean up #

Patch Set 12 : Filter out redundant rules before merging #

Total comments: 27

Patch Set 13 : Move merge options to constructor and update comments #

Total comments: 6

Patch Set 14 : Use for..of instead of forEach where possible #

Total comments: 2

Patch Set 15 : Change options object into a single boolean option #

Patch Set 16 : Merge rules in groups #

Total comments: 7

Patch Set 17 : Make generateRules asynchronous #

Patch Set 18 : Make more code asynchronous #

Patch Set 19 : Implement merge options "all", "auto", and "none" #

Patch Set 20 : Remove redundant async #

Total comments: 6

Patch Set 21 : Split variable declaration statement #

Total comments: 3

Patch Set 22 : Rebase with minor changes #

Total comments: 4

Patch Set 23 : Map async callees asynchronously #

Patch Set 24 : Rebase #

Unified diffs Side-by-side diffs Delta from patch set Stats (+990 lines, -284 lines) Patch
M abp2blocklist.js View 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 1 chunk +15 lines, -14 lines 0 comments Download
M lib/abp2blocklist.js View 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 4 chunks +572 lines, -10 lines 0 comments Download
M test/abp2blocklist.js View 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 1 chunk +403 lines, -260 lines 0 comments Download

Messages

Total messages: 67
Manish Jethani
5 months, 3 weeks ago (2017-05-01 19:37:44 UTC) #1
Manish Jethani
Patch Set 1
5 months, 3 weeks ago (2017-05-01 19:39:40 UTC) #2
kzar
Wow, I'm impressed you came up with something so fast. Could you add some unit ...
5 months, 3 weeks ago (2017-05-01 19:52:55 UTC) #3
kzar
Also could you run the script on EasyList before and after your changes? I'd be ...
5 months, 3 weeks ago (2017-05-02 09:24:46 UTC) #4
Manish Jethani
On 2017/05/02 09:24:46, kzar wrote: > Also could you run the script on EasyList before ...
5 months, 3 weeks ago (2017-05-02 15:38:25 UTC) #5
Manish Jethani
Patch Set 2: Ignore special characters It's a minor change, but it seemed like we ...
5 months, 3 weeks ago (2017-05-02 15:39:53 UTC) #6
Manish Jethani
Patch Set 3: Look for matches only within a certain range Now we're talking. With ...
5 months, 3 weeks ago (2017-05-02 16:53:09 UTC) #7
Manish Jethani
On 2017/05/02 09:24:46, kzar wrote: > Another thing to check, have you made sure the ...
5 months, 3 weeks ago (2017-05-02 19:02:02 UTC) #8
Manish Jethani
Patch Set 4: Fix bugs and add unit tests Fixed some bugs so that there ...
5 months, 3 weeks ago (2017-05-03 00:31:19 UTC) #9
Manish Jethani
Patch Set 5: Add advanced merge support This is beyond the scope of the issue, ...
5 months, 3 weeks ago (2017-05-03 04:46:54 UTC) #10
kzar
> Also, the added execution is easily offset by using a more efficient > implementation ...
5 months, 2 weeks ago (2017-05-03 10:22:18 UTC) #11
kzar
On 2017/05/03 10:22:18, kzar wrote: > - Accuracy of blocking. Since Adblock Plus filters don't ...
5 months, 2 weeks ago (2017-05-03 10:27:49 UTC) #12
kzar
Just checking in IRC and we do still care about supporting Adblock Plus for Safari ...
5 months, 2 weeks ago (2017-05-03 10:47:24 UTC) #13
kzar
I need more time to get my head around what you're doing here but I've ...
5 months, 2 weeks ago (2017-05-03 11:17:24 UTC) #14
Manish Jethani
On 2017/05/03 10:27:49, kzar wrote: > If you're interested here's[1] the repository for those manually ...
5 months, 2 weeks ago (2017-05-03 13:57:20 UTC) #15
Manish Jethani
Comments inline. https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.js File lib/abp2blocklist.js (right): https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.js#newcode369 lib/abp2blocklist.js:369: function closeMatch(s, t, {multi = false} = ...
5 months, 2 weeks ago (2017-05-03 14:41:54 UTC) #16
kzar
https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.js File lib/abp2blocklist.js (right): https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.js#newcode369 lib/abp2blocklist.js:369: function closeMatch(s, t, {multi = false} = {}) On ...
5 months, 2 weeks ago (2017-05-03 15:19:04 UTC) #17
Manish Jethani
Patch Set 6: Improved matching algorithm Now instead of trying to do both kinds of ...
5 months, 2 weeks ago (2017-05-04 02:49:32 UTC) #18
Manish Jethani
Patch Set 7: Make it work in Safari OK, so I got this working in ...
5 months, 2 weeks ago (2017-05-05 04:51:32 UTC) #19
Manish Jethani
Patch Set 7: Simplify and optimize The main change is that we're now processing rules ...
5 months, 2 weeks ago (2017-05-05 22:10:17 UTC) #20
Manish Jethani
On 2017/05/05 22:10:17, Manish Jethani wrote: > Patch Set 7: Simplify and optimize Sorry, that's ...
5 months, 2 weeks ago (2017-05-05 22:10:54 UTC) #21
Manish Jethani
On 2017/05/04 02:49:32, Manish Jethani wrote: > On 2017/05/03 15:19:04, kzar wrote: > > I ...
5 months, 2 weeks ago (2017-05-05 22:32:10 UTC) #22
Manish Jethani
Patch Set 10: Merge rules by resource-type and if-domain It turns out we had some ...
5 months, 2 weeks ago (2017-05-06 17:24:41 UTC) #23
Manish Jethani
Updated description with something of a specification. Updated title to reflect expanded scope of the ...
5 months, 2 weeks ago (2017-05-07 17:12:10 UTC) #24
Manish Jethani
Patch Set 12: Filter out redundant rules before merging If three rules that are otherwise ...
5 months, 2 weeks ago (2017-05-07 22:50:37 UTC) #25
Manish Jethani
On 2017/05/07 22:50:37, Manish Jethani wrote: > "^https?://.*/ad", "^https?://.*/ads", and "^https?://.*/advertisement", where > previously we ...
5 months, 2 weeks ago (2017-05-07 23:02:13 UTC) #26
kzar
> Below are the results of running the merge function on the > various subscription ...
5 months, 2 weeks ago (2017-05-08 08:13:04 UTC) #27
Manish Jethani
Comments inline. I acknowledge the rest of your comments and will make the suggested changes ...
5 months, 2 weeks ago (2017-05-08 14:04:00 UTC) #28
Manish Jethani
Patch Set 13: Move merge options to constructor and update comments Addressed all review comments. ...
5 months, 2 weeks ago (2017-05-08 23:12:49 UTC) #29
Manish Jethani
On 2017/05/08 08:13:04, kzar wrote: > Thanks for the detailed explanation of what's going on ...
5 months, 2 weeks ago (2017-05-08 23:29:56 UTC) #30
kzar
https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.js File lib/abp2blocklist.js (right): https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.js#newcode476 lib/abp2blocklist.js:476: rulesInfo[index] = {rule}; On 2017/05/08 14:03:58, Manish Jethani wrote: ...
5 months, 2 weeks ago (2017-05-09 10:05:48 UTC) #31
kzar
On 2017/05/08 23:29:56, Manish Jethani wrote: > On 2017/05/08 08:13:04, kzar wrote: > > > ...
5 months, 2 weeks ago (2017-05-09 10:11:58 UTC) #32
Manish Jethani
Patch Set 14: Use for..of instead of forEach where possible https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.js File lib/abp2blocklist.js (right): https://codereview.adblockplus.org/29426594/diff/29427618/lib/abp2blocklist.js#newcode476 ...
5 months, 2 weeks ago (2017-05-09 15:52:47 UTC) #33
kzar
https://codereview.adblockplus.org/29426594/diff/29433655/lib/abp2blocklist.js File lib/abp2blocklist.js (right): https://codereview.adblockplus.org/29426594/diff/29433655/lib/abp2blocklist.js#newcode947 lib/abp2blocklist.js:947: rules = mergeRules(rules, {exhaustive: this.options.exhaustiveMerge}); On 2017/05/09 15:52:47, Manish ...
5 months, 2 weeks ago (2017-05-09 16:50:59 UTC) #34
kzar
I wonder if there should be the merge option at all? We don't want to ...
5 months, 2 weeks ago (2017-05-09 16:54:46 UTC) #35
Manish Jethani
Patch Set 15: Change options object into a single boolean option https://codereview.adblockplus.org/29426594/diff/29433655/lib/abp2blocklist.js File lib/abp2blocklist.js (right): ...
5 months, 2 weeks ago (2017-05-09 17:32:12 UTC) #36
Manish Jethani
On 2017/05/09 16:54:46, kzar wrote: > I wonder if there should be the merge option ...
5 months, 2 weeks ago (2017-05-09 17:36:29 UTC) #37
kzar
LGTM assuming it still works for Safari 9. By the way are you definitely hitting ...
5 months, 1 week ago (2017-05-10 08:40:40 UTC) #38
Manish Jethani
On 2017/05/10 08:40:40, kzar wrote: > LGTM assuming it still works for Safari 9. Now ...
5 months, 1 week ago (2017-05-10 11:49:25 UTC) #39
Manish Jethani
On 2017/05/10 11:49:25, Manish Jethani wrote: > I'll file the issue for integrating this into ...
5 months, 1 week ago (2017-05-11 20:00:42 UTC) #40
Manish Jethani
I just realized that this might be broken because the order of the "ignore-previous-rules" type ...
5 months ago (2017-05-18 23:33:46 UTC) #41
kzar
On 2017/05/18 23:33:46, Manish Jethani wrote: > I just realized that this might be broken ...
5 months ago (2017-05-19 12:13:38 UTC) #42
Manish Jethani
Patch Set 16: Merge rules in groups Generate rules in groups and then merge rules ...
5 months ago (2017-05-20 21:09:51 UTC) #43
Manish Jethani
On 2017/05/20 21:09:51, Manish Jethani wrote: > Patch Set 16: Merge rules in groups > ...
5 months ago (2017-05-20 21:59:05 UTC) #44
Sebastian Noack
https://codereview.adblockplus.org/29426594/diff/29443583/abp2blocklist.js File abp2blocklist.js (right): https://codereview.adblockplus.org/29426594/diff/29443583/abp2blocklist.js#newcode25 abp2blocklist.js:25: var blockerList = new ContentBlockerList({merge: true, exhaustiveMerge: true}); Any ...
5 months ago (2017-05-21 21:07:29 UTC) #45
kzar
https://codereview.adblockplus.org/29426594/diff/29443583/abp2blocklist.js File abp2blocklist.js (right): https://codereview.adblockplus.org/29426594/diff/29443583/abp2blocklist.js#newcode25 abp2blocklist.js:25: var blockerList = new ContentBlockerList({merge: true, exhaustiveMerge: true}); On ...
5 months ago (2017-05-22 08:47:01 UTC) #46
Manish Jethani
https://codereview.adblockplus.org/29426594/diff/29443583/abp2blocklist.js File abp2blocklist.js (right): https://codereview.adblockplus.org/29426594/diff/29443583/abp2blocklist.js#newcode25 abp2blocklist.js:25: var blockerList = new ContentBlockerList({merge: true, exhaustiveMerge: true}); On ...
5 months ago (2017-05-22 09:28:29 UTC) #47
Sebastian Noack
Also, as discussed on IRC, https://codereview.adblockplus.org/29426594/diff/29443583/abp2blocklist.js File abp2blocklist.js (right): https://codereview.adblockplus.org/29426594/diff/29443583/abp2blocklist.js#newcode25 abp2blocklist.js:25: var blockerList = new ...
5 months ago (2017-05-22 09:39:25 UTC) #48
Sebastian Noack
https://codereview.adblockplus.org/29426594/diff/29443583/abp2blocklist.js File abp2blocklist.js (right): https://codereview.adblockplus.org/29426594/diff/29443583/abp2blocklist.js#newcode25 abp2blocklist.js:25: var blockerList = new ContentBlockerList({merge: true, exhaustiveMerge: true}); On ...
5 months ago (2017-05-22 09:54:58 UTC) #49
Manish Jethani
https://codereview.adblockplus.org/29426594/diff/29443583/abp2blocklist.js File abp2blocklist.js (right): https://codereview.adblockplus.org/29426594/diff/29443583/abp2blocklist.js#newcode25 abp2blocklist.js:25: var blockerList = new ContentBlockerList({merge: true, exhaustiveMerge: true}); On ...
5 months ago (2017-05-22 17:39:52 UTC) #50
Sebastian Noack
https://codereview.adblockplus.org/29426594/diff/29443583/abp2blocklist.js File abp2blocklist.js (right): https://codereview.adblockplus.org/29426594/diff/29443583/abp2blocklist.js#newcode25 abp2blocklist.js:25: var blockerList = new ContentBlockerList({merge: true, exhaustiveMerge: true}); On ...
5 months ago (2017-05-23 11:59:10 UTC) #51
Manish Jethani
Patch Set 17: Make generateRules asynchronous I've made the generateRules function and most of the ...
5 months ago (2017-05-23 16:25:29 UTC) #52
Sebastian Noack
The main problem with any delay here is that after starting up Safari (while potentially ...
5 months ago (2017-05-23 16:58:59 UTC) #53
Manish Jethani
Patch Set 18: Make more code asynchronous Now I've managed to narrow the gap between ...
5 months ago (2017-05-24 00:19:06 UTC) #54
Manish Jethani
Patch Set 19: Implement merge options "all", "auto", and "none" The merge option can now ...
5 months ago (2017-05-24 01:31:08 UTC) #55
Manish Jethani
Patch Set 20: Remove redundant async
5 months ago (2017-05-24 02:48:46 UTC) #56
kzar
https://codereview.adblockplus.org/29426594/diff/29446577/lib/abp2blocklist.js File lib/abp2blocklist.js (right): https://codereview.adblockplus.org/29426594/diff/29446577/lib/abp2blocklist.js#newcode465 lib/abp2blocklist.js:465: let i = 0, j = 0; Nit: Please ...
4 months, 3 weeks ago (2017-05-29 09:49:55 UTC) #57
Manish Jethani
Patch Set 21: Split variable declaration statement https://codereview.adblockplus.org/29426594/diff/29446577/lib/abp2blocklist.js File lib/abp2blocklist.js (right): https://codereview.adblockplus.org/29426594/diff/29446577/lib/abp2blocklist.js#newcode465 lib/abp2blocklist.js:465: let i ...
4 months, 3 weeks ago (2017-05-31 06:43:08 UTC) #58
kzar
https://codereview.adblockplus.org/29426594/diff/29452555/lib/abp2blocklist.js File lib/abp2blocklist.js (right): https://codereview.adblockplus.org/29426594/diff/29452555/lib/abp2blocklist.js#newcode1033 lib/abp2blocklist.js:1033: console.error("Rule merging took " + (Date.now() - t) + ...
3 months, 2 weeks ago (2017-07-07 13:23:29 UTC) #59
Manish Jethani
https://codereview.adblockplus.org/29426594/diff/29452555/lib/abp2blocklist.js File lib/abp2blocklist.js (right): https://codereview.adblockplus.org/29426594/diff/29452555/lib/abp2blocklist.js#newcode1033 lib/abp2blocklist.js:1033: console.error("Rule merging took " + (Date.now() - t) + ...
3 months, 2 weeks ago (2017-07-08 12:13:20 UTC) #60
kzar
https://codereview.adblockplus.org/29426594/diff/29452555/lib/abp2blocklist.js File lib/abp2blocklist.js (right): https://codereview.adblockplus.org/29426594/diff/29452555/lib/abp2blocklist.js#newcode1033 lib/abp2blocklist.js:1033: console.error("Rule merging took " + (Date.now() - t) + ...
3 months, 1 week ago (2017-07-10 12:41:27 UTC) #61
Manish Jethani
Patch Set 22: Rebase with minor changes Rebased on master, minor improvements, updated unit tests ...
3 months ago (2017-07-20 15:48:59 UTC) #62
kzar
https://codereview.adblockplus.org/29426594/diff/29493609/lib/abp2blocklist.js File lib/abp2blocklist.js (right): https://codereview.adblockplus.org/29426594/diff/29493609/lib/abp2blocklist.js#newcode1101 lib/abp2blocklist.js:1101: return async(Array.from(map.values()).map(mergeableRulesInfo => () => If async always took ...
2 months, 3 weeks ago (2017-07-25 12:18:54 UTC) #63
Manish Jethani
Patch Set 23: Map async callees asynchronously Comments inline. I'm also wondering if we should ...
2 months, 3 weeks ago (2017-07-28 09:17:37 UTC) #64
kzar
Nice, LGTM. Go ahead and push when you're ready.
2 months, 3 weeks ago (2017-07-28 10:48:27 UTC) #65
Manish Jethani
Patch Set 24: Rebase This is rebased on master. There were two conflicts but nothing ...
2 months, 3 weeks ago (2017-07-28 13:33:35 UTC) #66
kzar
2 months, 3 weeks ago (2017-07-28 13:52:00 UTC) #67
LGTM
Sign in to reply to this message.

Powered by Google App Engine
RSS Feeds Recent Issues | This issue
This is Rietveld 87257f5