Rietveld Code Review Tool
Help | Bug tracker | Discussion group | Source code | Sign in
(421)

Issue 29789581: Issue 6834 - Index filter objects by hash

Can't Edit
Can't Publish+Mail
Start Review
Created:
4 months, 3 weeks ago by Manish Jethani
Modified:
3 weeks, 6 days ago
Reviewers:
CC:
kzar, sergei
Base URL:
https://hg.adblockplus.org/adblockpluscore/
Visibility:
Public.

Description

Based on https://codereview.adblockplus.org/29789578/ Please see #6834 on Trac for the latest description of the issue. The one below is a bit outdated, but it might yet contain some valuable information so I leave it here. This patch is NOT READY FOR REVIEW. --- The first step towards reducing memory consumption is to stop holding on to the filter text. All strings derived from the filter text are still "sliced strings", which means anything we do to "optimize" them right now is only going to make it worse. Once we've lost the references to the filter text itself, then we can start optimizing the strings by "interning" them and reusing the interned strings. For example, take the following two filters: foo.com,bar.com###foo foo.com,bar.com###bar This creates two strings and two "sliced strings" (the latter for the domains part). Our goal is to reuse "foo.com,bar.com###" so that we end up with three strings ("foo.com,bar.com###", "foo", and "bar") and two "concatenated strings" ("foo.com,bar.com###foo" and "foo.com,bar.com###bar"). In order for this to work, we need to lose any references to the original filter text. We should then generate the filter text on demand by concatenating the substrings whenever required (e.g. during serialization). But first we should stop using the filter text as the key for the filter object cache. This patch uses a fast hash function to hash the filter text and then use the digest as the key in the Filter.knownFilters map. By itself this patch only makes the situation worse overall: Filter.fromText now takes ~1.5x as much time to execute, and the overall memory consumption goes up a bit. We'll see any benefits of this change only when we really lose the other references to the filter text (i.e. in the Filter objects) and start optimizing the string allocations based on what I said above. Therefore I don't intend to land this patch anytime soon, but please have a look if you have time. PS: In case you're concerned about hash collisions, I did not find any in EasyList+AA, but of course it's possible, which is why I have taken care of it.

Patch Set 1 #

Patch Set 2 : Add tests #

Patch Set 3 : Rebase #

Patch Set 4 : Store collisions as arrays #

Patch Set 5 : Rebase #

Unified diffs Side-by-side diffs Delta from patch set Stats (+148 lines, -11 lines) Patch
M lib/common.js View 1 chunk +41 lines, -0 lines 0 comments Download
M lib/filterClasses.js View 1 2 3 4 3 chunks +27 lines, -6 lines 0 comments Download
M lib/filterStorage.js View 1 2 3 1 chunk +16 lines, -5 lines 0 comments Download
A test/common.js View 1 1 chunk +64 lines, -0 lines 0 comments Download

Messages

Total messages: 6
Manish Jethani
4 months, 3 weeks ago (2018-05-25 01:51:21 UTC) #1
Manish Jethani
Patch Set 1 I'll file an issue, but see the description here for now.
4 months, 3 weeks ago (2018-05-25 03:14:55 UTC) #2
Manish Jethani
By the way, I think we'll hash domain names eventually (if it works out for ...
4 months, 3 weeks ago (2018-05-25 07:45:24 UTC) #3
Manish Jethani
On 2018/05/25 07:45:24, Manish Jethani wrote: > For example, if you're looking for filters that ...
4 months, 3 weeks ago (2018-05-25 07:49:33 UTC) #4
sergei
Could you please move the description of the idea to an issue? I very like ...
4 months, 3 weeks ago (2018-05-25 08:48:45 UTC) #5
Manish Jethani
2 months, 2 weeks ago (2018-08-03 09:21:17 UTC) #6
On 2018/05/25 08:48:45, sergei wrote:
> Could you please move the description of the idea to an issue?

Thanks, I have filed an issue now:

https://issues.adblockplus.org/ticket/6834

> I very like the idea but I have a lot of questions. Especially I'm concerned
> with the collisions and replacing of already added filters, then with the
exact

I have uploaded another patch now which stores collisions as arrays. In ~82,000
filters I got one collision, for what it's worth.

> "interning" of the strings, and the direction should be thoroughly profiled. I
> would even say perhaps we should rather slightly complicate the knownFilters
by
> adding one or two more layers (basically turn it into a tree) and consider the
> way filters are serialized, but the latter can be a next step of course, e.g.
in
> order to improve performance if we observe that it gives a significant
advantage
> in the memory usage but hurts the performance.

OK, let's talk about it.

But first I want to try some optimizations to this patch to see if I can speed
it up. Please do not review yet. I'll get back.
Sign in to reply to this message.

Powered by Google App Engine
RSS Feeds Recent Issues | This issue
This is Rietveld 87257f5