compiled/Filter.cpp - Issue 29333474: Issue 4125 - [emscripten] Convert filter classes to C++

Side by Side Diff: compiled/Filter.cpp

Issue 29333474: Issue 4125 - [emscripten] Convert filter classes to C++ (Closed)

Patch Set: Optimized hash lookup performance a bit Created Feb. 8, 2016, 7:11 p.m.

Left:
Right:

Use n/p to move between diff chunks; N/P to move between comments.

Jump to:

View unified diff | Download patch

OLD	NEW
(Empty)
	1 #include "Filter.h"

	2 #include "CommentFilter.h"

	3 #include "InvalidFilter.h"

	4 #include "RegExpFilter.h"

	5 #include "WhitelistFilter.h"

	6 #include "ElemHideBase.h"

	7 #include "ElemHideFilter.h"

	8 #include "ElemHideException.h"

	9 #include "CSSPropertyFilter.h"

	10 #include "StringMap.h"

	11

	12 namespace

	13 {

	14 StringMap<Filter*> knownFilters(8192);

	15

	16 void NormalizeWhitespace(DependentString& text)

	17 {

	18 String::size_type start = 0;

	19 String::size_type end = text.length();

	20

	21 // Remove leading spaces and special characters like line breaks

	22 for (; start < end; start++)

	23 if (text[start] > ' ')

	24 break;

	25

	26 // Now look for invalid characters inside the string

	27 String::size_type pos;

	28 for (pos = start; pos < end; pos++)

	29 if (text[pos] < ' ')

	30 break;

	31

	32 if (pos < end)

	33 {

	34 // Found invalid characters, copy all the valid characters while skipping

	35 // the invalid ones.

	36 String::size_type delta = 1;

	37 for (pos = pos + 1; pos < end; pos++)

	38 {

	39 if (text[pos] < ' ')

	40 delta++;

	41 else

	42 text[pos - delta] = text[pos];

	43 }

	44 end -= delta;

	45 }

	46

	47 // Remove trailing spaces

	48 for (; end > 0; end--)

	49 if (text[end - 1] != ' ')

	50 break;

	51

	52 // Set new string boundaries

	53 text.reset(text, start, end - start);

	54 }

	55 }

	56

	57 Filter::Filter(const String& text)

	58 : ref_counted(), mText(text)
	sergei 2016/02/17 12:54:37 `ref_counted()` is not necessary here. `ref_counted()` is not necessary here. Wladimir Palant 2016/02/18 16:06:41 I preferred to spell this out explicitly neverthel Show quoted text On 2016/02/17 12:54:37, sergei wrote: > `ref_counted()` is not necessary here. I preferred to spell this out explicitly nevertheless. Anyway, removed.
	59 {

	60 annotate_address(this, "Filter");

	61 }

	62

	63 Filter::~Filter()

	64 {

	65 // TODO: This should be removing from knownFilters

	66 }

	67

	68 OwnedString Filter::Serialize() const

	69 {

	70 OwnedString result(u"[Filter]\ntext="_str);

	71 result.append(mText);

	72 result.append(u'\n');

	73 return std::move(result);
	sergei 2016/02/17 12:54:36 std::move is not necessary here. BTW, in C++, here std::move is not necessary here. BTW, in C++, here will be copy elision. Wladimir Palant 2016/02/18 16:06:42 Done. Show quoted text On 2016/02/17 12:54:36, sergei wrote: > std::move is not necessary here. > BTW, in C++, here will be copy elision. Done.
	74 }

	75

	76 Filter* Filter::FromText(DependentString& text)

	77 {

	78 NormalizeWhitespace(text);
	sergei 2016/02/17 12:54:33 For me personally, the approach here is really con For me personally, the approach here is really confusing in terms of C++ because it can easily lead to something unexpected. Eventually we copy `text` to `Filter::mText`, so I would propose to pass `text` as a constant reference, copy it to `OwnedString` (`Filter::mText`) (here we can trim it) and then operate on the `OwnedString` to remove undesired characters in the middle and don't modify `DependentString`. BTW, instead of passing it as a constant reference, it can be even r-value `OwnedString&&`. I guess it won't cause a significant overhead. Having that, I guess, we will be able to move non-constant methods from `String` to `OwnedString` class and remove `DependentString(String& str,...)` and `DependentString(value_type* buf, size_type len)` (constructors which accept non-constant buffers). What do you think about it? Wladimir Palant 2016/02/18 16:06:43 I've been through multiple iterations here, and th Show quoted text On 2016/02/17 12:54:33, sergei wrote: > For me personally, the approach here is really confusing in terms of C++ because > it can easily lead to something unexpected. > > Eventually we copy `text` to `Filter::mText`, so I would propose to pass `text` > as a constant reference, copy it to `OwnedString` (`Filter::mText`) (here we can > trim it) and then operate on the `OwnedString` to remove undesired characters in > the middle and don't modify `DependentString`. BTW, instead of passing it as a > constant reference, it can be even r-value `OwnedString&&`. > I guess it won't cause a significant overhead. > > Having that, I guess, we will be able to move non-constant methods from `String` > to `OwnedString` class and remove `DependentString(String& str,...)` and > `DependentString(value_type* buf, size_type len)` (constructors which accept > non-constant buffers). > > What do you think about it? I've been through multiple iterations here, and the current version is the one that saves significant overhead. While messing with the incoming buffers isn't the best coding style, we can do so "for free" - they are allocated on the stack and will be automatically released anyway as soon as we return. Once we allocate our own buffer however we better make everything right - reallocating will be expensive. For example, how do we trim leading whitespace with an owned buffer? We have the choice between moving the content and keeping two pointers around (one for the buffer we need to release, another one for start of the content). DependentString allows doing this very efficiently. sergei 2016/02/22 12:45:36 Clear. BTW, can we avoid copying of strings here Show quoted text On 2016/02/18 16:06:43, Wladimir Palant wrote: > On 2016/02/17 12:54:33, sergei wrote: > > For me personally, the approach here is really confusing in terms of C++ > because > > it can easily lead to something unexpected. > > > > Eventually we copy `text` to `Filter::mText`, so I would propose to pass > `text` > > as a constant reference, copy it to `OwnedString` (`Filter::mText`) (here we > can > > trim it) and then operate on the `OwnedString` to remove undesired characters > in > > the middle and don't modify `DependentString`. BTW, instead of passing it as a > > constant reference, it can be even r-value `OwnedString&&`. > > I guess it won't cause a significant overhead. > > > > Having that, I guess, we will be able to move non-constant methods from > `String` > > to `OwnedString` class and remove `DependentString(String& str,...)` and > > `DependentString(value_type* buf, size_type len)` (constructors which accept > > non-constant buffers). > > > > What do you think about it? > > I've been through multiple iterations here, and the current version is the one > that saves significant overhead. While messing with the incoming buffers isn't > the best coding style, we can do so "for free" - they are allocated on the stack > and will be automatically released anyway as soon as we return. Once we allocate > our own buffer however we better make everything right - reallocating will be > expensive. Clear. BTW, can we avoid copying of strings here at all? In the extension we read filters from a file, and we can read the file into a contiguous buffer, then we can create `DependentString`s which use that buffer and even modify it in place (it's our buffer on the heap). It looks fragile but it should be very efficient. Of course we should do something to have an ability to add and remove filters. Show quoted text > For example, how do we trim leading whitespace with an owned buffer? We have the > choice between moving the content and keeping two pointers around (one for the > buffer we need to release, another one for start of the content). > DependentString allows doing this very efficiently. We can do it similar way as we do it now. Currently we find `start` and `end` of the expected trimmed and normalized string in `NormalizeWhitespace`, then call `DependentString::reset(text, start, end - start);` and then copy the piece of buffer determined by fields of `DependentString` into `Filter::mText`. I proposed to find offset (`start`) and the length (`start - end`) of the expected trimmed string and then copy it into `Filter::mText` without modifying of `DependentString` and its buffer, and then remove undesired characters in the middle of the string. The latter action means the possibility for overhead in the unused tail of `OwnedString`, how often does it actually happen?. And to do that we need to add `OwnedString::OwnedString(const String& src, size_type offset, size_type length)` or use `OwnedString(DependentString(text, offset, length))`. Anyway, since we know what is happening on the caller side, I would propose to leave it as is because it's more descriptive now. Wladimir Palant 2016/02/23 12:37:20 Reading that file will eventually move into C++ so Show quoted text On 2016/02/22 12:45:36, sergei wrote: > BTW, can we avoid copying of strings here at all? In the extension we read > filters from a file, and we can read the file into a contiguous buffer, then we > can create `DependentString`s which use that buffer and even modify it in place > (it's our buffer on the heap). It looks fragile but it should be very efficient. Reading that file will eventually move into C++ so at that point we could try tricks like that. There is an encoding issue however: the file uses UTF-8 which we will have to decode anyway. And even without it, it would merely lead to memory fragmentation - that file contains way more info that the filter strings, and we wouldn't be able to reuse that memory easily. I think a fixed buffer (e.g. 128 kB) which we would read chunks of the file into, parse and then reuse will be way more memory-efficient. Show quoted text > I > proposed to find offset (`start`) and the length (`start - end`) of the expected > trimmed string and then copy it into `Filter::mText` without modifying of > `DependentString` and its buffer, I had this implemented like that, to some degree. Main problem is: we have to perform the normalization before creating the filter, in order to do the lookup in knownFilters. We only create the filter if that lookup yields no result. I'd rather not create an owned buffer only to free it again because we didn't need it. sergei 2016/02/23 15:07:24 Acknowledged. Show quoted text On 2016/02/23 12:37:20, Wladimir Palant wrote: > On 2016/02/22 12:45:36, sergei wrote: > > BTW, can we avoid copying of strings here at all? In the extension we read > > filters from a file, and we can read the file into a contiguous buffer, then > we > > can create `DependentString`s which use that buffer and even modify it in > place > > (it's our buffer on the heap). It looks fragile but it should be very > efficient. > > Reading that file will eventually move into C++ so at that point we could try > tricks like that. There is an encoding issue however: the file uses UTF-8 which > we will have to decode anyway. And even without it, it would merely lead to > memory fragmentation - that file contains way more info that the filter strings, > and we wouldn't be able to reuse that memory easily. I think a fixed buffer > (e.g. 128 kB) which we would read chunks of the file into, parse and then reuse > will be way more memory-efficient. > > > I > > proposed to find offset (`start`) and the length (`start - end`) of the > expected > > trimmed string and then copy it into `Filter::mText` without modifying of > > `DependentString` and its buffer, > > I had this implemented like that, to some degree. Main problem is: we have to > perform the normalization before creating the filter, in order to do the lookup > in knownFilters. We only create the filter if that lookup yields no result. I'd > rather not create an owned buffer only to free it again because we didn't need > it. Acknowledged.
	79 if (text.empty())

	80 return nullptr;

	81

	82 // Parsing also normalizes the filter text, so it has to be done before the

	83 // lookup in knownFilters.

	84 union

	85 {

	86 RegExpFilterData regexp;

	87 ElemHideData elemhide;

	88 } data;

	89 OwnedString error;

	90

	91 Filter::Type type = CommentFilter::Parse(text);

	92 if (type == Filter::Type::UNKNOWN)

	93 type = ElemHideBase::Parse(text, data.elemhide);

	94 if (type == Filter::Type::UNKNOWN)

	95 type = RegExpFilter::Parse(text, error, data.regexp);

	96

	97 FilterPtr filter(GetKnownFilter(text));

	98 if (filter)

	99 return filter;

	100

	101 switch (type)

	102 {

	103 case Filter::Type::COMMENT:

	104 filter = new CommentFilter(text);

	105 break;

	106 case Filter::Type::INVALID:

	107 filter = new InvalidFilter(text, error);

	108 break;

	109 case Filter::Type::BLOCKING:

	110 filter = new RegExpFilter(text, data.regexp);

	111 break;

	112 case Filter::Type::WHITELIST:

	113 filter = new WhitelistFilter(text, data.regexp);

	114 break;

	115 case Filter::Type::ELEMHIDE:

	116 filter = new ElemHideFilter(text, data.elemhide);

	117 break;

	118 case Filter::Type::ELEMHIDEEXCEPTION:

	119 filter = new ElemHideException(text, data.elemhide);

	120 break;

	121 case Filter::Type::CSSPROPERTY:

	122 filter = new CSSPropertyFilter(text, data.elemhide);

	123 if (reinterpret_cast<CSSPropertyFilter*>(filter.get())->IsGeneric())
	sergei 2016/02/17 12:54:34 it's better to use `static_cast` here. it's better to use `static_cast` here. Wladimir Palant 2016/02/18 16:06:44 Done. Show quoted text On 2016/02/17 12:54:34, sergei wrote: > it's better to use `static_cast` here. Done.
	124 filter = new InvalidFilter(text,

	125 u"No active domain specified for CSS property filter"_str);

	126 break;

	127 default:

	128 // This should never happen but just in case

	129 return nullptr;

	130 }

	131

	132 enter_context("Adding to known filters");

	133 knownFilters[filter->mText] = filter.get();

	134 exit_context();

	135

	136 // TODO: We intentionally leak the filter here - currently it won't be used

	137 // for anything and would be deleted immediately.

	138 filter->AddRef();

	139

	140 return filter;

	141 }

	142

	143 Filter* Filter::GetKnownFilter(const String& text)

	144 {

	145 auto it = knownFilters.find(text);

	146 if (it != knownFilters.end())

	147 return it->second;

	148 else

	149 return nullptr;

	150 }

OLD	NEW

« compiled/CSSPropertyFilter.h ('K') | « compiled/Filter.h ('k') | compiled/InvalidFilter.h » ('j') | compiled/RegExpFilter.cpp » ('J')