|
|
Created:
Sept. 26, 2017, 9:34 p.m. by hub Modified:
Oct. 11, 2017, 9:55 a.m. Base URL:
https://hg.adblockplus.org/adblockpluscore/ Visibility:
Public. |
DescriptionIssue 5141 - Convert filter match to C++
Patch Set 1 #
Total comments: 14
Patch Set 2 : Cleanup. Fixed the bindings to export what we actually need. #Patch Set 3 : Reworked the code, added test, they fail. #Patch Set 4 : Some more cleanup #
Total comments: 91
Patch Set 5 : Addressed most of the comment. Fixed some issues. #
Total comments: 3
Patch Set 6 : Fixed many issues. One test left out. #
Total comments: 20
MessagesTotal messages: 15
This is a rough conversion of the JavaScript code. I didn't change the logic including for matchesAny, and there is some ugliness. I don't mind getting some feedback, there are thing I'm not sure I did right, but this is not "ready for review" per see. Also I realize there don't seem to be proper test coverage since the test did pass the first time. https://codereview.adblockplus.org/29556737/diff/29556738/compiled/String.h File compiled/String.h (right): https://codereview.adblockplus.org/29556737/diff/29556738/compiled/String.h#n... compiled/String.h:202: bool match(int id, ReMatchResults*) const; String API that mimic JS string API. https://codereview.adblockplus.org/29556737/diff/29556738/compiled/String.h#n... compiled/String.h:419: class ReMatchResults : public ref_counted This it the hack to be able to get the array of string that `String::prototype::match()` returns, out of library.js. https://codereview.adblockplus.org/29556737/diff/29556738/compiled/filter/Fil... File compiled/filter/Filter.h (right): https://codereview.adblockplus.org/29556737/diff/29556738/compiled/filter/Fil... compiled/filter/Filter.h:61: static const DependentString optionsRegExp; these RegExp where in Filter in the JS code. Maybe I should move/add them to `Matcher.cpp` instead. https://codereview.adblockplus.org/29556737/diff/29556738/compiled/filter/Mat... File compiled/filter/Matcher.cpp (right): https://codereview.adblockplus.org/29556737/diff/29556738/compiled/filter/Mat... compiled/filter/Matcher.cpp:25: StringMap<Filter*> mResultCache; I wanted to use FilterPtr in there, but it didn't want to compile. I'll look at what it is, it was a mutability issue. https://codereview.adblockplus.org/29556737/diff/29556738/compiled/filter/Mat... compiled/filter/Matcher.cpp:45: filter.second->ReleaseRef(); See above: if we could have the FilterPtr as the data type for mResultCache, this wouldn't be needed :-/ https://codereview.adblockplus.org/29556737/diff/29556738/compiled/filter/Mat... compiled/filter/Matcher.cpp:284: auto activeFilter = static_cast<ActiveFilter*>(filter.get()); This is done without checking. And it is ugly. And given that we have RTTI disabled, we can't even dynamic_cast<>. https://codereview.adblockplus.org/29556737/diff/29556738/compiled/filter/Mat... compiled/filter/Matcher.cpp:288: auto reFilter = static_cast<RegExpFilter*>(activeFilter); SImilarly as above: this is unchecked. https://codereview.adblockplus.org/29556737/diff/29556738/compiled/filter/Mat... File compiled/filter/Matcher.h (right): https://codereview.adblockplus.org/29556737/diff/29556738/compiled/filter/Mat... compiled/filter/Matcher.h:27: class MatcherBase : public ref_counted the base class exposing the interface. Regular matcher and CombinedMatcher implements it much like it is done in JavaScript. But this is hidden from the interface clients. https://codereview.adblockplus.org/29556737/diff/29556738/compiled/filter/Mat... compiled/filter/Matcher.h:54: class Matcher : public MatcherBase now that I re-read this I ponder whether this `Matcher` should be moved to the cpp file. https://codereview.adblockplus.org/29556737/diff/29556738/compiled/library.h File compiled/library.h (right): https://codereview.adblockplus.org/29556737/diff/29556738/compiled/library.h#... compiled/library.h:41: int GenerateRegExp(const String& regexp, bool matchCase, bool global); I needed to add support for the `g` flag in RegExp.
Slightly updated the patch. https://codereview.adblockplus.org/29556737/diff/29556738/compiled/filter/Mat... File compiled/filter/Matcher.h (right): https://codereview.adblockplus.org/29556737/diff/29556738/compiled/filter/Mat... compiled/filter/Matcher.h:54: class Matcher : public MatcherBase On 2017/09/26 21:49:00, hub wrote: > now that I re-read this I ponder whether this `Matcher` should be moved to the > cpp file. We need it in a public header as notification.js want to instantiate directly.
https://codereview.adblockplus.org/29556737/diff/29559667/compiled/bindings/m... File compiled/bindings/main.cpp (right): https://codereview.adblockplus.org/29556737/diff/29559667/compiled/bindings/m... compiled/bindings/main.cpp:145: .function("findKeyword", &CombinedMatcher::FindKeyword); the function bindings are mostly needed for the tests. https://codereview.adblockplus.org/29556737/diff/29559667/test/matcher.js File test/matcher.js (right): https://codereview.adblockplus.org/29556737/diff/29559667/test/matcher.js#new... test/matcher.js:17: This test wasn't in the emscripten branch. This is a verbatim copy of the one in master.
https://codereview.adblockplus.org/29556737/diff/29556738/compiled/String.h File compiled/String.h (right): https://codereview.adblockplus.org/29556737/diff/29556738/compiled/String.h#n... compiled/String.h:419: class ReMatchResults : public ref_counted On 2017/09/26 21:48:59, hub wrote: > This it the hack to be able to get the array of string that > `String::prototype::match()` returns, out of library.js. I thin it's fine, I had experienced a demand in something like this too, so I think we can accumulate some cases and then will see how to deal with it. https://codereview.adblockplus.org/29556737/diff/29556738/compiled/filter/Fil... File compiled/filter/Filter.h (right): https://codereview.adblockplus.org/29556737/diff/29556738/compiled/filter/Fil... compiled/filter/Filter.h:61: static const DependentString optionsRegExp; On 2017/09/26 21:48:59, hub wrote: > these RegExp where in Filter in the JS code. Maybe I should move/add them to > `Matcher.cpp` instead. If they are not used anywhere else, then I would like to have them in the matcher. https://codereview.adblockplus.org/29556737/diff/29556738/compiled/filter/Mat... File compiled/filter/Matcher.cpp (right): https://codereview.adblockplus.org/29556737/diff/29556738/compiled/filter/Mat... compiled/filter/Matcher.cpp:284: auto activeFilter = static_cast<ActiveFilter*>(filter.get()); On 2017/09/26 21:49:00, hub wrote: > This is done without checking. And it is ugly. And given that we have RTTI > disabled, we can't even dynamic_cast<>. Although we don't pass other filters here, what do you think about having some `Filter::As` (see Subscriptions) method down-casting with the checking of its type and returning nullptr if the type is incorrect. Here we could have an assert that it should never be a nullptr. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/String.cpp File compiled/String.cpp (right): https://codereview.adblockplus.org/29556737/diff/29559667/compiled/String.cpp... compiled/String.cpp:26: return OwnedString(&mBuf[pos], len); Will it be safer to return DependentString(*this, pos, len)? https://codereview.adblockplus.org/29556737/diff/29559667/compiled/String.h File compiled/String.h (right): https://codereview.adblockplus.org/29556737/diff/29559667/compiled/String.h#n... compiled/String.h:202: bool match(int id, ReMatchResults*) const; It might be an overkill but this function does not support nullptr for the results, so what about using a reference instead of the pointer? https://codereview.adblockplus.org/29556737/diff/29559667/compiled/StringMap.h File compiled/StringMap.h (right): https://codereview.adblockplus.org/29556737/diff/29559667/compiled/StringMap.... compiled/StringMap.h:243: resize(0); I'm not sure that `resize` expects a value, lower than the previous size, and `find_bucket` does not expected `mBucketCount` to have zero value. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... File compiled/filter/Matcher.cpp (right): https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:24: OwnedString CombinedMatcher::FindKeyword(const FilterPtr& filter) It's not important but still it would be better to keep the order more or less similar to the order in the headers, at least for classes, firstly Matcher and then CombinedMatcher. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:24: OwnedString CombinedMatcher::FindKeyword(const FilterPtr& filter) the method should be const. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:26: if (filter->mType == Filter::Type::WHITELIST) What do you think about having an inline function Matcher& CombinedMatcher::GetMatcher(const Filter& filter) { return filter.mType == Filter::Type::WHITELIST ? mWhitelist : mBlacklist; } and using of it everywhere below? https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:36: void CombinedMatcher::Add(const FilterPtr& filter) Should the argument be `Filter&`? https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:46: void CombinedMatcher::Remove(const FilterPtr& filter) Should the argument be `const Filter&`? https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:63: bool CombinedMatcher::HasFilter(const FilterPtr& filter) const Should the argument be `const Filter&`? https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:70: const String& CombinedMatcher::GetKeywordForFilter(const FilterPtr& filter) Should the argument be `const Filter&`? https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:70: const String& CombinedMatcher::GetKeywordForFilter(const FilterPtr& filter) the method should be const. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:77: FilterPtr CombinedMatcher::MatchesAnyInternal(const String& location, the method should be const if it's possible. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:84: auto match_re_id = GenerateRegExp(u"[a-z0-9%]{3,}"_str, true, true); It should be in anonymous namespace, otherwise a new regexp is generated on each call of the method, isn't it? https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:85: text.match(match_re_id, &reResult); Although it seems it does work here, I think for present we should avoid instantiation of ref_counted on the stack, therefore I would recommend to use intrusive_ptr<ReMatchResults> and pass a pointer to the object on heap to text.match. However, I'm fine if a third opinion is to keep it on the stack in this particular case. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:93: if (mWhitelist.mFilterByKeyword.find(substr)) It's already changed in the master, do you mind to change it here accordingly? https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:112: const String& sitekey, bool specificOnly) The method should be const if it's possible. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:143: return result.get(); It would be better to `return result.release();`. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:155: OwnedString Matcher::FindKeyword(const FilterPtr& filter) Should the argument be `const Filter&`? https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:155: OwnedString Matcher::FindKeyword(const FilterPtr& filter) should it be a const method? https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:159: auto re_id = GenerateRegExp(DependentString(regexpRegExp), true, false); It and all other regexps below should be in the anonymous namespace because we should not create them on each call of the current method. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:166: if (index != -1) It would be better to use String::npos than -1. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:170: if (text[0] == '@' && text[1] == '@') Firstly we should check the length of the `text`. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:188: auto count = (hash.find(candidate) ? hash[candidate].size() : 0); Basically braces are not needed here. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:188: auto count = (hash.find(candidate) ? hash[candidate].size() : 0); It seems it could be optimized by auto ii_hash = hash.find(candidate); auto count = ii_hash ? ii_hash->second->size() : 0; https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:201: void Matcher::Add(const FilterPtr& filter) What about passing `Filter&`? https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:211: mFilterByKeyword[keyword].push_back(filter); StringMap::operator[](const String& key) creates an entry if it does not exists, so, it seems more efficient to be just mFilterByKeyword[keyword].push_back(filter); https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:212: mKeywordByFilter[filter->GetText()] = keyword; mKeywordByFilter stores DependentString, what if the `filter` is deleted? I would not like to rely on the fact that we can remove the entry when a filter is removed, I would prefer to have something more reliable. BTW, in the current master the filter object is used as a key, I wonder whether it's somehow possible here as well. For instance without digging into a new hash structure, I can propose a bit hacky approach by replacing `first` member of the StringMapEntry by a getter and probably extending it with is_invalid and having hash object storing FilterPtr and using its text property to calculate the hash. It all will be based on templates and all methods are inlined, so there should not be any degradation or difference in the compiled code for already existing `StringMap`s. I think we should try it in a separate codereview. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:215: void Matcher::Remove(const FilterPtr& filter) It seems the argument can be a const reference. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:220: auto keyword = mKeywordByFilter[filter->GetText()]; There is also no need for double looking up. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:227: list.erase(iter); It can be one line but it does not matter. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:238: bool Matcher::HasFilter(const FilterPtr& filter) const the argument should be a const reference. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:243: static DependentString emptyString = u""_str; Although static in the compilation unit achieves the goal, I think we should rather use anonymous namespaces. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:245: const String& Matcher::GetKeywordForFilter(const FilterPtr& filter) the argument should be a const reference and the method should be const because it does not modify mKeywordByFilter. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:249: return emptyString; There is also no need for double looking up. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:255: const String& sitekey, bool specificOnly) basically this method and the one below do not modify the state, so I would propose to make them const if it's possible. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:259: auto activeFilter = static_cast<ActiveFilter*>(filter.get()); opening brace { should be on the new line. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:290: return result.get(); just return `result.release();` https://codereview.adblockplus.org/29556737/diff/29559667/test/matcher.js File test/matcher.js (right): https://codereview.adblockplus.org/29556737/diff/29559667/test/matcher.js#new... test/matcher.js:17: On 2017/09/29 16:17:35, hub wrote: > This test wasn't in the emscripten branch. This is a verbatim copy of the one in > master. Actually, I think, we should merge the emscripten branch on master, in a different review, though, in order to bring the latest changes from the master
I have not checked the everything yet, only quickly run through the most essential parts. In general I find it a good step towards.
test still don't pass. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... File compiled/filter/Matcher.cpp (right): https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:24: OwnedString CombinedMatcher::FindKeyword(const FilterPtr& filter) On 2017/10/02 12:02:35, sergei wrote: > It's not important but still it would be better to keep the order more or less > similar to the order in the headers, at least for classes, firstly Matcher and > then CombinedMatcher. Done. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:24: OwnedString CombinedMatcher::FindKeyword(const FilterPtr& filter) On 2017/10/02 12:02:36, sergei wrote: > the method should be const. Done. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:26: if (filter->mType == Filter::Type::WHITELIST) On 2017/10/02 12:02:33, sergei wrote: > What do you think about having an inline function > Matcher& CombinedMatcher::GetMatcher(const Filter& filter) > { > return filter.mType == Filter::Type::WHITELIST ? mWhitelist : mBlacklist; > } > and using of it everywhere below? Done. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:36: void CombinedMatcher::Add(const FilterPtr& filter) On 2017/10/02 12:02:36, sergei wrote: > Should the argument be `Filter&`? Done. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:46: void CombinedMatcher::Remove(const FilterPtr& filter) On 2017/10/02 12:02:37, sergei wrote: > Should the argument be `const Filter&`? Done. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:63: bool CombinedMatcher::HasFilter(const FilterPtr& filter) const On 2017/10/02 12:02:34, sergei wrote: > Should the argument be `const Filter&`? Done. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:70: const String& CombinedMatcher::GetKeywordForFilter(const FilterPtr& filter) On 2017/10/02 12:02:36, sergei wrote: > the method should be const. Done. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:70: const String& CombinedMatcher::GetKeywordForFilter(const FilterPtr& filter) On 2017/10/02 12:02:35, sergei wrote: > Should the argument be `const Filter&`? Done. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:77: FilterPtr CombinedMatcher::MatchesAnyInternal(const String& location, On 2017/10/02 12:02:34, sergei wrote: > the method should be const if it's possible. Done. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:85: text.match(match_re_id, &reResult); On 2017/10/02 12:02:35, sergei wrote: > Although it seems it does work here, I think for present we should avoid > instantiation of ref_counted on the stack, therefore I would recommend to use > intrusive_ptr<ReMatchResults> and pass a pointer to the object on heap to > text.match. > However, I'm fine if a third opinion is to keep it on the stack in this > particular case. Done. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:93: if (mWhitelist.mFilterByKeyword.find(substr)) On 2017/10/02 12:02:36, sergei wrote: > It's already changed in the master, do you mind to change it here accordingly? Done. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:112: const String& sitekey, bool specificOnly) On 2017/10/02 12:02:34, sergei wrote: > The method should be const if it's possible. sadly the use of the cache makes it non-const. I could eventually mark mResultCache as mutable. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:143: return result.get(); On 2017/10/02 12:02:34, sergei wrote: > It would be better to `return result.release();`. Done. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:155: OwnedString Matcher::FindKeyword(const FilterPtr& filter) On 2017/10/02 12:02:36, sergei wrote: > should it be a const method? Done. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:155: OwnedString Matcher::FindKeyword(const FilterPtr& filter) On 2017/10/02 12:02:34, sergei wrote: > Should the argument be `const Filter&`? Done. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:159: auto re_id = GenerateRegExp(DependentString(regexpRegExp), true, false); On 2017/10/02 12:02:37, sergei wrote: > It and all other regexps below should be in the anonymous namespace because we > should not create them on each call of the current method. The mistake here is that a create a new DependentString from a DependentString(). https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:166: if (index != -1) On 2017/10/02 12:02:34, sergei wrote: > It would be better to use String::npos than -1. Done. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:170: if (text[0] == '@' && text[1] == '@') On 2017/10/02 12:02:37, sergei wrote: > Firstly we should check the length of the `text`. Done. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:188: auto count = (hash.find(candidate) ? hash[candidate].size() : 0); On 2017/10/02 12:02:35, sergei wrote: > Basically braces are not needed here. Done. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:188: auto count = (hash.find(candidate) ? hash[candidate].size() : 0); On 2017/10/02 12:02:37, sergei wrote: > It seems it could be optimized by > auto ii_hash = hash.find(candidate); > auto count = ii_hash ? ii_hash->second->size() : 0; I have to do that for to make the function `const` since StringMap<>::operator[] isn't. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:201: void Matcher::Add(const FilterPtr& filter) On 2017/10/02 12:02:36, sergei wrote: > What about passing `Filter&`? Done. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:211: mFilterByKeyword[keyword].push_back(filter); On 2017/10/02 12:02:37, sergei wrote: > StringMap::operator[](const String& key) creates an entry if it does not exists, > so, it seems more efficient to be just > mFilterByKeyword[keyword].push_back(filter); Done. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:215: void Matcher::Remove(const FilterPtr& filter) On 2017/10/02 12:02:37, sergei wrote: > It seems the argument can be a const reference. Done. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:220: auto keyword = mKeywordByFilter[filter->GetText()]; On 2017/10/02 12:02:37, sergei wrote: > There is also no need for double looking up. Done. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:227: list.erase(iter); On 2017/10/02 12:02:35, sergei wrote: > It can be one line but it does not matter. Done. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:238: bool Matcher::HasFilter(const FilterPtr& filter) const On 2017/10/02 12:02:35, sergei wrote: > the argument should be a const reference. Done. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:243: static DependentString emptyString = u""_str; On 2017/10/02 12:02:37, sergei wrote: > Although static in the compilation unit achieves the goal, I think we should > rather use anonymous namespaces. Done. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:245: const String& Matcher::GetKeywordForFilter(const FilterPtr& filter) On 2017/10/02 12:02:36, sergei wrote: > the argument should be a const reference and the method should be const because > it does not modify mKeywordByFilter. Done. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:249: return emptyString; On 2017/10/02 12:02:37, sergei wrote: > There is also no need for double looking up. Done (needed for making the method `const`) https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:255: const String& sitekey, bool specificOnly) On 2017/10/02 12:02:34, sergei wrote: > basically this method and the one below do not modify the state, so I would > propose to make them const if it's possible. Done. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:259: auto activeFilter = static_cast<ActiveFilter*>(filter.get()); On 2017/10/02 12:02:35, sergei wrote: > opening brace { should be on the new line. Done. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:290: return result.get(); On 2017/10/02 12:02:36, sergei wrote: > just return `result.release();` Done.
https://codereview.adblockplus.org/29556737/diff/29559667/compiled/StringMap.h File compiled/StringMap.h (right): https://codereview.adblockplus.org/29556737/diff/29559667/compiled/StringMap.... compiled/StringMap.h:243: resize(0); On 2017/10/02 12:02:33, sergei wrote: > I'm not sure that `resize` expects a value, lower than the previous size, and > `find_bucket` does not expected `mBucketCount` to have zero value. This is not addressed. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... File compiled/filter/Matcher.cpp (right): https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:26: if (filter->mType == Filter::Type::WHITELIST) On 2017/10/03 19:33:11, hub wrote: > On 2017/10/02 12:02:33, sergei wrote: > > What do you think about having an inline function > > Matcher& CombinedMatcher::GetMatcher(const Filter& filter) > > { > > return filter.mType == Filter::Type::WHITELIST ? mWhitelist : mBlacklist; > > } > > and using of it everywhere below? > > Done. I meant that the code of CombinedMatcher::SomeMethod(Filter& filter) would be { return GetMatcher(filter).SomeMethod(filter). } https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:84: auto match_re_id = GenerateRegExp(u"[a-z0-9%]{3,}"_str, true, true); On 2017/10/02 12:02:36, sergei wrote: > It should be in anonymous namespace, otherwise a new regexp is generated on each > call of the method, isn't it? This is not addressed. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:159: auto re_id = GenerateRegExp(DependentString(regexpRegExp), true, false); On 2017/10/03 19:33:14, hub wrote: > On 2017/10/02 12:02:37, sergei wrote: > > It and all other regexps below should be in the anonymous namespace because we > > should not create them on each call of the current method. > > The mistake here is that a create a new DependentString from a > DependentString(). Each call of GenerateRegExp increases global _regexp_counter and calls `_regexp_data[id] = new RegExp(readString(source), flag);`, so it creates a new regexp each time and stores it in _regexp_data, it's a sort of memory leak. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:188: auto count = (hash.find(candidate) ? hash[candidate].size() : 0); On 2017/10/03 19:33:12, hub wrote: > On 2017/10/02 12:02:37, sergei wrote: > > It seems it could be optimized by > > auto ii_hash = hash.find(candidate); > > auto count = ii_hash ? ii_hash->second->size() : 0; > > I have to do that for to make the function `const` since StringMap<>::operator[] > isn't. It's just a side effect of the present code, there could be a const version of StringMap<>::operator[], the optimization is in the elimination of the double look up, firstly in find and secondly in operator[]. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:212: mKeywordByFilter[filter->GetText()] = keyword; On 2017/10/02 12:02:34, sergei wrote: > mKeywordByFilter stores DependentString, what if the `filter` is deleted? > I would not like to rely on the fact that we can remove the entry when a filter > is removed, I would prefer to have something more reliable. BTW, in the current > master the filter object is used as a key, I wonder whether it's somehow > possible here as well. > For instance without digging into a new hash structure, I can propose a bit > hacky approach by replacing `first` member of the StringMapEntry by a getter and > probably extending it with is_invalid and having hash object storing FilterPtr > and using its text property to calculate the hash. It all will be based on > templates and all methods are inlined, so there should not be any degradation or > difference in the compiled code for already existing `StringMap`s. I think we > should try it in a separate codereview. What about having some struct FilterKeyword { FilterKeyword(OwnedString&& keyword, Filter& filter) : keyword(std::move(keyword)), filter(&filter) {} explicit operator const String&() const { return keyword; } private: OwnedString keyword; FilterPtr filter; } and declaring StringMap<FilterKeyword> mKeywordByFilter;? BTW, taking into account that the keyword is just a substring, we could save the memory here by a custom implementation of the regular expression and storing the keyword as DependentString but in another commit. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/library.js File compiled/library.js (right): https://codereview.adblockplus.org/29556737/diff/29559667/compiled/library.js... compiled/library.js:111: result.push(createString(matches[i])); matches[i] is JS String, but result.push expects our C++ const String&, so createString should not be removed. However, since the string is created on the stack, this method should contain var sp = Runtime.stackSave(); at the beginning and Runtime.stackRestore(sp); before returning. https://codereview.adblockplus.org/29556737/diff/29563595/compiled/String.cpp File compiled/String.cpp (right): https://codereview.adblockplus.org/29556737/diff/29563595/compiled/String.cpp... compiled/String.cpp:24: if (len == npos) I think this method should not contain that logic, it does not consider many cases, like 'pos > length()', however `DependentString(const String& str, size_type pos = 0, size_type len = npos)` already covers all such cases and OwnedString can be constructed from it, so I think here simply DependentString(*this, pos, len) should be returned. In addition, the return type can be DependentString. https://codereview.adblockplus.org/29556737/diff/29563595/compiled/filter/Mat... File compiled/filter/Matcher.cpp (right): https://codereview.adblockplus.org/29556737/diff/29563595/compiled/filter/Mat... compiled/filter/Matcher.cpp:162: { Earlier return would be better here, in my opinion.
https://codereview.adblockplus.org/29556737/diff/29559667/compiled/StringMap.h File compiled/StringMap.h (right): https://codereview.adblockplus.org/29556737/diff/29559667/compiled/StringMap.... compiled/StringMap.h:243: resize(0); On 2017/10/04 08:54:31, sergei wrote: > On 2017/10/02 12:02:33, sergei wrote: > > I'm not sure that `resize` expects a value, lower than the previous size, and > > `find_bucket` does not expected `mBucketCount` to have zero value. > > This is not addressed. What about creation of an `init(size_type expectedEntries)` method and call it from both the contructor and from `clear`?
I have commented ONE test assert that fails. I need to find why before we can land this. Also I should file an issue about StringMap<>. We need one that uses OwnedString as keys. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/String.cpp File compiled/String.cpp (right): https://codereview.adblockplus.org/29556737/diff/29559667/compiled/String.cpp... compiled/String.cpp:26: return OwnedString(&mBuf[pos], len); On 2017/10/02 12:02:33, sergei wrote: > Will it be safer to return DependentString(*this, pos, len)? Safer, not sure, but yes it seems we can. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/String.h File compiled/String.h (right): https://codereview.adblockplus.org/29556737/diff/29559667/compiled/String.h#n... compiled/String.h:202: bool match(int id, ReMatchResults*) const; On 2017/10/02 12:02:33, sergei wrote: > It might be an overkill but this function does not support nullptr for the > results, so what about using a reference instead of the pointer? Done. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/StringMap.h File compiled/StringMap.h (right): https://codereview.adblockplus.org/29556737/diff/29559667/compiled/StringMap.... compiled/StringMap.h:243: resize(0); On 2017/10/04 08:54:31, sergei wrote: > On 2017/10/02 12:02:33, sergei wrote: > > I'm not sure that `resize` expects a value, lower than the previous size, and > > `find_bucket` does not expected `mBucketCount` to have zero value. > > This is not addressed. Done. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/StringMap.... compiled/StringMap.h:243: resize(0); On 2017/10/06 09:33:56, sergei wrote: > On 2017/10/04 08:54:31, sergei wrote: > > On 2017/10/02 12:02:33, sergei wrote: > > > I'm not sure that `resize` expects a value, lower than the previous size, > and > > > `find_bucket` does not expected `mBucketCount` to have zero value. > > > > This is not addressed. > What about creation of an `init(size_type expectedEntries)` method and call it > from both the contructor and from `clear`? that's the approach I took. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... File compiled/filter/Matcher.cpp (right): https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:26: if (filter->mType == Filter::Type::WHITELIST) On 2017/10/04 08:54:32, sergei wrote: > On 2017/10/03 19:33:11, hub wrote: > > On 2017/10/02 12:02:33, sergei wrote: > > > What do you think about having an inline function > > > Matcher& CombinedMatcher::GetMatcher(const Filter& filter) > > > { > > > return filter.mType == Filter::Type::WHITELIST ? mWhitelist : mBlacklist; > > > } > > > and using of it everywhere below? > > > > Done. > > I meant that the code of CombinedMatcher::SomeMethod(Filter& filter) would be > { > return GetMatcher(filter).SomeMethod(filter). > } Done. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:84: auto match_re_id = GenerateRegExp(u"[a-z0-9%]{3,}"_str, true, true); On 2017/10/04 08:54:31, sergei wrote: > On 2017/10/02 12:02:36, sergei wrote: > > It should be in anonymous namespace, otherwise a new regexp is generated on > each > > call of the method, isn't it? > > This is not addressed. Done. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:159: auto re_id = GenerateRegExp(DependentString(regexpRegExp), true, false); On 2017/10/04 08:54:32, sergei wrote: > On 2017/10/03 19:33:14, hub wrote: > > On 2017/10/02 12:02:37, sergei wrote: > > > It and all other regexps below should be in the anonymous namespace because > we > > > should not create them on each call of the current method. > > > > The mistake here is that a create a new DependentString from a > > DependentString(). > > Each call of GenerateRegExp increases global _regexp_counter and calls > `_regexp_data[id] = new RegExp(readString(source), flag);`, so it creates a new > regexp each time and stores it in _regexp_data, it's a sort of memory leak. Done. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:188: auto count = (hash.find(candidate) ? hash[candidate].size() : 0); On 2017/10/04 08:54:32, sergei wrote: > On 2017/10/03 19:33:12, hub wrote: > > On 2017/10/02 12:02:37, sergei wrote: > > > It seems it could be optimized by > > > auto ii_hash = hash.find(candidate); > > > auto count = ii_hash ? ii_hash->second->size() : 0; > > > > I have to do that for to make the function `const` since > StringMap<>::operator[] > > isn't. > > It's just a side effect of the present code, there could be a const version of > StringMap<>::operator[], the optimization is in the elimination of the double > look up, firstly in find and secondly in operator[]. I addressed that. Just as I said making this const prevent from using operator[] which is gone at that point. I don't see a reasonable way to implement operator[] on a const. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/filter/Mat... compiled/filter/Matcher.cpp:212: mKeywordByFilter[filter->GetText()] = keyword; On 2017/10/04 08:54:32, sergei wrote: > On 2017/10/02 12:02:34, sergei wrote: > > mKeywordByFilter stores DependentString, what if the `filter` is deleted? > > I would not like to rely on the fact that we can remove the entry when a > filter > > is removed, I would prefer to have something more reliable. BTW, in the > current > > master the filter object is used as a key, I wonder whether it's somehow > > possible here as well. > > For instance without digging into a new hash structure, I can propose a bit > > hacky approach by replacing `first` member of the StringMapEntry by a getter > and > > probably extending it with is_invalid and having hash object storing FilterPtr > > and using its text property to calculate the hash. It all will be based on > > templates and all methods are inlined, so there should not be any degradation > or > > difference in the compiled code for already existing `StringMap`s. I think we > > should try it in a separate codereview. > > What about having some > struct FilterKeyword > { > FilterKeyword(OwnedString&& keyword, Filter& filter) > : keyword(std::move(keyword)), filter(&filter) > {} > explicit operator const String&() const > { > return keyword; > } > private: > OwnedString keyword; > FilterPtr filter; > } > > and declaring StringMap<FilterKeyword> mKeywordByFilter;? > > BTW, taking into account that the keyword is just a substring, we could save the > memory here by a custom implementation of the regular expression and storing the > keyword as DependentString but in another commit. Sounds like a good idea. Done. https://codereview.adblockplus.org/29556737/diff/29559667/compiled/library.js File compiled/library.js (right): https://codereview.adblockplus.org/29556737/diff/29559667/compiled/library.js... compiled/library.js:111: result.push(createString(matches[i])); On 2017/10/04 08:54:32, sergei wrote: > matches[i] is JS String, but result.push expects our C++ const String&, so > createString should not be removed. > > However, since the string is created on the stack, this method should contain > var sp = Runtime.stackSave(); at the beginning > and > Runtime.stackRestore(sp); before returning. No. push() argument is plain JS string. The bindings takes care of that: ```` exports.ReMatchResults = createClass(null, 0, { push: function(arg0) { var sp = Runtime.stackSave(); __ZN14ReMatchResults4pushERK6String(this._pointer, createString(arg0)); Runtime.stackRestore(sp); }, }); ```` Before: I had "empty" strings for the right count. After: I get the right content. https://codereview.adblockplus.org/29556737/diff/29563595/compiled/filter/Mat... File compiled/filter/Matcher.cpp (right): https://codereview.adblockplus.org/29556737/diff/29563595/compiled/filter/Mat... compiled/filter/Matcher.cpp:162: { On 2017/10/04 08:54:33, sergei wrote: > Earlier return would be better here, in my opinion. Done.
We are no longer writing JavaScript code, so all the JavaScript-specific hacks of the current implementation don't need to be copied. The issue description already indicates that the new implementation should be very different from the current one, you should read it. https://codereview.adblockplus.org/29556737/diff/29567770/compiled/String.h File compiled/String.h (right): https://codereview.adblockplus.org/29556737/diff/29567770/compiled/String.h#n... compiled/String.h:204: } This functionality doesn't belong here, it isn't basic String functionality. If we need to share regexp-specific code between multiple classes, we should create a new RegExp class for that. https://codereview.adblockplus.org/29556737/diff/29567770/compiled/StringMap.h File compiled/StringMap.h (right): https://codereview.adblockplus.org/29556737/diff/29567770/compiled/StringMap.... compiled/StringMap.h:120: size_type mExpectedEntries; No point recalculating bucket count if the map is cleared, mInitialBucketCount will do. https://codereview.adblockplus.org/29556737/diff/29567770/compiled/StringMap.... compiled/StringMap.h:238: init(mExpectedEntries); IMHO, the shared function between constructor and clear() shouldn't involve calculating the bucket count - merely assigning mBucketCount and reallocation. So here it should be: mEntryCount = 0; allocate(mInitialBucketCount); Same function can be called in resize() then. https://codereview.adblockplus.org/29556737/diff/29567770/compiled/bindings/m... File compiled/bindings/main.cpp (right): https://codereview.adblockplus.org/29556737/diff/29567770/compiled/bindings/m... compiled/bindings/main.cpp:136: .function("push", &ReMatchResults::push); Why do we need to export what is clearly an internal class with an internal method? https://codereview.adblockplus.org/29556737/diff/29567770/compiled/bindings/m... compiled/bindings/main.cpp:146: .function("findKeyword", &CombinedMatcher::FindKeyword); The three methods above shouldn't be exported, these are implementation details now that JS code doesn't need to call. In fact, the distinction between CombinedMatcher and Matcher was an implementation details that shouldn't be exposed any more (mentioned in the issue description). https://codereview.adblockplus.org/29556737/diff/29567770/compiled/filter/Mat... File compiled/filter/Matcher.cpp (right): https://codereview.adblockplus.org/29556737/diff/29567770/compiled/filter/Mat... compiled/filter/Matcher.cpp:112: } As mentioned in the issue description, we should not reuse the algorithm currently implemented in JavaScript - we only chose that one because it would run efficiently in JavaScript. C++ allows using better algorithms however, with Rabin-Karp most likely being the best choice here. https://codereview.adblockplus.org/29556737/diff/29567770/compiled/filter/Mat... File compiled/filter/Matcher.h (right): https://codereview.adblockplus.org/29556737/diff/29567770/compiled/filter/Mat... compiled/filter/Matcher.h:41: OwnedString mKeyword; A "keyword" is a substring of a filter text, so it can be a dependent string - no need for an additional allocation here. Note how we keep a reference to the filter here, so the filter cannot be garbage collected before the keyword is. https://codereview.adblockplus.org/29556737/diff/29567770/compiled/filter/Mat... compiled/filter/Matcher.h:112: StringMap<CacheEntry> mResultCache; Before we copy the caching approach from the JS code, we should be certain that caching is still worth doing. At the very least, 1024 cache entries are definitely pointless, a small circular buffer (8 entries?) should do the job if necessary - this would also make concatenating strings unnecessary.
https://codereview.adblockplus.org/29556737/diff/29559667/compiled/library.js File compiled/library.js (right): https://codereview.adblockplus.org/29556737/diff/29559667/compiled/library.js... compiled/library.js:111: result.push(createString(matches[i])); On 2017/10/06 13:49:17, hub wrote: > On 2017/10/04 08:54:32, sergei wrote: > > matches[i] is JS String, but result.push expects our C++ const String&, so > > createString should not be removed. > > > > However, since the string is created on the stack, this method should contain > > var sp = Runtime.stackSave(); at the beginning > > and > > Runtime.stackRestore(sp); before returning. > > No. push() argument is plain JS string. The bindings takes care of that: > ```` > exports.ReMatchResults = createClass(null, 0, { > push: function(arg0) > { > var sp = Runtime.stackSave(); > __ZN14ReMatchResults4pushERK6String(this._pointer, createString(arg0)); > Runtime.stackRestore(sp); > }, > }); > ```` > > Before: I had "empty" strings for the right count. > After: I get the right content. Acknowledged. https://codereview.adblockplus.org/29556737/diff/29567770/compiled/String.h File compiled/String.h (right): https://codereview.adblockplus.org/29556737/diff/29567770/compiled/String.h#n... compiled/String.h:204: } On 2017/10/09 08:39:45, Wladimir Palant wrote: > This functionality doesn't belong here, it isn't basic String functionality. If > we need to share regexp-specific code between multiple classes, we should create > a new RegExp class for that. I would suggest for the beginning simply remove the member function `String::match` and use `MatchRegExp`. https://codereview.adblockplus.org/29556737/diff/29567770/compiled/bindings/m... File compiled/bindings/main.cpp (right): https://codereview.adblockplus.org/29556737/diff/29567770/compiled/bindings/m... compiled/bindings/main.cpp:136: .function("push", &ReMatchResults::push); On 2017/10/09 08:39:47, Wladimir Palant wrote: > Why do we need to export what is clearly an internal class with an internal > method? This class is used as a callback to obtain the results from JS implementation of regular expressions. I expect it to be removed after complete switching to C++. https://codereview.adblockplus.org/29556737/diff/29567770/compiled/bindings/m... compiled/bindings/main.cpp:146: .function("findKeyword", &CombinedMatcher::FindKeyword); On 2017/10/09 08:39:47, Wladimir Palant wrote: > The three methods above shouldn't be exported, these are implementation details > now that JS code doesn't need to call. > > In fact, the distinction between CombinedMatcher and Matcher was an > implementation details that shouldn't be exposed any more (mentioned in the > issue description). I think they are currently exported for the sake of tests (https://codereview.adblockplus.org/29556737/diff/29559667/compiled/bindings/m...). I have no objection regarding removing the tests using them and not exposing these methods but for present it seems they can be very useful to identify a regression. https://codereview.adblockplus.org/29556737/diff/29567770/compiled/filter/Mat... File compiled/filter/Matcher.cpp (right): https://codereview.adblockplus.org/29556737/diff/29567770/compiled/filter/Mat... compiled/filter/Matcher.cpp:112: } On 2017/10/09 08:39:47, Wladimir Palant wrote: > As mentioned in the issue description, we should not reuse the algorithm > currently implemented in JavaScript - we only chose that one because it would > run efficiently in JavaScript. C++ allows using better algorithms however, with > Rabin-Karp most likely being the best choice here. On 2017/10/09 08:39:48, Wladimir Palant wrote: > We are no longer writing JavaScript code, so all the JavaScript-specific hacks > of the current implementation don't need to be copied. The issue description > already indicates that the new implementation should be very different from the > current one, you should read it. Although it merely converts the existing JS code and approaches to C++, I would like to have it landed because of the following reasons. It already provides with the C++ API of the matcher and corresponding tests, though in JS. We know that the existing approach does work in JS, so it should work in C++ and should not show worse results (I'm actually afraid of a lot of C++<>JS string conversions, but anyway it should work). Having it in C++ now will allow to compare the performance (speed and especially the memory usage) of the existing approach and a new one later. This change is already pretty huge and the change of only the implementation, which is BTW pretty well localized, especially involving some algorithmic work, is better to review and commit separately. BTW, right now some dependencies required to implement the proposed approach are missing and there are actually some questions. https://codereview.adblockplus.org/29556737/diff/29567770/compiled/filter/Mat... File compiled/filter/Matcher.h (right): https://codereview.adblockplus.org/29556737/diff/29567770/compiled/filter/Mat... compiled/filter/Matcher.h:41: OwnedString mKeyword; On 2017/10/09 08:39:47, Wladimir Palant wrote: > A "keyword" is a substring of a filter text, so it can be a dependent string - > no need for an additional allocation here. Note how we keep a reference to the > filter here, so the filter cannot be garbage collected before the keyword is. The current way of obtaining of the keyword string is actually C++ String -> JS string -> JS regexp -> JS string -> C++ OwnedString, so it cannot be the DependentString, so far. https://codereview.adblockplus.org/29556737/diff/29567770/compiled/filter/Mat... compiled/filter/Matcher.h:112: StringMap<CacheEntry> mResultCache; On 2017/10/09 08:39:47, Wladimir Palant wrote: > Before we copy the caching approach from the JS code, we should be certain that > caching is still worth doing. It seems not trivial because it requires at least certain measurements to estimate whether that caching is still worth, and it also depends on the actual implementation (the proposed Rabin-Karp is not implemented yet, maybe we could even use a modified version of the Bloom filter) and its tweaks. So, could you please file a separate issue for the evaluation and proper implementation in case of necessity of the caching here? > At the very least, 1024 cache entries are > definitely pointless, a small circular buffer (8 entries?) should do the job if > necessary - this would also make concatenating strings unnecessary. Taking into account that a single visit of a web-page results in at least around 100 requests thus ~100 calls of the "MatchesAny" method with different parameters and that the cache likely has a hit on the second visit of the same web site, it seems that the size of cache should be rather at least 100. If it's a circular buffer with a dozen of entries then they will be never hit. So, the cache with ~1000 entries looks reasonable. The cache structure should indeed behave like circular buffer, but there is no such functionality yet, so I would propose to keep the current approach and change it in another commit. I would rather encapsulate the combined matcher in a caching matcher and keep the interface in order to simplify possible usage and profiling of the matcher with different caching approaches and without any caching, but for the present it's fine.
https://codereview.adblockplus.org/29556737/diff/29567770/compiled/bindings/m... File compiled/bindings/main.cpp (right): https://codereview.adblockplus.org/29556737/diff/29567770/compiled/bindings/m... compiled/bindings/main.cpp:146: .function("findKeyword", &CombinedMatcher::FindKeyword); On 2017/10/09 15:27:52, sergei wrote: > I think they are currently exported for the sake of tests Yes, that's probably it. We should remove/rewrite tests relying on internals however. https://codereview.adblockplus.org/29556737/diff/29567770/compiled/filter/Mat... File compiled/filter/Matcher.cpp (right): https://codereview.adblockplus.org/29556737/diff/29567770/compiled/filter/Mat... compiled/filter/Matcher.cpp:112: } On 2017/10/09 15:27:53, sergei wrote: > Although it merely converts the existing JS code and approaches to C++, I would > like to have it landed because of the following reasons. It already provides > with the C++ API of the matcher and corresponding tests, though in JS. We know > that the existing approach does work in JS, so it should work in C++ and should > not show worse results (I'm actually afraid of a lot of C++<>JS string > conversions, but anyway it should work). Having it in C++ now will allow to > compare the performance (speed and especially the memory usage) of the existing > approach and a new one later. This change is already pretty huge and the change > of only the implementation, which is BTW pretty well localized, especially > involving some algorithmic work, is better to review and commit separately. BTW, > right now some dependencies required to implement the proposed approach are > missing and there are actually some questions. I strongly disagree. Landing crappy code is always a bad idea. Landing crappy code that actually works is even worse, because nobody will have a strong incentive to replace it once it landed, and it will stay around as a maintenance burden way too long. As it is now, this code is way too complicated for the job that it is doing. It also overspends memory, introduces unnecessary allocations and wastes performance for no good reason on every corner. Comparing well-optimized JavaScript code with badly optimized C++ code is *not* useful. The matcher is where we can really make a difference performance-wise, but not with this approach. I can tell you already that the overhead introduced here will make this code perform considerably worse than our current JavaScript code. https://codereview.adblockplus.org/29556737/diff/29567770/compiled/filter/Mat... File compiled/filter/Matcher.h (right): https://codereview.adblockplus.org/29556737/diff/29567770/compiled/filter/Mat... compiled/filter/Matcher.h:112: StringMap<CacheEntry> mResultCache; On 2017/10/09 15:27:53, sergei wrote: > It seems not trivial because it requires at least certain measurements to > estimate whether that caching is still worth, and it also depends on the actual > implementation (the proposed Rabin-Karp is not implemented yet, maybe we could > even use a modified version of the Bloom filter) and its tweaks. So, could you > please file a separate issue for the evaluation and proper implementation in > case of necessity of the caching here? You are getting it backwards. The caching shouldn't be implemented here in the first place. We need a separate issue for evaluating the necessity of caching and implementing it if necessary. > Taking into account that a single visit of a web-page results in at least around > 100 requests thus ~100 calls of the "MatchesAny" method with different > parameters and that the cache likely has a hit on the second visit of the same > web site, it seems that the size of cache should be rather at least 100. That's unlikely. A cache needs to introduce less overhead due to storing values and lookups than one saves by not redoing certain calculations. With a sufficiently efficient matching algorithm (and that's what we are aiming for here), caching results between different requests shouldn't be worth it. The cache should only be useful if a website requests the same resource all the time in quick succession.
https://codereview.adblockplus.org/29556737/diff/29567770/compiled/bindings/m... File compiled/bindings/main.cpp (right): https://codereview.adblockplus.org/29556737/diff/29567770/compiled/bindings/m... compiled/bindings/main.cpp:136: .function("push", &ReMatchResults::push); On 2017/10/09 08:39:47, Wladimir Palant wrote: > Why do we need to export what is clearly an internal class with an internal > method? The sole purpose of this class is to be able to get the results from a regexp match. See in library.js. So it needs to have push() exported into JS. https://codereview.adblockplus.org/29556737/diff/29567770/compiled/bindings/m... compiled/bindings/main.cpp:146: .function("findKeyword", &CombinedMatcher::FindKeyword); On 2017/10/10 07:39:05, Wladimir Palant wrote: > On 2017/10/09 15:27:52, sergei wrote: > > I think they are currently exported for the sake of tests > > Yes, that's probably it. We should remove/rewrite tests relying on internals > however. Note that currently these are test that exist in master as well. Also notification.js currently explicitely instanciate a Matcher in addition to using the defaultMatcher.
Message was sent while issue was closed.
https://codereview.adblockplus.org/29556737/diff/29567770/compiled/filter/Mat... File compiled/filter/Matcher.cpp (right): https://codereview.adblockplus.org/29556737/diff/29567770/compiled/filter/Mat... compiled/filter/Matcher.cpp:49: mFilterByKeyword[keyword].push_back(FilterPtr(&filter)); Although the review is already closed I think it's important to let everyone know that here is a bug in this line to avoid it in a future. The type of key of StringMap is DependentString, the type of keyword is OwnedString, so after exiting of the scope of this method the stored DependentString points to a removed string. |