|
|
Patch Set 1 #
Total comments: 6
Patch Set 2 : convert to UTF-8 #Patch Set 3 : adjust comments #Patch Set 4 : revert and change referrer map #Patch Set 5 : rename BuildReferrerChain to BuildFrameStructure #
Total comments: 34
Patch Set 6 : fix according to comments and add some tests #
Total comments: 9
Patch Set 7 : fix according to comments #
MessagesTotal messages: 9
Assuming all types of requests should be transitive, this LGTM, with minor changes in comments. http://codereview.adblockplus.org/5768603836088320/diff/5629499534213120/src/... File src/FilterEngine.cpp (right): http://codereview.adblockplus.org/5768603836088320/diff/5629499534213120/src/... src/FilterEngine.cpp:281: lastDocumentUrl); I would add a comment before CheckFilterMatch something like: // Check if this request is whitelisted from any of the parent documents http://codereview.adblockplus.org/5768603836088320/diff/5629499534213120/src/... src/FilterEngine.cpp:286: // we whitelist the everything which is requested by the page example.org. This comment is confusing, I would remove it. http://codereview.adblockplus.org/5768603836088320/diff/5629499534213120/src/... src/FilterEngine.cpp:292: CONTENT_TYPE_DOCUMENT, lastDocumentUrl); Before this CheckFilterMatch I'll add a comment like: // Check if any of the parent documents are whitelisted
Thanks, comments are adjusted. Also I would ask whether it makes sense to comment that we are doing it not from the bottom to upward, in direction to the root page, but in the reverse order, from the root page to the downward. For example, for the chain A->B->C->D the operations are in the following order: check(D, D_type, A) check(A, DOCUMENT, A) check(B, DOCUMENT, A) check(C, DOCUMENT, A) check(D, D_type, B) check(B, DOCUMENT, B) check(C, DOCUMENT, B) check(D, D_type, C) check(C, DOCUMENT, C) http://codereview.adblockplus.org/5768603836088320/diff/5629499534213120/src/... File src/FilterEngine.cpp (right): http://codereview.adblockplus.org/5768603836088320/diff/5629499534213120/src/... src/FilterEngine.cpp:281: lastDocumentUrl); On 2014/11/13 13:32:18, Oleksandr wrote: > I would add a comment before CheckFilterMatch something like: > // Check if this request is whitelisted from any of the parent documents added http://codereview.adblockplus.org/5768603836088320/diff/5629499534213120/src/... src/FilterEngine.cpp:286: // we whitelist the everything which is requested by the page example.org. On 2014/11/13 13:32:18, Oleksandr wrote: > This comment is confusing, I would remove it. See the updated version. It says ''parent'' but we start from the same page `ii = it` not `ii = it + 1`. I think it looks like a bug and it's important to comment that fact because without such definition of ''document'' type I would expect to start really from parent. http://codereview.adblockplus.org/5768603836088320/diff/5629499534213120/src/... src/FilterEngine.cpp:292: CONTENT_TYPE_DOCUMENT, lastDocumentUrl); On 2014/11/13 13:32:18, Oleksandr wrote: > Before this CheckFilterMatch I'll add a comment like: > // Check if any of the parent documents are whitelisted add before the for loop
No, the domain option isn't transitive. As noted in the issue, the problem actually lies elsewhere. What we need here is a way to fix the frame structure, so that a Flash object isn't considered to be a frame.
Fix the building of frame structure.
http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/incl... File include/AdblockPlus/ReferrerMapping.h (right): http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/incl... include/AdblockPlus/ReferrerMapping.h:71: FilterEngine::ContentType type; It is unnecessary to store the actual type here - all you needs is |boolean isLeaf| (false for frames and true for everything else). http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/incl... include/AdblockPlus/ReferrerMapping.h:73: std::map</*url*/std::string, RequestInfo> mapping; How about |using URL = std::string| so that you can actually use type URL here instead of commenting that std::string is a URL? http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/src/... File src/ReferrerMapping.cpp (right): http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/src/... src/ReferrerMapping.cpp:33: mapping.emplace(url, RequestInfo(referrer, requestType)); My understanding is that map::emplace won't replace existing entries. That's wrong - we should keep the most current referrer for each request. Consider for example multiple websites using the same ad provider. http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/src/... src/ReferrerMapping.cpp:47: std::vector<std::string> referrerChain; If you rename the function you should also rename the name of this variable. http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/src/... src/ReferrerMapping.cpp:50: return referrerChain; Why is this special case necessary? An empty string simply won't be found in the map. http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/src/... src/ReferrerMapping.cpp:56: // elegant. This snarky comment seems completely pointless to document the code below. http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/src/... src/ReferrerMapping.cpp:57: const int maxChainLength = 20; Why increase this value? 10 should be enough for any real-world scenario. http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/src/... src/ReferrerMapping.cpp:58: referrerChain.reserve(maxChainLength); I doubt that this makes sense. While we might get a lengthy chain, this is an absolutely exceptional case. In the overwhelming majority of cases the resulting vector will have length 1 - no point reserving space for 32 elements (vector size will be "rounded up" to the next power of two). http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/src/... src/ReferrerMapping.cpp:59: std::map<std::string, RequestInfo>::const_iterator currentEntry = Given that you use std::map<std::string, RequestInfo> several times, why not set up ReferrerMap as an alias for it? http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/src/... src/ReferrerMapping.cpp:64: // not empty then add it. I assume that the code this comment refers to will be removed. However, the comment is currently completely unintelligible. http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/src/... src/ReferrerMapping.cpp:82: referrerChain.push_back(url); This special case is unnecessary. An empty referrer chain is perfectly valid and FilterEngine::Matches() can deal with it. So please don't duplicate code outside the loop. http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/src/... src/ReferrerMapping.cpp:85: currentEntry = mapping.find(currentEntry->second.referrer)) Please put the assignment to currentEntry at the end of the loop body, where it was before. The code is much easier to read that way, in particular prevEntry and currentEntry should always be set in the same place. http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/src/... src/ReferrerMapping.cpp:89: referrerChain.push_back(currentEntry->first); This should be .insert(referrerChain.begin(), ...) - same as before. http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/src/... src/ReferrerMapping.cpp:98: // of copying the tail of the referrerChain on each adding. It should also be pretty pointless, with the referrer chain having only one element most of the time... http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/test... File test/ReferrerMapping.cpp (right): http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/test... test/ReferrerMapping.cpp:23: TEST(ReferrerMappingTest, BuildFrameStructureShouldIncludeUrlItselfWhenThereIsNoMappingEntryForIt) Now that you've earned the price for the longest class name ever, could you please change it back? Note that the test name should generally describe the setup and not the result. The result is defined in the asserts below.
http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/incl... File include/AdblockPlus/ReferrerMapping.h (right): http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/incl... include/AdblockPlus/ReferrerMapping.h:71: FilterEngine::ContentType type; On 2014/11/25 15:46:37, Wladimir Palant wrote: > It is unnecessary to store the actual type here - all you needs is |boolean > isLeaf| (false for frames and true for everything else). I would use `isFrame` because it's not necessary to be a leaf. But using such bool in function signatures reduces the readability and maintainability. So I've created enum FrameIndicator. http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/incl... include/AdblockPlus/ReferrerMapping.h:73: std::map</*url*/std::string, RequestInfo> mapping; On 2014/11/25 15:46:37, Wladimir Palant wrote: > How about |using URL = std::string| so that you can actually use type URL here > instead of commenting that std::string is a URL? Added typedef for `Url` and `Urls`. http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/src/... File src/ReferrerMapping.cpp (right): http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/src/... src/ReferrerMapping.cpp:33: mapping.emplace(url, RequestInfo(referrer, requestType)); On 2014/11/25 15:46:37, Wladimir Palant wrote: > My understanding is that map::emplace won't replace existing entries. That's > wrong - we should keep the most current referrer for each request. Consider for > example multiple websites using the same ad provider. Excellent point, I forget about it. http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/src/... src/ReferrerMapping.cpp:47: std::vector<std::string> referrerChain; On 2014/11/25 15:46:37, Wladimir Palant wrote: > If you rename the function you should also rename the name of this variable. renamed to `frames`. http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/src/... src/ReferrerMapping.cpp:50: return referrerChain; On 2014/11/25 15:46:37, Wladimir Palant wrote: > Why is this special case necessary? An empty string simply won't be found in the > map. Because it's rather an invalid argument and we should neither even try to find it in the container nor increase the complexity of the logic below by it. I would even add it to ReferrerMapping::Add as well, but I guess it's inconsistent with the current approach. Removed. http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/src/... src/ReferrerMapping.cpp:56: // elegant. On 2014/11/25 15:46:37, Wladimir Palant wrote: > This snarky comment seems completely pointless to document the code below. removed http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/src/... src/ReferrerMapping.cpp:57: const int maxChainLength = 20; On 2014/11/25 15:46:37, Wladimir Palant wrote: > Why increase this value? 10 should be enough for any real-world scenario. It was needed for tests. removed. http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/src/... src/ReferrerMapping.cpp:58: referrerChain.reserve(maxChainLength); On 2014/11/25 15:46:37, Wladimir Palant wrote: > I doubt that this makes sense. While we might get a lengthy chain, this is an > absolutely exceptional case. In the overwhelming majority of cases the resulting > vector will have length 1 - no point reserving space for 32 elements (vector > size will be "rounded up" to the next power of two). I've tested, it's indeed has only one frame in about 70% cases and in about 30% it has 2 frames. removed. BTW, could you point to the reference where it's stated that `reserve` rounds up to the next power of two? It was not before and the current implementation does not do it. http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/src/... src/ReferrerMapping.cpp:59: std::map<std::string, RequestInfo>::const_iterator currentEntry = On 2014/11/25 15:46:37, Wladimir Palant wrote: > Given that you use std::map<std::string, RequestInfo> several times, why not set > up ReferrerMap as an alias for it? changed. http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/src/... src/ReferrerMapping.cpp:82: referrerChain.push_back(url); On 2014/11/25 15:46:37, Wladimir Palant wrote: > This special case is unnecessary. An empty referrer chain is perfectly valid and > FilterEngine::Matches() can deal with it. So please don't duplicate code outside > the loop. Not clear which special case is mentioned, could you point to the incorrect test, provide another test which demonstrates the expected behaviour or explain how it can be in the loop. http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/src/... src/ReferrerMapping.cpp:85: currentEntry = mapping.find(currentEntry->second.referrer)) On 2014/11/25 15:46:37, Wladimir Palant wrote: > Please put the assignment to currentEntry at the end of the loop body, where it > was before. The code is much easier to read that way, in particular prevEntry > and currentEntry should always be set in the same place. Not sure whether it's easier to read, because 'for loop' has 'initialization', 'condition' and 'increment' (post iteration statement), here we spread the 'increment' among `for(....)` and the body of the loop. May be it should not be a for-loop? http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/src/... src/ReferrerMapping.cpp:89: referrerChain.push_back(currentEntry->first); On 2014/11/25 15:46:37, Wladimir Palant wrote: > This should be .insert(referrerChain.begin(), ...) - same as before. In general it's inefficient, I don't understand why we need it, but it's OK here, because as I see its size does not usually grow more than 2 elements. http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/src/... src/ReferrerMapping.cpp:98: // of copying the tail of the referrerChain on each adding. On 2014/11/25 15:46:37, Wladimir Palant wrote: > It should also be pretty pointless, with the referrer chain having only one > element most of the time... The main reason of it was to use push_back and reverse it at the end.
http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/src/... File src/ReferrerMapping.cpp (right): http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/src/... src/ReferrerMapping.cpp:50: return referrerChain; On 2014/11/26 10:34:43, sergei wrote: > Because it's rather an invalid argument and we should neither even try to find > it in the container nor increase the complexity of the logic below by it. I realized that I misunderstood how this method is supposed to work - see below. http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/src/... src/ReferrerMapping.cpp:58: referrerChain.reserve(maxChainLength); On 2014/11/26 10:34:43, sergei wrote: > BTW, could you point to the reference where it's stated that `reserve` rounds up > to the next power of two? Any sane allocation approach does, to minimize memory fragmentation - but that's an implementation detail of course, so it won't be mentioned in any documentation. http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/src/... src/ReferrerMapping.cpp:82: referrerChain.push_back(url); I just realized that this was designed to work differently from what I expected :-( I thought that one would call this function with the URL of a request, but apparently it is supposed to be called with the URL of its referrer - and that referrer will always be listed in the frame structure. Given your changes here, the current design seems wrong. What do you think? http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/src/... src/ReferrerMapping.cpp:89: referrerChain.push_back(currentEntry->first); On 2014/11/26 10:34:43, sergei wrote: > In general it's inefficient, I don't understand why we need it, but it's OK > here, because as I see its size does not usually grow more than 2 elements. That's what I meant - for tiny lists like what we have here, doing this kind of optimizations (making the logic more complicated to follow) just isn't worth it. http://codereview.adblockplus.org/5768603836088320/diff/5743114304094208/incl... File include/AdblockPlus/ReferrerMapping.h (right): http://codereview.adblockplus.org/5768603836088320/diff/5743114304094208/incl... include/AdblockPlus/ReferrerMapping.h:40: /// Contains the ordered list of URLs. Nit: *an* ordered list http://codereview.adblockplus.org/5768603836088320/diff/5743114304094208/incl... include/AdblockPlus/ReferrerMapping.h:46: FRAME_INDICATOR_NOT_A_FRAME = 0, FRAME_INDICATOR_FRAME= 1 Extreme nit: articles seem wrong in constant names, FRAME_INDICATOR_NOT_FRAME? http://codereview.adblockplus.org/5768603836088320/diff/5743114304094208/incl... include/AdblockPlus/ReferrerMapping.h:77: FrameIndicator frameIndicatorArg = FrameIndicator::FRAME_INDICATOR_NOT_A_FRAME) Shouldn't you have that default value in method Add() rather than here? Add() is being called by all kind of code, this constructor on the other hand is for internal use only and never called without the second parameter. Also, I'm pretty sure that the default should be FRAME_INDICATOR_FRAME - that's what we have to assume for referrers if we don't have any further information. http://codereview.adblockplus.org/5768603836088320/diff/5743114304094208/src/... File src/ReferrerMapping.cpp (right): http://codereview.adblockplus.org/5768603836088320/diff/5743114304094208/src/... src/ReferrerMapping.cpp:51: // process the current url beyond the loop below because we should add it beyond => outside http://codereview.adblockplus.org/5768603836088320/diff/5743114304094208/src/... src/ReferrerMapping.cpp:68: // it's a frame, despite it's not necessary to be a frame. Nit: "despite it not necessarily being a frame" http://codereview.adblockplus.org/5768603836088320/diff/5743114304094208/src/... src/ReferrerMapping.cpp:71: } Can't we check prevEntry after the loop instead of duplicating its content? Something along these lines: auto currentEntry = mapping.find(url); auto prevEntry = mapping.end(); for (...) { ... } if (prevEntry == mapping.end()) { // No data, just assume that the url is a frame. frames.push_back(url); } else if (!prevEntry->second.referrer.empty()) { ... }
http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/src/... File src/ReferrerMapping.cpp (right): http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/src/... src/ReferrerMapping.cpp:58: referrerChain.reserve(maxChainLength); On 2014/12/09 21:20:03, Wladimir Palant wrote: > On 2014/11/26 10:34:43, sergei wrote: > > BTW, could you point to the reference where it's stated that `reserve` rounds > up > > to the next power of two? > > Any sane allocation approach does, to minimize memory fragmentation - but that's > an implementation detail of course, so it won't be mentioned in any > documentation. Actually on a lower level usual allocator rounds it to some granularity which depends on the machine. But it's not the proper place to discuss it. http://codereview.adblockplus.org/5768603836088320/diff/5738600293466112/src/... src/ReferrerMapping.cpp:82: referrerChain.push_back(url); On 2014/12/09 21:20:03, Wladimir Palant wrote: > I just realized that this was designed to work differently from what I expected > :-( I thought that one would call this function with the URL of a request, but > apparently it is supposed to be called with the URL of its referrer - and that > referrer will always be listed in the frame structure. Given your changes here, > the current design seems wrong. What do you think? I also misunderstood it the same way initially and there was even the issue 1160. So it's definitely a tricky point, feel free to suggest the fix of the documentation or API. Eventually, I actually came with the conclusion that it's done the correct way because we should not and we don't have to modify ReferrerMapping to get the frame structure for new request. It's very good from the design point of view. What concerns the current design, let's consider the following example Resource relation structure ;requestType; referrer http://root-page/ +http://some-frame/ ;subdocument; http://root-page/ +http://flash/player.swf ;object ; http://some-frame/ | +http://cdn/img ;image ; http://flash/player.swf +http://some-frame/script;script ; http://some-frame/ For the call matches(url: "http://some-frame/", subdocument, referrer: "http://root-page/") or matches(url: "http://some-frame/script", script, referrer: "http://some-frame/") we should build the frame structure based on "http://root-page/" or "http://some-frame/" respectively and it seems we need *always* include it into the frame structure. But for the call matches(url: "http://cdn/img", image, referrer: "http://flash/player.swf") we should not include "http://flash/player.swf" into the frame structure, the frame structure is "http://root-page/" -> "http://some-frame/". So, yes, it's wrong and the aim of this fix is to adjust it. http://codereview.adblockplus.org/5768603836088320/diff/5743114304094208/incl... File include/AdblockPlus/ReferrerMapping.h (right): http://codereview.adblockplus.org/5768603836088320/diff/5743114304094208/incl... include/AdblockPlus/ReferrerMapping.h:46: FRAME_INDICATOR_NOT_A_FRAME = 0, FRAME_INDICATOR_FRAME= 1 On 2014/12/09 21:20:03, Wladimir Palant wrote: > Extreme nit: articles seem wrong in constant names, FRAME_INDICATOR_NOT_FRAME? I know, I thought about it but don't remember the reason I decided to add the article. fixed http://codereview.adblockplus.org/5768603836088320/diff/5743114304094208/incl... include/AdblockPlus/ReferrerMapping.h:77: FrameIndicator frameIndicatorArg = FrameIndicator::FRAME_INDICATOR_NOT_A_FRAME) On 2014/12/09 21:20:03, Wladimir Palant wrote: > Shouldn't you have that default value in method Add() rather than here? Add() is > being called by all kind of code, this constructor on the other hand is for > internal use only and never called without the second parameter. > > Also, I'm pretty sure that the default should be FRAME_INDICATOR_FRAME - that's > what we have to assume for referrers if we don't have any further information. These default arguments are required by std::map, in particular if `ReferrerMapping::mapping` does not contain `url` then the statement `mapping[url] = RequestInfo(referrer, isFrame);` firstly creates `RequestInfo` using default ctr and then calls `RequestInfo::operator=`. It's a question whether we want `mapping[url] = ...` and internal class with default ctr or a little bit more of additional logic but without default ctr. What concerns Add(), that's exactly because it is called by all kind of code it's difficult to assume what should be the default there and even more, do we really need something default there? http://codereview.adblockplus.org/5768603836088320/diff/5743114304094208/src/... File src/ReferrerMapping.cpp (right): http://codereview.adblockplus.org/5768603836088320/diff/5743114304094208/src/... src/ReferrerMapping.cpp:71: } On 2014/12/09 21:20:03, Wladimir Palant wrote: > Can't we check prevEntry after the loop instead of duplicating its content? > Something along these lines: > > auto currentEntry = mapping.find(url); > auto prevEntry = mapping.end(); > for (...) > { > ... > } > if (prevEntry == mapping.end()) > { > // No data, just assume that the url is a frame. > frames.push_back(url); > } > else if (!prevEntry->second.referrer.empty()) > { > ... > } That does work, please check the updated version. Although I would say that the complexity of the algorithm is higher despite the amount of code is less and there is no repeating. |