|
|
DescriptionIssue 3237 - Whitelisted AFS ad is not shown on netzwelt.de
Patch Set 1 #
Total comments: 9
Patch Set 2 : Move subdocument detection to the top. #
Total comments: 2
MessagesTotal messages: 6
This fixes the issue in 2 steps. 1. The whitelisted ad was blocked in the APP, because incorrect type was detected (IMAGE instead of SUBDOCUMENT). Some heuristics is used here, but seems to work good enough. 2. The whitelisted ad was also hidden by the traverser, because whitelisting filter: @@||google.com/cse?$subdocument,document,domain=netzwelt.de could not be applied. That is because domain was not resolved correctly for the Matches call. Third parameter in ShouldBlock should be a URL, ie should start with http, but it was missing the protocol part. 2a. Also changed the type of the m_documentName from CString to std::wstring and renamed it.
On 2015/12/01 02:23:21, Oleksandr wrote: > This fixes the issue in 2 steps. > 1. The whitelisted ad was blocked in the APP, because incorrect type was > detected (IMAGE instead of SUBDOCUMENT). Some heuristics is used here, but seems > to work good enough. > 2. The whitelisted ad was also hidden by the traverser, because whitelisting > filter: > mailto:@@||google.com/cse?$subdocument,document,domain=netzwelt.de > could not be applied. That is because domain was not resolved correctly for the > Matches call. Third parameter in ShouldBlock should be a URL, ie should start > with http, but it was missing the protocol part. > 2a. Also changed the type of the m_documentName from CString to std::wstring and > renamed it. Could you put it into description and use as a commit message body? https://codereview.adblockplus.org/29331669/diff/29331670/src/plugin/PluginWb... File src/plugin/PluginWbPassThrough.cpp (right): https://codereview.adblockplus.org/29331669/diff/29331670/src/plugin/PluginWb... src/plugin/PluginWbPassThrough.cpp:151: if ((mimeType.find(L"text/") != std::wstring::npos) || In IE11 on my machine Accept header for "subdocument" seems "text/html, application/xhtml+xml, image/jxr, */*" pretty stable. Just wonder, when and what are there some different "text/???" and "application/???"? May be it makes sense to restrict the rule here. What are the values on IE{8-10}?
I'm good with fix 2 in the present change set; Sergei also had no comment on it. We could push that part immediately if it were split out. Fix #1, the part on inferring the content type, is rather more delicate. My fear is that we inadvertently break something else by accident fixing this one. If we go so far as to gather a collection of Accept headers, we should probably also add unit tests for our content type detection algorithm. This seems like a piece of code we may need to revisit again even after this fix. https://codereview.adblockplus.org/29331669/diff/29331670/src/plugin/PluginWb... File src/plugin/PluginWbPassThrough.cpp (right): https://codereview.adblockplus.org/29331669/diff/29331670/src/plugin/PluginWb... src/plugin/PluginWbPassThrough.cpp:145: ContentType WBPassthruSink::GetContentTypeFromMimeType(const std::wstring& mimeType) Here's the relevant section on the Accept header from the standard http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html It specifies an algorithm to determine precedence of the various media types. Unfortunately, that precedence is not the order they appear in the media type list. Nevertheless, it would seem that the highest precedence media types says more about the request type than the lower ones. The present algorithm doesn't distinguish precedence. On the other hand, I don't know what IE actually generates to know if this is important. That all said, the present algorithm looks throughout the whole media type list, ignoring anything like precedence. If IE is generating these lists sanely, then the first media type in the list would be the highest precedence one. If that's the case, we'd want to loop through the media types and check each in turn. I can guess that IE doesn't send Accept-Charset or Accept-Language when it's requesting images or objects. (I haven't tried to find out if this is so.) We might be able to get better distinctions if we look at these headers in addition to Accept by itself. Here's a recent, mostly-authoritative (it's from IANA) list of known media types. http://www.iana.org/assignments/media-types/media-types.xhtml Clearly we don't need all of these in the code, but perusing it might be useful. https://codereview.adblockplus.org/29331669/diff/29331670/src/plugin/PluginWb... src/plugin/PluginWbPassThrough.cpp:151: if ((mimeType.find(L"text/") != std::wstring::npos) || On 2015/12/01 10:53:59, sergei wrote: > when and what are there some different "text/???" and "application/???"? We don't have anything like a crawler for IE, but it doesn't seem that it would be all that difficult to log the media type list here, crawl manually, and gather a collection of the media type lists that IE actually generates. https://codereview.adblockplus.org/29331669/diff/29331670/src/plugin/PluginWb... src/plugin/PluginWbPassThrough.cpp:170: if (mimeType.find(L"text/html") != std::wstring::npos) Rather than testing for "text/html" when there's a known image, you could just move this test up to the top and test for it first. I would also guess that anything including "application/xhtml+xml" (XML-serialized HTML5) is also a document. I don't know, though, if IE generates that header in requests.
Patchset 2 is a rebase and addressing comments. https://codereview.adblockplus.org/29331669/diff/29331670/src/plugin/PluginWb... File src/plugin/PluginWbPassThrough.cpp (right): https://codereview.adblockplus.org/29331669/diff/29331670/src/plugin/PluginWb... src/plugin/PluginWbPassThrough.cpp:145: ContentType WBPassthruSink::GetContentTypeFromMimeType(const std::wstring& mimeType) On 2015/12/01 20:32:56, Eric wrote: > Here's the relevant section on the Accept header from the standard > http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html > It specifies an algorithm to determine precedence of the various media types. > Unfortunately, that precedence is not the order they appear in the media type > list. Nevertheless, it would seem that the highest precedence media types says > more about the request type than the lower ones. The present algorithm doesn't > distinguish precedence. On the other hand, I don't know what IE actually > generates to know if this is important. > That all said, the present algorithm looks throughout the whole media type list, > ignoring anything like precedence. If IE is generating these lists sanely, then > the first media type in the list would be the highest precedence one. If that's > the case, we'd want to loop through the media types and check each in turn. It is not known really if the order has any effect on the possible type of the request. It looks like it might, but it is a little far fetched, I think. There's also very little gain had we implemented the precedence algorithm anyway. > I can guess that IE doesn't send Accept-Charset or Accept-Language when it's > requesting images or objects. (I haven't tried to find out if this is so.) We > might be able to get better distinctions if we look at these headers in addition > to Accept by itself. It does send Accept-Language for images as well. https://codereview.adblockplus.org/29331669/diff/29331670/src/plugin/PluginWb... src/plugin/PluginWbPassThrough.cpp:151: if ((mimeType.find(L"text/") != std::wstring::npos) || On 2015/12/01 20:32:57, Eric wrote: > On 2015/12/01 10:53:59, sergei wrote: > > when and what are there some different "text/???" and "application/???"? > > We don't have anything like a crawler for IE, but it doesn't seem that it would > be all that difficult to log the media type list here, crawl manually, and > gather a collection of the media type lists that IE actually generates. I have only been able to log 2 types here: text/html, application/xhtml+xml, image/jxr, */* image/png, image/svg+xml, image/jxr, image/*;q=0.8, */*;q=0.5 In IE9 the image/jxr part is missing, but other then that the header looks very similar. So I agree, we can be precise here. https://codereview.adblockplus.org/29331669/diff/29331670/src/plugin/PluginWb... src/plugin/PluginWbPassThrough.cpp:170: if (mimeType.find(L"text/html") != std::wstring::npos) On 2015/12/01 20:32:56, Eric wrote: > Rather than testing for "text/html" when there's a known image, you could just > move this test up to the top and test for it first. > > I would also guess that anything including "application/xhtml+xml" > (XML-serialized HTML5) is also a document. I don't know, though, if IE generates > that header in requests. Seems like it does, based on my tests. Done.
LGTM. I wouldn't mind a new patch set with the suggestions in my comment. But the present one does the job, even though it does pass up some easy opportunities to clean up the code. https://codereview.adblockplus.org/29331669/diff/29331670/src/plugin/PluginWb... File src/plugin/PluginWbPassThrough.cpp (right): https://codereview.adblockplus.org/29331669/diff/29331670/src/plugin/PluginWb... src/plugin/PluginWbPassThrough.cpp:145: ContentType WBPassthruSink::GetContentTypeFromMimeType(const std::wstring& mimeType) On 2015/12/15 02:10:16, Oleksandr wrote: > There's also very little gain had we implemented the precedence algorithm > anyway. Given the data you've gathered, it seems that IE isn't doing anything sophisticated here. It seems like it might just have some string constants it uses. https://codereview.adblockplus.org/29331669/diff/29331670/src/plugin/PluginWb... src/plugin/PluginWbPassThrough.cpp:151: if ((mimeType.find(L"text/") != std::wstring::npos) || On 2015/12/15 02:10:16, Oleksandr wrote: > I have only been able to log 2 types here: > text/html, application/xhtml+xml, image/jxr, */* > image/png, image/svg+xml, image/jxr, image/*;q=0.8, */*;q=0.5 > > In IE9 the image/jxr part is missing, but other then that the header looks very > similar. So I agree, we can be precise here. These are the strings I'd like to see in a comment. https://codereview.adblockplus.org/29331669/diff/29332669/src/plugin/PluginWb... File src/plugin/PluginWbPassThrough.cpp (right): https://codereview.adblockplus.org/29331669/diff/29332669/src/plugin/PluginWb... src/plugin/PluginWbPassThrough.cpp:145: ContentType WBPassthruSink::GetContentTypeFromMimeType(const std::wstring& mimeType) While not strictly necessary, there are a few things that would improve the code, as long as we're touching it here. They're listed in the order I care. - Add a comment here listing the media type strings found in testing. That's some useful information that doesn't deserve to get buried in a comment in a review. - "MimeType" should be renamed "MediaTypeList". The present name is somewhat confusing unless you track down where it's called from. - I'd use "InferContentType" rather than "Get...". This is a heuristic, not just a parse-and-fetch operation like its name seems to indicate. - This function could be in an anonymous namespace. It doesn't need to be a static class member. I'm not going to hold the review up if these aren't present, but I would like to see them, particularly the first, since it directly related to the defect being fixed in the present change set.
LGTM https://codereview.adblockplus.org/29331669/diff/29332669/src/plugin/PluginWb... File src/plugin/PluginWbPassThrough.cpp (right): https://codereview.adblockplus.org/29331669/diff/29332669/src/plugin/PluginWb... src/plugin/PluginWbPassThrough.cpp:145: ContentType WBPassthruSink::GetContentTypeFromMimeType(const std::wstring& mimeType) On 2015/12/15 03:18:18, Eric wrote: > While not strictly necessary, there are a few things that would improve the > code, as long as we're touching it here. They're listed in the order I care. > - Add a comment here listing the media type strings found in testing. That's > some useful information that doesn't deserve to get buried in a comment in a > review. > - "MimeType" should be renamed "MediaTypeList". The present name is somewhat > confusing unless you track down where it's called from. > - I'd use "InferContentType" rather than "Get...". This is a heuristic, not just > a parse-and-fetch operation like its name seems to indicate. > - This function could be in an anonymous namespace. It doesn't need to be a > static class member. > > I'm not going to hold the review up if these aren't present, but I would like to > see them, particularly the first, since it directly related to the defect being > fixed in the present change set. The same from my side. |