|
|
Created:
March 14, 2016, 2:41 p.m. by sergei Modified:
Oct. 6, 2016, 2:26 p.m. Reviewers:
Wladimir Palant CC:
Thomas Greiner, tschuster Visibility:
Public. |
Description# abpcrawler project
# Tested on released Firefox 47.
Patch Set 1 #
Total comments: 17
Patch Set 2 : rebase on #3815 and address some trivial comments #Patch Set 3 : fix issue with beacons #Patch Set 4 : add missed space #Patch Set 5 : rebase, fix race condition and support canvas limits #Patch Set 6 : fix race condition #
Total comments: 42
Patch Set 7 : address comments #
Total comments: 19
Patch Set 8 : address comments #Patch Set 9 : fix #
Total comments: 2
Patch Set 10 : change comment #MessagesTotal messages: 15
https://codereview.adblockplus.org/29338242/diff/29338243/lib/child/frameScri... File lib/child/frameScript.js (right): https://codereview.adblockplus.org/29338242/diff/29338243/lib/child/frameScri... lib/child/frameScript.js:1: "use strict"; We put "use strict" after the license header. https://codereview.adblockplus.org/29338242/diff/29338243/lib/child/frameScri... lib/child/frameScript.js:15: sendSyncMessage("abpcrawler:reportException", undefined, e); Don't send sync messages unless you absolutely cannot help it (no, this isn't such a case). https://codereview.adblockplus.org/29338242/diff/29338243/lib/crawler.js File lib/crawler.js (left): https://codereview.adblockplus.org/29338242/diff/29338243/lib/crawler.js#oldc... lib/crawler.js:157: }; Why did you move this functionality into the content process? It should work perfectly well in the parent, you only need to change the way timeouts are being set. Going into the content process should only be necessary to serialize HTML and get the screenshot. https://codereview.adblockplus.org/29338242/diff/29338243/lib/crawler.js File lib/crawler.js (right): https://codereview.adblockplus.org/29338242/diff/29338243/lib/crawler.js#newc... lib/crawler.js:227: mm.removeMessageListener("abpcrawler:pageInfoGathered", onDone); So, which tab did you get the page info for? The response should contain the outerWindowID so that you can check whether it applies to "your" tab. https://codereview.adblockplus.org/29338242/diff/29338243/lib/crawler.js#newc... lib/crawler.js:233: }); Please move this functionality into a separate function. Here it should be only: Object.assign(result, yield getPageInfo(tab)); https://codereview.adblockplus.org/29338242/diff/29338243/lib/crawler.js#newc... lib/crawler.js:261: globalMessageManager.loadFrameScript(frameScriptPath, true); This should be a process script, no point using a frame script here. You can send the outerWindowID as parameter, the process script can then call nsIWindowMediator.getOuterWindowWithId() to retrieve it. https://codereview.adblockplus.org/29338242/diff/29338243/lib/crawler.js#newc... lib/crawler.js:265: reportException(msg.objects); Please don't use msg.objects - ever. That's a wrapper which will cause additional IPC traffic whenever you access its properties. This is also completely unnecessary, just report the exception from the process script. Both Cu.reportError() and dump() are available.
So far merely a couple of comments. https://codereview.adblockplus.org/29338242/diff/29338243/lib/crawler.js File lib/crawler.js (left): https://codereview.adblockplus.org/29338242/diff/29338243/lib/crawler.js#oldc... lib/crawler.js:157: }; On 2016/03/15 10:07:10, Wladimir Palant wrote: > Why did you move this functionality into the content process? It should work > perfectly well in the parent, you only need to change the way timeouts are being > set. Going into the content process should only be necessary to serialize HTML > and get the screenshot. In e10s it does not work in chrome process, we cannot receive headers and the following exception is thrown (printed be `reportException(e);` added after `// Exceptions are expected here`): [object CPOW [Exception... "It's illegal to pass a CPOW to native code arg 0 [nsIHttpChannel.visitResponseHeaders]" nsresult: "0x80570036 (NS_ERROR_XPC_CANT_PASS_CPOW_TO_NATIVE)" location: "JS frame :: chrome://global/content/browser-child.js :: _send :: line 112" data: no]] _send@chrome://global/content/browser-child.js:112:7 onStateChange@chrome://global/content/browser-child.js:135:5 As I understand from https://bugzilla.mozilla.org/show_bug.cgi?id=1081879#c4, there is a shim in the chrome process, so it seems it's even not supposed to be used from chrome process anymore (regardless whether someone needs `DOMWindow` or not). https://codereview.adblockplus.org/29338242/diff/29338243/lib/crawler.js File lib/crawler.js (right): https://codereview.adblockplus.org/29338242/diff/29338243/lib/crawler.js#newc... lib/crawler.js:214: let requestNotifier = new RequestNotifier(tab.linkedBrowser.outerWindowID, function({type, location, filter}, scanComplete) BTW, in addition, this part stops to work, I have filed an issue about it https://bugzilla.mozilla.org/show_bug.cgi?id=1256602
https://codereview.adblockplus.org/29338242/diff/29338243/lib/child/frameScri... File lib/child/frameScript.js (right): https://codereview.adblockplus.org/29338242/diff/29338243/lib/child/frameScri... lib/child/frameScript.js:1: "use strict"; On 2016/03/15 10:07:10, Wladimir Palant wrote: > We put "use strict" after the license header. Done. https://codereview.adblockplus.org/29338242/diff/29338243/lib/child/frameScri... lib/child/frameScript.js:15: sendSyncMessage("abpcrawler:reportException", undefined, e); On 2016/03/15 10:07:10, Wladimir Palant wrote: > Don't send sync messages unless you absolutely cannot help it (no, this isn't > such a case). Acknowledged. JIC, it was sync here because CPOW is used (see comment in lib/crawler.js) here and to avoid race conditions we should pass it using synchronous message. https://codereview.adblockplus.org/29338242/diff/29338243/lib/crawler.js File lib/crawler.js (right): https://codereview.adblockplus.org/29338242/diff/29338243/lib/crawler.js#newc... lib/crawler.js:214: let requestNotifier = new RequestNotifier(tab.linkedBrowser.outerWindowID, function({type, location, filter}, scanComplete) https://issues.adblockplus.org/ticket/3815 https://codereview.adblockplus.org/29338242/diff/29338243/lib/crawler.js#newc... lib/crawler.js:227: mm.removeMessageListener("abpcrawler:pageInfoGathered", onDone); On 2016/03/15 10:07:10, Wladimir Palant wrote: > So, which tab did you get the page info for? For the `tab`, it's "browser message manager" it allows to avoid additional manual fiddling with `outerWindowID`. > > The response should contain the outerWindowID so that you can check whether it > applies to "your" tab. https://codereview.adblockplus.org/29338242/diff/29338243/lib/crawler.js#newc... lib/crawler.js:233: }); On 2016/03/15 10:07:10, Wladimir Palant wrote: > Please move this functionality into a separate function. Here it should be only: > > Object.assign(result, yield getPageInfo(tab)); Done. https://codereview.adblockplus.org/29338242/diff/29338243/lib/crawler.js#newc... lib/crawler.js:265: reportException(msg.objects); On 2016/03/15 10:07:10, Wladimir Palant wrote: > This is also completely unnecessary, just report the exception from the process > script. Both Cu.reportError() and dump() are available. Acknowledged. Actually, I wanted to avoid duplicating of `reportException`, may be it's not necessary. https://codereview.adblockplus.org/29338242/diff/29338243/lib/crawler.js#newc... lib/crawler.js:265: reportException(msg.objects); On 2016/03/15 10:07:10, Wladimir Palant wrote: > Please don't use msg.objects - ever. That's a wrapper which will cause > additional IPC traffic whenever you access its properties. I know, I used `msg.objects` because we don't know anything about exception object, it's not necessarily that its properties are accessible from chrome process, so we should rather access it from chrome process through CPOW. Just wonder, what is the issue with additional IPC traffic in case of exception which should actually never happen and when it happens it likely means that there is a bug in our code?
https://codereview.adblockplus.org/29338242/diff/29338243/lib/crawler.js File lib/crawler.js (right): https://codereview.adblockplus.org/29338242/diff/29338243/lib/crawler.js#newc... lib/crawler.js:265: reportException(msg.objects); On 2016/03/16 14:44:23, sergei wrote: > Just wonder, what is the issue with additional IPC traffic in case of exception > which should actually never happen and when it happens it likely means that > there is a bug in our code? Worst-case scenario: deadlocks because all of that traffic needs to be synchronous. https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... File lib/child/frameScript.js (right): https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... lib/child/frameScript.js:6: "use strict"; Nit: How about an empty line before that? https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... lib/child/frameScript.js:29: * https://developer.mozilla.org/en-US/docs/Mozilla/Tech/XPCOM/Reference/Interfa... Nit: How about: * Progress listener capturing the data of any page to finish loading and * sending this data via "abpcrawler:pageInfoGathered" message. * * @type nsIWebProgressListener Note that type is enough to look up documentation. https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... lib/child/frameScript.js:35: // use isTopLevel to filter beacon requests out This comment doesn't make sense - not just beacon requests, you don't want to be bothered with frames either. Either way, we don't want to risk receiving messages for the wrong window (e.g. another tab). How about checking `webProgress.DOMWindow == content` instead of isTopLevel here? Note that isTopLevel was introduced for the scenario where your progress listener runs in a different process but that's not what you have here. https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... lib/child/frameScript.js:42: let pageInfo = {headers: []}; What about `channelStatus: status`? We used to capture that. https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... lib/child/frameScript.js:53: return; So what happens if a host name doesn't resolve? We just time out without collecting any data whatsoever? We have the channel status, it's useful information and should be sent. Maybe it's a good idea to not report exceptions in case of NS_ERROR_NOT_AVAILABLE, but just returning definitely isn't. https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... lib/child/frameScript.js:57: sendAsyncMessage("abpcrawler:pageInfoGathered", pageInfo); The two lines above don't depend on the request object and should be outside the `if` block. If we happen to have a page served via FTP and we cannot get the headers - we would still like to capture all other page data. https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... lib/child/frameScript.js:62: // definitions of the remaining functions see related documentation Nit: this comment doesn't add any value, remove? https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... lib/child/frameScript.js:64: onProgressChange: function(aWebProgress, aRequest, curSelf, maxSelf, curTot, maxTot) {}, So, Hungarian notation or not? Our style guide says: not. But you can also just leave out the parameters since you aren't actually implementing these methods (and they won't even be called with proper filtering in place). https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... lib/child/frameScript.js:68: QueryInterface: XPCOMUtils.generateQI([Ci.nsIWebProgressListener, Ci.nsISupportsWeakReference]) I don't think that progress listeners can be weak references, exposing nsIWebProgressListener should be sufficient. https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... lib/child/frameScript.js:72: .createInstance(Ci.nsIWebProgress); Nit: we usually align the dot with the [ on the previous line. Either way, why do we even bother with browser-status-filter if we already registered our progress listener on the docShell for the current tab? https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... lib/child/frameScript.js:77: webProgress.addProgressListener(filter, Ci.nsIWebProgress.NOTIFY_ALL); The reason why the original code is receiving all progress events: tabbrowser doesn't support filtering. nsIWebProgress supports filtering however, so you should use it rather than requesting all kind of stuff you don't need. Ci.nsIWebProgress.NOTIFY_STATE_WINDOW should do. https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... lib/child/frameScript.js:80: * Gathers information about page using DOM window. Just "Gathers information about a DOM window." maybe? https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... lib/child/frameScript.js:97: // http://stackoverflow.com/questions/6081483/maximum-size-of-a-canvas-element Great, you could have linked to http://stackoverflow.com/a/12644047/785541 just as well :-( As I mentioned back in 2012 - these are all implementation details and subject to change any time. Nor are the values you hardcoded here actually correct (didn't you consider the number 472907776 rather unlikely?). The reason why I decided to just let the canvas fail is because the actual restrictions are hard to pinpoint and one wouldn't want to cut off screenshots unnecessarily. This is an unrelated change, please put it into a separate review and we can discuss the best approach there. https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... lib/child/frameScript.js:128: result.errors.push("Cannot gather page info"); The original code had the canvas in a separate try block - this code is known to fail, it shouldn't affect the other data we are capturing. https://codereview.adblockplus.org/29338242/diff/29340782/lib/crawler.js File lib/crawler.js (right): https://codereview.adblockplus.org/29338242/diff/29340782/lib/crawler.js#newc... lib/crawler.js:293: mm.addMessageListener("abpcrawler:pageInfoGathered", (msg) => onDone(msg.data));; That's not the callback you are removing above. Also: duplicate semicolon. https://codereview.adblockplus.org/29338242/diff/29340782/lib/crawler.js#newc... lib/crawler.js:294: timerID = setTimeout(onDone.bind(this, {error: "timeout"}), timeout); How about not using bind() here, for clarity and consistency? () => onDone({data: {error: "timeout"}})
https://codereview.adblockplus.org/29338242/diff/29340782/lib/crawler.js File lib/crawler.js (right): https://codereview.adblockplus.org/29338242/diff/29340782/lib/crawler.js#newc... lib/crawler.js:13: Cu.import("resource://gre/modules/Promise.jsm"); Just realized that this import hasn't been removed - please do so, we want to use the default Promise object. https://codereview.adblockplus.org/29338242/diff/29340782/lib/crawler.js#newc... lib/crawler.js:14: Cu.import("resource://gre/modules/Timer.jsm"); Nit: We don't import like that any more, symbols should be imported explicitly. let {setTimeout, clearTimeout} = Cu.import("resource://gre/modules/Timer.jsm", {});
https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... File lib/child/frameScript.js (right): https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... lib/child/frameScript.js:6: "use strict"; On 2016/09/14 16:11:47, Wladimir Palant wrote: > Nit: How about an empty line before that? Done. https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... lib/child/frameScript.js:29: * https://developer.mozilla.org/en-US/docs/Mozilla/Tech/XPCOM/Reference/Interfa... On 2016/09/14 16:11:48, Wladimir Palant wrote: > Nit: How about: > > * Progress listener capturing the data of any page to finish loading and * > sending this data via "abpcrawler:pageInfoGathered" message. > * > * @type nsIWebProgressListener > > Note that type is enough to look up documentation. I have changed the comment. In addition webProgressListener does not send the message, instead it calls onPageLoaded because in future version this function will be more complicated and it's better to keep such code out of webProgressListener. https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... lib/child/frameScript.js:35: // use isTopLevel to filter beacon requests out On 2016/09/14 16:11:46, Wladimir Palant wrote: > This comment doesn't make sense - not just beacon requests, you don't want to be > bothered with frames either. Either way, we don't want to risk receiving > messages for the wrong window (e.g. another tab). How about checking > `webProgress.DOMWindow == content` instead of isTopLevel here? Note that > isTopLevel was introduced for the scenario where your progress listener runs in > a different process but that's not what you have here. Done. It seems it does indeed work, however I think I have tried it and it was not working. https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... lib/child/frameScript.js:42: let pageInfo = {headers: []}; On 2016/09/14 16:11:48, Wladimir Palant wrote: > What about `channelStatus: status`? We used to capture that. Restored. https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... lib/child/frameScript.js:53: return; On 2016/09/14 16:11:48, Wladimir Palant wrote: > So what happens if a host name doesn't resolve? We just time out without > collecting any data whatsoever? We have the channel status, it's useful > information and should be sent. Maybe it's a good idea to not report exceptions > in case of NS_ERROR_NOT_AVAILABLE, but just returning definitely isn't. If the hostname is not resolved we don't get here anyway, so, yes, we do wait for the timeout, https://issues.adblockplus.org/ticket/3975. It was used here as workaround for race condition, the exception is thrown when we try to access request.responseStatus but the page has not been loaded yet, so I think that simply returning and waiting for the real finish of loading was a good idea. I have added the check of the location.protocol above and removed that check, so if it happens on a real page it will dump the exception info and call onPageLoaded. https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... lib/child/frameScript.js:57: sendAsyncMessage("abpcrawler:pageInfoGathered", pageInfo); On 2016/09/14 16:11:47, Wladimir Palant wrote: > The two lines above don't depend on the request object and should be outside the > `if` block. If we happen to have a page served via FTP and we cannot get the > headers - we would still like to capture all other page data. Moved. https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... lib/child/frameScript.js:62: // definitions of the remaining functions see related documentation On 2016/09/14 16:11:48, Wladimir Palant wrote: > Nit: this comment doesn't add any value, remove? Done. https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... lib/child/frameScript.js:64: onProgressChange: function(aWebProgress, aRequest, curSelf, maxSelf, curTot, maxTot) {}, On 2016/09/14 16:11:47, Wladimir Palant wrote: > So, Hungarian notation or not? Our style guide says: not. But you can also just > leave out the parameters since you aren't actually implementing these methods > (and they won't even be called with proper filtering in place). Done. https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... lib/child/frameScript.js:68: QueryInterface: XPCOMUtils.generateQI([Ci.nsIWebProgressListener, Ci.nsISupportsWeakReference]) On 2016/09/14 16:11:48, Wladimir Palant wrote: > I don't think that progress listeners can be weak references, exposing > nsIWebProgressListener should be sufficient. According to the documentation, "This object must also implement nsISupportsWeakReference." https://developer.mozilla.org/en-US/docs/Mozilla/Tech/XPCOM/Reference/Interfa.... https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... lib/child/frameScript.js:72: .createInstance(Ci.nsIWebProgress); On 2016/09/14 16:11:48, Wladimir Palant wrote: > Nit: we usually align the dot with the [ on the previous line. > > Either way, why do we even bother with browser-status-filter if we already > registered our progress listener on the docShell for the current tab? Removed. BTW, in the example from mozilla they are also using filter but it does indeed work without it. https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... lib/child/frameScript.js:77: webProgress.addProgressListener(filter, Ci.nsIWebProgress.NOTIFY_ALL); On 2016/09/14 16:11:48, Wladimir Palant wrote: > The reason why the original code is receiving all progress events: tabbrowser > doesn't support filtering. nsIWebProgress supports filtering however, so you > should use it rather than requesting all kind of stuff you don't need. > Ci.nsIWebProgress.NOTIFY_STATE_WINDOW should do. Done. https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... lib/child/frameScript.js:80: * Gathers information about page using DOM window. On 2016/09/14 16:11:47, Wladimir Palant wrote: > Just "Gathers information about a DOM window." maybe? Done. https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... lib/child/frameScript.js:97: // http://stackoverflow.com/questions/6081483/maximum-size-of-a-canvas-element On 2016/09/14 16:11:48, Wladimir Palant wrote: > Great, you could have linked to http://stackoverflow.com/a/12644047/785541 just > as well :-( > > As I mentioned back in 2012 - these are all implementation details and subject > to change any time. Nor are the values you hardcoded here actually correct > (didn't you consider the number 472907776 rather unlikely?). The reason why I > decided to just let the canvas fail is because the actual restrictions are hard > to pinpoint and one wouldn't want to cut off screenshots unnecessarily. This is > an unrelated change, please put it into a separate review and we can discuss the > best approach there. Removed from here. https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... lib/child/frameScript.js:128: result.errors.push("Cannot gather page info"); On 2016/09/14 16:11:47, Wladimir Palant wrote: > The original code had the canvas in a separate try block - this code is known to > fail, it shouldn't affect the other data we are capturing. Done. https://codereview.adblockplus.org/29338242/diff/29340782/lib/crawler.js File lib/crawler.js (right): https://codereview.adblockplus.org/29338242/diff/29340782/lib/crawler.js#newc... lib/crawler.js:14: Cu.import("resource://gre/modules/Timer.jsm"); On 2016/09/14 16:13:18, Wladimir Palant wrote: > Nit: We don't import like that any more, symbols should be imported explicitly. > > let {setTimeout, clearTimeout} = Cu.import("resource://gre/modules/Timer.jsm", > {}); Fixed and the rest related to coding style is addressed in codereview https://codereview.adblockplus.org/29355247/ https://codereview.adblockplus.org/29338242/diff/29340782/lib/crawler.js#newc... lib/crawler.js:293: mm.addMessageListener("abpcrawler:pageInfoGathered", (msg) => onDone(msg.data));; On 2016/09/14 16:11:48, Wladimir Palant wrote: > That's not the callback you are removing above. Also: duplicate semicolon. Fixed. Sorry, overlooked. https://codereview.adblockplus.org/29338242/diff/29340782/lib/crawler.js#newc... lib/crawler.js:294: timerID = setTimeout(onDone.bind(this, {error: "timeout"}), timeout); On 2016/09/14 16:11:49, Wladimir Palant wrote: > How about not using bind() here, for clarity and consistency? > > () => onDone({data: {error: "timeout"}}) Done. Good idea.
https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... File lib/child/frameScript.js (right): https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... lib/child/frameScript.js:53: return; On 2016/09/29 09:58:13, sergei wrote: > It was used here as workaround for race condition, the exception is thrown when > we try to access request.responseStatus but the page has not been loaded yet, I don't really see why the page isn't loaded when you get STATE_STOP. https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... lib/child/frameScript.js:72: .createInstance(Ci.nsIWebProgress); On 2016/09/29 09:58:12, sergei wrote: > Removed. BTW, in the example from mozilla they are also using filter but it does > indeed work without it. Yes, there is an example mentioning it - but no documentation explaining what it does. The point of this filter is apparently limiting the number of times C++ needs to call into JavaScript, for performance reasons. For that it filters out some messages using an undocumented approach. This has all the signs on an internal component meant to work with the progress listener installed by Firefox but non necessary doing the right thing for us. We shouldn't use it, even if the results are correct right now they won't necessarily be in the next Firefox release. https://codereview.adblockplus.org/29338242/diff/29355263/lib/child/frameScri... File lib/child/frameScript.js (right): https://codereview.adblockplus.org/29338242/diff/29355263/lib/child/frameScri... lib/child/frameScript.js:7: 'use strict'; Double quotation marks please. https://codereview.adblockplus.org/29338242/diff/29355263/lib/child/frameScri... lib/child/frameScript.js:39: (flags & Ci.nsIWebProgressListener.STATE_IS_WINDOW)) You don't have to check STATE_IS_WINDOW any more. https://codereview.adblockplus.org/29338242/diff/29355263/lib/child/frameScri... lib/child/frameScript.js:42: return; This needs an explanation. Is that check necessary because you load about:blank into unused tabs and you don't want to report that load? At the very least, this needs a comment saying so. Still, what issue does this check solve? My guess is that without this check the code will collect data from about:blank, send a message and this message will be ignored. Wasting a few CPU cycles, so what? Do we really need this special case? Or maybe we should rather work correctly even if we get something like about:newtab on our URL list? https://codereview.adblockplus.org/29338242/diff/29355263/lib/child/frameScri... lib/child/frameScript.js:43: const pageInfo = {channelStatus: status}; Please don't declare regular variables as constants. Just because you don't assign to pageInfo itself but only its properties this doesn't make it a constant. https://codereview.adblockplus.org/29338242/diff/29355263/lib/child/frameScri... lib/child/frameScript.js:75: const webProgress = docShell.QueryInterface(Ci.nsIInterfaceRequestor).getInterface(Ci.nsIWebProgress); As above, please don't mark any variable that you happen to only assign once (in this revision of the code at least, but it might change later) as a constant. Please get rid of the pseudo-constants below as well. https://codereview.adblockplus.org/29338242/diff/29355263/lib/child/frameScri... lib/child/frameScript.js:92: result.errors.push('No document.documentElement'); Double quotation marks please. https://codereview.adblockplus.org/29338242/diff/29355263/lib/crawler.js File lib/crawler.js (right): https://codereview.adblockplus.org/29338242/diff/29355263/lib/crawler.js#newc... lib/crawler.js:112: // navigate away from early opened URL early => previously https://codereview.adblockplus.org/29338242/diff/29355263/lib/crawler.js#newc... lib/crawler.js:113: tab.linkedBrowser.loadURI('about:blank', null, null); Double quotation marks please. https://codereview.adblockplus.org/29338242/diff/29355263/lib/crawler.js#newc... lib/crawler.js:164: '&info=' + encodeURIComponent(JSON.stringify(info)); Double quotation marks please. What's the point of sending the info object to the frame script? It doesn't need this information.
https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... File lib/child/frameScript.js (right): https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... lib/child/frameScript.js:53: return; On 2016/09/29 11:44:57, Wladimir Palant wrote: > On 2016/09/29 09:58:13, sergei wrote: > > It was used here as workaround for race condition, the exception is thrown > when > > we try to access request.responseStatus but the page has not been loaded yet, > > I don't really see why the page isn't loaded when you get STATE_STOP. On 2016/09/29 11:44:58, Wladimir Palant wrote: > This needs an explanation. Is that check necessary because you load about:blank > into unused tabs and you don't want to report that load? At the very least, this > needs a comment saying so. > > Still, what issue does this check solve? My guess is that without this check the > code will collect data from about:blank, send a message and this message will be > ignored. Wasting a few CPU cycles, so what? Do we really need this special case? > Or maybe we should rather work correctly even if we get something like > about:newtab on our URL list? Sorry for confusion, I meant under the page the page with URL distinct from about:blank. First time we receive STATE_STOP for about:blank and the second time for our URL. We should not process about:blank because it can happen that the message with information about about:blank is delivered when the code in crawler.js is already waiting for a message from that tab. When about:blank is loaded first time, request.responseStatus is not available. I have put a comment about it into the code. BTW, I have just discovered yet one issue with https://github.com/adblockplus/abpcrawler/blob/master/lib/crawler.js#L111. When we navigate to about:blank from our URL we get STATE_STOP the third time but the value of content.location.href is sometimes (!) still old (our URL) and it tries to get the page info yet one time, and according to the log it fails trying to get screenshot. I guess it happens because the tab is going to be closed. It's easy to reproduce if run the crawler with one tab because in that case it happens very often. I would propose to come back to creation of a new tab in TabAllocator.releaseTab to avoid any side effects of state of already processed tab. https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... lib/child/frameScript.js:72: .createInstance(Ci.nsIWebProgress); On 2016/09/29 11:44:57, Wladimir Palant wrote: > On 2016/09/29 09:58:12, sergei wrote: > > Removed. BTW, in the example from mozilla they are also using filter but it > does > > indeed work without it. > > Yes, there is an example mentioning it - but no documentation explaining what it > does. The point of this filter is apparently limiting the number of times C++ > needs to call into JavaScript, for performance reasons. For that it filters out > some messages using an undocumented approach. This has all the signs on an > internal component meant to work with the progress listener installed by Firefox > but non necessary doing the right thing for us. We shouldn't use it, even if the > results are correct right now they won't necessarily be in the next Firefox > release. Acknowledged. https://codereview.adblockplus.org/29338242/diff/29355263/lib/child/frameScri... File lib/child/frameScript.js (right): https://codereview.adblockplus.org/29338242/diff/29355263/lib/child/frameScri... lib/child/frameScript.js:7: 'use strict'; On 2016/09/29 11:44:58, Wladimir Palant wrote: > Double quotation marks please. Done. https://codereview.adblockplus.org/29338242/diff/29355263/lib/child/frameScri... lib/child/frameScript.js:39: (flags & Ci.nsIWebProgressListener.STATE_IS_WINDOW)) On 2016/09/29 11:44:59, Wladimir Palant wrote: > You don't have to check STATE_IS_WINDOW any more. Done. I left it just in case. https://codereview.adblockplus.org/29338242/diff/29355263/lib/child/frameScri... lib/child/frameScript.js:42: return; On 2016/09/29 11:44:58, Wladimir Palant wrote: > This needs an explanation. Is that check necessary because you load about:blank > into unused tabs and you don't want to report that load? At the very least, this > needs a comment saying so. > > Still, what issue does this check solve? My guess is that without this check the > code will collect data from about:blank, send a message and this message will be > ignored. Wasting a few CPU cycles, so what? Do we really need this special case? > Or maybe we should rather work correctly even if we get something like > about:newtab on our URL list? Answer is in https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... https://codereview.adblockplus.org/29338242/diff/29355263/lib/child/frameScri... lib/child/frameScript.js:43: const pageInfo = {channelStatus: status}; On 2016/09/29 11:44:58, Wladimir Palant wrote: > Please don't declare regular variables as constants. Just because you don't > assign to pageInfo itself but only its properties this doesn't make it a > constant. const in javascript does not mean immutable, it's useful to find unexpected assignments. Do we use it as indicator of immutability? I have changed to "let" most of the declarations. https://codereview.adblockplus.org/29338242/diff/29355263/lib/child/frameScri... lib/child/frameScript.js:75: const webProgress = docShell.QueryInterface(Ci.nsIInterfaceRequestor).getInterface(Ci.nsIWebProgress); On 2016/09/29 11:44:58, Wladimir Palant wrote: > As above, please don't mark any variable that you happen to only assign once (in > this revision of the code at least, but it might change later) as a constant. > Please get rid of the pseudo-constants below as well. Done. https://codereview.adblockplus.org/29338242/diff/29355263/lib/child/frameScri... lib/child/frameScript.js:92: result.errors.push('No document.documentElement'); On 2016/09/29 11:44:58, Wladimir Palant wrote: > Double quotation marks please. Done. https://codereview.adblockplus.org/29338242/diff/29355263/lib/crawler.js File lib/crawler.js (right): https://codereview.adblockplus.org/29338242/diff/29355263/lib/crawler.js#newc... lib/crawler.js:112: // navigate away from early opened URL On 2016/09/29 11:44:59, Wladimir Palant wrote: > early => previously Done. https://codereview.adblockplus.org/29338242/diff/29355263/lib/crawler.js#newc... lib/crawler.js:113: tab.linkedBrowser.loadURI('about:blank', null, null); On 2016/09/29 11:44:59, Wladimir Palant wrote: > Double quotation marks please. Done. https://codereview.adblockplus.org/29338242/diff/29355263/lib/crawler.js#newc... lib/crawler.js:164: '&info=' + encodeURIComponent(JSON.stringify(info)); On 2016/09/29 11:44:59, Wladimir Palant wrote: > Double quotation marks please. Done. > > What's the point of sending the info object to the frame script? It doesn't need > this information. I use info.addonRoot to load modules, however it should be in the commit when it is actually needed, so I have removed it.
LGTM but please improve the comment as outlined below. https://codereview.adblockplus.org/29338242/diff/29355263/lib/child/frameScri... File lib/child/frameScript.js (right): https://codereview.adblockplus.org/29338242/diff/29355263/lib/child/frameScri... lib/child/frameScript.js:43: const pageInfo = {channelStatus: status}; On 2016/09/29 15:36:21, sergei wrote: > const in javascript does not mean immutable, it's useful to find unexpected > assignments. Do we use it as indicator of immutability? Yes, that's how const is normally used - to indicate values that are inherently immutable, not merely unchanged due to current implementation details. https://codereview.adblockplus.org/29338242/diff/29355374/lib/child/frameScri... File lib/child/frameScript.js (right): https://codereview.adblockplus.org/29338242/diff/29355374/lib/child/frameScri... lib/child/frameScript.js:45: // Another not interesting for us case is about:newtab. Nit: "Another case we are not interested in is about:newtab."
https://codereview.adblockplus.org/29338242/diff/29355374/lib/child/frameScri... File lib/child/frameScript.js (right): https://codereview.adblockplus.org/29338242/diff/29355374/lib/child/frameScri... lib/child/frameScript.js:45: // Another not interesting for us case is about:newtab. On 2016/09/30 07:43:12, Wladimir Palant wrote: > Nit: "Another case we are not interested in is about:newtab." Done.
LGTM
https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... File lib/child/frameScript.js (right): https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... lib/child/frameScript.js:53: return; On 2016/09/29 15:36:21, sergei wrote: > On 2016/09/29 11:44:57, Wladimir Palant wrote: > > On 2016/09/29 09:58:13, sergei wrote: > > > It was used here as workaround for race condition, the exception is thrown > > when > > > we try to access request.responseStatus but the page has not been loaded > yet, > > > > I don't really see why the page isn't loaded when you get STATE_STOP. > > On 2016/09/29 11:44:58, Wladimir Palant wrote: > > This needs an explanation. Is that check necessary because you load > about:blank > > into unused tabs and you don't want to report that load? At the very least, > this > > needs a comment saying so. > > > > Still, what issue does this check solve? My guess is that without this check > the > > code will collect data from about:blank, send a message and this message will > be > > ignored. Wasting a few CPU cycles, so what? Do we really need this special > case? > > Or maybe we should rather work correctly even if we get something like > > about:newtab on our URL list? > > Sorry for confusion, I meant under the page the page with URL distinct from > about:blank. First time we receive STATE_STOP for about:blank and the second > time for our URL. We should not process about:blank because it can happen that > the message with information about about:blank is delivered when the code in > crawler.js is already waiting for a message from that tab. When about:blank is > loaded first time, request.responseStatus is not available. > I have put a comment about it into the code. > > BTW, I have just discovered yet one issue with > https://github.com/adblockplus/abpcrawler/blob/master/lib/crawler.js#L111. When > we navigate to about:blank from our URL we get STATE_STOP the third time but the > value of content.location.href is sometimes (!) still old (our URL) and it tries > to get the page info yet one time, and according to the log it fails trying to > get screenshot. I guess it happens because the tab is going to be closed. It's > easy to reproduce if run the crawler with one tab because in that case it > happens very often. > > I would propose to come back to creation of a new tab in TabAllocator.releaseTab > to avoid any side effects of state of already processed tab. What about creation of a new tab in TabAllocator.releaseTab?
Message was sent while issue was closed.
https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... File lib/child/frameScript.js (right): https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... lib/child/frameScript.js:53: return; On 2016/10/04 10:57:31, sergei wrote: > What about creation of a new tab in TabAllocator.releaseTab? No, I'd rather not make the code more complicated than it already is for this edge case. I'd suggest that you push your patch and deal with it in a follow-up issue - this took long enough already. > When we navigate to about:blank from our URL we get STATE_STOP the third time but the value of content.location.href is sometimes (!) still old (our URL) Sure that this is STATE_STOP for about:blank? I'd suspect that the page timed out, so it's actually getting its first STATE_STOP when you navigate away to about:blank - and content.location.href is set correctly at that point. I'm not sure why creating a screenshot fails at that point but maybe document.readyState will help you recognize this situation.
Message was sent while issue was closed.
https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... File lib/child/frameScript.js (right): https://codereview.adblockplus.org/29338242/diff/29340782/lib/child/frameScri... lib/child/frameScript.js:53: return; On 2016/10/04 11:10:38, Wladimir Palant wrote: > On 2016/10/04 10:57:31, sergei wrote: > > What about creation of a new tab in TabAllocator.releaseTab? > > No, I'd rather not make the code more complicated than it already is for this > edge case. I'd suggest that you push your patch and deal with it in a follow-up > issue - this took long enough already. Of course it should be in a follow-up issue, the question is whether we need it or not. > > > When we navigate to about:blank from our URL we get STATE_STOP the third time > but the value of content.location.href is sometimes (!) still old (our URL) > > Sure that this is STATE_STOP for about:blank? I'd suspect that the page timed > out, so it's actually getting its first STATE_STOP when you navigate away to > about:blank - and content.location.href is set correctly at that point. I'm not > sure why creating a screenshot fails at that point but maybe document.readyState > will help you recognize this situation. Yes, sure, there is simply no space to time out. Here is the log when we get STATE_STOP and the number of tabs is 1. console.log: owinID: 14, href: about:blank, rstate: uninitialized console.log: owinID: 16, href: about:newtab, rstate: complete console.log: owinID: 14, href: ftp://speedtest.tele2.net/, rstate: complete console.log: owinID: 14, href: ftp://speedtest.tele2.net/, rstate: complete console.log: owinID: 20, href: about:blank, rstate: uninitialized console.log: owinID: 20, href: https://www.amazon.com/, rstate: complete console.log: owinID: 20, href: https://www.amazon.com/, rstate: complete console.log: owinID: 27, href: about:blank, rstate: uninitialized console.log: owinID: 27, href: https://codereview.adblockplus.org/, rstate: complete console.log: owinID: 27, href: https://codereview.adblockplus.org/, rstate: complete console.log: owinID: 30, href: about:blank, rstate: uninitialized console.log: owinID: 30, href: https://twitter.com/, rstate: complete console.log: owinID: 30, href: about:blank, rstate: complete console.log: owinID: 43, href: about:blank, rstate: complete console.log: owinID: 43, href: http://edition.cnn.com/, rstate: complete console.log: owinID: 43, href: http://edition.cnn.com/, rstate: complete console.log: owinID: 94, href: about:blank, rstate: uninitialized console.log: owinID: 94, href: http://bcc.com/, rstate: complete console.log: owinID: 94, href: http://bcc.com/, rstate: complete console.log: owinID: 97, href: about:blank, rstate: uninitialized console.log: owinID: 97, href: https://adblockplus.org/, rstate: complete console.log: owinID: 97, href: about:blank, rstate: complete owinID - outerWindowID rstate - document.readyState I have tried it with commented line tab.linkedBrowser.loadURI("about:blank", null, null); in releaseTab and it looks very similar however there is no any third entry for the same outer window ID. |