|
|
Created:
March 11, 2016, 3:55 p.m. by sergei Modified:
March 15, 2016, 7:25 p.m. Reviewers:
Wladimir Palant Visibility:
Public. |
Description# abpcrawler project
The code also assumes that after loading of filters there is at least one subscription.
Patch Set 1 #
Total comments: 16
Patch Set 2 : address comments #
Total comments: 6
Patch Set 3 : move .catch() on the same line with .then() #Patch Set 4 : make crawl_urls function #Patch Set 5 : remove empty line #Patch Set 6 : don't use internal FilterStorage._loading #
Total comments: 4
Patch Set 7 : remove onFiltersLoaded listener and fit 80 chars #MessagesTotal messages: 9
https://codereview.adblockplus.org/29338153/diff/29338154/lib/crawler.js File lib/crawler.js (right): https://codereview.adblockplus.org/29338153/diff/29338154/lib/crawler.js#newc... lib/crawler.js:224: new Promise(function(resolve, reject) Nit: Why not use arrow functions consistently? (resolve, reject) => https://codereview.adblockplus.org/29338153/diff/29338154/lib/crawler.js#newc... lib/crawler.js:226: if (FilterStorage.subscriptions.length > 0 && !FilterStorage._loading) _loading is an internal flag to prevent reentrance, you cannot rely on it. Checking FilterStorage.subscriptions.length should do. https://codereview.adblockplus.org/29338153/diff/29338154/lib/crawler.js#newc... lib/crawler.js:233: if (action === "load") Nit: We don't usually use strict equality, and it doesn't make much sense here. https://codereview.adblockplus.org/29338153/diff/29338154/lib/crawler.js#newc... lib/crawler.js:238: }).then(_ => No pointless parameter please, () => https://codereview.adblockplus.org/29338153/diff/29338154/lib/crawler.js#newc... lib/crawler.js:265: // Be careful, `catch` does not catch exeptions from any asynchronous calls exeptions => exceptions. https://codereview.adblockplus.org/29338153/diff/29338154/lib/crawler.js#newc... lib/crawler.js:268: }).catch(reportException); I'm not really happy with the way this is structured. At the very least, then callback should be moved into a separate function, maybe called crawl_urls. Then this lengthy comment won't be necessary, nobody will expect that catch callback to capture errors from everything above.
https://codereview.adblockplus.org/29338153/diff/29338154/lib/crawler.js File lib/crawler.js (right): https://codereview.adblockplus.org/29338153/diff/29338154/lib/crawler.js#newc... lib/crawler.js:226: if (FilterStorage.subscriptions.length > 0 && !FilterStorage._loading) On 2016/03/15 09:13:10, Wladimir Palant wrote: > _loading is an internal flag to prevent reentrance, you cannot rely on it. > Checking FilterStorage.subscriptions.length should do. Actually, I think that this is a bad assumption from the start - the crawler might be run without any subscriptions whatsoever, just to see what requests are being made. Since we are running on startup however we can assume that the filters didn't load yet and just wait for the "load" notification.
https://codereview.adblockplus.org/29338153/diff/29338154/lib/crawler.js File lib/crawler.js (right): https://codereview.adblockplus.org/29338153/diff/29338154/lib/crawler.js#newc... lib/crawler.js:224: new Promise(function(resolve, reject) On 2016/03/15 09:13:10, Wladimir Palant wrote: > Nit: Why not use arrow functions consistently? (resolve, reject) => We don't need to capture `this` here, so I have not used arrow function to indicate that we are not using anything from `this`. Changed to arrow function for consistency. https://codereview.adblockplus.org/29338153/diff/29338154/lib/crawler.js#newc... lib/crawler.js:226: if (FilterStorage.subscriptions.length > 0 && !FilterStorage._loading) On 2016/03/15 09:18:11, Wladimir Palant wrote: > On 2016/03/15 09:13:10, Wladimir Palant wrote: > > _loading is an internal flag to prevent reentrance, you cannot rely on it. > > Checking FilterStorage.subscriptions.length should do. > > Actually, I think that this is a bad assumption from the start - the crawler > might be run without any subscriptions whatsoever, just to see what requests are > being made. Since we are running on startup however we can assume that the > filters didn't load yet and just wait for the "load" notification. I thought about it. It does happen that "load" event has been already sent and we will never receive it, thus wait forever, so there should be some way to know whether we should wait for "load" or not. It happens not so often with easylist, to easier reproduce it I used a custom filter file with merely a couple of filters (to test one can add "-f custom-fitler-file=" to command line). I'm not sure that testing `FilterStorage.subscriptions.length` against zero is enough because I can easily assume that it's already bigger than zero but not all subscriptions are loaded yet (because they can be loaded asynchronously), can it be the truth? `_loading` is set to false only right before sending of the "load" event, of course except the initial value, so it seems the best candidate for the role of the flag whether we need to wait for "load" or not. In addition, it's not so easy to run it without subscribtions, one need to change "run.py", either set `extensions.adblockplus.currentVersion` or specify some pre-configured profile, otherwise there are always several subscriptions. So I think that if someone wants to run with zero subscriptions he knows what he is doing and can also adapt crawler. https://codereview.adblockplus.org/29338153/diff/29338154/lib/crawler.js#newc... lib/crawler.js:233: if (action === "load") On 2016/03/15 09:13:11, Wladimir Palant wrote: > Nit: We don't usually use strict equality, and it doesn't make much sense here. Fixed. https://codereview.adblockplus.org/29338153/diff/29338154/lib/crawler.js#newc... lib/crawler.js:238: }).then(_ => On 2016/03/15 09:13:11, Wladimir Palant wrote: > No pointless parameter please, () => Acknowledged. https://codereview.adblockplus.org/29338153/diff/29338154/lib/crawler.js#newc... lib/crawler.js:265: // Be careful, `catch` does not catch exeptions from any asynchronous calls On 2016/03/15 09:13:11, Wladimir Palant wrote: > exeptions => exceptions. Done. Basically, I have removed the comment. https://codereview.adblockplus.org/29338153/diff/29338154/lib/crawler.js#newc... lib/crawler.js:268: }).catch(reportException); On 2016/03/15 09:13:11, Wladimir Palant wrote: > I'm not really happy with the way this is structured. At the very least, then > callback should be moved into a separate function, maybe called crawl_urls. Done. > Then > this lengthy comment won't be necessary, nobody will expect that catch callback > to capture errors from everything above. Yeah, the comment in general is not necessary, I added it rather to attract attention to the structure of the code above. I would say it's not a good idea to have such free-running tasks, instead of "rejected" `function(url, exception)...` it would be better to use `.catch` because if `function(result)` somehow throws it will be lost, loadListener.stop(); windowCloser.stop(); onDone(); from `taskDone` should be rather in some-kind of "finally" block.
https://codereview.adblockplus.org/29338153/diff/29338154/lib/crawler.js File lib/crawler.js (right): https://codereview.adblockplus.org/29338153/diff/29338154/lib/crawler.js#newc... lib/crawler.js:224: new Promise(function(resolve, reject) On 2016/03/15 12:16:23, sergei wrote: > We don't need to capture `this` here, so I have not used arrow function to > indicate that we are not using anything from `this`. We generally use arrow functions for inline callbacks, it's simply a more convenient syntax and one doesn't have to think about side-effects (meaning: `this` pointer changing). https://codereview.adblockplus.org/29338153/diff/29338154/lib/crawler.js#newc... lib/crawler.js:226: if (FilterStorage.subscriptions.length > 0 && !FilterStorage._loading) On 2016/03/15 12:16:23, sergei wrote: > I'm not sure that testing `FilterStorage.subscriptions.length` against zero is > enough because I can easily assume that it's already bigger than zero but not > all subscriptions are loaded yet (because they can be loaded asynchronously), > can it be the truth? No, subscriptions are added all at once when loading finishes. So checking _loading is unnecessary. Fair enough - I guess when we decide to run this without filters we'll simply add an empty subscription. https://codereview.adblockplus.org/29338153/diff/29338313/lib/crawler.js File lib/crawler.js (right): https://codereview.adblockplus.org/29338153/diff/29338313/lib/crawler.js#newc... lib/crawler.js:24: Nit: Why the empty line here? https://codereview.adblockplus.org/29338153/diff/29338313/lib/crawler.js#newc... lib/crawler.js:224: let crawl_urls = () => I really meant having crawl_urls as a regular function - |function clawl_urls(urls)|, just above |function* crawl_url|. https://codereview.adblockplus.org/29338153/diff/29338313/lib/crawler.js#newc... lib/crawler.js:267: .catch(reportException); Nit: put catch() on the same line as then()?
https://codereview.adblockplus.org/29338153/diff/29338154/lib/crawler.js File lib/crawler.js (right): https://codereview.adblockplus.org/29338153/diff/29338154/lib/crawler.js#newc... lib/crawler.js:226: if (FilterStorage.subscriptions.length > 0 && !FilterStorage._loading) On 2016/03/15 13:44:22, Wladimir Palant wrote: > On 2016/03/15 12:16:23, sergei wrote: > > I'm not sure that testing `FilterStorage.subscriptions.length` against zero is > > enough because I can easily assume that it's already bigger than zero but not > > all subscriptions are loaded yet (because they can be loaded asynchronously), > > can it be the truth? > > No, subscriptions are added all at once when loading finishes. So checking > _loading is unnecessary. > > Fair enough - I guess when we decide to run this without filters we'll simply > add an empty subscription. Acknowledged. https://codereview.adblockplus.org/29338153/diff/29338313/lib/crawler.js File lib/crawler.js (right): https://codereview.adblockplus.org/29338153/diff/29338313/lib/crawler.js#newc... lib/crawler.js:24: On 2016/03/15 13:44:22, Wladimir Palant wrote: > Nit: Why the empty line here? Done. https://codereview.adblockplus.org/29338153/diff/29338313/lib/crawler.js#newc... lib/crawler.js:224: let crawl_urls = () => On 2016/03/15 13:44:22, Wladimir Palant wrote: > I really meant having crawl_urls as a regular function - |function > clawl_urls(urls)|, just above |function* crawl_url|. Done. https://codereview.adblockplus.org/29338153/diff/29338313/lib/crawler.js#newc... lib/crawler.js:267: .catch(reportException); On 2016/03/15 13:44:22, Wladimir Palant wrote: > Nit: put catch() on the same line as then()? Done.
https://codereview.adblockplus.org/29338153/diff/29338339/lib/crawler.js File lib/crawler.js (right): https://codereview.adblockplus.org/29338153/diff/29338339/lib/crawler.js#newc... lib/crawler.js:219: FilterNotifier.addListener((action, item, newValue, oldValue) => Just realized: you have to remove that listener after it fires. https://codereview.adblockplus.org/29338153/diff/29338339/lib/crawler.js#newc... lib/crawler.js:226: }).then(() => crawl_urls(window, urls, timeout, maxtabs, targetURL, onDone)).catch(reportException); Sorry but now having catch() on the same line no longer makes sense - the line got too long...
https://codereview.adblockplus.org/29338153/diff/29338339/lib/crawler.js File lib/crawler.js (right): https://codereview.adblockplus.org/29338153/diff/29338339/lib/crawler.js#newc... lib/crawler.js:219: FilterNotifier.addListener((action, item, newValue, oldValue) => On 2016/03/15 14:50:58, Wladimir Palant wrote: > Just realized: you have to remove that listener after it fires. Acknowledged. https://codereview.adblockplus.org/29338153/diff/29338339/lib/crawler.js#newc... lib/crawler.js:226: }).then(() => crawl_urls(window, urls, timeout, maxtabs, targetURL, onDone)).catch(reportException); On 2016/03/15 14:50:58, Wladimir Palant wrote: > Sorry but now having catch() on the same line no longer makes sense - the line > got too long... Done.
LGTM |