|
|
Created:
April 24, 2015, 3:38 p.m. by Wladimir Palant Modified:
Sept. 14, 2016, 2:38 p.m. CC:
Felix Dahlke Visibility:
Public. |
DescriptionDon`t hesitate to nit pick, this code was written rather quickly so readability might not be ideal.
Patch Set 1 #
Total comments: 117
Patch Set 2 : Addressed comments #
Total comments: 4
MessagesTotal messages: 11
http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/.hgi... File .hgignore (right): http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/.hgi... .hgignore:5: *.pyc Nit: A while ago we started to list *.pyo along with *.pyc in ignore files. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/READ... File README.md (right): http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/READ... README.md:4: Firefox extension that loads a range of websites and records which Apparently its not only a Firefox extension but also a script running Firefox with that extension against a list of websites. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/READ... README.md:10: * [Python 2.x](https://www.python.org) We actually require Python 2.7 specifically, as we use argparse and collections.OrderedDict, and potentially other stuff introduced there. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/READ... README.md:12: * [mozrunner module](https://pypi.python.org/pypi/mozrunner) I think you should add Firefox to that list as well. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.py File run.py (right): http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.... run.py:25: def __init__(self, parameters): Nit: There should be a new line between the class variable and the __init__ method. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.... run.py:27: with io.open(self.parameters.list, 'r', encoding='utf-8') as handle: Nit: I prefer to name file objects just "file" rather than "handle" to match Python's terminology. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.... run.py:28: self.urls = map(unicode.strip, handle.readlines()) I'd prefer to use a list comprehension here. Also did you know that you can read a file lazily line-by-line by iterating over the file object? self.urls = [line.strip() for line in handle] http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.... run.py:39: elif path == '/save': Nit: This should be just "if" as we return above. You might want to add a newline above as well. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.... run.py:42: except (ValueError): Nit: Redundant parenthesis. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.... run.py:43: start_response('400 Bad Request', []) 411 Length Required? http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.... run.py:44: return '' Please, never return strings from a WSGI handler, but an empty list in this case. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.... run.py:50: urlhash = hashlib.new('md5', data['url']).hexdigest() MD5 is generally considered deprecated and shouldn't be used in new systems, even if it's not security relevant. I sugggest to use SHA1 instead. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.... run.py:51: timestamp = datetime.datetime.fromtimestamp(data['startTime'] / 1000.0).strftime('%Y-%m-%dT%H%M%S.%f') Any reason, you don't use time.strftime in the first place, rather than taking the detour over creating a datetime object first? http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.... run.py:63: if "screenshot" in data: As you are removing the item anyway how about using the .pop() method, rather than accessing the item 3 times? screenshot = data.pop("screenshot", None) if screenshot: .. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.... run.py:65: handle.write(urllib.urlopen(data["screenshot"]).read()) Please close the file-like object returned by urlopen(). http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.... run.py:74: handle.write(unicode(json.dumps(data, indent=2, ensure_ascii=False, sort_keys=True)) + u'\n') How about json.dump(data, handle, ..)? http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.... run.py:154: print >>handle, 'url=%s' % url Missing quote? (Also in this case, adding a single string to then end of another, I wouldn't mind using the plus operator instead) http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.... run.py:169: threading.Thread(target=lambda: server.serve_forever()).start() This should probably be daemon thread. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.... run.py:169: threading.Thread(target=lambda: server.serve_forever()).start() Nit: You can just pass server.server_forever here. No need to wrap it into a lambda function.
http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.py File run.py (right): http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.... run.py:154: print >>handle, 'url=%s' % url On 2015/04/27 14:55:50, Sebastian Noack wrote: > Missing quote? Never mind, there is actually a quote. I just overlooked it.
http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.py File run.py (right): http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.... run.py:25: def __init__(self, parameters): On 2015/04/27 14:55:50, Sebastian Noack wrote: > Nit: There should be a new line between the class variable and the __init__ > method. Seems like you can omit that variable in the first place. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.... run.py:168: app.server = server Seems like you never access app.server. So this assignment is unneeded.
http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/READ... File README.md (right): http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/READ... README.md:26: Optionally, you can provide the path to the Adblock Plus repository - Adblock Maybe make sense to also add some small notes about -f flag so users will know they can customize the subscriptions ? http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/READ... README.md:29: License Is there a purpose why we use MPL instead of GPL ? Same questions goes to other JS files License headers. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... File lib/commandLine.js (right): http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/commandLine.js:8: * @module commandLine I think this should be file overview, maybe smth like: @fileOverview handle command line parameters http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/commandLine.js:28: for each (let category in this.xpcom_categories) for each in has been deprecated: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/... Please use For...of instead. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/commandLine.js:33: for each (let category in this.xpcom_categories) Please use For...of instead. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... File lib/crawler.js (right): http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:8: * @module crawler I think this should be file overview, maybe smth like: @fileOverview loads a range of websites and records which elements are filtered by Adblock Plus http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:17: let result = {}; We can create here Object without Object prototype: Object.create(null); http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:38: function TabAllocator(browser, maxtabs) Why don't we use default amount of maxtab equal to the urls count specified in the textfile ? And can you please give me a hint how it's being calculated in case it's not specified ? http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:55: * @result {Promise} Nit: @return ? http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:63: let deferred = Promise.defer(); The Deffered object is obsolete starting from Gecko 30: https://developer.mozilla.org/en-US/docs/Mozilla/JavaScript_code_modules/Prom... Please use Promise constructor instead. Also please update the metadata.gecko for the compatibility. Not sure if it make sense to make it backward compatible with Deffered. Promises should be supported by Firefox starting from v29 according to HTML5Rocks article, couldn't find official info: http://www.html5rocks.com/en/tutorials/es6/promises/#toc-browser-support http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:81: this._deferred.shift().resolve(tab); As mentioned above, the Deffered object is obsolete. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:109: * @result {Promise} Nit: @return http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:113: let deferred = Promise.defer(); As mentioned above this is Obsolete, this comment actually goes to other Deffered object usage in the script. Will stop comment on each related row. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:134: if ((flags & Ci.nsIWebProgressListener.STATE_STOP) && (flags & Ci.nsIWebProgressListener.STATE_IS_WINDOW)) Nit: This case can be splitted into multiple lines. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:151: // Exceptions are expected here Please also handle the exception. reportException(e); http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:168: Services.obs.addObserver(this, "xul-window-registered", true) Missing semicolon http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:176: Services.obs.removeObserver(this, "xul-window-registered") Missing semicolon http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:182: .getInterface(Ci.nsIDOMWindow) Missing semicolon http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:185: if (window.document.documentElement.localName == 'dialog') Nit: Please use double quote. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:200: * @result {Object} Nit: @return http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:218: }; Nit: Semicolon is redundant here. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:231: * URL that should receive the results Nit: Please also document onDone parameter for consistency. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:291: * @result {Object} Nit: @return {Object} Crawling result http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:319: result.screenshot = canvas.toDataURL("image/jpeg", 0.8); Maybe make sense to let user specify the quality of image same way it's done for "maxtabs" ? What you think ? Anyway we don't have any specifications for this review, so it's optional I think. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:324: result.error = "Capturing screenshot failed: " + e; Isn't result.error redundant ? http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:327: // TODO: Capture frames as well? Nit: I think we shouldn't have TODO comments in released product. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:375: * Call the original processNode. If the original throws, then we will too, so this is outside a try clause. Nit: the line is too long. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:407: }; Nit: Semicolon is redundant here. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... File lib/main.js (right): http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/main.js:8: * @module main I think this should be file overview, maybe smth like: /* * @fileOverview Starts up Crawler /* http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/main.js:26: let deferred = Promise.defer(); As mentioned in crawler.js please also don't use Deffered object here, but Promises as well. This comment actually goes to other Deffered object usage in the script. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/main.js:91: Cu.reportError("Failed loading parameters"); Maybe also make sense to print the message in the console, not to repeat the code, I think it make sense to create separate "reportException" method, similar we have in crawler.js. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/meta... File metadata.gecko (right): http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/meta... metadata.gecko:4: version=0.1 Version should be 0.2 at least. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/meta... metadata.gecko:8: firefox=16.0/39.0 The minimal version I think should be as minimum same as for Adblock Plus while this add-on is depends on ABP.
http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... File lib/commandLine.js (right): http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/commandLine.js:8: * @module commandLine On 2015/05/04 18:13:43, saroyanm wrote: > I think this should be file overview, maybe smth like: > @fileOverview handle command line parameters If this is actually a module, then there should be a @module tag, at least if you might want to run JsDoc 3 on this code at some point, generating documentation that isn't totally confusing. You can add a @fileOverview tag as well, but please only do so if you plan to add any real information. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... File lib/crawler.js (right): http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:17: let result = {}; On 2015/05/04 18:13:43, saroyanm wrote: > We can create here Object without Object prototype: > Object.create(null); It seems that we don't need to bother about conflicts with properties inherited form Object.prototypte here. Also the lookup below isn't a hotspot. So IMO using an object literal is totally fine, and even preferable to keep it simple. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:231: * URL that should receive the results Also note that JsDoc 3 will mistakenly document this (and similarly defined) function(s) as @inner (because it doesn't see that it is assigned to exports.*). In order to avoid that you could do one of following: 1. Explicitly add the @static tag 2. Directly assign the function to exports.* /** * .. */ exports.run = function(..) {..} 3. Assign the function both, a local variable and exports.* in one expression, but add the docs after the local variable: let run = /** * .. */ exports.run = function(..) {..} http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:375: * Call the original processNode. If the original throws, then we will too, so this is outside a try clause. Also I suggest to use simply // for comments explaining the logic inside a function, instead the more verbose JsDoc style. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/meta... File metadata.gecko (right): http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/meta... metadata.gecko:8: firefox=16.0/39.0 On 2015/05/04 18:13:43, saroyanm wrote: > The minimal version I think should be as minimum same as for Adblock Plus while > this add-on is depends on ABP. IMO that line of arguing doesn't make sense. In theory you could use abpcrawler also with older ABP versions on older FF versions. Also we're probably not going to update this file, every time we update the compatible versions for ABP. But well, it seems if we upgrade deferreds -> promises, we'll need Gecko 29 indeed.
http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... File lib/commandLine.js (right): http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/commandLine.js:8: * @module commandLine On 2015/05/04 20:39:01, Sebastian Noack wrote: > On 2015/05/04 18:13:43, saroyanm wrote: > > I think this should be file overview, maybe smth like: > > @fileOverview handle command line parameters > > If this is actually a module, then there should be a @module tag, at least if > you might want to run JsDoc 3 on this code at some point, generating > documentation that isn't totally confusing. You can add a @fileOverview tag as > well, but please only do so if you plan to add any real information. fair enough. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/meta... File metadata.gecko (right): http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/meta... metadata.gecko:8: firefox=16.0/39.0 On 2015/05/04 20:39:01, Sebastian Noack wrote: > On 2015/05/04 18:13:43, saroyanm wrote: > > The minimal version I think should be as minimum same as for Adblock Plus > while > > this add-on is depends on ABP. > > IMO that line of arguing doesn't make sense. In theory you could use abpcrawler > also with older ABP versions on older FF versions. Also we're probably not going > to update this file, every time we update the compatible versions for ABP. > > But well, it seems if we upgrade deferreds -> promises, we'll need Gecko 29 > indeed. agree.
http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/.hgi... File .hgignore (right): http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/.hgi... .hgignore:5: *.pyc On 2015/04/27 14:55:50, Sebastian Noack wrote: > Nit: A while ago we started to list *.pyo along with *.pyc in ignore files. Done. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/READ... File README.md (right): http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/READ... README.md:4: Firefox extension that loads a range of websites and records which On 2015/04/27 14:55:50, Sebastian Noack wrote: > Apparently its not only a Firefox extension but also a script running Firefox > with that extension against a list of websites. Done. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/READ... README.md:10: * [Python 2.x](https://www.python.org) On 2015/04/27 14:55:50, Sebastian Noack wrote: > We actually require Python 2.7 specifically, as we use argparse and > collections.OrderedDict, and potentially other stuff introduced there. Done. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/READ... README.md:12: * [mozrunner module](https://pypi.python.org/pypi/mozrunner) On 2015/04/27 14:55:50, Sebastian Noack wrote: > I think you should add Firefox to that list as well. Done. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/READ... README.md:26: Optionally, you can provide the path to the Adblock Plus repository - Adblock On 2015/05/04 18:13:43, saroyanm wrote: > Maybe make sense to also add some small notes about -f flag so users will know > they can customize the subscriptions ? Done. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/READ... README.md:29: License On 2015/05/04 18:13:43, saroyanm wrote: > Is there a purpose why we use MPL instead of GPL ? Same questions goes to other > JS files License headers. This extension was written before we switched to the GPL, and I didn't bother relicensing it. Frankly, none of our Firefox extensions have been relicensed so far - with exception of Adblock Plus itself. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... File lib/commandLine.js (right): http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/commandLine.js:28: for each (let category in this.xpcom_categories) On 2015/05/04 18:13:43, saroyanm wrote: > for each in has been deprecated: > https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/... > > > Please use For...of instead. Done. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/commandLine.js:33: for each (let category in this.xpcom_categories) On 2015/05/04 18:13:43, saroyanm wrote: > Please use For...of instead. Done. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... File lib/crawler.js (right): http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:8: * @module crawler On 2015/05/04 18:13:43, saroyanm wrote: > I think this should be file overview, maybe smth like: > @fileOverview loads a range of websites and records which elements are filtered > by Adblock Plus As Sebastian noted elsewhere, @module is actually appropriate here. And - yes, we need to fix that in Adblock Plus for Firefox as well. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:17: let result = {}; Yes, this isn't a place where property conflicts could be an issue. This code is actually identical to the require() function defined in chrome/content/ui/utils.js in adblockplus repository. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:38: function TabAllocator(browser, maxtabs) On 2015/05/04 18:13:43, saroyanm wrote: > Why don't we use default amount of maxtab equal to the urls count specified in > the textfile ? Because the text file has 50 thousand URLs. > And can you please give me a hint how it's being calculated in case it's not > specified ? run.py sets it to 15 by default. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:55: * @result {Promise} On 2015/05/04 18:13:43, saroyanm wrote: > Nit: @return ? Done. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:63: let deferred = Promise.defer(); On 2015/05/04 18:13:43, saroyanm wrote: > Promises should be supported > by Firefox starting from v29 according to HTML5Rocks article, couldn't find > official info: > http://www.html5rocks.com/en/tutorials/es6/promises/#toc-browser-support Promise constructor in Promise.jsm is actually supported starting with Firefox 30 (https://bugzilla.mozilla.org/show_bug.cgi?id=941757). Done. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:81: this._deferred.shift().resolve(tab); On 2015/05/04 18:13:43, saroyanm wrote: > As mentioned above, the Deffered object is obsolete. Done. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:109: * @result {Promise} On 2015/05/04 18:13:43, saroyanm wrote: > Nit: @return Done. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:113: let deferred = Promise.defer(); On 2015/05/04 18:13:43, saroyanm wrote: > As mentioned above this is Obsolete, this comment actually goes to other > Deffered object usage in the script. Will stop comment on each related row. Done. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:151: // Exceptions are expected here On 2015/05/04 18:13:43, saroyanm wrote: > Please also handle the exception. > reportException(e); As the comment explains, exceptions are expected here. If the connection wasn't established, request.responseStatus will throw an exception. Spamming the console with that is pointless, we are saving the channel status anyway. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:168: Services.obs.addObserver(this, "xul-window-registered", true) On 2015/05/04 18:13:43, saroyanm wrote: > Missing semicolon Done. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:176: Services.obs.removeObserver(this, "xul-window-registered") On 2015/05/04 18:13:43, saroyanm wrote: > Missing semicolon Done. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:182: .getInterface(Ci.nsIDOMWindow) On 2015/05/04 18:13:43, saroyanm wrote: > Missing semicolon Done. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:185: if (window.document.documentElement.localName == 'dialog') On 2015/05/04 18:13:43, saroyanm wrote: > Nit: Please use double quote. Done. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:200: * @result {Object} On 2015/05/04 18:13:43, saroyanm wrote: > Nit: @return Done. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:218: }; On 2015/05/04 18:13:43, saroyanm wrote: > Nit: Semicolon is redundant here. Done. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:231: * URL that should receive the results On 2015/05/04 20:39:01, Sebastian Noack wrote: > Also note that JsDoc 3 will mistakenly document this (and similarly defined) > function(s) as @inner (because it doesn't see that it is assigned to exports.*). > In order to avoid that you could do one of following: > > 1. Explicitly add the @static tag > 2. Directly assign the function to exports.* > > /** > * .. > */ > exports.run = function(..) {..} > > 3. Assign the function both, a local variable and exports.* in one expression, > but add the docs after the local variable: > > let run = > /** > * .. > */ > exports.run = function(..) {..} Done. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:291: * @result {Object} On 2015/05/04 18:13:43, saroyanm wrote: > Nit: @return {Object} Crawling result Done. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:319: result.screenshot = canvas.toDataURL("image/jpeg", 0.8); On 2015/05/04 18:13:43, saroyanm wrote: > Maybe make sense to let user specify the quality of image same way it's done for > "maxtabs" ? Well, changing maxtabs is very useful when testing. Given that there won't be a "user" here but merely a server running this, I cannot really imagine anybody tweaking this parameter. Still, we can easily add it if we need it. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:324: result.error = "Capturing screenshot failed: " + e; On 2015/05/04 18:13:43, saroyanm wrote: > Isn't result.error redundant ? No. The exception is merely reported to the console but not stored with the data - useful for debugging, nothing beyond that. The result variable on the other hand will be saved in a file, so somebody will be able to look up why the screenshot for this particular site wasn't saved. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:327: // TODO: Capture frames as well? On 2015/05/04 18:13:43, saroyanm wrote: > Nit: I think we shouldn't have TODO comments in released product. Why not? This is something we might want to do in future, not really important right now however. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:375: * Call the original processNode. If the original throws, then we will too, so this is outside a try clause. On 2015/05/04 20:39:01, Sebastian Noack wrote: > Also I suggest to use simply // for comments explaining the logic inside a > function, instead the more verbose JsDoc style. Done. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:407: }; On 2015/05/04 18:13:43, saroyanm wrote: > Nit: Semicolon is redundant here. Done. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... File lib/main.js (right): http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/main.js:8: * @module main On 2015/05/04 18:13:43, saroyanm wrote: > I think this should be file overview, maybe smth like: > /* > * @fileOverview Starts up Crawler > /* As Sebastian noted elsewhere, @module is actually appropriate here. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/main.js:26: let deferred = Promise.defer(); On 2015/05/04 18:13:43, saroyanm wrote: > As mentioned in crawler.js please also don't use Deffered object here, but > Promises as well. This comment actually goes to other Deffered object usage in > the script. Done. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/main.js:91: Cu.reportError("Failed loading parameters"); On 2015/05/04 18:13:43, saroyanm wrote: > Maybe also make sense to print the message in the console, > not to repeat the code, I think it make sense to create separate > "reportException" method, similar we have in crawler.js. Done. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/meta... File metadata.gecko (right): http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/meta... metadata.gecko:4: version=0.1 On 2015/05/04 18:13:43, saroyanm wrote: > Version should be 0.2 at least. Done. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.py File run.py (right): http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.... run.py:25: def __init__(self, parameters): On 2015/04/28 10:42:25, Sebastian Noack wrote: > On 2015/04/27 14:55:50, Sebastian Noack wrote: > > Nit: There should be a new line between the class variable and the __init__ > > method. > > Seems like you can omit that variable in the first place. Done. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.... run.py:27: with io.open(self.parameters.list, 'r', encoding='utf-8') as handle: On 2015/04/27 14:55:50, Sebastian Noack wrote: > Nit: I prefer to name file objects just "file" rather than "handle" to match > Python's terminology. Done. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.... run.py:28: self.urls = map(unicode.strip, handle.readlines()) On 2015/04/27 14:55:50, Sebastian Noack wrote: > I'd prefer to use a list comprehension here. Also did you know that you can read > a file lazily line-by-line by iterating over the file object? > > self.urls = [line.strip() for line in handle] Done. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.... run.py:39: elif path == '/save': On 2015/04/27 14:55:50, Sebastian Noack wrote: > Nit: This should be just "if" as we return above. You might want to add a > newline above as well. Done. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.... run.py:42: except (ValueError): On 2015/04/27 14:55:50, Sebastian Noack wrote: > Nit: Redundant parenthesis. Done. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.... run.py:43: start_response('400 Bad Request', []) On 2015/04/27 14:55:50, Sebastian Noack wrote: > 411 Length Required? Done. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.... run.py:44: return '' On 2015/04/27 14:55:50, Sebastian Noack wrote: > Please, never return strings from a WSGI handler, but an empty list in this > case. Done. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.... run.py:50: urlhash = hashlib.new('md5', data['url']).hexdigest() On 2015/04/27 14:55:50, Sebastian Noack wrote: > MD5 is generally considered deprecated and shouldn't be used in new systems, > even if it's not security relevant. I sugggest to use SHA1 instead. We've had that discussion before :) The reason I'm using md5 here is simply shorter hashes. Note that SHA1 is equally deprecated. I don't think we have to keep using the latest and greatest when the point isn't security. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.... run.py:51: timestamp = datetime.datetime.fromtimestamp(data['startTime'] / 1000.0).strftime('%Y-%m-%dT%H%M%S.%f') On 2015/04/27 14:55:50, Sebastian Noack wrote: > Any reason, you don't use time.strftime in the first place, rather than taking > the detour over creating a datetime object first? Well, how do you convert a Unix timestamp to struct_time that time.strftime expects? http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.... run.py:63: if "screenshot" in data: On 2015/04/27 14:55:50, Sebastian Noack wrote: > As you are removing the item anyway how about using the .pop() method, rather > than accessing the item 3 times? > > screenshot = data.pop("screenshot", None) > if screenshot: > .. > > Done. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.... run.py:65: handle.write(urllib.urlopen(data["screenshot"]).read()) On 2015/04/27 14:55:50, Sebastian Noack wrote: > Please close the file-like object returned by urlopen(). Done. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.... run.py:74: handle.write(unicode(json.dumps(data, indent=2, ensure_ascii=False, sort_keys=True)) + u'\n') On 2015/04/27 14:55:50, Sebastian Noack wrote: > How about json.dump(data, handle, ..)? Well, json module will try to write str instead of unicode objects - won't work. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.... run.py:168: app.server = server On 2015/04/28 10:42:25, Sebastian Noack wrote: > Seems like you never access app.server. So this assignment is unneeded. Done. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.... run.py:169: threading.Thread(target=lambda: server.serve_forever()).start() On 2015/04/27 14:55:50, Sebastian Noack wrote: > Nit: You can just pass server.server_forever here. No need to wrap it into a > lambda function. Done. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.... run.py:169: threading.Thread(target=lambda: server.serve_forever()).start() On 2015/04/27 14:55:50, Sebastian Noack wrote: > This should probably be daemon thread. Done.
http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.py File run.py (right): http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.... run.py:50: urlhash = hashlib.new('md5', data['url']).hexdigest() On 2015/05/07 00:04:59, Wladimir Palant wrote: > On 2015/04/27 14:55:50, Sebastian Noack wrote: > > MD5 is generally considered deprecated and shouldn't be used in new systems, > > even if it's not security relevant. I suggest to use SHA1 instead. > > We've had that discussion before :) > > The reason I'm using md5 here is simply shorter hashes. Note that SHA1 is > equally deprecated. I don't think we have to keep using the latest and greatest > when the point isn't security. Well, SHA1 isn't equally deprecated. AFAIK, SHA2 is pretty much the same algo just providing larger digests. And when NIST started the SHA3 challenge a couple of years ago, they assumed that SHA1/2 will be broken by now, but it turned out that SHA1/2 is still secure. As opposed to MD5 which is known to various collision attacks by now. I know security isn't a concern here. But why do you insist to use something that is ancient and best practices suggest to avoid for the past couple of years? If you are bothered about digest sizes, do the 4 bytes more really make a difference? If so truncating SHA1+ digests to 16 bits would still be better than using MD5. Also note that Keccak (the SHA3 winner) is a sponge functions with variable length digests. So you can have whatever digest size you want. Unfortunately it's not built into hashlib yet. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.... run.py:51: timestamp = datetime.datetime.fromtimestamp(data['startTime'] / 1000.0).strftime('%Y-%m-%dT%H%M%S.%f') On 2015/05/07 00:04:59, Wladimir Palant wrote: > On 2015/04/27 14:55:50, Sebastian Noack wrote: > > Any reason, you don't use time.strftime in the first place, rather than taking > > the detour over creating a datetime object first? > > Well, how do you convert a Unix timestamp to struct_time that time.strftime > expects? Nothing easier than that. I suppose you want UTC, time.gmtime() then, otherwise time.localtime(). Both take a number of seconds since epoch (aka UNIX timestamp) and return a time tuple, suitable to pass to time.strftime(). http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/run.... run.py:74: handle.write(unicode(json.dumps(data, indent=2, ensure_ascii=False, sort_keys=True)) + u'\n') On 2015/05/07 00:04:59, Wladimir Palant wrote: > On 2015/04/27 14:55:50, Sebastian Noack wrote: > > How about json.dump(data, handle, ..)? > > Well, json module will try to write str instead of unicode objects - won't work. Yeah, and the json module will also encode it as UTF-8 (if not specified other with the "encoding" argument). So just use open() instead io.open()? http://codereview.adblockplus.org/5288886037118976/diff/5673385510043648/lib/... File lib/debug.js (right): http://codereview.adblockplus.org/5288886037118976/diff/5673385510043648/lib/... lib/debug.js:23: dump(e + "\n" + stack + "\n"); Note that the trailing newline will be duplicated.
http://codereview.adblockplus.org/5288886037118976/diff/5673385510043648/lib/... File lib/commandLine.js (right): http://codereview.adblockplus.org/5288886037118976/diff/5673385510043648/lib/... lib/commandLine.js:31: onShutdown.add((function() Nit: I'd rather use an arrow function instead (function() {}).bind(this) http://codereview.adblockplus.org/5288886037118976/diff/5673385510043648/lib/... File lib/crawler.js (right): http://codereview.adblockplus.org/5288886037118976/diff/5673385510043648/lib/... lib/crawler.js:296: function* crawl_url(url, tabAllocator, loadListener) Nit: camel case
LGTM with small Nit. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/READ... File README.md (right): http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/READ... README.md:29: License On 2015/05/07 00:04:59, Wladimir Palant wrote: > On 2015/05/04 18:13:43, saroyanm wrote: > > Is there a purpose why we use MPL instead of GPL ? Same questions goes to > other > > JS files License headers. > > This extension was written before we switched to the GPL, and I didn't bother > relicensing it. Frankly, none of our Firefox extensions have been relicensed so > far - with exception of Adblock Plus itself. Got it, thanks. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... File lib/crawler.js (right): http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:319: result.screenshot = canvas.toDataURL("image/jpeg", 0.8); On 2015/05/07 00:04:59, Wladimir Palant wrote: > On 2015/05/04 18:13:43, saroyanm wrote: > > Maybe make sense to let user specify the quality of image same way it's done > for > > "maxtabs" ? > > Well, changing maxtabs is very useful when testing. Given that there won't be a > "user" here but merely a server running this, I cannot really imagine anybody > tweaking this parameter. Still, we can easily add it if we need it. Good point. http://codereview.adblockplus.org/5288886037118976/diff/5629499534213120/lib/... lib/crawler.js:324: result.error = "Capturing screenshot failed: " + e; On 2015/05/07 00:04:59, Wladimir Palant wrote: > On 2015/05/04 18:13:43, saroyanm wrote: > > Isn't result.error redundant ? > > No. The exception is merely reported to the console but not stored with the data > - useful for debugging, nothing beyond that. The result variable on the other > hand will be saved in a file, so somebody will be able to look up why the > screenshot for this particular site wasn't saved. Yes, missed that. Good point. http://codereview.adblockplus.org/5288886037118976/diff/5673385510043648/lib/... File lib/crawler.js (right): http://codereview.adblockplus.org/5288886037118976/diff/5673385510043648/lib/... lib/crawler.js:132: if ((flags & Ci.nsIWebProgressListener.STATE_STOP) && Nit: We usually also move operators to new line, for consistency. if ((flags & Ci.nsIWebProgressListener.STATE_STOP) && (flags & Ci.nsIWebProgressListener.STATE_IS_WINDOW)) |